# [gperftools]一行不改,多链个库,程序提速40% 听起来挺玄幻的,但今天下午这事儿真真实实地发生了。 正在搞的是一个极端性能敏感的物理仿真类程序。使用了[OpenVolumeMesh](https://www.openvolumemesh.org)来处理体网格。这个`OpenVolumeMesh`是个很好用的东西,可惜极度依赖于STL容器(主要是`std::vector`)。不但内部用STL容器表示数据,各类迭代、遍历什么的也经常是返回个`vector`了事。可以想象,运行时的内存分配一定相当频繁。加之我用了[Intel TBB](https://www.threadingbuildingblocks.org)做多线程计算,8个线程开起来各自动态分配一群一群的小`vector`,画美不看。 早有耳闻`TCMALLOC`面对这类碎片化内存分配很有一套,据传可以将STL操作或者类似频繁内存的操作提速多少多少。今天终于有空实践了一下,结果效果好得出奇。`TCMALLOC`是[某个不存在的公司](https://www.google.com)发布的[gperftools](https://github.com/gperftools/gperftools)套件的一部分。官方源码库就是上面的网址(虽然看起来很山寨)。 # 安装 由于近期用的Mac,安装很简单,`brew install gperftools`了之。 Linux下的安装很诡异,找到的教程里面都说要安装一个叫`libunwind`的,因为`glibc`有bug什么的。然而这些教程都是2013年左右的。新教程很少,[这里](http://pkxpp.github.io/2017/03/30/gperftools%E5%AE%89%E8%A3%85/)算是一篇比较新的,并没有提到`libunwind`。 就目前的[[linux:manjaro|Manjaro]]而言,直接`pacman`安装即可使用,目前还没发现什么问题。 Windows的话,源码库里面有个`README_windows.txt` # CMake 官方没有提供`CMake`相关内容。按文档来说,大部分需求是可以做到源码一行不改的,只需要链接相应的库,所以手写起来`CMake`也简单。[此处就提供了一个我看着很不错的](https://github.com/vast-io/vast/blob/master/cmake/FindGperftools.cmake): # Tries to find Gperftools. # # Usage of this module as follows: # # find_package(Gperftools) # # Variables used by this module, they can change the default behaviour and need # to be set before calling find_package: # # Gperftools_ROOT_DIR Set this variable to the root installation of # Gperftools if the module has problems finding # the proper installation path. # # Variables defined by this module: # # GPERFTOOLS_FOUND System has Gperftools libs/headers # GPERFTOOLS_LIBRARIES The Gperftools libraries (tcmalloc & profiler) # GPERFTOOLS_INCLUDE_DIR The location of Gperftools headers find_library(GPERFTOOLS_TCMALLOC NAMES tcmalloc HINTS ${Gperftools_ROOT_DIR}/lib) find_library(GPERFTOOLS_PROFILER NAMES profiler HINTS ${Gperftools_ROOT_DIR}/lib) find_library(GPERFTOOLS_TCMALLOC_AND_PROFILER NAMES tcmalloc_and_profiler HINTS ${Gperftools_ROOT_DIR}/lib) find_path(GPERFTOOLS_INCLUDE_DIR NAMES gperftools/heap-profiler.h HINTS ${Gperftools_ROOT_DIR}/include) set(GPERFTOOLS_LIBRARIES ${GPERFTOOLS_TCMALLOC_AND_PROFILER}) include(FindPackageHandleStandardArgs) find_package_handle_standard_args( Gperftools DEFAULT_MSG GPERFTOOLS_LIBRARIES GPERFTOOLS_INCLUDE_DIR) mark_as_advanced( Gperftools_ROOT_DIR GPERFTOOLS_TCMALLOC GPERFTOOLS_PROFILER GPERFTOOLS_TCMALLOC_AND_PROFILER GPERFTOOLS_LIBRARIES GPERFTOOLS_INCLUDE_DIR) 将此文件放在项目树`CustomCMake`文件夹中,确保`set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} ${CMAKE_CURRENT_SOURCE_DIR}/CustomCMake)`一下将至加入CMake路径,然后在项目`CMakeLists.txt`中: if (Simulator_TcMalloc) find_package(Gperftools REQUIRED) if(GPERFTOOLS_FOUND) include_directories(${GPERFTOOLS_INCLUDE_DIR}) else() message(FATAL_ERROR "Gperftools not found. Set Gperftools_ROOT_DIR to correct path.") endif() endif() 这已经是很完整的用法了,也不过8行而已。由于项目包含了`profile`和`tcmalloc`两个模块,我还做了两个选项: option(Simulator_TcMalloc "Use google tcmalloc instead of default malloc." ON) option(Simulator_Profile "Use google CPU profiler to record performance details." OFF) # 其他随便什么操作 if(Simulator_TcMalloc AND Simulator_Profile) target_link_libraries(srmSimulator ${GPERFTOOLS_TCMALLOC_AND_PROFILER}) elseif(Simulator_TcMalloc AND (NOT Simulator_Profile)) target_link_libraries(srmSimulator ${GPERFTOOLS_TCMALLOC}) elseif((NOT Simulator_TcMalloc) AND Simulator_Profile) target_link_libraries(srmSimulator ${GPERFTOOLS_PROFILER}) endif() 然后清理项目,重新编译就好了。 ## 测试 本来是想用`gperftools`自带的`pprof`工具来测试的,但不知道为什么这孙子的输出结果不带大部分函数名字,不知道是因为兼容性问题还是因为我编译的是`release`版(总之用`nm`是可以看到函数名字的),也不排除是因为我大量用了`lambda`表达式的原因。`debug`版程序由于`STL`的关系,运行一遍要半个小时,不想费时间。 无奈只好手动来搞。为了优雅,让CMake中的`Simulator_Profile`选项与计时代码联动,以免在正式版程序中还有`profile`行为: CMake侧: if (Simulator_Profile) add_definitions(-DSIMULATOR_PROFILE) endif() C++侧: #ifdef SIMULATOR_PROFILE #include #endif // 某个函数 #ifdef SIMULATOR_PROFILE auto evolveStart = std::chrono::steady_clock::now(); #endif 搞某件事; #ifdef SIMULATOR_PROFILE auto evolveEnd = std::chrono::steady_clock::now(); logger->info("[Profile] Evolving duration: {0} ms", std::chrono::duration_cast(evolveEnd - evolveStart).count()); #endif #ifdef SIMULATOR_PROFILE auto reinitialStart = std::chrono::steady_clock::now(); #endif for (int r = 0; r < levelSetReinitialPasses; ++r) { 搞另一件事; } #ifdef SIMULATOR_PROFILE auto reinitialEnd = std::chrono::steady_clock::now(); double ms = std::chrono::duration_cast(evolveEnd - evolveStart).count(); logger->info("[Profile] Averaged reinitial duration: {0} ms", ms / levelSetReinitialPasses); #endif 完事儿运行看看,关闭`Simulator_TcMalloc`选项后: ``` [info] LSM round 1 [info] [Profile] Evolving duration: 2993 ms [info] [Profile] Averaged reinitial duration: 598.6 ms [info] LSM round 2 [info] [Profile] Evolving duration: 3523 ms [info] [Profile] Averaged reinitial duration: 704.6 ms [info] LSM round 3 [info] [Profile] Evolving duration: 3613 ms [info] [Profile] Averaged reinitial duration: 722.6 ms [info] LSM round 4 [info] [Profile] Evolving duration: 3504 ms [info] [Profile] Averaged reinitial duration: 700.8 ms [info] LSM round 5 [info] [Profile] Evolving duration: 3520 ms [info] [Profile] Averaged reinitial duration: 704 ms [info] LSM round 6 [info] [Profile] Evolving duration: 3621 ms [info] [Profile] Averaged reinitial duration: 724.2 ms [info] LSM round 7 [info] [Profile] Evolving duration: 3527 ms [info] [Profile] Averaged reinitial duration: 705.4 ms [info] LSM round 8 [info] [Profile] Evolving duration: 3563 ms [info] [Profile] Averaged reinitial duration: 712.6 ms [info] LSM round 9 [info] [Profile] Evolving duration: 3532 ms [info] [Profile] Averaged reinitial duration: 706.4 ms [info] LSM round 10 [info] [Profile] Evolving duration: 3522 ms [info] [Profile] Averaged reinitial duration: 704.4 ms [info] Simulation complete. ``` 开启`Simulator_TcMalloc`选项后: ``` [info] LSM round 1 [info] [Profile] Evolving duration: 1901 ms [info] [Profile] Averaged reinitial duration: 380.2 ms [info] LSM round 2 [info] [Profile] Evolving duration: 1904 ms [info] [Profile] Averaged reinitial duration: 380.8 ms [info] LSM round 3 [info] [Profile] Evolving duration: 2136 ms [info] [Profile] Averaged reinitial duration: 427.2 ms [info] LSM round 4 [info] [Profile] Evolving duration: 2225 ms [info] [Profile] Averaged reinitial duration: 445 ms [info] LSM round 5 [info] [Profile] Evolving duration: 2267 ms [info] [Profile] Averaged reinitial duration: 453.4 ms [info] LSM round 6 [info] [Profile] Evolving duration: 2066 ms [info] [Profile] Averaged reinitial duration: 413.2 ms [info] LSM round 7 [info] [Profile] Evolving duration: 2062 ms [info] [Profile] Averaged reinitial duration: 412.4 ms [info] LSM round 8 [info] [Profile] Evolving duration: 2135 ms [info] [Profile] Averaged reinitial duration: 427 ms [info] LSM round 9 [info] [Profile] Evolving duration: 2494 ms [info] [Profile] Averaged reinitial duration: 498.8 ms [info] LSM round 10 [info] [Profile] Evolving duration: 2179 ms [info] [Profile] Averaged reinitial duration: 435.8 ms [info] Simulation complete. ``` 高下立判! 当然,估计普通程序的加速比没这么夸张。本程序的收益如此之高,有两个原因,其一是`STL`操作占比较高,分配内存操作十分频繁。其二是多线程的重度运用,而据说STL分配内存是有线程锁的,两个线程都要内存的时候就需要排队——而`TCMALLOC`专门花功夫处理了多线程问题。