# [gperftools]一行不改,多链个库,程序提速40%
听起来挺玄幻的,但今天下午这事儿真真实实地发生了。
正在搞的是一个极端性能敏感的物理仿真类程序。使用了[OpenVolumeMesh](https://www.openvolumemesh.org)来处理体网格。这个`OpenVolumeMesh`是个很好用的东西,可惜极度依赖于STL容器(主要是`std::vector`)。不但内部用STL容器表示数据,各类迭代、遍历什么的也经常是返回个`vector`了事。可以想象,运行时的内存分配一定相当频繁。加之我用了[Intel TBB](https://www.threadingbuildingblocks.org)做多线程计算,8个线程开起来各自动态分配一群一群的小`vector`,画美不看。
早有耳闻`TCMALLOC`面对这类碎片化内存分配很有一套,据传可以将STL操作或者类似频繁内存的操作提速多少多少。今天终于有空实践了一下,结果效果好得出奇。`TCMALLOC`是[某个不存在的公司](https://www.google.com)发布的[gperftools](https://github.com/gperftools/gperftools)套件的一部分。官方源码库就是上面的网址(虽然看起来很山寨)。
# 安装
由于近期用的Mac,安装很简单,`brew install gperftools`了之。
Linux下的安装很诡异,找到的教程里面都说要安装一个叫`libunwind`的,因为`glibc`有bug什么的。然而这些教程都是2013年左右的。新教程很少,[这里](http://pkxpp.github.io/2017/03/30/gperftools%E5%AE%89%E8%A3%85/)算是一篇比较新的,并没有提到`libunwind`。
就目前的[[linux:manjaro|Manjaro]]而言,直接`pacman`安装即可使用,目前还没发现什么问题。
Windows的话,源码库里面有个`README_windows.txt`
# CMake
官方没有提供`CMake`相关内容。按文档来说,大部分需求是可以做到源码一行不改的,只需要链接相应的库,所以手写起来`CMake`也简单。[此处就提供了一个我看着很不错的](https://github.com/vast-io/vast/blob/master/cmake/FindGperftools.cmake):
# Tries to find Gperftools.
#
# Usage of this module as follows:
#
# find_package(Gperftools)
#
# Variables used by this module, they can change the default behaviour and need
# to be set before calling find_package:
#
# Gperftools_ROOT_DIR Set this variable to the root installation of
# Gperftools if the module has problems finding
# the proper installation path.
#
# Variables defined by this module:
#
# GPERFTOOLS_FOUND System has Gperftools libs/headers
# GPERFTOOLS_LIBRARIES The Gperftools libraries (tcmalloc & profiler)
# GPERFTOOLS_INCLUDE_DIR The location of Gperftools headers
find_library(GPERFTOOLS_TCMALLOC
NAMES tcmalloc
HINTS ${Gperftools_ROOT_DIR}/lib)
find_library(GPERFTOOLS_PROFILER
NAMES profiler
HINTS ${Gperftools_ROOT_DIR}/lib)
find_library(GPERFTOOLS_TCMALLOC_AND_PROFILER
NAMES tcmalloc_and_profiler
HINTS ${Gperftools_ROOT_DIR}/lib)
find_path(GPERFTOOLS_INCLUDE_DIR
NAMES gperftools/heap-profiler.h
HINTS ${Gperftools_ROOT_DIR}/include)
set(GPERFTOOLS_LIBRARIES ${GPERFTOOLS_TCMALLOC_AND_PROFILER})
include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(
Gperftools
DEFAULT_MSG
GPERFTOOLS_LIBRARIES
GPERFTOOLS_INCLUDE_DIR)
mark_as_advanced(
Gperftools_ROOT_DIR
GPERFTOOLS_TCMALLOC
GPERFTOOLS_PROFILER
GPERFTOOLS_TCMALLOC_AND_PROFILER
GPERFTOOLS_LIBRARIES
GPERFTOOLS_INCLUDE_DIR)
将此文件放在项目树`CustomCMake`文件夹中,确保`set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} ${CMAKE_CURRENT_SOURCE_DIR}/CustomCMake)`一下将至加入CMake路径,然后在项目`CMakeLists.txt`中:
if (Simulator_TcMalloc)
find_package(Gperftools REQUIRED)
if(GPERFTOOLS_FOUND)
include_directories(${GPERFTOOLS_INCLUDE_DIR})
else()
message(FATAL_ERROR "Gperftools not found. Set Gperftools_ROOT_DIR to correct path.")
endif()
endif()
这已经是很完整的用法了,也不过8行而已。由于项目包含了`profile`和`tcmalloc`两个模块,我还做了两个选项:
option(Simulator_TcMalloc "Use google tcmalloc instead of default malloc." ON)
option(Simulator_Profile "Use google CPU profiler to record performance details." OFF)
# 其他随便什么操作
if(Simulator_TcMalloc AND Simulator_Profile)
target_link_libraries(srmSimulator ${GPERFTOOLS_TCMALLOC_AND_PROFILER})
elseif(Simulator_TcMalloc AND (NOT Simulator_Profile))
target_link_libraries(srmSimulator ${GPERFTOOLS_TCMALLOC})
elseif((NOT Simulator_TcMalloc) AND Simulator_Profile)
target_link_libraries(srmSimulator ${GPERFTOOLS_PROFILER})
endif()
然后清理项目,重新编译就好了。
## 测试
本来是想用`gperftools`自带的`pprof`工具来测试的,但不知道为什么这孙子的输出结果不带大部分函数名字,不知道是因为兼容性问题还是因为我编译的是`release`版(总之用`nm`是可以看到函数名字的),也不排除是因为我大量用了`lambda`表达式的原因。`debug`版程序由于`STL`的关系,运行一遍要半个小时,不想费时间。
无奈只好手动来搞。为了优雅,让CMake中的`Simulator_Profile`选项与计时代码联动,以免在正式版程序中还有`profile`行为:
CMake侧:
if (Simulator_Profile)
add_definitions(-DSIMULATOR_PROFILE)
endif()
C++侧:
#ifdef SIMULATOR_PROFILE
#include
#endif
// 某个函数
#ifdef SIMULATOR_PROFILE
auto evolveStart = std::chrono::steady_clock::now();
#endif
搞某件事;
#ifdef SIMULATOR_PROFILE
auto evolveEnd = std::chrono::steady_clock::now();
logger->info("[Profile] Evolving duration: {0} ms",
std::chrono::duration_cast(evolveEnd - evolveStart).count());
#endif
#ifdef SIMULATOR_PROFILE
auto reinitialStart = std::chrono::steady_clock::now();
#endif
for (int r = 0; r < levelSetReinitialPasses; ++r) {
搞另一件事;
}
#ifdef SIMULATOR_PROFILE
auto reinitialEnd = std::chrono::steady_clock::now();
double ms = std::chrono::duration_cast(evolveEnd - evolveStart).count();
logger->info("[Profile] Averaged reinitial duration: {0} ms", ms / levelSetReinitialPasses);
#endif
完事儿运行看看,关闭`Simulator_TcMalloc`选项后:
```
[info] LSM round 1
[info] [Profile] Evolving duration: 2993 ms
[info] [Profile] Averaged reinitial duration: 598.6 ms
[info] LSM round 2
[info] [Profile] Evolving duration: 3523 ms
[info] [Profile] Averaged reinitial duration: 704.6 ms
[info] LSM round 3
[info] [Profile] Evolving duration: 3613 ms
[info] [Profile] Averaged reinitial duration: 722.6 ms
[info] LSM round 4
[info] [Profile] Evolving duration: 3504 ms
[info] [Profile] Averaged reinitial duration: 700.8 ms
[info] LSM round 5
[info] [Profile] Evolving duration: 3520 ms
[info] [Profile] Averaged reinitial duration: 704 ms
[info] LSM round 6
[info] [Profile] Evolving duration: 3621 ms
[info] [Profile] Averaged reinitial duration: 724.2 ms
[info] LSM round 7
[info] [Profile] Evolving duration: 3527 ms
[info] [Profile] Averaged reinitial duration: 705.4 ms
[info] LSM round 8
[info] [Profile] Evolving duration: 3563 ms
[info] [Profile] Averaged reinitial duration: 712.6 ms
[info] LSM round 9
[info] [Profile] Evolving duration: 3532 ms
[info] [Profile] Averaged reinitial duration: 706.4 ms
[info] LSM round 10
[info] [Profile] Evolving duration: 3522 ms
[info] [Profile] Averaged reinitial duration: 704.4 ms
[info] Simulation complete.
```
开启`Simulator_TcMalloc`选项后:
```
[info] LSM round 1
[info] [Profile] Evolving duration: 1901 ms
[info] [Profile] Averaged reinitial duration: 380.2 ms
[info] LSM round 2
[info] [Profile] Evolving duration: 1904 ms
[info] [Profile] Averaged reinitial duration: 380.8 ms
[info] LSM round 3
[info] [Profile] Evolving duration: 2136 ms
[info] [Profile] Averaged reinitial duration: 427.2 ms
[info] LSM round 4
[info] [Profile] Evolving duration: 2225 ms
[info] [Profile] Averaged reinitial duration: 445 ms
[info] LSM round 5
[info] [Profile] Evolving duration: 2267 ms
[info] [Profile] Averaged reinitial duration: 453.4 ms
[info] LSM round 6
[info] [Profile] Evolving duration: 2066 ms
[info] [Profile] Averaged reinitial duration: 413.2 ms
[info] LSM round 7
[info] [Profile] Evolving duration: 2062 ms
[info] [Profile] Averaged reinitial duration: 412.4 ms
[info] LSM round 8
[info] [Profile] Evolving duration: 2135 ms
[info] [Profile] Averaged reinitial duration: 427 ms
[info] LSM round 9
[info] [Profile] Evolving duration: 2494 ms
[info] [Profile] Averaged reinitial duration: 498.8 ms
[info] LSM round 10
[info] [Profile] Evolving duration: 2179 ms
[info] [Profile] Averaged reinitial duration: 435.8 ms
[info] Simulation complete.
```
高下立判!
当然,估计普通程序的加速比没这么夸张。本程序的收益如此之高,有两个原因,其一是`STL`操作占比较高,分配内存操作十分频繁。其二是多线程的重度运用,而据说STL分配内存是有线程锁的,两个线程都要内存的时候就需要排队——而`TCMALLOC`专门花功夫处理了多线程问题。