[gperftools]一行不改,多链个库,程序提速40%

听起来挺玄幻的,但今天下午这事儿真真实实地发生了。

正在搞的是一个极端性能敏感的物理仿真类程序。使用了OpenVolumeMesh来处理体网格。这个OpenVolumeMesh是个很好用的东西,可惜极度依赖于STL容器(主要是std::vector)。不但内部用STL容器表示数据,各类迭代、遍历什么的也经常是返回个vector了事。可以想象,运行时的内存分配一定相当频繁。加之我用了Intel TBB做多线程计算,8个线程开起来各自动态分配一群一群的小vector,画美不看。

早有耳闻TCMALLOC面对这类碎片化内存分配很有一套,据传可以将STL操作或者类似频繁内存的操作提速多少多少。今天终于有空实践了一下,结果效果好得出奇。TCMALLOC某个不存在的公司发布的gperftools套件的一部分。官方源码库就是上面的网址(虽然看起来很山寨)。

安装

由于近期用的Mac,安装很简单,brew install gperftools了之。

Linux下的安装很诡异,找到的教程里面都说要安装一个叫libunwind的,因为glibc有bug什么的。然而这些教程都是2013年左右的。新教程很少,这里算是一篇比较新的,并没有提到libunwind。 就目前的Manjaro而言,直接pacman安装即可使用,目前还没发现什么问题。

Windows的话,源码库里面有个README_windows.txt

CMake

官方没有提供CMake相关内容。按文档来说,大部分需求是可以做到源码一行不改的,只需要链接相应的库,所以手写起来CMake也简单。此处就提供了一个我看着很不错的

FindGperftools.cmake
# Tries to find Gperftools.
#
# Usage of this module as follows:
#
#     find_package(Gperftools)
#
# Variables used by this module, they can change the default behaviour and need
# to be set before calling find_package:
#
#  Gperftools_ROOT_DIR  Set this variable to the root installation of
#                       Gperftools if the module has problems finding
#                       the proper installation path.
#
# Variables defined by this module:
#
#  GPERFTOOLS_FOUND              System has Gperftools libs/headers
#  GPERFTOOLS_LIBRARIES          The Gperftools libraries (tcmalloc & profiler)
#  GPERFTOOLS_INCLUDE_DIR        The location of Gperftools headers
 
find_library(GPERFTOOLS_TCMALLOC
  NAMES tcmalloc
  HINTS ${Gperftools_ROOT_DIR}/lib)
 
find_library(GPERFTOOLS_PROFILER
  NAMES profiler
  HINTS ${Gperftools_ROOT_DIR}/lib)
 
find_library(GPERFTOOLS_TCMALLOC_AND_PROFILER
  NAMES tcmalloc_and_profiler
  HINTS ${Gperftools_ROOT_DIR}/lib)
 
find_path(GPERFTOOLS_INCLUDE_DIR
  NAMES gperftools/heap-profiler.h
  HINTS ${Gperftools_ROOT_DIR}/include)
 
set(GPERFTOOLS_LIBRARIES ${GPERFTOOLS_TCMALLOC_AND_PROFILER})
 
include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(
  Gperftools
  DEFAULT_MSG
  GPERFTOOLS_LIBRARIES
  GPERFTOOLS_INCLUDE_DIR)
 
mark_as_advanced(
  Gperftools_ROOT_DIR
  GPERFTOOLS_TCMALLOC
  GPERFTOOLS_PROFILER
  GPERFTOOLS_TCMALLOC_AND_PROFILER
  GPERFTOOLS_LIBRARIES
  GPERFTOOLS_INCLUDE_DIR)

将此文件放在项目树CustomCMake文件夹中,确保set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} ${CMAKE_CURRENT_SOURCE_DIR}/CustomCMake)一下将至加入CMake路径,然后在项目CMakeLists.txt中:

if (Simulator_TcMalloc)
	find_package(Gperftools REQUIRED)
	if(GPERFTOOLS_FOUND)
		include_directories(${GPERFTOOLS_INCLUDE_DIR})
	else()
		message(FATAL_ERROR "Gperftools not found. Set Gperftools_ROOT_DIR to correct path.")
	endif()
endif()

这已经是很完整的用法了,也不过8行而已。由于项目包含了profiletcmalloc两个模块,我还做了两个选项:

option(Simulator_TcMalloc "Use google tcmalloc instead of default malloc." ON)
option(Simulator_Profile "Use google CPU profiler to record performance details." OFF)
 
# 其他随便什么操作
 
if(Simulator_TcMalloc AND Simulator_Profile)
	target_link_libraries(srmSimulator ${GPERFTOOLS_TCMALLOC_AND_PROFILER})
elseif(Simulator_TcMalloc AND (NOT Simulator_Profile))
	target_link_libraries(srmSimulator ${GPERFTOOLS_TCMALLOC})
elseif((NOT Simulator_TcMalloc) AND Simulator_Profile)
	target_link_libraries(srmSimulator ${GPERFTOOLS_PROFILER})
endif()

然后清理项目,重新编译就好了。

本来是想用gperftools自带的pprof工具来测试的,但不知道为什么这孙子的输出结果不带大部分函数名字,不知道是因为兼容性问题还是因为我编译的是release版(总之用nm是可以看到函数名字的),也不排除是因为我大量用了lambda表达式的原因。debug版程序由于STL的关系,运行一遍要半个小时,不想费时间。

无奈只好手动来搞。为了优雅,让CMake中的Simulator_Profile选项与计时代码联动,以免在正式版程序中还有profile行为:

CMake侧:

if (Simulator_Profile)
	add_definitions(-DSIMULATOR_PROFILE)
endif()

C++侧:

#ifdef SIMULATOR_PROFILE
#include <chrono>
#endif
 
// 某个函数
 
#ifdef SIMULATOR_PROFILE
		auto evolveStart = std::chrono::steady_clock::now();
#endif
		搞某件事;
#ifdef SIMULATOR_PROFILE
		auto evolveEnd = std::chrono::steady_clock::now();
		logger->info("[Profile] Evolving duration: {0} ms",
			std::chrono::duration_cast<std::chrono::milliseconds>(evolveEnd - evolveStart).count());
#endif
 
#ifdef SIMULATOR_PROFILE
		auto reinitialStart = std::chrono::steady_clock::now();
#endif
		for (int r = 0; r < levelSetReinitialPasses; ++r) {
			搞另一件事;
		}
#ifdef SIMULATOR_PROFILE
		auto reinitialEnd = std::chrono::steady_clock::now();
		double ms = std::chrono::duration_cast<std::chrono::milliseconds>(evolveEnd - evolveStart).count();
		logger->info("[Profile] Averaged reinitial duration: {0} ms", ms / levelSetReinitialPasses);
#endif

完事儿运行看看,关闭Simulator_TcMalloc选项后:

[info] LSM round 1
[info] [Profile] Evolving duration: 2993 ms
[info] [Profile] Averaged reinitial duration: 598.6 ms
[info] LSM round 2
[info] [Profile] Evolving duration: 3523 ms
[info] [Profile] Averaged reinitial duration: 704.6 ms
[info] LSM round 3
[info] [Profile] Evolving duration: 3613 ms
[info] [Profile] Averaged reinitial duration: 722.6 ms
[info] LSM round 4
[info] [Profile] Evolving duration: 3504 ms
[info] [Profile] Averaged reinitial duration: 700.8 ms
[info] LSM round 5
[info] [Profile] Evolving duration: 3520 ms
[info] [Profile] Averaged reinitial duration: 704 ms
[info] LSM round 6
[info] [Profile] Evolving duration: 3621 ms
[info] [Profile] Averaged reinitial duration: 724.2 ms
[info] LSM round 7
[info] [Profile] Evolving duration: 3527 ms
[info] [Profile] Averaged reinitial duration: 705.4 ms
[info] LSM round 8
[info] [Profile] Evolving duration: 3563 ms
[info] [Profile] Averaged reinitial duration: 712.6 ms
[info] LSM round 9
[info] [Profile] Evolving duration: 3532 ms
[info] [Profile] Averaged reinitial duration: 706.4 ms
[info] LSM round 10
[info] [Profile] Evolving duration: 3522 ms
[info] [Profile] Averaged reinitial duration: 704.4 ms
[info] Simulation complete.

开启Simulator_TcMalloc选项后:

[info] LSM round 1
[info] [Profile] Evolving duration: 1901 ms
[info] [Profile] Averaged reinitial duration: 380.2 ms
[info] LSM round 2
[info] [Profile] Evolving duration: 1904 ms
[info] [Profile] Averaged reinitial duration: 380.8 ms
[info] LSM round 3
[info] [Profile] Evolving duration: 2136 ms
[info] [Profile] Averaged reinitial duration: 427.2 ms
[info] LSM round 4
[info] [Profile] Evolving duration: 2225 ms
[info] [Profile] Averaged reinitial duration: 445 ms
[info] LSM round 5
[info] [Profile] Evolving duration: 2267 ms
[info] [Profile] Averaged reinitial duration: 453.4 ms
[info] LSM round 6
[info] [Profile] Evolving duration: 2066 ms
[info] [Profile] Averaged reinitial duration: 413.2 ms
[info] LSM round 7
[info] [Profile] Evolving duration: 2062 ms
[info] [Profile] Averaged reinitial duration: 412.4 ms
[info] LSM round 8
[info] [Profile] Evolving duration: 2135 ms
[info] [Profile] Averaged reinitial duration: 427 ms
[info] LSM round 9
[info] [Profile] Evolving duration: 2494 ms
[info] [Profile] Averaged reinitial duration: 498.8 ms
[info] LSM round 10
[info] [Profile] Evolving duration: 2179 ms
[info] [Profile] Averaged reinitial duration: 435.8 ms
[info] Simulation complete.

高下立判!

当然,估计普通程序的加速比没这么夸张。本程序的收益如此之高,有两个原因,其一是STL操作占比较高,分配内存操作十分频繁。其二是多线程的重度运用,而据说STL分配内存是有线程锁的,两个线程都要内存的时候就需要排队——而TCMALLOC专门花功夫处理了多线程问题。

  • 最后更改: 2019/05/13 14:05