最近在進行服務器內存優化的時候,發現一個非常奇妙的問題,我們的認證服務器(AuthServer)負責跟第三方渠道SDK打交道,由于采用了curl阻塞的方式,所以這里開了128個線程,奇怪的是每次剛啟動的時候占用的虛擬內存在2.3G,然后每次處理消息就增加64M,增加到4.4G就不再增加了,由于我們采用預分配的方式,在線程內部根本沒有大塊分內存,那么這些內存到底是從哪來的呢?讓人百思不得其解。
1.探索
一開始首先排除掉內存泄露,不可能每次都泄露64M內存這么巧合,為了證明我的觀點,首先,我使用了valgrind。
1: valgrind --leak-check=full --track-fds=yes --log-file=./AuthServer.vlog &
然后啟動測試,跑至內存不再增加,果然valgrind顯示沒有任何內存泄露。反復試驗了很多次,結果都是這樣。
在多次使用valgrind無果以后,我開始懷疑程序內部是不是用到mmap之類的調用,于是使用strace對mmap,brk等系統函數的檢測:
1: strace -f -e"brk,mmap,munmap" -p $(pidof AuthServer)
其結果如下:
1: [pid 19343] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f53c8ca9000
2: [pid 19343] munmap(0x7f53c8ca9000, 53833728) = 0
3: [pid 19343] munmap(0x7f53d0000000, 13275136) = 0
4: [pid 19343] mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f53d04a8000
5: Process 19495 attached
我檢查了一下trace文件也沒有發現大量內存mmap動作,即便是brk動作引起的內存增長也不大。于是感覺人生都沒有方向了,然后懷疑是不是文件緩存把虛擬內存占掉了,注釋掉了代碼中所有讀寫日志的代碼,虛擬內存依然增加,排除了這個可能。
2.靈光一現
后來,我開始減少thread的數量開始測試,在測試的時候偶然發現一個很奇怪的現象。那就是如果進程創建了一個線程并且在該線程內分配一個很小的內存1k,整個進程虛擬內存立馬增加64M,然后再分配,內存就不增加了。測試代碼如下:
1: #include <IOStream>
2: #include <stdio.h>
3: #include <stdlib.h>
4: #include <unistd.h>
5: using namespace std;
6:
7: volatile bool start = 0;
8:
9:
10: void* thread_run( void* )
11: {
12:
13:while(1)
14:{
15: if(start)
16: {
17: cout << "Thread malloc" << endl;
18: char *buf = new char[1024];
19: start = 0;
20: }
21: sleep(1);
22:}
23: }
24:
25: int main()
26: {
27: pthread_t th;
28:
29: getchar();
30: getchar();
31: pthread_create(&th, 0, thread_run, 0);
32:
33: while((getchar()))
34: {
35: start = 1;
36: }
37:
38:
39: return 0;
40: }
其運行結果如下圖,剛開始時,進程占用虛擬內存14M,輸入0,創建子線程,進程內存達到23M,這增加的10M是線程堆棧的大小(查看和設置線程堆棧大小可用ulimit –s),第一次輸入1,程序分配1k內存,整個進程增加64M虛擬內存,之后再輸入2,3,各再次分配1k,內存均不再變化。
這個結果讓我欣喜若狂,由于以前學習過谷歌的Tcmalloc,其中每個線程都有自己的緩沖區來解決多線程內存分配的競爭,估計新版的glibc同樣學習了這個技巧,于是查看pmap $(pidof main) 查看內存情況,如下:
請注意65404這一行,種種跡象表明,這個再加上它上面那一行(在這里是132)就是增加的那個64M)。后來增加thread的數量,就會有新增thread數量相應的65404的內存塊。
3.刨根問底
經過一番搜索和代碼查看。終于知道了原來是glibc的malloc在這里搗鬼。glibc 版本大于2.11的都會有這個問題:在redhat 的官方文檔上:
Red Hat Enterprise linux 6 features version 2.11 of glibc, providing many features and enhancements, including... An enhanced dynamic memory allocation (malloc) behaviour enabling higher scalability across many sockets and cores.This is achieved by assigning threads their own memory pools and by avoiding locking in some situations. The amount of additional memory used for the memory pools (if any) can be controlled using the environment variables MALLOC_ARENA_TEST and MALLOC_ARENA_MAX. MALLOC_ARENA_TEST specifies that a test for the number of cores is performed once the number of memory pools reaches this value. MALLOC_ARENA_MAX sets the maximum number of memory pools used, regardless of the number of cores.
The developer, Ulrich Drepper, has a much deeper explanation on his blog:
Before, malloc tried to emulate a per-core memory pool. Every time when contention for all existing memory pools was detected a new pool is created. Threads stay with the last used pool if possible... This never worked 100% because a thread can be descheduled while executing a malloc call. When some other thread tries to use the memory pool used in the call it would detect contention. A second problem is that if multiple threads on multiple core/sockets hAppily use malloc without contention memory from the same pool is used by different cores/on different sockets. This can lead to false sharing and definitely additional cross traffic because of the meta information updates. There are more potential problems not worth going into here in detail.
The changes which are in glibc now create per-thread memory pools. This can eliminate false sharing in most cases. The meta data is usually accessed only in one thread (which hopefully doesn’t get migrated off its assigned core). To prevent the memory handling from blowing up the address space use too much the number of memory pools is capped. By default we create up to two memory pools per core on 32-bit machines and up to eight memory per core on 64-bit machines. The code delays testing for the number of cores (which is not cheap, we have to read /proc/stat) until there are already two or eight memory pools allocated, respectively.
While these changes might increase the number of memory pools which are created (and thus increase the address space they use) the number can be controlled. Because using the old mechanism there could be a new pool being created whenever there are collisions the total number could in theory be higher. Unlikely but true, so the new mechanism is more predictable.
... Memory use is not that much of a premium anymore and most of the memory pool doesn’t actually require memory until it is used, only address space... We have done internally some measurements of the effects of the new implementation and they can be quite dramatic.
New versions of glibc present in RHEL6 include a new arena allocator design. In several clusters we've seen this new allocator cause huge amounts of virtual memory to be used, since when multiple threads perform allocations, they each get their own memory arena. On a 64-bit system, these arenas are 64M mappings, and the maximum number of arenas is 8 times the number of cores. We've observed a DN process using 14GB of vmem for only 300M of resident set. This causes all kinds of nasty issues for obvious reasons.
Setting MALLOC_ARENA_MAX to a low number will restrict the number of memory arenas and bound the virtual memory, with no noticeable downside in performance - we've been recommending MALLOC_ARENA_MAX=4. We should set this in Hadoop-env.sh to avoid this issue as RHEL6 becomes more and more common.
總結一下,glibc為了分配內存的性能的問題,使用了很多叫做arena的memory pool,缺省配置在64bit下面是每一個arena為64M,一個進程可以最多有 cores * 8個arena。假設你的機器是4核的,那么最多可以有4 * 8 = 32個arena,也就是使用32 * 64 = 2048M內存。 當然你也可以通過設置環境變量來改變arena的數量.例如export MALLOC_ARENA_MAX=1
hadoop推薦把這個值設置為4。當然了,既然是多核的機器,而arena的引進是為了解決多線程內存分配競爭的問題,那么設置為cpu核的數量估計也是一個不錯的選擇。設置這個值以后最好能對你的程序做一下壓力測試,用以看看改變arena的數量是否會對程序的性能有影響。
mallopt(M_ARENA_MAX, xxx)如果你打算在程序代碼中來設置這個東西,那么可以調用mallopt(M_ARENA_MAX, xxx)來實現,由于我們AuthServer采用了預分配的方式,在各個線程內并沒有分配內存,所以不需要這種優化,在初始化的時候采用mallopt(M_ARENA_MAX, 1)將其關掉,設置為0,表示系統按CPU進行自動設置。
4.意外發現
想到tcmalloc小對象才從線程自己的內存池分配,大內存仍然從中央分配區分配,不知道glibc是如何設計的,于是將上面程序中線程每次分配的內存從1k調整為1M,果然不出所料,再分配完64M后,仍然每次都會增加1M,由此可見,新版 glibc完全借鑒了tcmalloc的思想。
忙了幾天的問題終于解決了,心情大好,通過今天的問題讓我知道,作為一個服務器程序員,如果不懂編譯器和操作系統內核,是完全不合格的,以后要加強這方面的學習。