由于早期的云服务器,大量存量3.10内核作为cvm的操作系统内核。3.10内核存在着很多已知问题,其中的常客之一便是内存不足场景下,内存回收引发的问题。 内存回收和OOM一直是Linux中一个饱受诟病的问题,其路径内核一直在优化,所以从理论上新版本内核一定是优于老版本。 本文通过构造用例测试,来针对3.10和5.4内核在内存不足场景下的表现进行分析对比,以说明5.4会在内存不足的场景下有更好的表现。
【一句话结论】
5.4相比3.10内存不足时,直接回收的耗时更短,因此更易进入oom逻辑,从而杀掉进程来释放内存;而3.10则会存在hung死、回收慢导致内存释放慢等问题;因此5.4在内存不足的场景下表现要远优于3.10。
【测试数据】
使用两台IT5 64C 256G 100M的cvm,一台3.10 内部内核,一台TencentOS 5.4.119-19-0009,内存相关内核参数相同。
构造测试程序每秒生成一个进程分配10G内存,分配完while 1不释放,当内存不足1G时,每秒分配10M内存,持续不断分配内存直到被系统oom杀掉。
在只跑测试程序的情况下,进入oom逻辑的时间分别为:
3.10 30-31s
5.4 29-30s
在打入高流量(只收包)的情况下,进入oom逻辑的时间分别为:
3.10 37-38s
5.4 27-28s
在打入高流量(同时收发包)的情况下,进入oom逻辑的时间分别为:
3.10 35s;且测试五次会有一hung死,不会进入oom
5.4 30-31s
详情请见“trace数据分析”章节。
整个问题的分析过程,vmcore分析相对冗余,属于又臭又长,不感兴趣的同学建议直接关注“trace数据分析”中的第2-4小节的数据“3.10-5.4内存回收路径代码举例对比分析”中的代码对比分析。
从业务反馈的一个hung死case分析开始。
这个case并没能分析出问题的详细逻辑,只是观察了问题时间机器内存里的种种“迹象”,并做出了一些对应的推论。
开始分析,先看到机器hung死时的串口日志,是在网络栈上的内存分配:
对于3.10因为内存回收hung死的机器,一般会通过sysrq魔术键触发crash来生成dump文件,进行进一步的分析。
拿到dump文件后,进入crash进行分析。
查看内存基本情况,内存耗尽,cache不高:
crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 65821943 251.1 GB ----
FREE 135924 531 MB 0% of TOTAL MEM
USED 65686019 250.6 GB 99% of TOTAL MEM
SHARED 10711 41.8 MB 0% of TOTAL MEM
BUFFERS 831 3.2 MB 0% of TOTAL MEM
CACHED 5255 20.5 MB 0% of TOTAL MEM
SLAB 89428 349.3 MB 0% of TOTAL MEM
TOTAL HUGE 0 0 ----
HUGE FREE 0 0 0% of TOTAL HUGE
TOTAL SWAP 0 0 ----
SWAP USED 0 0 0% of TOTAL SWAP
SWAP FREE 0 0 0% of TOTAL SWAP
COMMIT LIMIT 32910971 125.5 GB ----
COMMITTED 71816972 274 GB 218% of TOTAL LIMIT
crash> p min_free_kbytes
min_free_kbytes = $1 = 64900
查看hung死进程:
crash> ps -l | grep UN | tail
[2856682881102] [UN] PID: 9876 TASK: ffff883f39544d70 CPU: 47 COMMAND: "sap1007"
[2856682877615] [UN] PID: 8579 TASK: ffff883f5d31b3a0 CPU: 27 COMMAND: "m3dbnode"
[2856682861078] [UN] PID: 1320 TASK: ffff881f8ea10000 CPU: 59 COMMAND: "nscd"
[2856682846246] [UN] PID: 8657 TASK: ffff881f816499d0 CPU: 61 COMMAND: "m3dbnode"
[2856682836259] [UN] PID: 8804 TASK: ffff883f5effcd70 CPU: 34 COMMAND: "m3dbnode"
[2856682831462] [UN] PID: 8945 TASK: ffff883f445633a0 CPU: 37 COMMAND: "m3dbnode"
[2856682820876] [UN] PID: 1859 TASK: ffff881f816899d0 CPU: 56 COMMAND: "crond"
[2856682815523] [UN] PID: 3913 TASK: ffff881f817b8000 CPU: 49 COMMAND: "agentWorker"
[2856682800170] [UN] PID: 1844 TASK: ffff881f8c1d0000 CPU: 32 COMMAND: "in:imjournal"
[2753630505936] [UN] PID: 266 TASK: ffff883f62614d70 CPU: 51 COMMAND: "kworker/51:0"
hung住最久的kworker是因为在等kthreadd创建内核线程:
crash> bt ffff883f62614d70
PID: 266 TASK: ffff883f62614d70 CPU: 51 COMMAND: "kworker/51:0"
#0 [ffff883f62623bb0] __schedule at ffffffff819fcb9e
#1 [ffff883f62623c18] schedule at ffffffff819fd0e9
#2 [ffff883f62623c28] schedule_timeout at ffffffff819fb629
#3 [ffff883f62623cb8] wait_for_completion at ffffffff819fd597
#4 [ffff883f62623d18] kthread_create_on_node at ffffffff810734d3
#5 [ffff883f62623dd0] create_worker at ffffffff8106c824
#6 [ffff883f62623e20] manage_workers at ffffffff8106ca8d
#7 [ffff883f62623e68] worker_thread at ffffffff8106cfa3
#8 [ffff883f62623ec8] kthread at ffffffff8107366f
#9 [ffff883f62623f50] ret_from_fork at ffffffff81ad7d48
crash> task -R sched_info ffff883f62614d70
PID: 266 TASK: ffff883f62614d70 CPU: 51 COMMAND: "kworker/51:0"
sched_info = {
pcount = 24,
run_delay = 15283,
last_arrival = 2753630505936,
last_queued = 0
},
crash> log | grep sysrq
[ 2856.304036] IP: [<ffffffff813c5196>] sysrq_handle_crash+0x16/0x20
[ 2856.304036] RIP: 0010:[<ffffffff813c5196>] [<ffffffff813c5196>] sysrq_handle_crash+0x16/0x20
[ 2856.304036] [<ffffffff813c5952>] __handle_sysrq+0xa2/0x170
[ 2856.304036] [<ffffffff813c5da5>] sysrq_filter+0x335/0x420
[ 2856.304036] RIP [<ffffffff813c5196>] sysrq_handle_crash+0x16/0x20
crash> pd 2856-2753
$2 = 103
kthreadd回收内存上等spinlock:
crash> bt 2
PID: 2 TASK: ffff881f931599d0 CPU: 24 COMMAND: "kthreadd"
#0 [ffff881fbfd05e70] crash_nmi_callback at ffffffff810294d2
#1 [ffff881fbfd05e80] nmi_handle at ffffffff81a00869
#2 [ffff881fbfd05ec8] do_nmi at ffffffff81a00980
#3 [ffff881fbfd05ef0] end_repeat_nmi at ffffffff819ffe91
[exception RIP: _raw_spin_lock+26]
RIP: ffffffff819ff3ca RSP: ffff881f93187a20 RFLAGS: 00000246
RAX: 0000000000000001 RBX: ffff883f62168000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff881f93187bb0 RDI: ffffffff823c4120
RBP: ffff881f93187a20 R8: 0000000000000000 R9: 0000000000000040
R10: 0000000000000000 R11: 0000000000000000 R12: ffff883f62168000
R13: 0000000000000000 R14: ffff881f93187bb0 R15: ffff883f621683b0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
--- <NMI exception stack> ---
#4 [ffff881f93187a20] _raw_spin_lock at ffffffff819ff3ca
#5 [ffff881f93187a28] put_super at ffffffff8118e9e9
#6 [ffff881f93187a40] prune_super at ffffffff8118fbae
#7 [ffff881f93187a78] shrink_slab at ffffffff81134de1
#8 [ffff881f93187b18] do_try_to_free_pages at ffffffff811381dd
#9 [ffff881f93187ba0] try_to_free_pages at ffffffff81138405
#10 [ffff881f93187c48] __alloc_pages_nodemask at ffffffff8112d654
#11 [ffff881f93187d80] copy_process at ffffffff81049204
#12 [ffff881f93187e08] do_fork at ffffffff8104a7f0
#13 [ffff881f93187e78] kernel_thread at ffffffff8104ab56
#14 [ffff881f93187e88] kthreadd at ffffffff81074125
#15 [ffff881f93187f50] ret_from_fork at ffffffff81ad7d48
这里这个spinlock是一个全局锁sb_lock(省略找参数过程)。
从上面的堆栈,结合代码,知道内存分配的代码路径是:
alloc_pages_nodemask->alloc_pages_slowpath->__alloc_pages_direct_reclaim
direct_reclaim只有退出,才会进入oom逻辑。
而kthreadd是卡在了direct_reclaim环节的shrink_slab(direct_reclaim除了回收cache,主要的一个动作就是回收slab,而从前面的内存总体情况里可以看到,cache不多)里。
看下汇编确认调用逻辑无误:
crash> dis -r ffffffff8112d654 | tail
0xffffffff8112d63f <__alloc_pages_nodemask+1711>: mov %r14,%rcx
0xffffffff8112d642 <__alloc_pages_nodemask+1714>: mov %r13d,%edx
0xffffffff8112d645 <__alloc_pages_nodemask+1717>: mov %r12,%rdi
0xffffffff8112d648 <__alloc_pages_nodemask+1720>: mov %rax,0x828(%r15)
0xffffffff8112d64f <__alloc_pages_nodemask+1727>: callq 0xffffffff81138300 <try_to_free_pages>
0xffffffff8112d654 <__alloc_pages_nodemask+1732>: andl $0xfffff7ff,0x14(%r15)
确认是从direct recliam进去的:
crash> dis -r ffffffff811381dd | tail
0xffffffff811381c5 <do_try_to_free_pages+1045>: mov -0x48(%rbp),%rax
0xffffffff811381c9 <do_try_to_free_pages+1049>: test %rax,%rax
0xffffffff811381cc <do_try_to_free_pages+1052>: jne 0xffffffff81138178 <do_try_to_free_pages+968>
0xffffffff811381ce <do_try_to_free_pages+1054>: mov (%r15),%rsi
0xffffffff811381d1 <do_try_to_free_pages+1057>: mov -0x70(%rbp),%rdi
0xffffffff811381d5 <do_try_to_free_pages+1061>: mov %r13,%rdx
0xffffffff811381d8 <do_try_to_free_pages+1064>: callq 0xffffffff81134d30 <shrink_slab>
0xffffffff811381dd <do_try_to_free_pages+1069>: mov -0x68(%rbp),%rdx
crash> dis -r ffffffff81134de1 | tail
0xffffffff81134dd1 <shrink_slab+161>: mov %r13,%rsi
0xffffffff81134dd4 <shrink_slab+164>: mov %r15,%rdi
0xffffffff81134dd7 <shrink_slab+167>: test %rbx,%rbx
0xffffffff81134dda <shrink_slab+170>: cmove %rax,%rbx
0xffffffff81134dde <shrink_slab+174>: callq *(%r15)
0xffffffff81134de1 <shrink_slab+177>: movslq %eax,%r14
在shrink_slab里,内核会去遍历注册的shrinker,一一调用来回收slab。
注册的shrinker绝大多数是文件系统的,用于回收对应文件系统的元数据(drop 3干的事)。
查看有多少shrinker:
crash> list -H shrinker_list shrinker.list | wc -l
42
crash> p shrinker_rwsem
shrinker_rwsem = $1 = {
count = 50,
wait_lock = {
raw_lock = {
lock = 0 '