使用Trace-Event解決系統(tǒng)不能深度睡眠的問(wèn)題
本文轉(zhuǎn)載自微信公眾號(hào)「相遇Linux」,作者JeffXie。轉(zhuǎn)載本文請(qǐng)聯(lián)系相遇Linux公眾號(hào)。
最近遇到一個(gè)問(wèn)題,系統(tǒng)不能睡眠到c7s, 只能睡眠到c3. (c-state不能到c7s, cpu的c-state, c0是運(yùn)行態(tài),其它狀態(tài)都是idle態(tài),睡眠的越深,c-state的值越大)
這時(shí)候第一感覺(jué)是不是系統(tǒng)很忙導(dǎo)致, 使用pert top看一下耗cpu的進(jìn)程和熱點(diǎn)函數(shù):
- perf top -E 100 --stdio > perf-top.txt
- 19.85% perf [.] __symbols__insert
- 7.68% perf [.] rb_next
- 4.60% libc-2.26.so [.] __strcmp_sse2_unaligned
- 4.20% libelf-0.168.so [.] gelf_getsym
- 3.92% perf [.] dso__load_sym
- 3.86% libc-2.26.so [.] _int_malloc
- 3.60% libc-2.26.so [.] __libc_calloc
- 3.30% libc-2.26.so [.] vfprintf
- 2.95% perf [.] rb_insert_color
- 2.61% [kernel] [k] prepare_exit_to_usermode
- 2.51% perf [.] machine__map_x86_64_entry_trampolines
- 2.31% perf [.] symbol__new
- 2.22% [kernel] [k] do_syscall_64
- 2.11% libc-2.26.so [.] __strlen_avx2
發(fā)現(xiàn)系統(tǒng)中只有perf工具本身比較耗cpu :(
然后就想到是不是系統(tǒng)中某個(gè)進(jìn)程搞的鬼,不讓cpu睡眠到c7s. 這時(shí)候使用trace event監(jiān)控一下系統(tǒng)中sched_switch事件. 使用trace-cmd工具監(jiān)控所有cpu上的sched_switch(進(jìn)程切換)事件30秒:
- #trace-cmd record -e sched:sched_switch -M -1 sleep 30
- CPU0 data recorded at offset=0x63e000
- 102400 bytes in size
- CPU1 data recorded at offset=0x657000
- 8192 bytes in size
- CPU2 data recorded at offset=0x659000
- 20480 bytes in size
- CPU3 data recorded at offset=0x65e000
- 20480 bytes in size
使用trace-cmd report 查看一下監(jiān)控結(jié)果,但是查看這樣的原始數(shù)據(jù)不夠直觀,沒(méi)有某個(gè)進(jìn)程被切換到的統(tǒng)計(jì)信息:
- #trace-cmd report
- cpus=4
- trace-cmd-19794 [001] 225127.464466: sched_switch: trace-cmd:19794 [120] S ==> swapper/1:0 [120]
- trace-cmd-19795 [003] 225127.464601: sched_switch: trace-cmd:19795 [120] S ==> swapper/3:0 [120]
- sleep-19796 [002] 225127.464792: sched_switch: sleep:19796 [120] S ==> swapper/2:0 [120]
- <idle>-0 [003] 225127.471948: sched_switch: swapper/3:0 [120] R ==> rcu_sched:11 [120]
- rcu_sched-11 [003] 225127.471950: sched_switch: rcu_sched:11 [120] W ==> swapper/3:0 [120]
- <idle>-0 [003] 225127.479959: sched_switch: swapper/3:0 [120] R ==> rcu_sched:11 [120]
- rcu_sched-11 [003] 225127.479960: sched_switch: rcu_sched:11 [120] W ==> swapper/3:0 [120]
- <idle>-0 [003] 225127.487959: sched_switch: swapper/3:0 [120] R ==> rcu_sched:11 [120]
- rcu_sched-11 [003] 225127.487961: sched_switch: rcu_sched:11 [120] W ==> swapper/3:0 [120]
- <idle>-0 [002] 225127.491959: sched_switch: swapper/2:0 [120] R ==> kworker/2:2:19735 [120]
- kworker/2:2-19735 [002] 225127.491972: sched_switch: kworker/2:2:19735 [120] W ==> swapper/2:0 [120]
trace-cmd report 的結(jié)果使用正則表達(dá)式過(guò)濾一下,然后排序統(tǒng)計(jì):
- trace-cmd report | grep -o '==> [^ ]\+:\?' | sort | uniq -c
- 3 ==> irqbalance:1034
- 3 ==> khugepaged:43
- 20 ==> ksoftirqd/0:10
- 1 ==> ksoftirqd/1:18
- 18 ==> ksoftirqd/3:30
- 1 ==> kthreadd:19798
- 1 ==> kthreadd:2
- 4 ==> kworker/0:0:19785
- 1 ==> kworker/0:1:19736
- 5 ==> kworker/0:1:19798
- 5 ==> kworker/0:1H:364
- 53 ==> kworker/0:2:19614
- 19 ==> kworker/1:1:7665
- 30 ==> tuned:19498
- ...
發(fā)現(xiàn)可疑線程tuned,30秒內(nèi)被切換到運(yùn)行了30次,其它線程都是常規(guī)線程。
此時(shí)查看一下系統(tǒng)中是否開(kāi)啟了tuned服務(wù):
果真是系統(tǒng)開(kāi)啟了tuned服務(wù),然后拉起了名字為tuned的線程.
查看一下tuned服務(wù)的配置文件:
- localhost:/home/jeff # tuned-adm active
- Current active profile: sap-hana
- localhost:/home/jeff # cat /usr/lib/tuned/sap-hana/tuned.conf
- [main]
- summary=Optimize for SAP NetWeaver, SAP HANA and HANA based products
- [cpu]
- force_latency = 70
發(fā)現(xiàn)關(guān)于cpu這一項(xiàng),設(shè)置強(qiáng)制延遲時(shí)間為70秒 force_latency = 70 ,這個(gè)是為了優(yōu)化HANA數(shù)據(jù)庫(kù)。
到底force_latency怎樣起作用,經(jīng)過(guò)一頓搜索,發(fā)現(xiàn)這個(gè)值是被設(shè)置進(jìn)了/dev/cpu_dma_latency
使用lsof /dev/cpu_dma_latency, 發(fā)現(xiàn)tuned線程確實(shí)是在操作這個(gè)文件
- #lsof /dev/cpu_dma_latency
- COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
- tuned 18734 root 9w CHR 10,60 0t0 11400 /dev/cpu_dma_latency
而且Linux內(nèi)核文檔也說(shuō)明了/dev/cpu_dma_latency文件,如果要對(duì)它進(jìn)行寫操作,要open之后寫數(shù)據(jù)之后不close,如果釋放掉了文件描述符它就又會(huì)恢復(fù)到默認(rèn)值,這也印證了上面lsof /dev/cpu_dma_latency是有輸出結(jié)果的.
- https://github.com/torvalds/linux/blob/v5.8/Documentation/trace/coresight/coresight-cpu-debug.rst
- As specified in the PM QoS documentation the requested parameter
- will stay in effect until the file descriptor is released. For example:
- # exec 3<> /dev/cpu_dma_latency; echo 0 >&3
- ...
- Do some work...
- ...
- # exec 3<>-
查看一下/dev/cpu_dma_latency文件的內(nèi)容,確實(shí)是70,也就是(force_latency = 70)
- localhost:/home/jeff # cat /dev/cpu_dma_latency | hexdump -Cv
- 00000000 46 00 00 00 |F...|
- localhost:/home/jeff # echo $((0x46))
- 70
此時(shí)查看一下系統(tǒng)中cpu各個(gè)睡眠態(tài)的描述和延遲時(shí)間值:
- # cd /sys/devices/system/cpu/cpu0/cpuidle/
- # for state in * ; do
- echo -e \
- "STATE: $state\t\
- DESC: $(cat $state/desc)\t\
- NAME: $(cat $state/name)\t\
- LATENCY: $(cat $state/latency)\t\
- RESIDENCY: $(cat $state/residency)"
- done
發(fā)現(xiàn)C3態(tài)的延遲時(shí)間是33微秒,C4的延時(shí)時(shí)間是133微秒,所以(force_latency = 70) ,
系統(tǒng)就只能睡眠到C3了 .(延遲時(shí)間就是從此睡眠態(tài)喚醒到運(yùn)行態(tài)的時(shí)間)
- STATE: state0 DESC: CPUIDLE CORE POLL IDLE NAME: POLL LATENCY: 0 RESIDENCY: 0
- STATE: state1 DESC: MWAIT 0x00 NAME: C1 LATENCY: 2 RESIDENCY: 2
- STATE: state2 DESC: MWAIT 0x01 NAME: C1E LATENCY: 10 RESIDENCY: 20
- STATE: state3 DESC: MWAIT 0x10 NAME: C3 LATENCY: 33 RESIDENCY: 100
- STATE: state4 DESC: MWAIT 0x20 NAME: C6 LATENCY: 133 RESIDENCY: 400
- STATE: state5 DESC: MWAIT 0x32 NAME: C7s LATENCY: 166 RESIDENCY: 500
此時(shí)關(guān)閉tuned 服務(wù), 再查看一下 /dev/cpu_dma_latency的值,變成了默認(rèn)的2000秒
- localhost:/home/jeff # tuned-adm off
- localhost:/home/jeff # cat /dev/cpu_dma_latency | hexdump -Cv
- 00000000 00 94 35 77 |..5w|
- localhost:/home/jeff # echo $((0x77359400))
- 2000000000
然后驗(yàn)證一下,此時(shí)系統(tǒng)可以睡眠到C7s了,此問(wèn)題得到解決 :)
解決此問(wèn)題,主要用到了Linux內(nèi)核本身提供的trace-event.
所以任何一個(gè)功能都不能小看,內(nèi)核就是這樣,一般看上去很無(wú)聊的功能,被一些工程師用很認(rèn)真的態(tài)度打磨出來(lái)之后,潛力還是非常大的:)