記一次 .NET 某電商定向爬蟲的內(nèi)存碎片化分析
一、背景
1. 講故事
上個(gè)月有位朋友wx找到我,說(shuō)他的程序存在內(nèi)存泄漏問題,尋求如何解決? 如下圖所示:
從截圖中可以看出,這位朋友對(duì) windbg 的操作還是有些熟悉的,可能缺乏一定的實(shí)操經(jīng)驗(yàn),所以用了幾個(gè)命令之后就不知道怎么排查下去了。
既然找到我,那就以我的個(gè)人經(jīng)驗(yàn)在他的dump上繼續(xù)分析尋找罪魁禍?zhǔn)?,閑話不多說(shuō),上windbg說(shuō)話。
二、Windbg 分析
1. 真的存在內(nèi)存泄漏嗎?
追這個(gè)系列的朋友應(yīng)該知道,我無(wú)數(shù)次的用 !address -summary 和 !eeheap -gc 這兩個(gè)命令來(lái)判斷當(dāng)前的內(nèi)存泄漏是屬于托管層還是非托管層。
- 0:000> !address -summary
- --- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
- Free 358 7dfc`67f60000 ( 125.986 TB) 98.43%
- <unknown> 1087 203`88b6e000 ( 2.014 TB) 99.99% 1.57%
- Image 1532 0`09f11000 ( 159.066 MB) 0.01% 0.00%
- Heap 249 0`03453000 ( 52.324 MB) 0.00% 0.00%
- Stack 66 0`01fc0000 ( 31.750 MB) 0.00% 0.00%
- Other 10 0`001d1000 ( 1.816 MB) 0.00% 0.00%
- TEB 22 0`0002c000 ( 176.000 kB) 0.00% 0.00%
- PEB 1 0`00001000 ( 4.000 kB) 0.00% 0.00%
- --- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
- MEM_MAPPED 183 200`00d06000 ( 2.000 TB) 99.30% 1.56%
- MEM_PRIVATE 1252 3`8d479000 ( 14.207 GB) 0.69% 0.01%
- MEM_IMAGE 1532 0`09f11000 ( 159.066 MB) 0.01% 0.00%
- --- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
- MEM_FREE 358 7dfc`67f60000 ( 125.986 TB) 98.43%
- MEM_RESERVE 749 200`28a9b000 ( 2.001 TB) 99.33% 1.56%
- MEM_COMMIT 2218 3`6f5f5000 ( 13.740 GB) 0.67% 0.01%
- 0:000> !eeheap -gc
- Number of GC Heaps: 1
- generation 0 starts at 0x00000026DA8DA928
- generation 1 starts at 0x00000026DA7FC348
- generation 2 starts at 0x00000024C4691000
- ephemeral segment allocation context: none
- segment begin allocated size
- 00000024C4690000 00000024C4691000 00000024D468FF28 0xfffef28(268431144)
- 00000024EECF0000 00000024EECF1000 00000024FECF0000 0xffff000(268431360)
- 000000248D6F0000 000000248D6F1000 000000249D6EFEF8 0xfffeef8(268431096)
- ...
- 00000026D66D0000 00000026D66D1000 00000026DBA3CA30 0x536ba30(87472688)
- Large object heap starts at 0x00000024D4691000
- segment begin allocated size
- 00000024D4690000 00000024D4691000 00000024DC67C318 0x7feb318(134132504)
- 00000024E60F0000 00000024E60F1000 00000024EE0637C8 0x7f727c8(133638088)
- 0000002482140000 0000002482141000 000000248A08F338 0x7f4e338(133489464)
- 00000024A6770000 00000024A6771000 00000024AE76F6C0 0x7ffe6c0(134211264)
- ...
- 000000278E6D0000 000000278E6D1000 000000279635F2D0 0x7c8e2d0(130605776)
- 00000029233E0000 00000029233E1000 000000292AF672F8 0x7b862f8(129524472)
- 000000292B3E0000 000000292B3E1000 0000002931A5ED60 0x667dd60(107470176)
- 000000299B3E0000 000000299B3E1000 00000029A20095B0 0x6c285b0(113411504)
- 000000281E6D0000 000000281E6D1000 0000002825CD3F58 0x7602f58(123744088)
- 00000028266D0000 00000028266D1000 000000282D5CAD50 0x6ef9d50(116366672)
- 000000282E6D0000 000000282E6D1000 0000002833CA0880 0x55cf880(89979008)
- 00000029A33E0000 00000029A33E1000 00000029A684D300 0x346c300(54969088)
- Total Size: Size: 0x353f96d88 (14293757320) bytes.
- ------------------------------
- GC Heap Size: Size: 0x353f96d88 (14293757320) bytes.
從輸出看,當(dāng)前進(jìn)程占用 MEM_COMMIT=13.7G,托管堆內(nèi)存占用 14293757320 = 13.3G,很明顯這屬于簡(jiǎn)單模式的 托管內(nèi)存泄漏,根據(jù)經(jīng)驗(yàn),托管堆上可能有什么大對(duì)象,這里用 !dumpheap -stat 命令。
- 0:000> !dumpheap -stat
- Statistics:
- MT Count TotalSize Class Name
- 00007ff9ed6ea268 3956842 94964208 System.Collections.Generic.Dictionary`2+KeyCollection[[System.String, System.Private.CoreLib],[Serilog.Events.LogEventPropertyValue, Serilog]]
- 00007ff9ed5e6d28 3842435 166405016 Serilog.Parsing.MessageTemplateToken[]
- 00007ff9ed5e5e28 3842434 184436832 Serilog.Events.MessageTemplate
- 00007ff9ecccf090 4011012 203304420 System.Int32[]
- 00007ff9ed647078 3956849 253238336 Serilog.Events.LogEvent
- 00007ff9ed6e7b48 3956849 284893128 System.Collections.Generic.Dictionary`2[[System.String, System.Private.CoreLib],[Serilog.Events.LogEventPropertyValue, Serilog]]
- 00007ff9ed5e74e8 9259598 296307136 Serilog.Parsing.TextToken
- 00007ff9ed6471b0 12551808 301243392 Serilog.Events.ScalarValue
- 00007ff9ed6e8308 3956849 729078048 System.Collections.Generic.Dictionary`2+Entry[[System.String, System.Private.CoreLib],[Serilog.Events.LogEventPropertyValue, Serilog]][]
- 00007ff9eccb1e18 16546412 3987811940 System.String
- 00000024c3b8faf0 82904 7382993568 Free
我去,托管堆最大的對(duì)象居然是 Free,大概占用 7.3G, 這就牛了,不按套路出牌哈,為了更好的理解,先來(lái)簡(jiǎn)要解釋下 free 對(duì)象。
2. 簡(jiǎn)單解釋 free
其實(shí)簡(jiǎn)而言之,free對(duì)象是被gc標(biāo)記成已回收的空閑塊但并未提交給操作系統(tǒng)釋放的對(duì)象,那怎么去證明呢? 可以先到托管堆上找一個(gè)free塊。
- 0:000> !dumpheap -type Free
- Address MT Size
- 00000024c4691000 00000024c3b8faf0 24 Free
- 00000024c46a2448 00000024c3b8faf0 24 Free
- 00000024c46b26d8 00000024c3b8faf0 40 Free
- 00000024c47e4418 00000024c3b8faf0 40 Free
- 00000024c4925680 00000024c3b8faf0 40 Free
- 00000024c49284a8 00000024c3b8faf0 64 Free
- 00000024c4947a90 00000024c3b8faf0 192 Free
- 00000024c4951f70 00000024c3b8faf0 48 Free
- 000000249d6ea3a8 00000024c3b8faf0 640 Free
(1)用 !do 命令查看是否標(biāo)記為free塊
- 0:000> !do 000000249d6ea3a8
- Free Object
- Size: 640(0x280) bytes
(2)用 dc 看一下這個(gè) free 塊的內(nèi)容,是否有g(shù)c回收的痕跡。
- 0:000> dc 000000249d6ea3a8 000000249d6ea3a8+0x280
- 00000024`9d6ea3a8 c3b8faf0 00000024 00000268 00000000 ....$...h.......
- 00000024`9d6ea3b8 9d6ea6d0 00000024 00000000 00000000 ..n.$...........
- 00000024`9d6ea3c8 ed3ae2b8 00007ff9 9d6ea3a8 00000024 ..:.......n.$...
- 00000024`9d6ea3d8 00000000 00000000 edcc9118 00007ff9 ................
- 00000024`9d6ea3e8 00000000 00000000 00000000 00000000 ................
- 00000024`9d6ea3f8 00000000 00000000 00000000 00000000 ................
- 00000024`9d6ea408 eeb07a50 00007ff9 9d6ea3c8 00000024 Pz........n.$...
- 00000024`9d6ea418 00000000 00000000 ef292ee8 00007ff9 ..........).....
- 00000024`9d6ea428 9d6ea408 00000024 00000000 00000000 ..n.$...........
- ...
- 00000024`9d6ea4a8 eeb0a158 00007ff9 9d6ea420 00000024 X....... .n.$...
- 00000024`9d6ea4b8 00000000 00000000 00000000 00000000 ................
- 00000024`9d6ea4c8 ef293818 00007ff9 9d6ea4a8 00000024 .8).......n.$...
- 00000024`9d6ea4d8 00000000 00000000 ee8357e0 00007ff9 .........W......
- ...
- 00000024`9d6ea508 eed37b40 00007ff9 00000000 00000000 @{..............
- 00000024`9d6ea518 00000000 00000000 00000000 00000000 ................
- 00000024`9d6ea528 c4699b48 00000024 00000000 00000000 H.i.$...........
- 00000024`9d6ea538 00000000 07000440 00000001 00000000 ....@...........
- 00000024`9d6ea548 00000000 00000000 00000000 00000000 ................
- 00000024`9d6ea558 00000000 00000000 ef2af6e0 00007ff9 ..........*.....
- 00000024`9d6ea568 00000000 00000000 00000000 00000000 ................
- 00000024`9d6ea578 00000000 00000000 c4699b48 00000024 ........H.i.$...
- 00000024`9d6ea588 00000000 00000000 00000000 07000400 ................
- ...
- 00000024`9d6ea628 ef2afd08 ..*.
可以看到,這個(gè)free塊生前是有一些殘留的內(nèi)容字符,好了,對(duì) free 塊有基本了解后,接下來(lái)我們繼續(xù)探究下。
3. 到底是什么阻止了free塊的合并?
按照正常邏輯,大多free塊會(huì)在gc回收完之后合并內(nèi)存時(shí),那些被清空后的segment會(huì)被操作系統(tǒng)釋放的,但這個(gè)dump并沒有,也就說(shuō)明這里有什么東西阻止了free塊的合并,那到底是什么呢? 有經(jīng)驗(yàn)的朋友會(huì)說(shuō),可以觀察下 gc 的句柄表,命令為 !gchandles -stat 。
- 0:000> !gchandles -stat
- Statistics:
- MT Count TotalSize Class Name
- ...
- 00007ff9ed15c0f0 1008 72576 System.Reflection.Emit.DynamicResolver
- 00007ff9ecbf6618 38 409344 System.Object[]
- Total 1784 objects
- Handles:
- Strong Handles: 233
- Pinned Handles: 16
- Async Pinned Handles: 18
- Ref Count Handles: 1
- Weak Long Handles: 1327
- Weak Short Handles: 144
- Dependent Handles: 45
從輸出看,這里并沒有什么可疑的地方,那怎么辦呢? 實(shí)操經(jīng)驗(yàn)多的話,這里還是有一些經(jīng)驗(yàn)值得分享的,比如觀察 free 在 heap 上的布局特征,往往就有重大發(fā)現(xiàn)。
4. 查看 free 塊的布局特征
為了簡(jiǎn)化輸出結(jié)果,我把范圍限定到 heap 上某一個(gè) segment 上,比如這里的: 00000029233E0000 000000292AF672F8,所以命令就是 !dumpheap 00000029233E0000 000000292AF672F8。
- 0:000> !dumpheap 00000029233E0000 000000292AF672F8
- Address MT Size
- 00000029233e1000 00000024c3b8faf0 8291896 Free
- 0000002923bc9638 00007ff9eccb1e18 108448
- 0000002923be3dd8 00000024c3b8faf0 29931248 Free
- 000000292586f4c8 00007ff9eccb1e18 301328
- 00000029258b8dd8 00000024c3b8faf0 41384784 Free
- 0000002928030928 00007ff9eccb1e18 301328
- 000000292807a238 00000024c3b8faf0 2542664 Free
- 00000029282e6e80 00007ff9eccb1e18 108448
- 0000002928301620 00000024c3b8faf0 29915032 Free
- 0000002929f88db8 00007ff9eccb1e18 301328
- 0000002929fd26c8 00000024c3b8faf0 2746688 Free
- 000000292a271008 00007ff9eccb1e18 291304
- 000000292a2b81f0 00000024c3b8faf0 1019600 Free
- 000000292a3b10c0 00007ff9eccb1e18 108448
- 000000292a3cb860 00000024c3b8faf0 10601048 Free
- 000000292ade7ab8 00007ff9eccb1e18 301328
- 000000292ae313c8 00000024c3b8faf0 280808 Free
- 000000292ae75cb0 00007ff9eccb1e18 280854
- 000000292aeba5c8 00000024c3b8faf0 416584 Free
- 000000292af20110 00007ff9eccb1e18 291304
- Statistics:
- MT Count TotalSize Class Name
- 00007ff9eccb1e18 10 2394118 System.String
- 00000024c3b8faf0 10 127130352 Free
- Total 20 objects
真是一看嚇一跳,free 和 object 呈交替狀,這就是為什么free塊不能被合并的真正原因,說(shuō)實(shí)話這種教科書式的 內(nèi)存碎片化 dump,真是可遇不可求,接下來(lái)就抽幾個(gè) free 之間的 object 對(duì)象,看看到底是被什么引用著導(dǎo)致gc回收不掉。
5. 尋找 object 的引用鏈
要想查看 object 到底被誰(shuí)引用著,可以用 !gcroot 命令,這里我抽二個(gè)看看。
- 0:000> !gcroot 0000002923bc9638
- Thread 1878:
- 00000024C39BE4B0 00007FFA4C0B3522 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2922]
- rbp+10: 00000024c39be520
- -> 00000024C48AD6E0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions+<RunAsync>d__4, Microsoft.Extensions.Hosting.Abstractions]]
- -> 00000024C48AD200 Microsoft.Extensions.Hosting.Internal.Host
- -> 00000024C48AC538 Microsoft.Extensions.Logging.Logger`1[[Microsoft.Extensions.Hosting.Internal.Host, Microsoft.Extensions.Hosting]]
- -> 00000024C48AC658 Microsoft.Extensions.Logging.Logger
- -> 00000024C48AC680 Microsoft.Extensions.Logging.LoggerInformation[]
- -> 00000024C48AC6E0 Serilog.Extensions.Logging.SerilogLogger
- -> 00000024C48AC808 Serilog.Core.Logger
- -> 00000024C48AC760 Serilog.Core.Logger
- -> 00000024C47AD468 Serilog.Core.Logger
- -> 00000024C47ABF08 Serilog.Core.Sinks.SafeAggregateSink
- -> 00000024C47ABF20 Serilog.Core.ILogEventSink[]
- -> 00000024C479C548 Serilog.Sinks.Grafana.Loki.LokiSink
- -> 00000024C479C778 Serilog.Sinks.Grafana.Loki.Infrastructure.BoundedQueue`1[[Serilog.Events.LogEvent, Serilog]]
- -> 00000024C479C7B8 System.Collections.Generic.Queue`1[[Serilog.Events.LogEvent, Serilog]]
- -> 00000026E8C1A800 Serilog.Events.LogEvent[]
- -> 00000026148D3308 Serilog.Events.LogEvent
- -> 00000026148D4EF0 Serilog.Events.MessageTemplate
- -> 0000002923BC9638 System.String
- Found 1 unique roots (run '!gcroot -all' to see all roots).
- 0:000> !gcroot 000000292586f4c8
- Thread 1878:
- 00000024C39BE4B0 00007FFA4C0B3522 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2922]
- rbp+10: 00000024c39be520
- -> 00000024C48AD6E0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions+<RunAsync>d__4, Microsoft.Extensions.Hosting.Abstractions]]
- -> 00000024C48AD200 Microsoft.Extensions.Hosting.Internal.Host
- -> 00000024C48AC538 Microsoft.Extensions.Logging.Logger`1[[Microsoft.Extensions.Hosting.Internal.Host, Microsoft.Extensions.Hosting]]
- -> 00000024C48AC658 Microsoft.Extensions.Logging.Logger
- -> 00000024C48AC680 Microsoft.Extensions.Logging.LoggerInformation[]
- -> 00000024C48AC6E0 Serilog.Extensions.Logging.SerilogLogger
- -> 00000024C48AC808 Serilog.Core.Logger
- -> 00000024C48AC760 Serilog.Core.Logger
- -> 00000024C47AD468 Serilog.Core.Logger
- -> 00000024C47ABF08 Serilog.Core.Sinks.SafeAggregateSink
- -> 00000024C47ABF20 Serilog.Core.ILogEventSink[]
- -> 00000024C479C548 Serilog.Sinks.Grafana.Loki.LokiSink
- -> 00000024C479C778 Serilog.Sinks.Grafana.Loki.Infrastructure.BoundedQueue`1[[Serilog.Events.LogEvent, Serilog]]
- -> 00000024C479C7B8 System.Collections.Generic.Queue`1[[Serilog.Events.LogEvent, Serilog]]
- -> 00000026E8C1A800 Serilog.Events.LogEvent[]
- -> 0000002614BB7AC8 Serilog.Events.LogEvent
- -> 0000002616D3CC40 Serilog.Events.MessageTemplate
- -> 000000292586F4C8 System.String
- Found 1 unique roots (run '!gcroot -all' to see all roots).
從引用鏈看,這些 object 都是通過(guò) Serilog 日志組件發(fā)送給 Grafana.Loki ,通過(guò)對(duì)引用鏈對(duì)象的追蹤,我發(fā)現(xiàn) System.Collections.Generic.Queue 有重大問題,截圖如下:
對(duì),這個(gè) queue 居然有 395w 的積壓,到底積壓了什么東西,可以用 !wdo 看一下string內(nèi)容。
看樣子這是一個(gè)失敗重試,分析到這里思路大概就清晰了,就是 Grafana.Loki 或者 Serilog 組件出了什么問題,導(dǎo)致日志發(fā)送不到 Loki 中或者發(fā)送速度過(guò)慢,然后不斷的積壓所致,接下來(lái)把分析到的這些信息和朋友做了一個(gè)溝通,截圖如下:
三、總結(jié)
本次內(nèi)存碎片化的主要原因在于 Serilog 對(duì)接 Loki 的過(guò)程中產(chǎn)生的395w的queue積壓所致,但我也只能分析到這里了,至于為什么有積壓,這個(gè)還得朋友進(jìn)一步調(diào)試分析,我相信這個(gè)問題很快就能得到解決。
本文轉(zhuǎn)載自微信公眾號(hào)「一線碼農(nóng)聊技術(shù)」,可以通過(guò)以下二維碼關(guān)注。轉(zhuǎn)載本文請(qǐng)聯(lián)系一線碼農(nóng)聊技術(shù)公眾號(hào)。