這次生產(chǎn)事故主要是由于高峰期請求量過大,由于某種原因 Socket 連接關(guān)閉,導(dǎo)致了大量的異步回調(diào)異常。解決方法在調(diào)用端做好限流,據(jù)朋友說減少了不必要的 https://api.xxxx/peer_messages 調(diào)用,目前沒有出現(xiàn) CPU 爆高現(xiàn)象。
?一、背景
1.講故事
前段時間有位朋友加微信,說他的程序直接 CPU=100%,每次只能手工介入重啟,讓我?guī)兔聪碌降自趺椿厥?,哈哈,這種CPU打滿的事故,程序員壓力會非常大, 我讓朋友在 CPU 高的時候抓 2 個 dump 下來,然后發(fā)給我分析。
二、WinDbg 分析
1. CPU 真的被打滿了嗎?
為了防止南轅北轍,一定要用 !tp 命令去驗證下是不是真的 CPU 爆高。
0:000> !tp
CPU utilization: 100%
Worker Thread: Total: 21 Running: 7 Idle: 0 MaxLimit: 32767 MinLimit: 4
Work Request in Queue: 3
AsyncTimerCallbackCompletion TimerInfo@00000000042d2430
AsyncTimerCallbackCompletion TimerInfo@00000000042d2f90
AsyncTimerCallbackCompletion TimerInfo@000000000420c150
--------------------------------------
Number of Timers: 0
--------------------------------------
Completion Port Thread:Total: 18 Free: 9 MaxFree: 8 CurrentLimit: 18 MaxLimit: 1000 MinLimit: 4
從卦中看確實 100%,太牛了,而且 WorkRequest 還有任務(wù)堆積現(xiàn)象,確認無疑后,接下來看下是誰引發(fā)的?
2. 誰導(dǎo)致的 CPU 爆高
根據(jù)慣例首先懷疑是不是 GC 觸發(fā)所致,可以用 !t? 查看下線程列表,觀察下有沒有 GC 字樣。
:000> !t
ThreadCount: 53
UnstartedThread: 0
BackgroundThread: 53
PendingThread: 0
DeadThread: 0
Hosted Runtime: no
Lock
ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
4 1 1240 00000000021cdf30 2a220 Preemptive 0000000000000000:0000000000000000 00000000021d94c0 0 MTA
23 2 4db4 00000000041cdaa0 2b220 Preemptive 0000000000000000:0000000000000000 00000000021d94c0 0 MTA (Finalizer)
...
65 156 22f4 000000000b1a3f60 8029220 Preemptive 00000004527751F0:0000000452775EE8 00000000021d94c0 0 MTA (Threadpool Completion Port)
66 205 2ef8 000000000b1a1080 8029220 Preemptive 0000000157641DE0:00000001576435B0 00000000021d94c0 0 MTA (Threadpool Completion Port)
...
從卦中看沒有 GC? 字樣,也表明這個程序并不是由 GC 觸發(fā)所致,接下來該怎么排查呢?一般來說 CPU 的爆高是由線程抬起來的,所以接下來就是看下 CPU 的檔次和各個線程棧,看有沒有什么新線索, 可以使用 ~*e !clrstack 。
0:000> !cpuid
CP F/M/S Manufacturer MHz
0 6,79,1 <unavailable> 2299
1 6,79,1 <unavailable> 2299
2 6,79,1 <unavailable> 2299
3 6,79,1 <unavailable> 2299
0:000> ~*e !clrstack
OS Thread Id: 0x2cc4 (68)
Child SP IP Call Site
000000000c14e758 00007ffadeb86e4a [GCFrame: 000000000c14e758]
000000000c14e840 00007ffadeb86e4a [GCFrame: 000000000c14e840]
000000000c14e878 00007ffadeb86e4a [HelperMethodFrame: 000000000c14e878] System.Threading.Monitor.Enter(System.Object)
000000000c14e970 00007ffaceb40491 System.Net.ConnectionGroup.Disassociate(System.Net.Connection) [f:\dd\NDP\fx\src\net\System\Net\_ConnectionGroup.cs @ 148]
000000000c14e9d0 00007ffaceb3fc93 System.Net.Connection.PrepareCloseConnectionSocket(System.Net.ConnectionReturnResult ByRef) [f:\dd\NDP\fx\src\net\System\Net\_Connection.cs @ 3048]
000000000c14eaa0 00007ffacf139bfb System.Net.Connection.HandleError(Boolean, Boolean, System.Net.WebExceptionStatus, System.Net.ConnectionReturnResult ByRef) [f:\dd\NDP\fx\src\net\System\Net\_Connection.cs @ 3119]
000000000c14eb00 00007ffacebc4118 System.Net.Connection.ReadComplete(Int32, System.Net.WebExceptionStatus) [f:\dd\NDP\fx\src\net\System\Net\_Connection.cs @ 3387]
000000000c14eb80 00007ffacebe3dc5 System.Net.LazyAsyncResult.Complete(IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs @ 415]
000000000c14ebe0 00007ffacebe3d07 System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs @ 368]
000000000c14ec20 00007ffacf3a476f System.Net.Security._SslStream.StartFrameBody(Int32, Byte[], Int32, Int32, System.Net.AsyncProtocolRequest)
000000000c14ec80 00007ffacebb3ed8 System.Net.Security._SslStream.ReadHeaderCallback(System.Net.AsyncProtocolRequest) [f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_SslStream.cs @ 1007]
000000000c14ece0 00007ffacebae5ee System.Net.AsyncProtocolRequest.CompleteRequest(Int32) [f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_HelperAsyncResults.cs @ 142]
000000000c14ed10 00007ffacf3a3567 System.Net.FixedSizeReader.CheckCompletionBeforeNextRead(Int32)
000000000c14ed40 00007ffacebae507 System.Net.FixedSizeReader.ReadCallback(System.IAsyncResult) [f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_FixedSizeReader.cs @ 148]
000000000c14ed90 00007ffacebe3dc5 System.Net.LazyAsyncResult.Complete(IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs @ 415]
000000000c14edf0 00007ffadcbe3a63 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 954]
000000000c14eec0 00007ffadcbe38f4 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 902]
000000000c14eef0 00007ffadcbe38c2 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 891]
000000000c14ef40 00007ffaceba60cf System.Net.ContextAwareResult.Complete(IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_ContextAwareResult.cs @ 463]
000000000c14ef90 00007ffacebe3d07 System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs @ 368]
000000000c14efd0 00007ffaceba5e2f System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) [f:\dd\NDP\fx\src\net\System\Net\Sockets\_BaseOverlappedAsyncResult.cs @ 399]
000000000c14f040 00007ffadcc2ffef System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) [f:\dd\ndp\clr\src\BCL\system\threading\overlapped.cs @ 135]
000000000c14f1f0 00007ffade9a6d93 [GCFrame: 000000000c14f1f0]
OS Thread Id: 0x5d6c (70)
Child SP IP Call Site
000000000943e920 00007ffacf139bfb System.Net.Connection.HandleError(Boolean, Boolean, System.Net.WebExceptionStatus, System.Net.ConnectionReturnResult ByRef)
OS Thread Id: 0x4ad4 (75)
Child SP IP Call Site
...
000000000c94e5a0 00007ffacf139bfb System.Net.Connection.HandleError(Boolean, Boolean, System.Net.WebExceptionStatus, System.Net.ConnectionReturnResult ByRef)
...
OS Thread Id: 0x1d70 (80)
Child SP IP Call Site
...
000000000d24e3a0 00007ffacf139bfb System.Net.Connection.HandleError(Boolean, Boolean, System.Net.WebExceptionStatus, System.Net.ConnectionReturnResult ByRef) [f:\dd\NDP\fx\src\net\System\Net\_Connection.cs @ 3119]
...
從線程棧上看,這個 CPU 是 4個核,剛好對應(yīng)著 4 個 HandleError? 報錯,看樣子是什么網(wǎng)絡(luò)出問題了,接下來切到 80 號線程看一下有沒有什么異常類。
0:000> ~80s
clr!AwareLock::Contention+0x194:
00007ffa`deb86e40 4883e801 sub rax,1
0:080> !mdso
Thread 80:
Location Object Type
------------------------------------------------------------
000000000d24e098 000000015765e028 System.Net.WebException
000000000d24e0f8 0000000340b07110 System.Collections.ArrayList
000000000d24e110 000000015765e2b8 System.Net.HttpWebRequest[]
000000000d24e1c0 0000000340b070b8 System.Net.ConnectionGroup
000000000d24e258 0000000144a79678 System.Net.Connection
0:080> !mdt 000000015765e028
000000015765e028 (System.Net.WebException)
_className:NULL (System.String)
_exceptionMethod:NULL (System.Reflection.MethodBase)
_exceptionMethodString:NULL (System.String)
_message:000000015765df70 (System.String) Length=77, String="The underlying connection was closed: The connection was closed unexpectedly."
...
果然看到了 System.Net.WebException?, 從異常信息看貌似是 連接關(guān)閉了?,到這里我就有了一個大膽的猜測,是不是高頻的異常輸出導(dǎo)致的 CPU 爆高呢?為了驗證,可以到托管堆上找下 WebException 的個數(shù)。
0:080> !dumpheap -stat
Statistics:
MT Count TotalSize Class Name
...
00007ffacecc38b0 13315 2343440 System.Net.WebException
00007ffadcdf6570 11369 1909992 System.IO.IOException
00007ffadcdf5fb8 13380 2247840 System.ObjectDisposedException
...
看到這么多異常還是挺嚇人的,剛好朋友抓了兩個dump可以做個比較。
0:048> !dumpheap -stat
Statistics:
MT Count TotalSize Class Name
00007ffacecc38b0 26745 4707120 System.Net.WebException
00007ffadcdf6570 26722 4489296 System.IO.IOException
00007ffadcdf5fb8 28745 4829160 System.ObjectDisposedException
可以看到,2 min 之內(nèi)異常增加了合計 4w? 左右,這就驗證了程序確實是瘋狂的拋異常,在 Windows 平臺上不管是硬件異常?還是軟件異常都由 Windows SEH 異常處理框架統(tǒng)一處理,會出現(xiàn)用戶態(tài)和內(nèi)核態(tài)的切換,這樣瘋狂的拋出,必然會導(dǎo)致 CPU 爆高,終于找到原因了,接下來就是尋找誘發(fā)因素。
3. 異常是誰誘發(fā)的
再回頭看 HandleError? 函數(shù)的調(diào)用棧都是底層的庫函數(shù),從線程棧的 PerformIOCompletionCallback? 函數(shù)來看是 IO線程? 誘發(fā)的,能被 IO 線程兜到是因為這是做了異步處理,既然是 異步?,自然 OverlappedData 也會非常多。
0:080> !gchandles -stat
Statistics:
MT Count TotalSize Class Name
00007ffadc6f7b98 14511 1625232 System.Threading.OverlappedData
Total 17550 objects
Handles:
Strong Handles: 426
Pinned Handles: 23
Async Pinned Handles: 14511
Ref Count Handles: 24
Weak Long Handles: 2430
Weak Short Handles: 132
SizedRef Handles: 4
說明此時有大概 1.5w 的異步請求待回頭,請求量還是蠻大的,但還是沒找到異常的用戶代碼,只能找下到底是誰發(fā)起了什么請求。
0:080> !mdso
Thread 80:
Location Object Type
------------------------------------------------------------
...
000000000d24e488 0000000358c57918 System.Net.HttpWebRequest
000000000d24e2e8 00000001407b5b40 System.String "net_io_readfailure"
...
0:080> !mdt -r:2 0000000358c57918
0000000358c57918 (System.Net.HttpWebRequest)
_Uri:0000000358c57210 (System.Uri)
m_String:00000002407ee430 (System.String) Length=98, String="https://api.xxxx/peer_messages"
....
可以看到請求的連接是 https://api.xxxx/peer_messages?,是一個第三方的API接口, 由于底層的連接關(guān)閉,導(dǎo)致了最后 net_io_readfailure。
把所有的信息整合一下就是:
當請求量大了之后,訪問 https://api.xxxx/peer_messages? 會出問題,對方關(guān)閉了底層連接,導(dǎo)致客戶端這邊請求出現(xiàn)了大量 IO 回調(diào)異常:IOException: Unable to read data from the transport connection: The connection was closed.?,2min之間多達合計 4w? 的異常拋出,進而引發(fā) CPU 爆高,將信息告訴了朋友,讓朋友重點關(guān)注下 https://api.xxxx/peer_messages 這個連接。
三、總結(jié)
這次生產(chǎn)事故主要是由于高峰期請求量過大,由于某種原因 Socket 連接關(guān)閉,導(dǎo)致了大量的異步回調(diào)異常。
解決方法在調(diào)用端做好限流,據(jù)朋友說減少了不必要的 https://api.xxxx/peer_messages 調(diào)用,目前沒有出現(xiàn) CPU 爆高現(xiàn)象。