記一次 .NET某酒業(yè)業(yè)務(wù)系統(tǒng)崩潰分析
一、背景
1. 講故事
前些天有位朋友找到我,說他的程序每次關(guān)閉時(shí)就會(huì)自動(dòng)崩潰,一直找不到原因讓我?guī)兔匆幌略趺椿厥?,這位朋友應(yīng)該是第二次找我了,分析了下 dump 還是挺經(jīng)典的,拿出來給大家分享一下吧。
二、WinDbg 分析
1. 為什么會(huì)崩潰
找崩潰原因比較簡(jiǎn)單,用 !analyze -v 命令觀察一下便知。
0:040> !analyze -v
CONTEXT: (.ecxr)
eax=0afdf5dc ebx=0698ade8 ecx=00000001 edx=00000000 esi=0698ade8 edi=7eec0000
eip=7753c5af esp=0afdf5dc ebp=0afdf62c iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202
KERNELBASE!RaiseException+0x58:
7753c5af c9 leave
Resetting default scope
EXCEPTION_RECORD: (.exr -1)
ExceptionAddress: 7753c5af (KERNELBASE!RaiseException+0x00000058)
ExceptionCode: c0020001
ExceptionFlags: 00000001
NumberParameters: 1
Parameter[0]: 8007042b
PROCESS_NAME: xxx.exe
從卦中數(shù)據(jù)看當(dāng)前崩潰碼是 c0020001,查了下碼表說是 string綁定無(wú)效 ,截圖如下:
圖片
這看起來有點(diǎn)無(wú)語(yǔ)呀,接下來觀察下線程棧。
0:040> .ecxr
eax=0afdf5dc ebx=0698ade8 ecx=00000001 edx=00000000 esi=0698ade8 edi=7eec0000
eip=7753c5af esp=0afdf5dc ebp=0afdf62c iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202
KERNELBASE!RaiseException+0x58:
7753c5af c9 leave
0:040> k
*** Stack trace for last set context - .thread/.cxr resets it
# ChildEBP RetAddr
00 0afdf62c 70e75e0b KERNELBASE!RaiseException+0x58
01 0afdf648 70f63bf5 clr!COMPlusThrowBoot+0x1a
02 0afdf654 70b6f1da clr!UMThunkStubRareDisableWorker+0x25
03 0afdf67c 77a9571e clr!UMThunkStubRareDisable+0x9
04 0afdf6bc 77a80f0b ntdll!RtlpTpTimerCallback+0x7a
05 0afdf6e0 77a809b1 ntdll!TppTimerpExecuteCallback+0x10f
06 0afdf830 75c4344d ntdll!TppWorkerThread+0x562
07 0afdf83c 77a69802 kernel32!BaseThreadInitThunk+0xe
08 0afdf87c 77a697d5 ntdll!__RtlUserThreadStart+0x70
09 0afdf894 00000000 ntdll!_RtlUserThreadStart+0x1b
從卦中的線程棧來看,這里利用了 Windows線程池 的timer回調(diào),回到 clr 之后主動(dòng)拋了一個(gè)異常。
2. 為什么會(huì)主動(dòng)拋異常
要想知道這個(gè)答案需要分析下clr 的源碼,簡(jiǎn)化后如下:
// Disable from a place that is calling into managed code via a UMEntryThunk.
extern "C" VOID __stdcall UMThunkStubRareDisableWorker(Thread * pThread, UMEntryThunk * pUMEntryThunk, Frame * pFrame)
{
// Check for ShutDown scenario. This happens only when we have initiated shutdown
// and someone is trying to call in after the CLR is suspended. In that case, we
// must either raise an unmanaged exception or return an HRESULT, depending on the
// expectations of our caller.
if (!CanRunManagedCode())
{
pThread->m_fPreemptiveGCDisabled = 0;
COMPlusThrowBoot(E_PROCESS_SHUTDOWN_REENTRY);
}
}
BOOL CanRunManagedCode(BOOL fCannotRunIsUserError, HINSTANCE hInst)
{
// If we are shutting down the runtime, then we cannot run code.
if (g_fForbidEnterEE == TRUE)
return FALSE;
// If we are finaling live objects or processing ExitProcess event,
// we can not allow managed method to run unless the current thread
// is the finalizer thread
if ((g_fEEShutDown & ShutDown_Finalize2) && !GCHeap::GetGCHeap()->IsCurrentThreadFinalizer())
return FALSE;
// If pre-loaded objects are not present, then no way.
if (g_pPreallocatedOutOfMemoryException == NULL)
return FALSE;
return TRUE;
}
根據(jù)上面的源碼,應(yīng)該就是CanRunManagedCode()函數(shù)返回false 導(dǎo)致的,那這個(gè)函數(shù)真的返回 false 嗎?可以用 Windbg 驗(yàn)證下g_fForbidEnterEE 這個(gè)變量。
0:040> dp clr!g_fForbidEnterEE L1
712a2684 00000001
無(wú)語(yǔ)了,這個(gè)變量為true表示當(dāng)前的CLR處于關(guān)閉狀態(tài),應(yīng)該是主線程調(diào)用了 Exit 方法,用 windbg 可以簡(jiǎn)單驗(yàn)證下。
0:000> k
00 0028d3b0 77549cd4 ntdll!NtQueryAttributesFile+0x12
01 0028d3b0 70bf560b KERNELBASE!GetFileAttributesW+0x71
02 0028d3c8 710602a5 clr!CheckFileExistence+0x1a
...
39 0028ebc0 70d2684b clr!WaitForEndOfShutdown_OneIteration+0x81
3a 0028ebc8 70d300e2 clr!WaitForEndOfShutdown+0x1b
3b 0028ec08 70d1329e clr!EEShutDown+0xad
3c 0028ec14 70d132fb clr!HandleExitProcessHelper+0x4d
3d 0028ec70 70d2ff99 clr!EEPolicy::HandleExitProcess+0x50
3e 0028ec70 7115af3b clr!ForceEEShutdown+0x31
3f 0028ec70 702a9faf clr!SystemNative::Exit+0x4f
接下來研究下它要進(jìn)入到什么托管方法中,這個(gè)答案就在 UMEntryThunk.m_pManagedTarget 字段里,參考源碼如下:
class UMEntryThunk
{
private:
// The start of the managed code
const BYTE* m_pManagedTarget;
// This is used for profiling.
PTR_MethodDesc m_pMD;
}
有了這些前置知識(shí)就可以用 windbg 輕松挖掘。
0:040> kb 5
# ChildEBP RetAddr Args to Child
00 0afdf62c 70e75e0b c0020001 00000001 00000001 KERNELBASE!RaiseException+0x58
01 0afdf648 70f63bf5 006e0fe0 0afdf67c 70b6f1da clr!COMPlusThrowBoot+0x1a
02 0afdf654 70b6f1da 0698ade8 00580a38 0698ade8 clr!UMThunkStubRareDisableWorker+0x25
03 0afdf67c 77a9571e 00000000 00000001 7d723ac9 clr!UMThunkStubRareDisable+0x9
04 0afdf6bc 77a80f0b 0afdf71c 006e0fe0 006f6c10 ntdll!RtlpTpTimerCallback+0x7a
0:040> dp 00580a38 L2
00580a38 00386580 008f2eb8
0:040> !U 00386580
Unmanaged code
00386580 e9ab390000 jmp 00389f30
...
0:040> !ip2md 00389f30
MethodDesc: 0018af94
Method Name: xxx._checkInput1(IntPtr, Boolean)
Class: 00435a7c
MethodTable: 0018afd8
mdToken: 06000034
Module: 0018a6a8
IsJitted: yes
CodeAddr: 00389f30
Transparency: Critical
通過一頓反解果然是一個(gè)托管回調(diào)函數(shù),分析到這里ztm的開心哈,感覺馬上就要看到光了,仔細(xì)找了下代碼,果然是借助Windows線程池創(chuàng)建了一個(gè)定時(shí)事件,無(wú)語(yǔ)了,截圖如下:
圖片
圖片
到這里就真相大白了,退出進(jìn)程的時(shí)候一定要先調(diào)用C#的Dispose()方法把非托管的Timer給關(guān)掉,否則就會(huì)出現(xiàn)這種偶發(fā)的崩潰異常。
3. 一些題外話
這個(gè)dump的錯(cuò)誤碼非常有誤導(dǎo)性,一個(gè)是外部的c0020001 ,一個(gè)內(nèi)部的 8007042Bh,尤其是搜內(nèi)部的 8007042Bh 會(huì)把你帶入到誤區(qū)里,讓你修復(fù)系統(tǒng)文件啥的,其實(shí)就是一個(gè)固定的死值,沒有意義的,參見匯編代碼。
0:000> ub 70f63bf5
clr!UMThunkStubRareDisableWorker+0x7:
70f63bd7 c9 leave
70f63bd8 e8d47fc3ff call clr!CanRunManagedCode (70b9bbb1)
70f63bdd 8b7508 mov esi,dword ptr [ebp+8]
70f63be0 85c0 test eax,eax
70f63be2 7511 jne clr!UMThunkStubRareDisableWorker+0x25 (70f63bf5)
70f63be4 b92b040780 mov ecx,8007042Bh
70f63be9 c7460800000000 mov dword ptr [esi+8],0
70f63bf0 e8f721f1ff call clr!COMPlusThrowBoot (70e75dec)
所以還是多以代碼說話,少道聽途說陷入迷途不知返。
三、總結(jié)
說實(shí)話這個(gè)dump分析起來還是挺有難度的,需要你對(duì)Windows線程池,clr源碼實(shí)現(xiàn)有一個(gè)基礎(chǔ)了解,否則很難構(gòu)造出完整證據(jù)鏈。