記一次 .NET某工業(yè)視覺(jué)軟件崩潰分析
一、背景
1. 講故事
前兩天給訓(xùn)練營(yíng)里的一位學(xué)員分析了一個(gè)dump,學(xué)員因?yàn)榕艘徽煲矝](méi)找到禍根,被我一下子弄出來(lái)了,極度想看看我是怎么分析的?由于在微信上不能一言?xún)烧Z(yǔ)表盡,干脆寫(xiě)一篇文章出來(lái)詳細(xì)的講講吧,哈哈,訓(xùn)練營(yíng)里的學(xué)員得有求必應(yīng)哈。。。話(huà)不多說(shuō),我們一起探索下這個(gè)程序的崩潰之路吧。
二、WinDbg分析
1. 為什么會(huì)崩潰
這個(gè)比較簡(jiǎn)單,因?yàn)槟J(rèn)會(huì)自動(dòng)定位到崩潰的線(xiàn)程,使用 .ecxr 切到崩潰前的上下文即可,然后使用 k 觀(guān)察崩潰點(diǎn),參考如下:
0:044> .ecxr
rax=0000000000000000 rbx=00000217a6c78f90 rcx=0000000000000000
rdx=0000000000000000 rsi=000000242c46fdd0 rdi=00000217ff1594a0
rip=00007ffdb35eb699 rsp=000000242c46fb40 rbp=000000242c46fc70
r8=00007ffdb35eb699 r9=00000217ea61fc08 r10=00000000e0434352
r11=0000000000000001 r12=00000217a6c78f90 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei pl nz na pe nc
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202
KERNELBASE!RaiseException+0x69:
00007ffd`b35eb699 0f1f440000 nop dword ptr [rax+rax]
0:044> k
*** Stack trace for last set context - .thread/.cxr resets it
# Child-SP RetAddr Call Site
00 00000024`2c46fb40 00007ffd`51bf8c6f KERNELBASE!RaiseException+0x69
01 00000024`2c46fc20 00007ffd`51890108 clr!UMThunkStubRareDisableWorker+0x3f
02 00000024`2c46fc50 00007ffc`dc826748 clr!UMThunkStub+0x128
03 00000024`2c46fce0 00007ffc`dc827406 MVCAMSDK_X64+0x6748
04 00000024`2c46fd80 00007ffc`dc9d21eb MVCAMSDK_X64!CameraGrabber_GetCameraDevInfo+0xb96
05 00000024`2c46fe50 00007ffc`dc9d227f MVCAMSDK_X64!CameraGigeEnumerateDevice+0x167f2b
06 00000024`2c46fe80 00007ffd`b5a07374 MVCAMSDK_X64!CameraGigeEnumerateDevice+0x167fbf
07 00000024`2c46feb0 00007ffd`b5ebcc91 kernel32!BaseThreadInitThunk+0x14
08 00000024`2c46fee0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
從卦中可以看到程序是在clr 的 UMThunkStubRareDisableWorker 函數(shù)中故意拋出的異常,接下來(lái)觀(guān)察下該函數(shù)代碼。
extern "C" VOID __stdcall UMThunkStubRareDisableWorker(Thread* pThread, UMEntryThunk* pUMEntryThunk, Frame* pFrame)
{
if (!CanRunManagedCode())
{
pThread->m_fPreemptiveGCDisabled = 0;
COMPlusThrowBoot(E_PROCESS_SHUTDOWN_REENTRY);
}
}
BOOL CanRunManagedCode(BOOL fCannotRunIsUserError, HINSTANCE hInst)
{
if (g_fForbidEnterEE == TRUE)
return FALSE;
if ((g_fEEShutDown & ShutDown_Finalize2) && !GCHeap::GetGCHeap()->IsCurrentThreadFinalizer())
return FALSE;
if (g_pPreallocatedOutOfMemoryException == NULL)
return FALSE;
return TRUE;
}
從卦中的代碼看是因?yàn)?CanRunManagedCode() 返回false導(dǎo)致的,到內(nèi)存中觀(guān)察這幾個(gè)變量即可。
0:044> dp clr!g_fForbidEnterEE L1
00007ffd`5201d548 00000000`00000000
0:044> dp clr!g_fEEShutDown L1
00007ffd`520140b0 00000000`00000007
0:044> dp clr!g_pPreallocatedOutOfMemoryException L1
00007ffd`52013e08 00000217`e4f513d8
看到卦中的幾個(gè)變量大概就知道了,原來(lái)當(dāng)前程序正在等待 終結(jié)器線(xiàn)程 執(zhí)行完畢的 EEShutDown 狀態(tài)。
2. 真的處于關(guān)閉狀態(tài)嗎
剛才只是通過(guò)變量觀(guān)察,接下來(lái)我們觀(guān)察線(xiàn)程棧來(lái)進(jìn)一步求證,分別切到主線(xiàn)程和終結(jié)器線(xiàn)程觀(guān)察調(diào)用棧,參考如下:
0:000> ~0s;k;~2s;k
win32u!NtUserMsgWaitForMultipleObjectsEx+0x14:
00007ffd`b3dfa104 c3 ret
# Child-SP RetAddr Call Site
00 00000024`2773f578 00007ffd`b3f40c8e win32u!NtUserMsgWaitForMultipleObjectsEx+0x14
...
08 00000024`2773f9c0 00007ffd`518828ff clr!CLREventBase::WaitEx+0xab
09 00000024`2773fa30 00007ffd`51882842 clr!WaitForEndOfShutdown_OneIteration+0xb3
0a 00000024`2773fac0 00007ffd`5171f703 clr!WaitForEndOfShutdown+0x1a
0b 00000024`2773faf0 00007ffd`518243f9 clr!EEShutDown+0xd3
0c 00000024`2773fb40 00007ffd`51823f34 clr!HandleExitProcessHelper+0x29
0d 00000024`2773fb70 00007ffd`51823e14 clr!_CorExeMainInternal+0xf8
0e 00000024`2773fc00 00007ffd`96d7d6ea clr!CorExeMain+0x14
0f 00000024`2773fc40 00007ffd`96c3ac42 mscoreei!CorExeMain+0xfa
10 00000024`2773fca0 00007ffd`b5a07374 mscoree!CorExeMain_Exported+0x72
11 00000024`2773fcd0 00007ffd`b5ebcc91 kernel32!BaseThreadInitThunk+0x14
12 00000024`2773fd00 00000000`00000000 ntdll!RtlUserThreadStart+0x21
paddle_inference_c!paddle::CreatePaddlePredictor<paddle::AnalysisConfig,3>+0x180be:
00007ffc`d39a1a9e 488b01 mov rax,qword ptr [rcx] ds:00000217`89df37b0=00007ffcd7791ff0
# Child-SP RetAddr Call Site
00 00000024`27eff080 00007ffc`d39a1a8f paddle_inference_c!paddle::CreatePaddlePredictor<paddle::AnalysisConfig,3>+0x180be
01 00000024`27eff0b0 00007ffc`d39a1a8f paddle_inference_c!paddle::CreatePaddlePredictor<paddle::AnalysisConfig,3>+0x180af
02 00000024`27eff0e0 00007ffc`d39a1a8f paddle_inference_c!paddle::CreatePaddlePredictor<paddle::AnalysisConfig,3>+0x180af
...
0e 00000024`27eff3b0 00007ffc`f20cbbde paddle_inference_c!PD_PredictorDestroy+0x39
0f 00000024`27eff3e0 00007ffc`f943796a 0x00007ffc`f20cbbde
10 00000024`27eff4b0 00007ffc`f9437930 Sdcb_PaddleInference!Sdcb.PaddleInference.PaddlePredictor.Dispose+0x1a
11 00000024`27eff4e0 00007ffd`518915e6 Sdcb_PaddleInference!Sdcb.PaddleInference.PaddlePredictor.Finalize+0x10
12 00000024`27eff510 00007ffd`5173b6c6 clr!FastCallFinalizeWorker+0x6
13 00000024`27eff540 00007ffd`5173bd73 clr!FastCallFinalize+0x5a
14 00000024`27eff580 00007ffd`5173bcae clr!MethodTable::CallFinalizer+0xb7
15 00000024`27eff5c0 00007ffd`5173b8e7 clr!CallFinalizer+0x5e
...
1e 00000024`27effa50 00000000`00000000 ntdll!RtlUserThreadStart+0x21
其實(shí)到這里邏輯全部清楚了,用戶(hù)點(diǎn)擊了關(guān)閉程序,但此時(shí)終結(jié)器線(xiàn)程正在執(zhí)行,所以執(zhí)行引擎需要等待它執(zhí)行完,在這期間是禁止除終結(jié)器線(xiàn)程之外的任何線(xiàn)程再對(duì)托管方法的調(diào)用,一旦有就會(huì)拋出異常,畫(huà)個(gè)簡(jiǎn)圖如下:
圖片
3. call托管方法在哪里
在錯(cuò)誤的時(shí)間出現(xiàn)了一個(gè)對(duì)的人,這是一種意難平,接下來(lái)把這個(gè)對(duì)的人給找出來(lái),即 UMThunkStubRareDisableWorker方法的第二個(gè)參數(shù)pUMEntryThunk,因?yàn)檫@里面的 m_pManagedTarget 即托管方法的入口點(diǎn),模型如下:
class UMEntryThunk
{
private:
// The start of the managed code
const BYTE* m_pManagedTarget;
// This is used for profiling.
PTR_MethodDesc m_pMD;
}
那如何挖呢?觀(guān)察UMThunkStubRareDisableWorker方法匯編代碼即可。
0:044> k
*** Stack trace for last set context - .thread/.cxr resets it
# Child-SP RetAddr Call Site
00 00000024`2c46fb40 00007ffd`51bf8c6f KERNELBASE!RaiseException+0x69
01 00000024`2c46fc20 00007ffd`51890108 clr!UMThunkStubRareDisableWorker+0x3f
02 00000024`2c46fc50 00007ffc`dc826748 clr!UMThunkStub+0x128
03 00000024`2c46fce0 00007ffc`dc827406 MVCAMSDK_X64+0x6748
04 00000024`2c46fd80 00007ffc`dc9d21eb MVCAMSDK_X64!CameraGrabber_GetCameraDevInfo+0xb96
05 00000024`2c46fe50 00007ffc`dc9d227f MVCAMSDK_X64!CameraGigeEnumerateDevice+0x167f2b
06 00000024`2c46fe80 00007ffd`b5a07374 MVCAMSDK_X64!CameraGigeEnumerateDevice+0x167fbf
07 00000024`2c46feb0 00007ffd`b5ebcc91 kernel32!BaseThreadInitThunk+0x14
08 00000024`2c46fee0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
0:044> uf 00007ffd`51890108
00007ffd`5188ffe0 4154 push r12
00007ffd`5188ffe2 55 push rbp
00007ffd`5188ffe3 4883ec78 sub rsp,78h
00007ffd`5188ffe7 488d6c2420 lea rbp,[rsp+20h]
00007ffd`5188ffec c6454000 mov byte ptr [rbp+40h],0
00007ffd`5188fff0 e8abe0ffff call clr!GetThread (00007ffd`5188e0a0)
00007ffd`5188fff5 4885c0 test rax,rax
00007ffd`5188fff8 746f je clr!UMThunkStub+0x89 (00007ffd`51890069) Branch
...
00007ffd`518900f9 4c895548 mov qword ptr [rbp+48h],r10
00007ffd`518900fd 498bcc mov rcx,r12
00007ffd`51890100 498bd2 mov rdx,r10
00007ffd`51890103 e8288b3600 call clr!UMThunkStubRareDisableWorker (00007ffd`51bf8c30)
根據(jù)x64調(diào)用協(xié)定,參數(shù)是放在 r10 寄存器中,而r10剛好放在 rbp+48h 中,所以挖出這個(gè)棧位置即可,簡(jiǎn)單計(jì)算為。
0:044> ? 00000024`2c46fce0-0x8-0x8-0x8-0x78+0x20
Evaluate expression: 155361672304 = 00000024`2c46fc70
0:044> dp 00000024`2c46fc70+0x48 L1
00000024`2c46fcb8 00000217`ff1594a0
0:044> dp 00000217`ff1594a0 L1
00000217`ff1594a0 00007ffc`f2f49380
0:044> !U 00007ffc`f2f49380
Unmanaged code
00007ffc`f2f49380 e98bbd0500 jmp 00007ffc`f2fa5110
00007ffc`f2f49385 5f pop rdi
00007ffc`f2f49386 0f ???
00007ffc`f2f49387 25e8e35194 and eax,9451E3E8h
00007ffc`f2f4938c 5e pop rsi
00007ffc`f2f4938d 5e pop rsi
00007ffc`f2f4938e 1124e8 adc dword ptr [rax+rbp*8],esp
00007ffc`f2f49391 db5194 fist dword ptr [rcx-6Ch]
00007ffc`f2f49394 5e pop rsi
00007ffc`f2f49395 5e pop rsi
0:044> !ip2md 00007ffc`f2fa5110
MethodDesc: 00007ffcf33bd9e8
Method Name: baslerCamera.MVCam.CameraGrabberFrameCallback(IntPtr, IntPtr, baslerCamera.tSdkFrameHead ByRef, IntPtr)
Class: 00007ffcf33c1c98
MethodTable: 00007ffcf33be828
mdToken: 0000000006000067
Module: 00007ffcf2352f60
IsJitted: yes
CodeAddr: 00007ffcf2fa5110
Transparency: Critical
從 baslerCamera 來(lái)看是一個(gè)工業(yè)相機(jī),接下來(lái)趕緊尋找源代碼,截圖如下:
圖片
圖中的 m_FrameCallback 是一個(gè)delegate函數(shù),并通過(guò) CameraGrabber_SetRGBCallback 方法傳給了C++,所以在關(guān)閉程序的時(shí)候一定要先釋放掉這個(gè) calllback 來(lái)和C++撇清關(guān)系,否則就會(huì)有災(zāi)難的降臨。
三、總結(jié)
在我的dump分析之旅中曾遇到過(guò)一次相似的生產(chǎn)故障,本篇文章主要還是對(duì)訓(xùn)練營(yíng)里這位朋友的有求必應(yīng)吧,不過(guò)說(shuō)實(shí)話(huà),多分析幾個(gè)像這樣的dump,會(huì)極大的提升你的高級(jí)調(diào)試能力。