記一次生產(chǎn)環(huán)境卡頓優(yōu)化過程:大事務并發(fā)回滾
概述
最近生產(chǎn)環(huán)境有這么個現(xiàn)象,平時的訂單調(diào)度只需要2s內(nèi)可以出結果,但是多個人調(diào)度就會卡住,超過15分鐘都沒有結果出來,有時還會失敗然后導致數(shù)據(jù)不準確。

下面記錄一下生產(chǎn)環(huán)境卡頓時排查的過程。
1、獲取ASH報告
- SQL> @?/rdbms/admin/ashrpt.sql
- --To specify absolute begin time:
- --[MM/DD/YY]] HH24:MI[:SS]
- --08/09/19 08:40:00




2、ASH分析
(1)Top User Events

(2)相關sql
Top SQL with Top Events

sql明細

(3)存儲過程

(4)TOP sessions

從上面分析可以看到兩個明顯的等待事件:wait for stopper event to be increased 等待事件和wait for a undo record 等待事件,這個應該是批量任務調(diào)度的時候產(chǎn)生了大量的大事務,產(chǎn)生了一些回滾造成了嚴重的資源消耗
3、處理大事務并發(fā)回滾
一般情況下wait for stopper event to be increased 等待事件是跟wait for a undo record 等待事件聯(lián)系起來的。
對于這個等待事件metalink上面有一篇文檔
- 464246.1
- Sometimes Parallel Rollback of Large Transaction may become very slow. After killing a large running transaction
- (either by killing the shadow process or aborting the database) then database seems to hang, or smon and parallel query servers
- taking all the available cpu.
- In fast-start parallel rollback, the background process Smon acts as a coordinator and rolls back a set of transactions in parallel
- using multiple server processes. Fast start parallel rollback is mainly useful when a system has transactions that run a long time
- before comitting, especially parallel Inserts, Updates, Deletes operations. When Smon discovers that the amount of recovery work is
- above a certain threshold, it automatically begins parallel rollback by dispersing the work among several parallel processes.
- There are cases where parallel transaction recovery is not as fast as serial transaction recovery, because the pq slaves are interfering
- with each other. It looks like the changes made by this transaction cannot be recovered in parallel without causing a performance problem.
- The parallel rollback slave processes are most likely contending for the same resource, which results in even worse rollback performance
- compared to a serial rollback.
解決的辦法:
- --關掉并發(fā)回滾,變成串行回滾(直接重啟解決)
- sql> alter system set fast_start_parallel_rollback = false scope=spfile;

通常,如果有很多并發(fā)進程,可以根據(jù)v$px_session視圖去查看,查看v$px_session視圖,發(fā)現(xiàn)所有的并發(fā)進程都是由smon進程導致(即qcsid列為smon進程的session id)
而smon進程的等待事件為wait for stopper event to be increased
即smon進程在做大事務的回滾,默認參數(shù)fast_start_parallel_rollback參數(shù)為low,即回滾時會啟動2*CPU個數(shù) 個并發(fā)進程。而由于是使用并發(fā),所以可能由于并發(fā)之間相互使用共同的資源,導致回滾速度更慢。因為是生產(chǎn)環(huán)境,不能隨便重啟,所以我用了下面的方法來修改這個參數(shù):
(1)查找smon進程ID
- select pid,spid,pname,username,tracefile from v$process where pname='SMON'

(2)禁用smon進程的事務清理(Disable SMON transaction cleanup)
- oradebug setorapid 'SMON's Oracle PID';
- oradebug event 10513 trace name context forever, level 2

(3)查詢V$FAST_START_SERVERS視圖,將所有smon啟用的并發(fā)進程殺掉

(4)修改fast_start_parallel_rollback參數(shù)
- alter system set fast_start_parallel_rollback=false;
(5)啟用smon進程的事務清理(enable transaction recovery)
- oradebug setorapid 'SMON's Oracle PID';
- oradebug event 10513 trace name context off
(6)獲得tracefile name
- oradebug tracefile_name

(7)驗證

4、業(yè)務驗證
修改后去業(yè)務驗證,到高峰期還是有卡頓現(xiàn)象,不過頻率減少了很多,報錯之類的也沒有了,同時觀察新的報告可以發(fā)現(xiàn)并發(fā)回滾之類的等待事件已經(jīng)沒有了。