自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

RAS:Intel MCA-CMCI,你了解嗎?

開(kāi)發(fā) 項(xiàng)目管理
CMCI是MCA的一個(gè)增強(qiáng)特性,主要用于將硬件CE、UCNA等類型故障通過(guò)中斷方式上報(bào)到軟件,軟件收到中斷后,執(zhí)行中斷處理函數(shù)intel_threshold_interrupt()采取irq mode或poll mode記錄錯(cuò)誤信息到/dev/mcelog,用戶態(tài)可以通過(guò)/dev/mcelog獲取硬件故障信息。

Corrected machine-check error interrupt (CMCI)是MCA的增強(qiáng)特性,它提供了一種threshold-based的錯(cuò)誤上報(bào)方式。這種模式下,軟件可以配置硬件corrected MC errors的閾值,硬件發(fā)生CE(Corrected Error)次數(shù)達(dá)到閾值后,會(huì)產(chǎn)生一個(gè)中斷通知到軟件處理。

值得一提的是,CMCI是隨MCA加入的特性,最開(kāi)始只能通過(guò)軟件輪詢方式獲取CE信息。CMCI中斷通知方式的優(yōu)點(diǎn)是每個(gè)CE都會(huì)經(jīng)過(guò)IRQ Handle處理,不會(huì)丟失任一CE;而輪詢方式可能因?yàn)檩喸冾l率低、存儲(chǔ)空間有限等原因,導(dǎo)致丟失CE。但是并不是說(shuō)CMCI最優(yōu),CMCI的缺點(diǎn)是大量CE會(huì)產(chǎn)生中斷風(fēng)暴,影響機(jī)器的性能。不幸的是在云服務(wù)器場(chǎng)景,CE風(fēng)暴是比較常見(jiàn)的,那么當(dāng)下Intel服務(wù)器是如何解決這個(gè)問(wèn)題的呢?下面會(huì)講到。

CMCI機(jī)制

CMCI默認(rèn)是關(guān)閉的,軟件需要通過(guò)配置IA32_MCG_CAP[10] = 1打開(kāi)。

軟件通過(guò)IA32_MCi_CTL2 MSR來(lái)控制對(duì)應(yīng)Bank使能/關(guān)閉CMCI功能。

通過(guò)IA32_MCi_CTL2 Bit 14:0設(shè)置閾值,如果設(shè)置非0,則使用配置的閾值;如果CMCI不支持,則全0;

CMCI機(jī)制如下圖

圖片圖片

硬件通過(guò)比較IA32_MCi_CTL2 Bit 14:0和IA32_MCi_STATUS Bit 52:38,如果數(shù)值相等,那么overflow event發(fā)送到APIC的CMCI LVT entry。如果MC error涉及多個(gè)processors,那么CMCI中斷會(huì)同時(shí)發(fā)送到這些processors,比如2個(gè)cpu共享的cache發(fā)生CE,那么這兩個(gè)cpu都會(huì)收到CMCI。

CMCI初始化

以Linux v6.3分支為例,內(nèi)核使能CMCI代碼

C++ arch/x86/kernel/cpu/mce/intel.c void intel_init_cmci(void) { int banks;

if (!cmci_supported(&banks))
            return;

    mce_threshold_vector = intel_threshold_interrupt;
    cmci_discover(banks);
    /*
     * For CPU #0 this runs with still disabled APIC, but that's
     * ok because only the vector is set up. We still do another
     * check for the banks later for CPU #0 just to make sure
     * to not miss any events.
     */
    apic_write(APIC_LVTCMCI, THRESHOLD_APIC_VECTOR|APIC_DM_FIXED);
    cmci_recheck();
    }

1.cmci_supported()函數(shù)主要事項(xiàng)包括

?根據(jù)內(nèi)核啟動(dòng)參數(shù)"mce=no_cmci,ignore_ce"判斷是否打開(kāi)cmci和ce上報(bào)功能

?檢查硬件是否支持cmci

?通過(guò)MCG_CMCI_P bit判斷硬件是否使能cmci功能

2.mce_threshold_vector = intel_threshold_interrupt; 聲明cmci的中斷處理函數(shù)為intel_threshold_interrupt();

3.cmci_discover()函數(shù)主要完成

?遍歷所有banks,通過(guò)配置IA32_MCi_CTL2寄存器使能所有bank的cmci功能;

C++ rdmsrl(MSR_IA32_MCx_CTL2(i), val); ...

val |= MCI_CTL2_CMCI_EN;
            wrmsrl(MSR_IA32_MCx_CTL2(i), val);
            rdmsrl(MSR_IA32_MCx_CTL2(i), val);

?設(shè)置cmci threshold值,代碼如下

C++ #define CMCI_THRESHOLD 1

if (!mca_cfg.bios_cmci_threshold) {
                    val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
                    val |= CMCI_THRESHOLD;
            } else if (!(val & MCI_CTL2_CMCI_THRESHOLD_MASK)) {
                    /*
                     * If bios_cmci_threshold boot option was specified
                     * but the threshold is zero, we'll try to initialize
                     * it to 1.
                     */
                    bios_zero_thresh = 1;
                    val |= CMCI_THRESHOLD;
            }

如果用戶未通過(guò)啟動(dòng)參數(shù)"mce=bios_cmci_threshold"配置值,則val = CMCI_THRESHOLD,為1;

如果啟動(dòng)參數(shù)"mce=bios_cmci_threshold"配置,那么表示bios已配置threshold值,即val & MCI_CTL2_CMCI_THRESHOLD_MASK不為0,跳過(guò)else if判斷,采用bios配置值;如果bios未配置值,val & MCI_CTL2_CMCI_THRESHOLD_MASK為0,那么驅(qū)動(dòng)初始化threshold為1。

4.cmci_recheck()

cmci_recheck函數(shù)通過(guò)調(diào)用machine_check_poll(),檢查CPU #0是否有遺漏的CE&UCE events。

CMCI處理

cmci中斷處理函數(shù)為intel_threshold_interrupt(),定義在arch/x86/kernel/cpu/mce/intel.c

C++
/*
 * The interrupt handler. This is called on every event.
 * Just call the poller directly to log any events.
 * This could in theory increase the threshold under high load,
 * but doesn't for now.
 */
static void intel_threshold_interrupt(void)
{
        if (cmci_storm_detect())
                return;

        machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
}machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));

1.cmci_storm_detect()函數(shù)主要是對(duì)cmci storm的處理,代碼如下

C++ static bool cmci_storm_detect(void) { unsigned int cnt = __this_cpu_read(cmci_storm_cnt); unsigned long ts = __this_cpu_read(cmci_time_stamp); unsigned long now = jiffies; int r;

if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
            return true;

    if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
            cnt++;
    } else {
            cnt = 1;
            __this_cpu_write(cmci_time_stamp, now);
    }
    __this_cpu_write(cmci_storm_cnt, cnt);

    if (cnt <= CMCI_STORM_THRESHOLD)
            return false;

    cmci_toggle_interrupt_mode(false);
    __this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
    r = atomic_add_return(1, &cmci_storm_on_cpus);
    mce_timer_kick(CMCI_STORM_INTERVAL);
    this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);

    if (r == 1)
            pr_notice("CMCI storm detected: switching to poll mode\n");
    return true;
    }

該函數(shù)通過(guò)jiffies,判斷固定時(shí)間內(nèi)發(fā)生的cmci次數(shù)是否大于CMCI_STORM_THRESHOLD(15),如果否則return,反之說(shuō)明發(fā)生cmci storm,則執(zhí)行cmci_toggle_interrupt_mode()關(guān)閉cmci功能, 切換為poll mode,通過(guò)輪詢方式獲取event;

2.非cmci storm情況下,通過(guò)machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned))函數(shù)獲取并記錄故障信息

參數(shù)1定義如下,MCP_TIMESTAMP表示會(huì)記錄當(dāng)前TSC


C++
enum mcp_flags {
        MCP_TIMESTAMP   = BIT(0),       /* log time stamp */
        MCP_UC          = BIT(1),       /* log uncorrected errors */
        MCP_DONTLOG     = BIT(2),       /* only clear, don't log */
};

machine_check_poll函數(shù)主要功能是通過(guò)讀取IA32_MCG_STATUS、IA32_MCi_STATUS寄存器信息和CPU的ip、cs等相關(guān)信息,然后故障分類,將CE event或其他故障類型event記錄到/dev/mcelog。用戶可以通過(guò)讀取/dev/mcelog獲取錯(cuò)誤記錄。

執(zhí)行流程如下,過(guò)程說(shuō)明在代碼注釋中

C++
bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
{
        if (flags & MCP_TIMESTAMP)
                m.tsc = rdtsc(); // 記錄當(dāng)前TSC

/*CE Error記錄*/
                /* If this entry is not valid, ignore it */
                if (!(m.status & MCI_STATUS_VAL))
                        continue;

                /*
                 * If we are logging everything (at CPU online) or this
                 * is a corrected error, then we must log it.
                 */
                if ((flags & MCP_UC) || !(m.status & MCI_STATUS_UC))
                        goto log_it;
/*UCNA Error記錄*/
                /*
                 * Log UCNA (SDM: 15.6.3 "UCR Error Classification")
                 * UC == 1 && PCC == 0 && S == 0
                 */
                if (!(m.status & MCI_STATUS_PCC) && !(m.status & MCI_STATUS_S))
                        goto log_it;
/*通過(guò)mce_log記錄故障信息*/
log_it:         
                /*
                 * Don't get the IP here because it's unlikely to
                 * have anything to do with the actual error location.
                 */
                if (!(flags & MCP_DONTLOG) && !mca_cfg.dont_log_ce)
                        mce_log(&m);
                else if (mce_usable_address(&m)) {
                        /*
                         * Although we skipped logging this, we still want
                         * to take action. Add to the pool so the registered
                         * notifiers will see it.
                         */
                        if (!mce_gen_pool_add(&m))
                                mce_schedule_work();
                }
        }

總結(jié)一下,CMCI是MCA的一個(gè)增強(qiáng)特性,主要用于將硬件CE、UCNA等類型故障通過(guò)中斷方式上報(bào)到軟件,軟件收到中斷后,執(zhí)行中斷處理函數(shù)intel_threshold_interrupt()采取irq mode或poll mode記錄錯(cuò)誤信息到/dev/mcelog,用戶態(tài)可以通過(guò)/dev/mcelog獲取硬件故障信息。

參考文檔:《Intel? 64 and IA-32 Architectures Software Developer’s Manual 》

責(zé)任編輯:武曉燕 來(lái)源: Linux閱碼場(chǎng)
相關(guān)推薦

2012-09-06 17:54:28

2022-07-26 00:00:22

HTAP系統(tǒng)數(shù)據(jù)庫(kù)

2014-04-17 16:42:03

DevOps

2018-02-02 10:56:19

屏蔽機(jī)房擴(kuò)建

2010-09-06 14:03:06

PPP身份認(rèn)證

2021-01-15 07:44:21

SQL注入攻擊黑客

2021-11-09 09:48:13

Logging python模塊

2019-09-16 08:40:42

2021-01-12 12:07:34

Linux磁盤系統(tǒng)

2014-11-28 10:31:07

Hybrid APP

2020-02-27 10:49:26

HTTPS網(wǎng)絡(luò)協(xié)議TCP

2019-10-31 08:36:59

線程內(nèi)存操作系統(tǒng)

2023-03-16 10:49:55

2021-03-28 09:26:30

HttpHttp協(xié)議網(wǎng)絡(luò)協(xié)議

2012-09-27 10:24:22

監(jiān)控機(jī)房

2017-10-18 22:01:12

2023-10-24 08:53:24

FutureTas并發(fā)編程

2012-05-31 09:56:54

云安全

2015-07-31 10:35:18

實(shí)時(shí)計(jì)算

2022-12-12 08:46:11

點(diǎn)贊
收藏

51CTO技術(shù)棧公眾號(hào)