自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<u id="61u8d"><li id="61u8d"></li></u>

<blockquote id="61u8d"><tt id="61u8d"><pre id="61u8d"></pre></tt></blockquote>

<legend id="61u8d"><abbr id="61u8d"></abbr></legend>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

Linux內(nèi)核的進程負載均衡機制

作者：金慶輝 2019-04-10 13:43:19

系統(tǒng) Linux

在多核系統(tǒng)中，為了更好的利用多CPU并行能力，進程調(diào)度器可以將進程負載盡可能的平均到各個CPU上。再具體實現(xiàn)中，如何選擇將進程遷移到的目標(biāo)CPU，除了考慮各個CPU的負載平衡，還需要將Cache利用納入權(quán)衡因素。同時，對于進程A喚醒進程B這個模型，還做了特殊的處理。

概述

在多核系統(tǒng)中，為了更好的利用多CPU并行能力，進程調(diào)度器可以將進程負載盡可能的平均到各個CPU上。再具體實現(xiàn)中，如何選擇將進程遷移到的目標(biāo)CPU，除了考慮各個CPU的負載平衡，還需要將Cache利用納入權(quán)衡因素。同時，對于進程A喚醒進程B這個模型，還做了特殊的處理。本文分析以Centos kernel 3.10.0-975源碼為藍本。

SMP負載均衡模型

問題

如果只是將CPU負載平均的分布在各個CPU上，那么就無所謂需要調(diào)度域。但是由于Cache以及內(nèi)存Numa的存在，使得進程最好能遷移到與之前運行所在CPU更'近'的CPU上。

以我們常用的Intel X86為例。Cache基本視圖如下圖：

從Cache和內(nèi)存訪問的視角，如果進程負載均衡需要把進程A遷移到另一個CPU上，

如果目標(biāo)CPU和進程A之前所在CPU正好是同一個物理CPU同一個核心上(超線程)，那么Cache利用率最好，畢竟L1，L2和L3中還是'熱'的。
如果目標(biāo)CPU和進程A之前所在CPU正好是同一個物理CPU但不同核心上(多核)，那么Cache利用率次之，L3中還有'熱'數(shù)據(jù)。
如果目標(biāo)CPU和進程A之前所在CPU正好是同一個NUMA但是不同物理CPU上(多NUMA結(jié)構(gòu))，雖然Cache已經(jīng)是'冷'了，但至少內(nèi)存訪問還是在本NUMA中。
如果目標(biāo)CPU和進程A之前所在CPU在不同NUMA中，不但Cache是'冷'的，跨NUMA內(nèi)存還有懲罰，此時內(nèi)存訪問速度最差。

SMP組織

為了更好地利用Cache，內(nèi)核將CPU(如果開啟了超線程，那么以邏輯CPU為單位，否則以物理CPU核心為單位)組織成了調(diào)度域。

邏輯視角

假設(shè)某機器為2路4核8核心CPU，它的CPU調(diào)度域邏輯上如下圖：

2路NUMA最為簡單，如果是4路NUMA，那么這個視圖在NUMA層級將會復(fù)雜很多，因為跨NUMA訪問根據(jù)訪問距離導(dǎo)致訪問延時還不相同，這部分最后討論。

分層視角

所有CPU一共分為三個層次：SMT，MC，NUMA，每層都包含了所有CPU，但是劃分粒度不同。根據(jù)Cache和內(nèi)存的相關(guān)性劃分調(diào)度域，調(diào)度域內(nèi)的CPU又劃分一次調(diào)度組。越往下層調(diào)度域越小，越往上層調(diào)度域越大。進程負載均衡會盡可能的在底層調(diào)度域內(nèi)部解決，這樣Cache利用率最優(yōu)。

從分層的視角分析，下圖是調(diào)度域?qū)嶋H組織方式，每層都有per-cpu數(shù)組保存每個CPU對應(yīng)的調(diào)度域和調(diào)度組，它們是在初始化時已經(jīng)提前分配的內(nèi)存。值得注意的是

每個CPU對應(yīng)的調(diào)度域數(shù)據(jù)結(jié)構(gòu)都包含了有效的內(nèi)容，比如說SMT層中，CPU0和CPU1對應(yīng)的不同調(diào)度域數(shù)據(jù)結(jié)構(gòu)，內(nèi)容是一模一樣的。
每個CPU對應(yīng)的調(diào)度組數(shù)據(jù)結(jié)構(gòu)不一定包含了有效內(nèi)容，比如說MC層中，CPU0和CPU1指向不同的struct sched_domain,但是sched_domain->groups指向的調(diào)度組確是同樣的數(shù)據(jù)結(jié)構(gòu)，這些調(diào)度組組成了環(huán)。

單CPU視角

從單個CPU的視角分析，下圖是調(diào)度域?qū)嶋H組織方式。

每個CPU的進程運行隊列有一個成員指向其所在調(diào)度域。從最低層到最高層。

我們可以在/proc/sys/kernel/sched_domain/cpuX/ 中看到CPU實際使用的調(diào)度域個數(shù)以及每個調(diào)度域的名字和配置參數(shù)。

負載均衡時機

周期性調(diào)用進程調(diào)度程序scheduler_tick()->trigger_load_balance()中，通過軟中斷觸發(fā)負載均衡。
某個CPU上無可運行進程，__schedule()準(zhǔn)備調(diào)度idle進程前，會嘗試從其它CPU上pull一批進程過來。

周期性負載均衡

CPU對應(yīng)的運行隊列數(shù)據(jù)結(jié)構(gòu)中記錄了下一次周期性負載均衡的時間，當(dāng)超過這個時間點后，將觸發(fā)SCHED_SOFTIRQ軟中斷來進行負載均衡。

void trigger_load_balance(struct rq *rq, int cpu) 
{ 
        /* Don't need to rebalance while attached to NULL domain */ 
        if (time_after_eq(jiffies, rq->next_balance) && 
            likely(!on_null_domain(cpu))) 
                raise_softirq(SCHED_SOFTIRQ); 
#ifdef CONFIG_NO_HZ_COMMON 
        if (nohz_kick_needed(rq) && likely(!on_null_domain(cpu))) 
                nohz_balancer_kick(cpu); 
#endif 
}

以下是rebalance_domains()函數(shù)核心流程，值得注意的是，每個層級的調(diào)度間隔不是固定的，而是臨時計算出來，他在一個可通過proc接口配置的最小值和最大值之間。

以下是對CPU的每個層級調(diào)度域調(diào)用load_balance()函數(shù)核心流程，目的是把一些進程遷移到指定的CPU(該場景就是當(dāng)前CPU)。

以我的服務(wù)器為例，觀察不同層級調(diào)度域的調(diào)度間隔范圍，時間單位為jiffies。

Level	min_interval	max_interval
SMT	2	4
MC	40	80
NUMA	80	160

可見，SMT負載均衡頻率最高，越往上層越低。這也符合體系結(jié)構(gòu)特點，在越低層次遷移進程代價越小(Cache利用率高)，所以可以更加頻繁一點。

CPU進入idle前負載均衡

當(dāng)進程調(diào)度函數(shù)__schedule()把即將切換到idle進程前，會發(fā)生一次負載均衡來避免當(dāng)前CPU空閑。

static void __sched __schedule(void) 
{ 
        ... 
        if (unlikely(!rq->nr_running)) 
                idle_balance(cpu, rq); 
 
        ... 
}

核心函數(shù)idle_balance()。基本上也是盡可能在低層調(diào)度域中負載均衡。

/*  * idle_balance is called by schedule() if this_cpu is about to become  * idle. Attempts to pull tasks from other CPUs.  */ 
void idle_balance(int this_cpu, struct rq *this_rq) 
{ 
    unsigned long next_balance = jiffies + HZ; 
    struct sched_domain *sd; 
    int pulled_task = 0; 
    u64 curr_cost = 0; 
 
    this_rq->idle_stamp = rq_clock(this_rq); 
 
    /* 如果該CPU平均空閑時間小于/proc中的配置值或者該cpu調(diào)度域中所有cpu都是idle狀態(tài)，那么不需要負載均衡了*/ 
    if (this_rq->avg_idle < sysctl_sched_migration_cost || 
        !this_rq->rd->overload) { 
        rcu_read_lock(); 
        sd = rcu_dereference_check_sched_domain(this_rq->sd); 
        if (sd) 
            update_next_balance(sd, 0, &next_balance); 
        rcu_read_unlock(); 
 
        goto out; 
    } 
 
    /*   * Drop the rq->lock, but keep IRQ/preempt disabled.     */ 
    raw_spin_unlock(&this_rq->lock); 
 
    update_blocked_averages(this_cpu); 
    rcu_read_lock(); 
    /* 從底向上遍歷調(diào)度域，只要遷移成功一個進程就跳出循環(huán)*/ 
    for_each_domain(this_cpu, sd) { 
        int should_balance; 
        u64 t0, domain_cost; 
 
        if (!(sd->flags & SD_LOAD_BALANCE)) 
            continue; 
 
        /*           * 如果（當(dāng)前累積的負載均衡開銷時間 + 歷史上該層級負載均衡開銷最大值）已經(jīng)大于CPU平均空閑時間了，          * 那么就沒有必要負載均衡了。注意，sd->max_newidle_lb_cost會在load_balance()函數(shù)中緩慢減少。          */ 
        if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) { 
            update_next_balance(sd, 0, &next_balance); 
            break; 
        } 
 
        /* 我的機器上該標(biāo)記總是設(shè)置了SD_BALANCE_NEWIDLE */ 
        if (sd->flags & SD_BALANCE_NEWIDLE) { 
            t0 = sched_clock_cpu(this_cpu); 
 
            pulled_task = load_balance(this_cpu, this_rq, 
                           sd, CPU_NEWLY_IDLE, 
                           &should_balance); 
            
            domain_cost = sched_clock_cpu(this_cpu) - t0; 
            if (domain_cost > sd->max_newidle_lb_cost) 
                sd->max_newidle_lb_cost = domain_cost; 
 
           /* 記錄了當(dāng)前負載均衡開銷累計值 */ 
            curr_cost += domain_cost; 
        } 
 
        update_next_balance(sd, 0, &next_balance); 
 
        /*       * Stop searching for tasks to pull if there are         * now runnable tasks on this rq.        */         
        if (pulled_task || this_rq->nr_running > 0) { 
            this_rq->idle_stamp = 0; 
            break; 
        } 
    } 
    rcu_read_unlock(); 
 
    raw_spin_lock(&this_rq->lock); 
 
out: 
    /* Move the next balance forward */ 
    if (time_after(this_rq->next_balance, next_balance)) 
        this_rq->next_balance = next_balance; 
 
    if (curr_cost > this_rq->max_idle_balance_cost) 
        this_rq->max_idle_balance_cost = curr_cost; 
}

其它需要用到SMP負載均衡模型的時機

內(nèi)核運行中，還有部分情況中需要用掉SMP負載均衡模型來確定最佳運行CPU:

進程A喚醒進程B時，try_to_wake_up()中會考慮進程B將在哪個CPU上運行。
進程調(diào)用execve()系統(tǒng)調(diào)用時。
fork出子進程，子進程第一次被調(diào)度運

喚醒進程時

當(dāng)A進程喚醒B進程時，假設(shè)都是普通進程，那么將會調(diào)用try_to_wake_up()->select_task_rq()->select_task_rq_fair()

/*  * sched_balance_self: balance the current task (running on cpu) in domains  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and  * SD_BALANCE_EXEC.  *  * Balance, ie. select the least loaded group.  *  * Returns the target CPU number, or the same CPU if no balancing is needed.  *  * preempt must be disabled.  */ 
/* A進程給自己或者B進程選擇一個CPU運行，  * 1: A喚醒B  * 2: A fork()出B后讓B運行  * 3: A execute()后重新選擇自己將要運行的CPU  */  
static int 
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) 
{ 
    struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; 
    int cpu = smp_processor_id(); 
    int new_cpu = cpu; 
    int want_affine = 0; 
    int sync = wake_flags & WF_SYNC; 
 
    /* 當(dāng)A進程喚醒B進程時，從try_to_wake_up()進入本函數(shù)，這里會置位SD_BALANCE_WAKE。 */ 
    if (sd_flag & SD_BALANCE_WAKE) { 
        /* B進程被喚醒時希望運行的CPU盡可能離A進程所在CPU近一點 */ 
        if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) 
            want_affine = 1; 
        new_cpu = prev_cpu; 
        record_wakee(p); 
    } 
 
    rcu_read_lock(); 
    /*       * 如果是A喚醒B模式，則查找同時包含A所在cpu和B睡眠前所在prev_cpu的最低級別的調(diào)度域。因為A進程      * 和B進程大概率會有某種數(shù)據(jù)交換關(guān)系，喚醒B時讓它們所在的CPU離的近一點會性能最優(yōu)。      * 否則，查找包含了sd_flag的最高調(diào)度域。      */ 
    for_each_domain(cpu, tmp) { 
        if (!(tmp->flags & SD_LOAD_BALANCE)) 
            continue; 
 
        /*       * If both cpu and prev_cpu are part of this domain,         * cpu is a valid SD_WAKE_AFFINE target.         */         
        if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && 
            cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { 
            affine_sd = tmp; 
            break; 
        } 
 
        if (tmp->flags & sd_flag) 
            sd = tmp; 
    } 
 
    /* 如果是A喚醒B模式，則在同時包含A所在cpu和B睡眠前所在prev_cpu的最低級別的調(diào)度域中尋找合適的CPU */ 
    if (affine_sd) { 
       /*          * wake_affine()計算A所在CPU和B睡眠前所在CPU的負載值，判斷出B進程喚醒時是否         * 需要離A近一點。         */ 
        if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) 
            prev_cpu = cpu; 
 
       /* 在與prev_cpu共享LLC的CPU中尋找空閑CPU，如果沒有找到，則返回prev_cpu。這里將確定         * B進程喚醒后在哪個CPU運行。         */ 
        new_cpu = select_idle_sibling(p, prev_cpu); 
        goto unlock; 
    } 
 
    /* 到這里，A進程和B進程基本是沒有啥親緣關(guān)系的。不用考慮兩個進程的Cache親緣性 */ 
    while (sd) { 
        int load_idx = sd->forkexec_idx; 
        struct sched_group *group; 
        int weight; 
 
        if (!(sd->flags & sd_flag)) { 
            sd = sd->child; 
            continue; 
        } 
 
        if (sd_flag & SD_BALANCE_WAKE) 
            load_idx = sd->wake_idx; 
 
        group = find_idlest_group(sd, p, cpu, load_idx); 
        if (!group) { 
            sd = sd->child; 
            continue; 
        } 
 
        new_cpu = find_idlest_cpu(group, p, cpu); 
        if (new_cpu == -1 || new_cpu == cpu) { 
            /* Now try balancing at a lower domain level of cpu */ 
            sd = sd->child; 
            continue; 
        } 
 
        /* Now try balancing at a lower domain level of new_cpu */ 
        cpu = new_cpu; 
        weight = sd->span_weight; 
        sd = NULL; 
        for_each_domain(cpu, tmp) { 
            if (weight <= tmp->span_weight) 
                break; 
            if (tmp->flags & sd_flag) 
                sd = tmp; 
        } 
        /* while loop will break here if sd == NULL */ 
    } 
unlock: 
    rcu_read_unlock(); 
 
    return new_cpu; 
}

/*  * Try and locate an idle CPU in the sched_domain.  */ 
 /* 尋找離target CPU最近的空閑CPU(Cache或者內(nèi)存距離最近)*/ 
static int select_idle_sibling(struct task_struct *p, int target) 
{ 
    struct sched_domain *sd; 
    struct sched_group *sg; 
    int i = task_cpu(p); 
     
    /* target CPU正好空閑，自己跟自己當(dāng)然最近*/ 
    if (idle_cpu(target)) 
        return target; 
 
    /*   * If the prevous cpu is cache affine and idle, don't be stupid.     */ 
    /*       * p進程所在的CPU跟target CPU有Cache共享關(guān)系(SMT,或者MC層才有這個關(guān)系)，并且是空閑的，那就用它了。      * Cache共享說明距離很近了       */ 
    if (i != target && cpus_share_cache(i, target) && idle_cpu(i)) 
        return i; 
 
    /*   * Otherwise, iterate the domains and find an elegible idle cpu.     */ 
    /*      * 在與target CPU有LLC Cache共享關(guān)系的調(diào)度域中尋找空閑CPU。注意，在X86體系中只有SMT和MC層的調(diào)度域才有Cache共享。      */ 
    sd = rcu_dereference(per_cpu(sd_llc, target));     
    /* 在我的機器上是按MC，SMT調(diào)度域順序遍歷 */ 
    for_each_lower_domain(sd) { 
        sg = sd->groups; 
        do { 
            if (!cpumask_intersects(sched_group_cpus(sg), 
                        tsk_cpus_allowed(p))) 
                goto next; 
 
           /* 調(diào)度組內(nèi)所有CPU都是空閑狀態(tài)，才能選定 */ 
            for_each_cpu(i, sched_group_cpus(sg)) { 
                if (i == target || !idle_cpu(i)) 
                    goto next; 
            } 
 
           /* 選擇全部CPU都空閑的調(diào)度組中第一個CPU*/ 
            target = cpumask_first_and(sched_group_cpus(sg), 
                    tsk_cpus_allowed(p)); 
            goto done; 
next: 
            sg = sg->next; 
        } while (sg != sd->groups); 
    } 
done: 
    return target; 
}

調(diào)用execve()系統(tǒng)調(diào)用時

/*  * sched_exec - execve() is a valuable balancing opportunity, because at  * this point the task has the smallest effective memory and cache footprint.  */ 
void sched_exec(void) 
{ 
    struct task_struct *p = current; 
    unsigned long flags; 
    int dest_cpu; 
 
    raw_spin_lock_irqsave(&p->pi_lock, flags); 
    /* 選擇最合適的CPU，這里由于進程execve()后，之前的Cache就無意義了，因此選擇目標(biāo)CPU不用考慮Cache距離 */ 
    dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0); 
    if (dest_cpu == smp_processor_id()) 
        goto unlock; 
 
    if (likely(cpu_active(dest_cpu))) { 
        struct migration_arg arg = { p, dest_cpu }; 
 
        raw_spin_unlock_irqrestore(&p->pi_lock, flags); 
        stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg); 
        return; 
    } 
unlock: 
    raw_spin_unlock_irqrestore(&p->pi_lock, flags); 
}

fork的子進程第一次被調(diào)度運行時

do_fork()->wake_up_new_task() 
 
/*  * wake_up_new_task - wake up a newly created task for the first time.  *  * This function will do some initial scheduler statistics housekeeping  * that must be done for every newly created context, then puts the task  * on the runqueue and wakes it.  */ 
void wake_up_new_task(struct task_struct *p) 
{ 
    unsigned long flags; 
    struct rq *rq; 
 
    raw_spin_lock_irqsave(&p->pi_lock, flags); 
#ifdef CONFIG_SMP 
    /*   * Fork balancing, do it here and not earlier because:   *  - cpus_allowed can change in the fork path   *  - any previously selected cpu might disappear through hotplug    */ 
    /* 選擇最合適的CPU，這里由于進程execve()后，之前的Cache就無意義了，因此選擇目標(biāo)CPU不用考慮Cache距離 */ 
    set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); 
#endif 
 
    /* Initialize new task's runnable average */ 
    init_task_runnable_average(p); 
    rq = __task_rq_lock(p); 
    activate_task(rq, p, 0); 
    p->on_rq = TASK_ON_RQ_QUEUED; 
    trace_sched_wakeup_new(p, true); 
    check_preempt_curr(rq, p, WF_FORK); 
#ifdef CONFIG_SMP 
    if (p->sched_class->task_woken) 
        p->sched_class->task_woken(rq, p); 
#endif 
    task_rq_unlock(rq, p, &flags); 
}

SMP負載均衡模型的配置

可以在/proc/sys/kernel/sched_domain/cpuX/中可以對指定CPU所在不同層的調(diào)度域進行設(shè)置

主要分兩類：

調(diào)度層名字：name
調(diào)度域支持的特性：設(shè)置flags文件值，比如SD_LOAD_BALANCE，SD_BALANCE_NEWIDLE，SD_BALANCE_EXEC等，它將決定上文函數(shù)遍歷調(diào)度域時是否忽略本域。
調(diào)度域計算參數(shù)：其它所有文件。

責(zé)任編輯：龐桂玉來源：騰訊云-云+社區(qū)

Linux 內(nèi)核進程負載

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營