Master 分配資源并在 Worker上啟動 Executor ，逐行代碼注釋版

作者： wangkai 2021-10-29 10:58:10

開發(fā) 前端

這里有個假設(shè)是：Spark 集群以 Standalone 的方式來啟動的，作業(yè)也是提交到 Spark standalone 集群。

[[432016]]

本文轉(zhuǎn)載自微信公眾號「KK架構(gòu)」，作者wangkai。轉(zhuǎn)載本文請聯(lián)系KK架構(gòu)公眾號。

一、回顧一下之前的內(nèi)容

上一次閱讀到了 SparkContext 初始化，繼續(xù)往下之前，先溫故一下之前的內(nèi)容。

這里有個假設(shè)是：Spark 集群以 Standalone 的方式來啟動的，作業(yè)也是提交到 Spark standalone 集群。

首先需要啟動 Spark 集群，使用 start-all.sh 腳本依次啟動 Master (主備) 和多個 Worker。

啟動好之后，開始提交作業(yè)，使用 spark-submit 命令來提交。

首先在提交任務(wù)的機(jī)器上使用 java 命令啟動了一個虛擬機(jī)，并且執(zhí)行了主類 SparkSubmit 的 main 方法作為入口。
然后根據(jù)提交到不同的集群，來 new 不同的客戶端類，如果是 standalone 的話，就 new 了一個 ClientApp;然后把 java DriverWrapper 這個命令封裝到 RequestSubmmitDriver 消息中，把這個消息發(fā)送給 Master;
Master 隨機(jī)找一個滿足資源條件的 Worker 來啟動 Driver，實(shí)際上是在虛擬機(jī)里執(zhí)行 DriverWrapper 的 main 方法;
然后 Worker 開始啟動 Driver，啟動的時候會執(zhí)行用戶提交的 java 包里的 main 方法，然后開始執(zhí)行 SparkContext 的初始化，依次在 Driver 中創(chuàng)建了 DAGScheduler、TaskScheduler、SchedulerBackend 三個重要的實(shí)例。并且啟動了 DriverEndpoint 和 ClientEndpoint ，用來和 Worker、Master 通信。

二、Master 處理應(yīng)用的注冊

接著上次 ClientEndpoint 啟動之后，會向 Master 發(fā)送一個 RegisterApplication 消息，Master 開始處理這個消息。

然后看到 Matster 類處理 RegisterApplication 消息的地方：

可以看到，用應(yīng)用程序的描述和 Driver 的引用創(chuàng)建了一個 Application，然后開始注冊這個 Application。

注冊 Application 很簡單，就是往 Master 的內(nèi)存中加入各種信息，重點(diǎn)來了，把 ApplicationInfo 加入到了 waitingApps 這個結(jié)構(gòu)里，然后 schedule() 方法會遍歷這個列表，為 Application 分配資源，并調(diào)度起來。

然后往 zk 中寫入了 Application 的信息，并且往 Driver 發(fā)送了一個 RegisteredApplication 應(yīng)用已經(jīng)注冊的消息。

接著開始 schedule()，這個方法上次講過，它會遍歷兩個列表，一個是遍歷 waitingDrivers 來啟動 Driver，一個是遍歷 waitingApps，來啟動 Application。

waitingDrivers 列表在客戶端請求啟動 Driver 的時候就處理過了，本次重點(diǎn)看這個方法：

startExecutorsOnWorkers()

三、Master 對資源的調(diào)度

有以下幾個步驟：

遍歷 waitingApps 的所有 app;
如果 app 需要的核數(shù)小于一個 Executor 可以提供的核數(shù)，就不為 app 分配新的 Executor;
過濾出還有可供調(diào)度的 cpu 和 memory 的 workers，并按照 cores 的大小降序排序，作為 usableWorkers;
計算所有 usableWorkers 上要分配多少 CPU;
然后遍歷可用的 Workers，分配資源并執(zhí)行調(diào)度，啟動 Executor。

源碼從 Master 類的 schedule() 方法的最后一行 startExecutorsOnWorkers() 開始：

這個方法主要作用是計算 worker 的 executor 數(shù)量和分配的資源并啟動 executor。

/** 
 * Schedule and launch executors on workers 
 */ 
private def startExecutorsOnWorkers(): Unit = { 
    // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app 
    // in the queue, then the second app, etc. 
 
    for (app <- waitingApps) { 
        val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1) 
        // If the cores left is less than the coresPerExecutor,the cores left will not be allocated 
        if (app.coresLeft >= coresPerExecutor) { 
            // 1. 剩余內(nèi)存大于單個 executor 需要的內(nèi)存 
            // 2. 剩余的內(nèi)核數(shù)大于單個 executor 需要的內(nèi)核數(shù) 
            // 3. 按照內(nèi)核數(shù)從大到小排序 
            // Filter out workers that don't have enough resources to launch an executor 
            val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE) 
                .filter(canLaunchExecutor(_, app.desc)) 
                .sortBy(_.coresFree).reverse 
            val appMayHang = waitingApps.length == 1 && 
                waitingApps.head.executors.isEmpty && usableWorkers.isEmpty 
            if (appMayHang) { 
                logWarning(s"App ${app.id} requires more resource than any of Workers could have.") 
            } 
            // 計算每個 Worker 上可用的 cores 
            val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps) 
             
            // Now that we've decided how many cores to allocate on each worker, let's allocate them 
            for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) { 
                allocateWorkerResourceToExecutors( 
                    app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos)) 
            } 
        } 
    } 
}

(1)遍歷 waitingApps，如果 app 還需要的 cpu 核數(shù)大于每個執(zhí)行器的核數(shù)，才繼續(xù)分配。

(2)過濾可用的 worker，條件一：該 worker 剩余內(nèi)存大于單個 executor 需要的內(nèi)存;條件二：該 worker 剩余 cpu 核數(shù)大于單個 executor 需要的核數(shù);然后按照可用 cpu核數(shù)從大到小排序。

(3)下面兩個方法是關(guān)鍵的方法

scheduleExecutorsOnWorkers()，用來計算每個 Worker 上可用的 cpu 核數(shù);

allocateWorkerResourceToExecutors() 用來真正在 Worker 上分配 Executor。

四、scheduleExecutorsOnWorkers 計算每個 Worker 可用的核數(shù)

這個方法很長，首先看方法注釋，大致翻譯了一下：

當(dāng)執(zhí)行器分配的 cpu 核數(shù)(spark.executor.cores)被顯示設(shè)置的時候，如果這個 worker 上有足夠的核數(shù)和內(nèi)存的話，那么每個 worker 上可以執(zhí)行多個執(zhí)行器;反之，沒有設(shè)置的時候，每個 worker 上只能啟動一個執(zhí)行器;并且，這個執(zhí)行器會使用 worker 能提供出來的盡可能多的核數(shù);

appA 和 appB 都有一個執(zhí)行器運(yùn)行在 worker1 上。但是 appA 還需要一些 cpu 核，當(dāng) appB 執(zhí)行結(jié)束，釋放了它在 worker1 上的核數(shù)時，下一次調(diào)度的時候，appA 會新啟動一個 executor 獲得了 worker1 上所有的可用的核心，因此 appA 就在 worker1 上啟動了多個執(zhí)行器。

設(shè)置 coresPerExecutor (spark.executor.cores)很重要，考慮下面的例子：集群有4個worker，每個worker有16核;用戶請求 3 個執(zhí)行器(spark.cores.max = 48,spark.executor.cores=16)。如果不設(shè)置這個參數(shù)，那么每次分配 1 個 cpu核心，每個 worker 輪流分配一個 cpu核，最終 4 個執(zhí)行器分配 12 個核心給每個 executor，4 個 worker 也同樣分配了48個核心，但是最終每個 executor 只有 12核 < 16 核，所以最終沒有執(zhí)行器被啟動。

如果看我的翻譯還是很費(fèi)勁，我就再精簡下：

如果沒有設(shè)置 spark.executor.cores，那么每個 Worker 只能啟動一個 Executor，并且這個 Executor 會占用所有 Worker 能提供的 cpu核數(shù);
如果顯示設(shè)置了，那么每個 Worker 可以啟動多個 Executor;

下面是源碼，每句都有挨個注釋過，中間有一個方法是判斷這個 Worker 上還能不能再分配 Executor 了。

重點(diǎn)是中間方法后面那一段，遍歷每個 Worker 分配 cpu，如果不是 Spend Out 模式，則在一個 Worker 上一直分配，直到 Worker 資源分配完畢。

private def scheduleExecutorsOnWorkers( 
    app: ApplicationInfo, 
    usableWorkers: Array[WorkerInfo], 
    spreadOutApps: Boolean): Array[Int] = { 
    // 每個 executor 的核數(shù) 
    val coresPerExecutor = app.desc.coresPerExecutor 
    // 每個 executor 的最小核數(shù) 為1 
    val minCoresPerExecutor = coresPerExecutor.getOrElse(1) 
    //  每個Worker分配一個Executor？ 這個參數(shù)可以控制這個行為 
    val oneExecutorPerWorker = coresPerExecutor.isEmpty 
    //  每個Executor的內(nèi)存 
    val memoryPerExecutor = app.desc.memoryPerExecutorMB 
    val resourceReqsPerExecutor = app.desc.resourceReqsPerExecutor 
    // 可用 Worker 的總數(shù) 
     
    val numUsable = usableWorkers.length 
    // 給每個Worker的cores數(shù) 
    val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker 
    // 給每個Worker上新的Executor數(shù) 
    val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker 
    // app 需要的核心數(shù) 和 所有 worker 能提供的核心總數(shù)，取最小值 
    var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum) 
 
    //  判斷指定的worker是否可以為這個app啟動一個executor 
    /** Return whether the specified worker can launch an executor for this app. */ 
    def canLaunchExecutorForApp(pos: Int): Boolean = { 
        // 如果能提供的核心數(shù) 大于等 executor 需要的最小核心數(shù)，則繼續(xù)分配 
        val keepScheduling = coresToAssign >= minCoresPerExecutor 
        // 是否有足夠的核心：當(dāng)前 worker 能提供的核數(shù) 減去 每個 worker 已分配的核心數(shù) ，大于每個 executor最小的核心數(shù) 
        val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor 
        // 當(dāng)前 worker 新分配的 executor 個數(shù) 
        val assignedExecutorNum = assignedExecutors(pos) 
 
        //  如果每個worker允許多個executor，就能一直在啟動新的的executor 
        //  如果在這個worker上已經(jīng)有executor，則給這個executor更多的core 
        // If we allow multiple executors per worker, then we can always launch new executors. 
        // Otherwise, if there is already an executor on this worker, just give it more cores. 
 
        // 如果一個 worker 上可以啟動多個 executor  或者 這個 worker 還沒分配 executor 
        val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutorNum == 0 
        if (launchingNewExecutor) { 
            // 總共已經(jīng)分配的內(nèi)存 
            val assignedMemory = assignedExecutorNum * memoryPerExecutor 
            // 是否有足夠的內(nèi)存：當(dāng)前worker 的剩余內(nèi)存 減去 已分配的內(nèi)存 大于每個 executor需要的內(nèi)存 
            val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor 
            // 
            val assignedResources = resourceReqsPerExecutor.map { 
                req => req.resourceName -> req.amount * assignedExecutorNum 
            }.toMap 
            val resourcesFree = usableWorkers(pos).resourcesAmountFree.map { 
                case (rName, free) => rName -> (free - assignedResources.getOrElse(rName, 0)) 
            } 
            val enoughResources = ResourceUtils.resourcesMeetRequirements( 
                resourcesFree, resourceReqsPerExecutor) 
            // 所有已分配的核數(shù)+app需要的核數(shù)  小于 app的核數(shù)限制 
            val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit 
            keepScheduling && enoughCores && enoughMemory && enoughResources && underLimit 
        } else { 
            // We're adding cores to an existing executor, so no need 
            // to check memory and executor limits 
            keepScheduling && enoughCores 
        } 
    } 
 
    // 不斷的啟動executor，直到不再有Worker可以容納任何Executor，或者達(dá)到了這個Application的要求 
    // Keep launching executors until no more workers can accommodate any 
    // more executors, or if we have reached this application's limits 
    // 過濾出可以啟動 executor 的 workers 
    var freeWorkers = (0 until numUsable).filter(canLaunchExecutorForApp) 
 
    while (freeWorkers.nonEmpty) { 
        // 遍歷每個 worker 
        freeWorkers.foreach { pos => 
            var keepScheduling = true 
            while (keepScheduling && canLaunchExecutorForApp(pos)) { 
                coresToAssign -= minCoresPerExecutor 
                assignedCores(pos) += minCoresPerExecutor 
 
                //  如果我們在每個worker上啟動一個executor，每次迭代為每個executor增加一個core 
                //  否則，每次迭代都會為新的executor分配cores 
                // If we are launching one executor per worker, then every iteration assigns 1 core 
                // to the executor. Otherwise, every iteration assigns cores to a new executor. 
                if (oneExecutorPerWorker) { 
                    assignedExecutors(pos) = 1 
                } else { 
                    assignedExecutors(pos) += 1 
                } 
 
                //  如果不使用Spreading out方法，我們會在這個worker上繼續(xù)調(diào)度executor，直到使用它所有的資源 
                //  否則，就跳轉(zhuǎn)到下一個worker 
                // Spreading out an application means spreading out its executors across as 
                // many workers as possible. If we are not spreading out, then we should keep 
                // scheduling executors on this worker until we use all of its resources. 
                // Otherwise, just move on to the next worker. 
                if (spreadOutApps) { 
                    keepScheduling = false 
                } 
            } 
        } 
        freeWorkers = freeWorkers.filter(canLaunchExecutorForApp) 
    } 
    assignedCores 
}

接著真正開始在 Worker 上啟動 Executor：

在 launchExecutor 在方法里：

private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = { 
    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id) 
    worker.addExecutor(exec) 
    worker.endpoint.send(LaunchExecutor(masterUrl, exec.application.id, exec.id, 
        exec.application.desc, exec.cores, exec.memory, exec.resources)) 
    exec.application.driver.send( 
        ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)) 
}