自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<pre id="ubow0"></pre>

<cite id="ubow0"></cite>

<cite id="ubow0"><track id="ubow0"></track></cite>

<sub id="ubow0"><i id="ubow0"></i></sub><style id="ubow0"></style>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設置退出

淺析 Kubelet 驅(qū)逐機制

作者：海的瀾色 2021-08-30 09:44:47

開發(fā) 前端

本文主要分析了 Kubelet 的 Eviction Manager，包括其對 Linux CGroup 事件的監(jiān)聽、判斷 Pod 驅(qū)逐的優(yōu)先級等。了解了這些之后，我們就可以根據(jù)自身應用的重要性來設置優(yōu)先級，甚至設置成 Critical Pod。

Kubelet 出于對節(jié)點的保護，允許在節(jié)點資源不足的情況下，開啟對節(jié)點上 Pod 進行驅(qū)逐的功能。最近對 Kubelet 的驅(qū)逐機制有所研究，發(fā)現(xiàn)其中有很多值得學習的地方，總結(jié)下來和大家分享。

Kubelet 的配置

Kubelet 的驅(qū)逐功能需要在配置中打開，并且配置驅(qū)逐的閾值。Kubelet 的配置中與驅(qū)逐相關(guān)的參數(shù)如下：

type KubeletConfiguration struct { 
    ... 
  // Map of signal names to quantities that defines hard eviction thresholds. For example: {"memory.available": "300Mi"}. 
  EvictionHard map[string]string 
  // Map of signal names to quantities that defines soft eviction thresholds.  For example: {"memory.available": "300Mi"}. 
  EvictionSoft map[string]string 
  // Map of signal names to quantities that defines grace periods for each soft eviction signal. For example: {"memory.available": "30s"}. 
  EvictionSoftGracePeriod map[string]string 
  // Duration for which the kubelet has to wait before transitioning out of an eviction pressure condition. 
  EvictionPressureTransitionPeriod metav1.Duration 
  // Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met. 
  EvictionMaxPodGracePeriod int32 
  // Map of signal names to quantities that defines minimum reclaims, which describe the minimum 
  // amount of a given resource the kubelet will reclaim when performing a pod eviction while 
  // that resource is under pressure. For example: {"imagefs.available": "2Gi"} 
  EvictionMinimumReclaim map[string]string 
  ... 
}

其中，EvictionHard 表示硬驅(qū)逐，一旦達到閾值，就直接驅(qū)逐;EvictionSoft 表示軟驅(qū)逐，即可以設置軟驅(qū)逐周期，只有超過軟驅(qū)逐周期后，才啟動驅(qū)逐，周期用 EvictionSoftGracePeriod 設置;EvictionMinimumReclaim 表示設置最小可用的閾值，比如 imagefs。

可以設置的驅(qū)逐信號有：

memory.available：node.status.capacity[memory] - node.stats.memory.workingSet，節(jié)點可用內(nèi)存
nodefs.available：node.stats.fs.available，Kubelet 使用的文件系統(tǒng)的可使用容量大小
nodefs.inodesFree：node.stats.fs.inodesFree，Kubelet 使用的文件系統(tǒng)的可使用 inodes 數(shù)量
imagefs.available：node.stats.runtime.imagefs.available，容器運行時用來存放鏡像及容器可寫層的文件系統(tǒng)的可使用容量
imagefs.inodesFree：node.stats.runtime.imagefs.inodesFree，容器運行時用來存放鏡像及容器可寫層的文件系統(tǒng)的可使用 inodes 容量
allocatableMemory.available：留給分配 Pod 用的可用內(nèi)存
pid.available：node.stats.rlimit.maxpid - node.stats.rlimit.curproc，留給分配 Pod 用的可用 PID

Eviction Manager 工作原理

Eviction Manager的主要工作在 synchronize 函數(shù)里。有兩個地方觸發(fā) synchronize 任務，一個是 monitor 任務，每 10s 觸發(fā)一次;另一個是根據(jù)用戶配置的驅(qū)逐信號，啟動的 notifier 任務，用來監(jiān)聽內(nèi)核事件。

notifier

notifier 由 eviction manager 中的 thresholdNotifier 啟動，用戶配置的每一個驅(qū)逐信號，都對應一個 thresholdNotifier，而 thresholdNotifier 和 notifier 通過 channel 通信，當 notifier 向 channel 中發(fā)送消息時，對應的 thresholdNotifier 便觸發(fā)一次 synchronize 邏輯。

notifier 采用的是內(nèi)核的 cgroups Memory thresholds，cgroups 允許用戶態(tài)進程通過 eventfd 來設置當 memory.usage_in_bytes 達到某閾值時，內(nèi)核給應用發(fā)送通知。具體做法是向 cgroup.event_control 寫入 " "。

notifier 的初始化代碼如下(為了方便閱讀，刪除了部分不相干代碼)，主要是找到 memory.usage_in_bytes 的文件描述符 watchfd，cgroup.event_control 的文件描述符 controlfd，完成 cgroup memory thrsholds 的注冊。

func NewCgroupNotifier(path, attribute string, threshold int64) (CgroupNotifier, error) { 
  var watchfd, eventfd, epfd, controlfd int 
 
  watchfd, err = unix.Open(fmt.Sprintf("%s/%s", path, attribute), unix.O_RDONLY|unix.O_CLOEXEC, 0) 
  defer unix.Close(watchfd) 
   
  controlfd, err = unix.Open(fmt.Sprintf("%s/cgroup.event_control", path), unix.O_WRONLY|unix.O_CLOEXEC, 0) 
  defer unix.Close(controlfd) 
   
  eventfd, err = unix.Eventfd(0, unix.EFD_CLOEXEC) 
  defer func() { 
    // Close eventfd if we get an error later in initialization 
    if err != nil { 
      unix.Close(eventfd) 
    } 
  }() 
   
  epfd, err = unix.EpollCreate1(unix.EPOLL_CLOEXEC) 
  defer func() { 
    // Close epfd if we get an error later in initialization 
    if err != nil { 
      unix.Close(epfd) 
    } 
  }() 
   
  config := fmt.Sprintf("%d %d %d", eventfd, watchfd, threshold) 
  _, err = unix.Write(controlfd, []byte(config)) 
 
  return &linuxCgroupNotifier{ 
    eventfd: eventfd, 
    epfd:    epfd, 
    stop:    make(chan struct{}), 
  }, nil 
}

notifier 在啟動時還會通過 epoll 來監(jiān)聽上述的 eventfd，當監(jiān)聽到內(nèi)核發(fā)送的事件時，說明使用的內(nèi)存已超過閾值，便向 channel 中發(fā)送信號。

func (n *linuxCgroupNotifier) Start(eventCh chan<- struct{}) { 
  err := unix.EpollCtl(n.epfd, unix.EPOLL_CTL_ADD, n.eventfd, &unix.EpollEvent{ 
    Fd:     int32(n.eventfd), 
    Events: unix.EPOLLIN, 
  }) 
 
  for { 
    select { 
    case <-n.stop: 
      return 
    default: 
    } 
    event, err := wait(n.epfd, n.eventfd, notifierRefreshInterval) 
    if err != nil { 
      klog.InfoS("Eviction manager: error while waiting for memcg events", "err", err) 
      return 
    } else if !event { 
      // Timeout on wait.  This is expected if the threshold was not crossed 
      continue 
    } 
    // Consume the event from the eventfd 
    buf := make([]byte, eventSize) 
    _, err = unix.Read(n.eventfd, buf) 
    if err != nil { 
      klog.InfoS("Eviction manager: error reading memcg events", "err", err) 
      return 
    } 
    eventCh <- struct{}{} 
  } 
}

synchronize 邏輯每次執(zhí)行都會判斷 10s 內(nèi) notifier 是否有更新，并重新啟動 notifier。cgroup memory threshold 的計算方式為內(nèi)存總量減去用戶設置的驅(qū)逐閾值。

synchronize

Eviction Manager 的主邏輯 synchronize 細節(jié)比較多，這里就不貼源碼了，梳理下來主要是以下幾個事項：

針對每個信號構(gòu)建排序函數(shù);
更新 threshold 并重新啟動 notifier;
獲取當前節(jié)點的資源使用情況(cgroup 的信息)和所有活躍的 pod;
針對每個信號，分別確定當前節(jié)點的資源使用情況是否達到驅(qū)逐的閾值，如果都沒有，則退出當前循環(huán);
將所有的信號進行優(yōu)先級排序，優(yōu)先級為：跟內(nèi)存有關(guān)的信號先進行驅(qū)逐;
向 apiserver 發(fā)送驅(qū)逐事件;
將所有活躍的 pod 進行優(yōu)先級排序;
按照排序后的順序?qū)?pod 進行驅(qū)逐。

計算驅(qū)逐順序

對 pod 的驅(qū)逐順序主要取決于三個因素：

pod 的資源使用情況是否超過其 requests;
pod 的 priority 值;
pod 的內(nèi)存使用情況;

三個因素的判斷順序也是根據(jù)注冊進 orderedBy 的順序。這里 orderedBy 函數(shù)的多級排序也是 Kubernetes 里一個值得學習(抄作業(yè))的一個實現(xiàn)，感興趣的讀者可以自行查閱源碼。

// rankMemoryPressure orders the input pods for eviction in response to memory pressure. 
// It ranks by whether or not the pod's usage exceeds its requests, then by priority, and 
// finally by memory usage above requests. 
func rankMemoryPressure(pods []*v1.Pod, stats statsFunc) { 
  orderedBy(exceedMemoryRequests(stats), priority, memory(stats)).Sort(pods) 
}

驅(qū)逐 Pod

接下來就是驅(qū)逐 Pod 的實現(xiàn)。Eviction Manager 驅(qū)逐 Pod 就是干凈利落的 kill，里面具體的實現(xiàn)這里不展開分析，值得注意的是在驅(qū)逐之前有一個判斷，如果 IsCriticalPod 返回為 true 則不驅(qū)逐。

func (m *managerImpl) evictPod(pod *v1.Pod, gracePeriodOverride int64, evictMsg string, annotations map[string]string) bool { 
  // If the pod is marked as critical and static, and support for critical pod annotations is enabled, 
  // do not evict such pods. Static pods are not re-admitted after evictions. 
  // https://github.com/kubernetes/kubernetes/issues/40573 has more details. 
  if kubelettypes.IsCriticalPod(pod) { 
    klog.ErrorS(nil, "Eviction manager: cannot evict a critical pod", "pod", klog.KObj(pod)) 
    return false 
  } 
  // record that we are evicting the pod 
  m.recorder.AnnotatedEventf(pod, annotations, v1.EventTypeWarning, Reason, evictMsg) 
  // this is a blocking call and should only return when the pod and its containers are killed. 
  klog.V(3).InfoS("Evicting pod", "pod", klog.KObj(pod), "podUID", pod.UID, "message", evictMsg) 
  err := m.killPodFunc(pod, true, &gracePeriodOverride, func(status *v1.PodStatus) { 
    status.Phase = v1.PodFailed 
    status.Reason = Reason 
    status.Message = evictMsg 
  }) 
  if err != nil { 
    klog.ErrorS(err, "Eviction manager: pod failed to evict", "pod", klog.KObj(pod)) 
  } else { 
    klog.InfoS("Eviction manager: pod is evicted successfully", "pod", klog.KObj(pod)) 
  } 
  return true 
}

再看看 IsCriticalPod 的代碼：

func IsCriticalPod(pod *v1.Pod) bool { 
  if IsStaticPod(pod) { 
    return true 
  } 
  if IsMirrorPod(pod) { 
    return true 
  } 
  if pod.Spec.Priority != nil && IsCriticalPodBasedOnPriority(*pod.Spec.Priority) { 
    return true 
  } 
  return false 
} 
 
// IsMirrorPod returns true if the passed Pod is a Mirror Pod. 
func IsMirrorPod(pod *v1.Pod) bool { 
  _, ok := pod.Annotations[ConfigMirrorAnnotationKey] 
  return ok 
} 
 
// IsStaticPod returns true if the pod is a static pod. 
func IsStaticPod(pod *v1.Pod) bool { 
  source, err := GetPodSource(pod) 
  return err == nil && source != ApiserverSource 
} 
 
func IsCriticalPodBasedOnPriority(priority int32) bool { 
  return priority >= scheduling.SystemCriticalPriority 
}

從代碼看，如果 Pod 是 Static、Mirror、Critical Pod 都不驅(qū)逐。其中 Static 和 Mirror 都是從 Pod 的 annotation 中判斷;而 Critical 則是通過 Pod 的 Priority 值判斷的，如果 Priority 為 system-cluster-critical/system-node-critical 都屬于 Critical Pod。

不過這里值得注意的是，官方文檔里提及 Critical Pod 是說，如果非 Static Pod 被標記為 Critical，并不完全保證不會被驅(qū)逐：https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods 。因此，很有可能是社區(qū)并沒有想清楚這種情況是否要驅(qū)逐，并不排除后面會改變這段邏輯，不過也有可能是文檔沒有及時更新??。

總結(jié)

本文主要分析了 Kubelet 的 Eviction Manager，包括其對 Linux CGroup 事件的監(jiān)聽、判斷 Pod 驅(qū)逐的優(yōu)先級等。了解了這些之后，我們就可以根據(jù)自身應用的重要性來設置優(yōu)先級，甚至設置成 Critical Pod。

責任編輯：武曉燕來源： CS實驗室

Kubelet 機制驅(qū)逐

51CTO技術(shù)棧公眾號

業(yè)務
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<pre id="g1iey"></pre>