讀讀 Pause 容器源碼
本文轉(zhuǎn)載自微信公眾號「董澤潤的技術(shù)筆記」,作者董澤潤。轉(zhuǎn)載本文請聯(lián)系董澤潤的技術(shù)筆記公眾號。
都知道 k8s 的調(diào)度最小單位是 POD, 并且每個 POD 都有一個所謂的 Infra 容器 Pause, 負(fù)責(zé)初始化相關(guān) namespace, 先于 POD 內(nèi)其它容器起動。那么到底什么是 Pause 容器呢?長什么樣?有什么作用?
分析源碼
廢話不多,直接上源碼,來自官方 pause.c[1]
- #include <signal.h>
- #include <stdio.h>
- #include <stdlib.h>
- #include <string.h>
- #include <sys/types.h>
- #include <sys/wait.h>
- #include <unistd.h>
- #define STRINGIFY(x) #x
- #define VERSION_STRING(x) STRINGIFY(x)
- #ifndef VERSION
- #define VERSION HEAD
- #endif
- static void sigdown(int signo) {
- psignal(signo, "Shutting down, got signal");
- exit(0);
- }
- static void sigreap(int signo) {
- while (waitpid(-1, NULL, WNOHANG) > 0)
- ;
- }
- int main(int argc, char **argv) {
- int i;
- for (i = 1; i < argc; ++i) {
- if (!strcasecmp(argv[i], "-v")) {
- printf("pause.c %s\n", VERSION_STRING(VERSION));
- return 0;
- }
- }
- if (getpid() != 1)
- /* Not an error because pause sees use outside of infra containers. */
- fprintf(stderr, "Warning: pause should be the first process\n");
- if (sigaction(SIGINT, &(struct sigaction){.sa_handler = sigdown}, NULL) < 0)
- return 1;
- if (sigaction(SIGTERM, &(struct sigaction){.sa_handler = sigdown}, NULL) < 0)
- return 2;
- if (sigaction(SIGCHLD, &(struct sigaction){.sa_handler = sigreap,
- .sa_flags = SA_NOCLDSTOP},
- NULL) < 0)
- return 3;
- for (;;)
- pause();
- fprintf(stderr, "Error: infinite loop terminated\n");
- return 42;
- }
可以看到 Pause 容器做如下兩件事情:
- 注冊各種信號處理函數(shù),主要處理兩類信息:退出信號和 child 信號。收到 SIGINT 或是 SIGTERM 后,直接退出。收到 SIGCHLD 信號,調(diào)用 waitpid, 回收退出進(jìn)程
- 主進(jìn)程 for 循環(huán)調(diào)用 pause() 函數(shù),使進(jìn)程進(jìn)入休眠狀態(tài),直到被終止或是收到信號
可疑的 waitpid
還是 c 的基礎(chǔ)不夠扎實,一直以為 waitpid 是父進(jìn)程等待回收退出的子進(jìn)程,但是真的這樣嘛?
- zerun.dong$ man waitpid
- WAIT(2) BSD System Calls Manual WAIT(2)
- NAME
- wait, wait3, wait4, waitpid -- wait for process termination
- SYNOPSIS
- #include <sys/wait.h>
在 mac 上查看 man 手冊,wait for process termination 也確實這么寫的。登到 ubuntu 18.04 查看一下
- :~# man waitpid
- WAIT(2) Linux Programmer's Manual WAIT(2)
- NAME
- wait, waitpid, waitid - wait for process to change state
對于 linux man 手冊,就變成了 wait for process to change state 等待進(jìn)程的狀態(tài)變更!!!
- All of these system calls are used to wait for state changes in a child of the calling process, and obtain information about the child whose
- state has changed. A state change is considered to be: the child terminated; the child was stopped by a signal; or the child was resumed by
- a signal. In the case of a terminated child, performing a wait allows the system to release the resources associated with the child; if a
- wait is not performed, then the terminated child remains in a "zombie" state (see NOTES below).
并且還很貼心的提供了測試代碼
- #include <sys/wait.h>
- #include <stdlib.h>
- #include <unistd.h>
- #include <stdio.h>
- int main(int argc, char *argv[])
- {
- pid_t cpid, w;
- int wstatus;
- cpid = fork();
- if (cpid == -1) {
- perror("fork");
- exit(EXIT_FAILURE);
- }
- if (cpid == 0) { /* Code executed by child */
- printf("Child PID is %ld\n", (long) getpid());
- if (argc == 1)
- pause(); /* Wait for signals */
- _exit(atoi(argv[1]));
- } else { /* Code executed by parent */
- do {
- w = waitpid(cpid, &wstatus, WUNTRACED | WCONTINUED);
- if (w == -1) {
- perror("waitpid");
- exit(EXIT_FAILURE);
- }
- if (WIFEXITED(wstatus)) {
- printf("exited, status=%d\n", WEXITSTATUS(wstatus));
- } else if (WIFSIGNALED(wstatus)) {
- printf("killed by signal %d\n", WTERMSIG(wstatus));
- } else if (WIFSTOPPED(wstatus)) {
- printf("stopped by signal %d\n", WSTOPSIG(wstatus));
- } else if (WIFCONTINUED(wstatus)) {
- printf("continued\n");
- }
- } while (!WIFEXITED(wstatus) && !WIFSIGNALED(wstatus));
- exit(EXIT_SUCCESS);
- }
- }
子進(jìn)程一直處于 pause 狀態(tài),而父進(jìn)程則調(diào)用 waitpid 等待子進(jìn)程狀態(tài)變更。讓我們開啟一個 session 運行代碼,另外一個 session 發(fā)送信號
- ~$ ./a.out
- Child PID is 70718
- stopped by signal 19
- continued
- stopped by signal 19
- continued
- ^C
- ~# ps aux | grep a.out
- zerun.d+ 70717 0.0 0.0 4512 744 pts/0 S+ 06:48 0:00 ./a.out
- zerun.d+ 70718 0.0 0.0 4512 72 pts/0 S+ 06:48 0:00 ./a.out
- root 71155 0.0 0.0 16152 1060 pts/1 S+ 06:49 0:00 grep --color=auto a.out
- ~#
- ~# kill -STOP 70718
- ~#
- ~# ps aux | grep a.out
- zerun.d+ 70717 0.0 0.0 4512 744 pts/0 S+ 06:48 0:00 ./a.out
- zerun.d+ 70718 0.0 0.0 4512 72 pts/0 T+ 06:48 0:00 ./a.out
- root 71173 0.0 0.0 16152 1060 pts/1 S+ 06:49 0:00 grep --color=auto a.out
- ~#
- ~# kill -CONT 70718
- ~#
- ~# ps aux | grep a.out
- zerun.d+ 70717 0.0 0.0 4512 744 pts/0 S+ 06:48 0:00 ./a.out
- zerun.d+ 70718 0.0 0.0 4512 72 pts/0 S+ 06:48 0:00 ./a.out
- root 71296 0.0 0.0 16152 1056 pts/1 R+ 06:49 0:00 grep --color=auto a.out
通過向子進(jìn)程發(fā)送信號 STOP CONT 來控制進(jìn)程。
看來不同操作系統(tǒng),同名 c 函數(shù)行為是不太一樣的。大驚小怪,就是菜:(
共享哪些 NS
一般提起 POD 就知道,同一個 POD 內(nèi)的容器如果互相訪問,只需調(diào)用 localhost 即可。如果把 k8s 集群想象成分布式操作系統(tǒng),那么 POD 就是進(jìn)程組的概念,一定要共享某些東西的,那么默認(rèn)共享哪些 namespace 呢?
使用 minikube 搭建環(huán)境,先看一下 POD 定義文件
- apiVersion: v1
- kind: Pod
- metadata:
- name: nginx
- spec:
- shareProcessNamespace: true
- containers:
- - name: nginx
- image: nginx
- - name: shell
- image: busybox
- securityContext:
- capabilities:
- add:
- - SYS_PTRACE
- stdin: true
- tty: true
從 1.17 開始有參數(shù) shareProcessNamespace 用來控制是否 POD 內(nèi)共享 PID namespace, 1.18 之后默認(rèn)是 false 的,如果有需求需要填寫該字段。
- ~$ kubectl attach -it nginx -c shell
- If you don't see a command prompt, try pressing enter.
- / # ps aux
- PID USER TIME COMMAND
- 1 root 0:00 /pause
- 8 root 0:00 nginx: master process nginx -g daemon off;
- 41 101 0:00 nginx: worker process
- 42 root 0:00 sh
- 49 root 0:00 ps aux
attach 到 shell 容器中,可以看到該 POD 內(nèi)所有進(jìn)程,并且只有 pause 容器是 init 1 進(jìn)程。
- / # kill -HUP 8
- / # ps aux
- PID USER TIME COMMAND
- 1 root 0:00 /pause
- 8 root 0:00 nginx: master process nginx -g daemon off;
- 42 root 0:00 sh
- 50 101 0:00 nginx: worker process
- 51 root 0:00 ps aux
測試給 nginx master 發(fā)送 HUP 信號,子進(jìn)程重啟。
如果不共享 PID ns, 那么每個容器內(nèi)的進(jìn)程 pid 都是 init 1 進(jìn)程。共享 PID ns 有什么影響呢?參考這篇文章[2]
容器進(jìn)程不再具有 PID 1。在沒有 PID 1 的情況下,一些容器鏡像拒絕啟動(例如,使用 systemd 的容器),或者拒絕執(zhí)行 kill -HUP 1 之類的命令來通知容器進(jìn)程。在具有共享進(jìn)程命名空間的 pod 中,kill -HUP 1 將通知 pod 沙箱(在上面的例子中是 /pause)。
進(jìn)程對 pod 中的其他容器可見。這包括 /proc 中可見的所有信息,例如作為參數(shù)或環(huán)境變量傳遞的密碼。這些僅受常規(guī) Unix 權(quán)限的保護(hù)。
容器文件系統(tǒng)通過 /proc/$pid/root 鏈接對 pod 中的其他容器可見。這使調(diào)試更加容易,但也意味著文件系統(tǒng)安全性只受文件系統(tǒng)權(quán)限的保護(hù)。
在宿主機查看 nginx, sh 的進(jìn)程 id, 通過 /proc/pid/ns 查看 namespace id
- ~# ls -l /proc/140756/ns
- total 0
- lrwxrwxrwx 1 root root 0 May 6 09:08 cgroup -> 'cgroup:[4026531835]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 ipc -> 'ipc:[4026532497]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 mnt -> 'mnt:[4026532561]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 net -> 'net:[4026532500]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 pid -> 'pid:[4026532498]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 pid_for_children -> 'pid:[4026532498]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 user -> 'user:[4026531837]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 uts -> 'uts:[4026532562]'
- ~# ls -l /proc/140879/ns
- total 0
- lrwxrwxrwx 1 root root 0 May 6 09:08 cgroup -> 'cgroup:[4026531835]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 ipc -> 'ipc:[4026532497]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 mnt -> 'mnt:[4026532563]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 net -> 'net:[4026532500]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 pid -> 'pid:[4026532498]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 pid_for_children -> 'pid:[4026532498]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 user -> 'user:[4026531837]'
- lrwxrwxrwx 1 root root 0 May 6 09:08 uts -> 'uts:[4026532564]'
可以看到這里共享了 cgroup, ipc, net, pid, user. 這里僅限于測試案例。
殺掉 Pause 容器
測試一下殺掉 Pause 容器的話,k8s 是如何處理 POD. 使用 minikube 搭建環(huán)境,先看一下 POD 定義文件
- apiVersion: v1
- kind: Pod
- metadata:
- name: nginx
- spec:
- shareProcessNamespace: false
- containers:
- - name: nginx
- image: nginx
- - name: shell
- image: busybox
- securityContext:
- capabilities:
- add:
- - SYS_PTRACE
- stdin: true
- tty: true
啟動后,查看 pause 進(jìn)程 id, 然后殺掉
- ~$ kubectl describe pod nginx
- ......
- Events:
- Type Reason Age From Message
- ---- ------ ---- ---- -------
- Normal SandboxChanged 3m1s (x2 over 155m) kubelet Pod sandbox changed, it will be killed and re-created.
- Normal Killing 3m1s (x2 over 155m) kubelet Stopping container nginx
- Normal Killing 3m1s (x2 over 155m) kubelet Stopping container shell
- Normal Pulling 2m31s (x3 over 156m) kubelet Pulling image "nginx"
- Normal Pulling 2m28s (x3 over 156m) kubelet Pulling image "busybox"
- Normal Created 2m28s (x3 over 156m) kubelet Created container nginx
- Normal Started 2m28s (x3 over 156m) kubelet Started container nginx
- Normal Pulled 2m28s kubelet Successfully pulled image "nginx" in 2.796081224s
- Normal Created 2m25s (x3 over 156m) kubelet Created container shell
- Normal Started 2m25s (x3 over 156m) kubelet Started container shell
- Normal Pulled 2m25s kubelet Successfully pulled image "busybox" in 2.856292466s
k8s 檢測到 pause 容器狀態(tài)異常,就會重啟該 POD, 其實也不難理解,無論是否共享 PID namespace, infra 容器退出了,POD 必然要重啟,畢竟生命周期是與 infra 容器一致的。
參考資料
[1]pause.c: https://github.com/kubernetes/kubernetes/blob/master/build/pause/linux/pause.c,
[2]share proceess namespace: https://kubernetes.io/zh/docs/tasks/configure-pod-container/share-process-namespace/,