Go Cmd 服務(wù)無法退出的小坑
本文轉(zhuǎn)載自微信公眾號(hào)「董澤潤(rùn)的技術(shù)筆記」,作者董澤潤(rùn)。轉(zhuǎn)載本文請(qǐng)聯(lián)系董澤潤(rùn)的技術(shù)筆記公眾號(hào)。
上家公司的案例。先說下使用背景,服務(wù)在每臺(tái)服務(wù)器上啟動(dòng) agent, 用戶會(huì)在指定機(jī)器上執(zhí)行任務(wù),并將結(jié)果返回到網(wǎng)頁上。執(zhí)行任務(wù)由用戶自定義腳本,一般也都是 shell 或是 python,會(huì)不斷的產(chǎn)生子進(jìn)程,孫進(jìn)程,直到執(zhí)行完畢或是超時(shí)被 kill。
問題
最近發(fā)現(xiàn)經(jīng)常有任務(wù),一直處于運(yùn)行中,但實(shí)際上己經(jīng)超時(shí)被 kill,并未將輸出寫到系統(tǒng),看不到任務(wù)的執(zhí)行情況
登錄機(jī)器,發(fā)現(xiàn)執(zhí)行腳本進(jìn)程己經(jīng)殺掉,但是有子腳本卡在某個(gè) http 調(diào)用。再看下這個(gè)腳本,python requests 默認(rèn)沒有設(shè)置超時(shí)...
總結(jié)一下現(xiàn)象:agent 用 go cmd 啟動(dòng)子進(jìn)程,子進(jìn)程還會(huì)啟動(dòng)孫進(jìn)程,孫進(jìn)程因某種原因阻塞。此時(shí),如果子進(jìn)程因超時(shí)被 agent kill 殺掉, agent 卻仍然處于 wait 狀態(tài)
復(fù)現(xiàn)
環(huán)境是 go version go1.16.5 linux/amd64, agent 使用 exec.CommandContext 啟動(dòng)任務(wù),設(shè)置 ctx 超時(shí) 30s,并將結(jié)果寫到 bytes.Buffer, 最后打印。簡(jiǎn)化例子如下:
- ~/zerun.dong/code/gotest# cat wait.go
- package main
- import (
- "bytes"
- "context"
- "fmt"
- "os/exec"
- "time"
- )
- func main() {
- ctx, cancelFn := context.WithTimeout(context.Background(), time.Second*30)
- defer cancelFn()
- cmd := exec.CommandContext(ctx, "./sleep.sh")
- var b bytes.Buffer
- cmd.Stdout = &b //劇透,坑在這里
- cmd.Stderr = &b
- cmd.Start()
- cmd.Wait()
- fmt.Println("recive: ", b.String())
- }
這個(gè)是 sleep.sh,模擬子進(jìn)程
- #!/bin/sh
- echo "in sleep"
- sh ./sleep1.sh
這是 sleep1.sh 模擬孫進(jìn)程,sleep 1000 阻塞在這里
- #!/bin/sh
- sleep 1000
###現(xiàn)象 啟動(dòng)測(cè)試 wait 程序,查看 ps axjf | less查看
- ppid pid pgid
- 2468 32690 32690 32690 ? -1 Ss 0 0:00 \_ sshd: root@pts/6
- 32690 32818 32818 32818 pts/6 28746 Ss 0 0:00 | \_ -bash
- 32818 28531 28531 32818 pts/6 28746 S 0 0:00 | \_ strace ./wait
- 28531 28543 28531 32818 pts/6 28746 Sl 0 0:00 | | \_ ./wait
- 28543 28559 28531 32818 pts/6 28746 S 0 0:00 | | \_ /bin/sh /root/dongzerun/sleep.sh
- 28559 28560 28531 32818 pts/6 28746 S 0 0:00 | | \_ sh /root/dongzerun/sleep1.sh
- 28560 28563 28531 32818 pts/6 28746 S 0 0:00 | | \_ sleep 1000
等過了 30s,通過 ps axjf | less 查看
- 2468 32690 32690 32690 ? -1 Ss 0 0:00 \_ sshd: root@pts/6
- 32690 32818 32818 32818 pts/6 36192 Ss 0 0:00 | \_ -bash
- 32818 28531 28531 32818 pts/6 36192 S 0 0:00 | \_ strace ./wait
- 28531 28543 28531 32818 pts/6 36192 Sl 0 0:00 | | \_ ./wait
- 1 28560 28531 32818 pts/6 36192 S 0 0:00 sh /root/dongzerun/sleep1.sh
- 28560 28563 28531 32818 pts/6 36192 S 0 0:00 \_ sleep 1000
通過上面的 case,可以看到 sleep1.sh 成了孤兒進(jìn)程,被 init 1 認(rèn)領(lǐng),但是 28543 wait 并沒有退出,那他在做什么???
分析
這個(gè)時(shí)候僵住了,祭出我們的 strace 大法,查看 wait 程序
- epoll_ctl(4, EPOLL_CTL_DEL, 6, {0, {u32=0, u64=0}}) = 0
- close(6) = 0
- futex(0xc420054938, FUTEX_WAKE, 1) = 1
- waitid(P_PID, 28559, {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=28559, si_status=SIGKILL, si_utime=0, si_stime=0}, WEXITED|WNOWAIT, NULL) = 0
- 卡在這里約 30s
- --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=28559, si_status=SIGKILL, si_utime=0, si_stime=0} ---
- rt_sigreturn() = 0
- futex(0x9a0378, FUTEX_WAKE, 1) = 1
- futex(0x9a02b0, FUTEX_WAKE, 1) = 1
- wait4(28559, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], 0, {ru_utime={0, 0}, ru_stime={0, 0}, ...}) = 28559
- futex(0x9a0b78, FUTEX_WAIT, 0, NULL
通過 go 源碼可以看到 go exec wait 時(shí),會(huì)先執(zhí)行 waitid, 阻塞在這里,然后再來一次 wait4 等待最終退出結(jié)果
不太明白為什么兩次 wait... 但是最后卡在了 futex 這里,看著像在等待什么資源???
打開 golang pprof, 再次運(yùn)行程序,并 pprof
- go func() {
- err := http.ListenAndServe(":6060", nil)
- if err != nil {
- fmt.Printf("failed to start pprof monitor:%s", err)
- }
- }()
- curl http://127.0.0.1:6060/debug/pprof/goroutine?debug=2
- goroutine 1 [chan receive]:
- os/exec.(*Cmd).Wait(0xc42017a000, 0x7c3d40, 0x0)
- /usr/local/go/src/os/exec/exec.go:454 +0x135
- main.main()
- /root/dongzerun/wait.go:32 +0x167
程序沒有退出,并不可思議的卡在了 exec.go:454 行代碼,查看源碼:
- // Wait releases any resources associated with the Cmd.
- func (c *Cmd) Wait() error {
- ......
- state, err := c.Process.Wait()
- if c.waitDone != nil {
- close(c.waitDone)
- }
- c.ProcessState = state
- var copyError error
- for range c.goroutine {
- //卡在了這里
- if err := <-c.errch; err != nil && copyError == nil {
- copyError = err
- }
- }
- c.closeDescriptors(c.closeAfterWait)
- ......
- return copyError
- }
通過源代碼分析,程序 wait 卡在了 <-c.errch 獲取 chan 數(shù)據(jù)。那么 errch 是如何生成的呢?
查看 cmd.Start 源碼,go 將 cmd.Stdin, cmd.Stdout, cmd.Stderr 組織成 *os.File,并依次寫到數(shù)組childFiles 中,這個(gè)數(shù)組索引就對(duì)應(yīng)子進(jìn)程的 0,1,2 文描術(shù)符,即子進(jìn)程的標(biāo)準(zhǔn)輸入,輸出,錯(cuò)誤
- type F func(*Cmd) (*os.File, error)
- for _, setupFd := range []F{(*Cmd).stdin, (*Cmd).stdout, (*Cmd).stderr} {
- fd, err := setupFd(c)
- if err != nil {
- c.closeDescriptors(c.closeAfterStart)
- c.closeDescriptors(c.closeAfterWait)
- return err
- }
- c.childFiles = append(c.childFiles, fd)
- }
- c.childFiles = append(c.childFiles, c.ExtraFiles...)
- var err error
- c.Process, err = os.StartProcess(c.Path, c.argv(), &os.ProcAttr{
- Dir: c.Dir,
- Files: c.childFiles,
- Env: dedupEnv(c.envv()),
- Sys: c.SysProcAttr,
- })
在執(zhí)行 setupFd 時(shí),會(huì)有一個(gè)關(guān)鍵的操作,打開 pipe 管道,封裝一個(gè)匿名 func,功能就是將子進(jìn)程的輸出結(jié)果寫到 pipe 或是將 pipe 數(shù)據(jù)寫到子進(jìn)程標(biāo)準(zhǔn)輸入,最后關(guān)閉 pipe
這個(gè)匿名函數(shù)最終在 Start 時(shí)執(zhí)行
- func (c *Cmd) stdin() (f *os.File, err error) {
- if c.Stdin == nil {
- f, err = os.Open(os.DevNull)
- if err != nil {
- return
- }
- c.closeAfterStart = append(c.closeAfterStart, f)
- return
- }
- if f, ok := c.Stdin.(*os.File); ok {
- return f, nil
- }
- pr, pw, err := os.Pipe()
- if err != nil {
- return
- }
- c.closeAfterStart = append(c.closeAfterStart, pr)
- c.closeAfterWait = append(c.closeAfterWait, pw)
- c.goroutine = append(c.goroutine, func() error {
- _, err := io.Copy(pw, c.Stdin)
- if skip := skipStdinCopyError; skip != nil && skip(err) {
- err = nil
- }
- if err1 := pw.Close(); err == nil {
- err = err1
- }
- return err
- })
- return pr, nil
- }
重新運(yùn)行測(cè)試 case,并用 lsof 查看進(jìn)程打開了哪些資源
- root@nb1963:~/dongzerun# ps aux |grep wait
- root 4531 0.0 0.0 122180 6520 pts/6 Sl 17:24 0:00 ./wait
- root 4726 0.0 0.0 10484 2144 pts/6 S+ 17:24 0:00 grep --color=auto wait
- root@nb1963:~/dongzerun#
- root@nb1963:~/dongzerun# ps aux |grep sleep
- root 4543 0.0 0.0 4456 688 pts/6 S 17:24 0:00 /bin/sh /root/dongzerun/sleep.sh
- root 4548 0.0 0.0 4456 760 pts/6 S 17:24 0:00 sh /root/dongzerun/sleep1.sh
- root 4550 0.0 0.0 5928 748 pts/6 S 17:24 0:00 sleep 1000
- root 4784 0.0 0.0 10480 2188 pts/6 S+ 17:24 0:00 grep --color=auto sleep
- root@nb1963:~/dongzerun#
- root@nb1963:~/dongzerun# lsof -p 4531
- COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
- wait 4531 root 0w CHR 1,3 0t0 1029 /dev/null
- wait 4531 root 1w REG 8,1 94371 4991345 /root/dongzerun/nohup.out
- wait 4531 root 2w REG 8,1 94371 4991345 /root/dongzerun/nohup.out
- wait 4531 root 3u IPv6 2005568215 0t0 TCP *:6060 (LISTEN)
- wait 4531 root 4u 0000 0,10 0 9076 anon_inode
- wait 4531 root 5r FIFO 0,9 0t0 2005473170 pipe
- root@nb1963:~/dongzerun# lsof -p 4543
- COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
- sleep.sh 4543 root 0r CHR 1,3 0t0 1029 /dev/null
- sleep.sh 4543 root 1w FIFO 0,9 0t0 2005473170 pipe
- sleep.sh 4543 root 2w FIFO 0,9 0t0 2005473170 pipe
- sleep.sh 4543 root 10r REG 8,1 55 4993949 /root/dongzerun/sleep.sh
- root@nb1963:~/dongzerun# lsof -p 4550
- COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
- sleep 4550 root mem REG 8,1 1607664 9179617 /usr/lib/locale/locale-archive
- sleep 4550 root 0r CHR 1,3 0t0 1029 /dev/null
- sleep 4550 root 1w FIFO 0,9 0t0 2005473170 pipe
- sleep 4550 root 2w FIFO 0,9 0t0 2005473170 pipe
原因總結(jié)
孫進(jìn)程啟動(dòng)后,默認(rèn)會(huì)繼承父進(jìn)程打開的文件描述符,即 node 2005473170 的 pipe
那么當(dāng)父進(jìn)程被 kill -9 后會(huì)清理資源,關(guān)閉打開的文件,但是 close 只是引用計(jì)數(shù)減 1。實(shí)際上 孫進(jìn)程 仍然打開著 pipe?;仡^看 agent 代碼
- c.goroutine = append(c.goroutine, func() error {
- _, err := io.Copy(pw, c.Stdin)
- if skip := skipStdinCopyError; skip != nil && skip(err) {
- err = nil
- }
- if err1 := pw.Close(); err == nil {
- err = err1
- }
- return err
- })
那么當(dāng)子進(jìn)程執(zhí)行結(jié)束后,go cmd 執(zhí)行這個(gè)匿名函數(shù)的 io.Copy 來讀取子進(jìn)程輸出數(shù)據(jù),永遠(yuǎn)沒有數(shù)據(jù)可讀,也沒有超時(shí),阻塞在 copy 這里
解決方案
原因找到了,解決方法也就有了。
- 子進(jìn)程啟動(dòng)孫進(jìn)程時(shí),增加 CloseOnEec 標(biāo)記,但不現(xiàn)實(shí),還要看孫進(jìn)程的輸出日志
- io.Copy 改寫,增加超時(shí)調(diào)用,理論上可行,但是要改源碼
- 超時(shí) kill, 不單殺子進(jìn)程,而是殺掉進(jìn)程組,此時(shí) pipe 會(huì)被真正的關(guān)閉,觸發(fā) io.Copy 返回
最終采用方案 3,簡(jiǎn)化代碼如下,主要改動(dòng)點(diǎn)有兩處:
SysProcAttr 配置 Setpgid,讓子進(jìn)程與孫進(jìn)程,擁有獨(dú)立的進(jìn)程組id,即子進(jìn)程的 pid
Syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL) 殺進(jìn)程時(shí)指定進(jìn)程組
- func Run(instance string, env map[string]string) bool {
- var (
- cmd *exec.Cmd
- proc *Process
- sysProcAttr *syscall.SysProcAttr
- )
- t := time.Now()
- sysProcAttr = &syscall.SysProcAttr{
- Setpgid: true, // 使子進(jìn)程擁有自己的 pgid,等同于子進(jìn)程的 pid
- Credential: &syscall.Credential{
- Uid: uint32(uid),
- Gid: uint32(gid),
- },
- }
- // 超時(shí)控制
- ctx, cancel := context.WithTimeout(context.Background(), time.Duration(j.Timeout)*time.Second)
- defer cancel()
- if j.ShellMode {
- cmd = exec.Command("/bin/bash", "-c", j.Command)
- } else {
- cmd = exec.Command(j.cmd[0], j.cmd[1:]...)
- }
- cmd.SysProcAttr = sysProcAttr
- var b bytes.Buffer
- cmd.Stdout = &b
- cmd.Stderr = &b
- if err := cmd.Start(); err != nil {
- j.Fail(t, instance, fmt.Sprintf("%s\n%s", b.String(), err.Error()), env)
- return false
- }
- waitChan := make(chan struct{}, 1)
- defer close(waitChan)
- // 超時(shí)殺掉進(jìn)程組 或正常退出
- go func() {
- select {
- case <-ctx.Done():
- log.Warnf("timeout kill job %s-%s %s ppid:%d", j.Group, j.ID, j.Name, cmd.Process.Pid)
- syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)
- case <-waitChan:
- }
- }()
- if err := cmd.Wait(); err != nil {
- j.Fail(t, instance, fmt.Sprintf("%s\n%s", b.String(), err.Error()), env)
- return false
- }
- return true
- }
但這種方式,也有個(gè)局限,目前只適用于類 linux 平臺(tái)
小結(jié)
大家也可以看到,只要權(quán)限足夠,問題穩(wěn)定復(fù)現(xiàn),沒有查不出來的問題。套路也都差不多,回歸問題開始,python request 庫不寫 timeout 的比比皆是 ...