Go BIO/NIO探討：Go netpoll 是如何工作的

作者：趙帥虎 2023-03-07 08:00:12

一般我們聊到 netpoll 時(shí)，是指 Go runtime 中借助于epoll對(duì)套接字進(jìn)行批量監(jiān)聽(tīng)、數(shù)據(jù)到來(lái)時(shí)喚醒特定goroutine的機(jī)制。

一般我們聊到 netpoll 時(shí)，是指 Go runtime 中借助于epoll對(duì)套接字進(jìn)行批量監(jiān)聽(tīng)、數(shù)據(jù)到來(lái)時(shí)喚醒特定goroutine的機(jī)制。對(duì)應(yīng)的代碼存放在runtime/netpoll.go 和 runtime/netpoll_epoll.go (只考慮linux) 中。為此 runtime 提供了兩大類(lèi)函數(shù):

第一類(lèi)：調(diào)用方是 Go Runtime。

netpoll: 檢查有事件發(fā)生的套接字，并返回處于pdReady狀態(tài)的goroutine列表，基于epoll_wait。
netpollBreak: 向 netpollBreakWr 寫(xiě)入一個(gè)字節(jié)數(shù)據(jù)，通過(guò)管道傳到 netpollBreakRd，epoll_wait 監(jiān)聽(tīng)到read pipe上的event，立即返回。

第二類(lèi)：調(diào)用方是internal/poll、net、net/http等。

poll_runtime_pollServerInit(netpollGenericInit): 初始化poller，基于epoll_create1。
poll_runtime_pollOpen: 將套接字添加到監(jiān)聽(tīng)列表，基于 epoll_ctl。
poll_runtime_pollWait: 等待套接字上的事件，可以休眠(gopark)當(dāng)前goroutine, 借助于netpollblock函數(shù)。
poll_runtime_pollUnblock: 使用Unblock模式進(jìn)行poll。
poll_runtime_pollClose: 將套接字從監(jiān)聽(tīng)列表刪除，基于 epoll_ctl。
poll_runtime_pollReset: nonblock模式下 prepareRead/prepareWrite 使用。

這些函數(shù)都會(huì)被link到 internal/poll.runtime_xxx, xxx 可以是。
runtime_pollServerInit/runtime_pollOpen等。

后面我們挑一些主要的函數(shù)來(lái)說(shuō)一下。

netpollGenericInit 初始化 poller

netpollGenericInit 保證 poller 被初始化，原子變量netpollInited保證其僅被初始化一次。

func netpollGenericInit() {
  if atomic.Load(&netpollInited) == 0 {
    lockInit(&netpollInitLock, lockRankNetpollInit)
    lock(&netpollInitLock)
    if netpollInited == 0 {
      netpollinit()
      atomic.Store(&netpollInited, 1)
    }
    unlock(&netpollInitLock)
  }
}

這個(gè)函數(shù)只是一個(gè)殼，初始化邏輯封裝在netpollinit函數(shù)中，依賴(lài)于平臺(tái)具體的實(shí)現(xiàn)。linux下，init的邏輯是:

通過(guò)epoll_create1系統(tǒng)調(diào)用創(chuàng)建 epoll fd。
創(chuàng)建一對(duì) read/write pipe。pipe的一個(gè)特性是向 write pipe寫(xiě)入數(shù)據(jù)，read pipe 就能收到同樣的數(shù)據(jù)。
通過(guò)epoll_ctl將 write pipe 對(duì)應(yīng)的fd 加入到監(jiān)聽(tīng)列表。

單獨(dú)創(chuàng)建一對(duì)pipe后，runtime就能夠按需中斷epoll_wait，讓netpoll函數(shù)立即返回。

func netpollinit() {
  epfd = epollcreate1(_EPOLL_CLOEXEC)
  if epfd < 0 {
    epfd = epollcreate(1024)
    if epfd < 0 {
      println("runtime: epollcreate failed with", -epfd)
      throw("runtime: netpollinit failed")
    }
    closeonexec(epfd)
  }
  r, w, errno := nonblockingPipe()
  if errno != 0 {
    println("runtime: pipe failed with", -errno)
    throw("runtime: pipe failed")
  }
  ev := epollevent{
    events: _EPOLLIN,
  }
  *(**uintptr)(unsafe.Pointer(&ev.data)) = &netpollBreakRd
  errno = epollctl(epfd, _EPOLL_CTL_ADD, r, &ev)
  if errno != 0 {
    println("runtime: epollctl failed with", -errno)
    throw("runtime: epollctl failed")
  }
  netpollBreakRd = uintptr(r)
  netpollBreakWr = uintptr(w)
}

netpoll函數(shù)

netpoll函數(shù)的功能是檢查可用的網(wǎng)絡(luò)連接，它的工作流程是(happy path)：

創(chuàng)建size=128的epollevent數(shù)組, 以接收事件。
調(diào)用epollwait等待事件: 依賴(lài)epoll_wait系統(tǒng)調(diào)用。
遍歷epoll events，對(duì)于每個(gè)event創(chuàng)建一個(gè)pollDesc對(duì)象調(diào)用netpollready，找到對(duì)應(yīng)的goroutine，并將其狀態(tài)從pdWait修改為pdReady。
返回pdReady狀態(tài)的 goroutine列表 (gList)。

struct pollDesc中包含兩個(gè)信號(hào)量字段，可以表示四種狀態(tài):

pdReady: io ready信號(hào)等待被接收，goroutine可以消費(fèi)這個(gè)信號(hào)，邏輯上是把信號(hào)量改成nil。
pdWait: goroutine已經(jīng)準(zhǔn)備好在該信號(hào)量上阻塞，但還沒(méi)有阻塞；如果goroutine通過(guò)gopark阻塞，狀態(tài)會(huì)變成G pointer如果并發(fā)的io ready信號(hào)到達(dá)，狀態(tài)會(huì)改成pdReady如果并發(fā)的timeout/close信號(hào)到達(dá)，狀態(tài)會(huì)被改成nil。
G pointer: goroutine被阻塞在信號(hào)量上，可以被下面兩類(lèi)事件喚醒:io ready信號(hào)到來(lái)時(shí)，狀態(tài)被修改好pdReadytimeout/close信號(hào)到來(lái)時(shí)，狀態(tài)被修改為nil。
nil: 不是上面三種狀態(tài)。

對(duì)應(yīng)一些輔助函數(shù):

netpollblock 函數(shù)將goroutine狀態(tài)從 pdReady 轉(zhuǎn)化成 pdWait，并gopark當(dāng)前goroutine。
netpollunblock 函數(shù)將goroutine狀態(tài)從 pdWait 轉(zhuǎn)換為 pdReady 或 nil。

netpoll函數(shù)的代碼在runtime/netpoll_epoll.go中，部分代碼如下：

func netpoll(delay int64) gList {
  // epoll fd 為-1，說(shuō)明不需要poll
  if epfd == -1 {
    return gList{}
  }
  var waitms int32
  // ...省略一段代碼
  var events [128]epollevent
retry:
  n := epollwait(epfd, &events[0], int32(len(events)), waitms)
  if n < 0 {
    if n != -_EINTR {
      println("runtime: epollwait on fd", epfd, "failed with", -n)
      throw("runtime: netpoll failed")
    }
    // If a timed sleep was interrupted, just return to
    // recalculate how long we should sleep now.
    if waitms > 0 {
      return gList{}
    }
    goto retry
  }
  var toRun gList
  for i := int32(0); i < n; i++ {
    ev := &events[i]
    if ev.events == 0 {
      continue
    }

    if *(**uintptr)(unsafe.Pointer(&ev.data)) == &netpollBreakRd {
      // ... read pipe 有數(shù)據(jù)
      // 不需要喚醒任何goroutine
    }

    var mode int32
    if ev.events&(_EPOLLIN|_EPOLLRDHUP|_EPOLLHUP|_EPOLLERR) != 0 {
      mode += 'r'
    }
    if ev.events&(_EPOLLOUT|_EPOLLHUP|_EPOLLERR) != 0 {
      mode += 'w'
    }
    if mode != 0 {
      pd := *(**pollDesc)(unsafe.Pointer(&ev.data))
      pd.setEventErr(ev.events == _EPOLLERR)
      // 將goroutine置為 pdReady
      // 并添加到toRun *gList
      netpollready(&toRun, pd, mode)
    }
  }
  return toRun
}

備注: netpollready 函數(shù)借助于netpollunblock修改goroutine狀態(tài)，并將其加到 io ready 的 goroutine list。

runtime在調(diào)用 netpoll 時(shí)，通常采用的是 nonblock 模式(delay=0), 只有在 findrunnable 的最后一個(gè)環(huán)節(jié)，會(huì)檢查是否有單獨(dú)的M(GMP中的M)進(jìn)行net polling，如果沒(méi)有，會(huì)block等待delay參數(shù)指定的時(shí)間。

netpollBreak 函數(shù)

netpollBreak函數(shù)的功能比較簡(jiǎn)單，但實(shí)現(xiàn)比較有意思。它和netpoll函數(shù)通過(guò)變量netpollWakeSig進(jìn)行交互，由于在不同的goroutine中，所以對(duì)于該變量的操作都是原子?操作。

// netpollBreak interrupts an epollwait.
func netpollBreak() {
  if atomic.Cas(&netpollWakeSig, 0, 1) {
    for {
      var b byte
      n := write(netpollBreakWr, unsafe.Pointer(&b), 1)
      if n == 1 {
        break
      }
      if n == -_EINTR {
        continue
      }
      if n == -_EAGAIN {
        return
      }
      println("runtime: netpollBreak write failed with", -n)
      throw("runtime: netpollBreak write failed")
    }
  }
}

poll_runtime_pollOpen 函數(shù)

poll_runtime_pollOpen 的邏輯分為三塊:

給 pollDesc 分配內(nèi)存。
初始化 pollDesc 對(duì)象。
借助于 netpollopen 注冊(cè)epoll監(jiān)聽(tīng)(netpollopen在linux下是 epoll_ctl)。
返回 pollDesc 對(duì)象。

poll_runtime_pollOpen函數(shù)的實(shí)現(xiàn)位于 runtime/netpoll.go 中, 主要邏輯如下:

//go:linkname poll_runtime_pollOpen internal/poll.runtime_pollOpen
func poll_runtime_pollOpen(fd uintptr) (*pollDesc, int) {
  pd := pollcache.alloc()
  lock(&pd.lock)
  wg := pd.wg.Load()
  if wg != 0 && wg != pdReady {
    throw("runtime: blocked write on free polldesc")
  }
  rg := pd.rg.Load()
  if rg != 0 && rg != pdReady {
    throw("runtime: blocked read on free polldesc")
  }
  pd.fd = fd
  // ... 省略部分初始化邏輯
  unlock(&pd.lock)

  errno := netpollopen(fd, pd)
  if errno != 0 {
    pollcache.free(pd)
    return nil, int(errno)
  }
  return pd, 0
}

// 位于net/netpoll_epoll.go
func netpollopen(fd uintptr, pd *pollDesc) int32 {
  var ev epollevent
  ev.events = _EPOLLIN | _EPOLLOUT | _EPOLLRDHUP | _EPOLLET
  *(**pollDesc)(unsafe.Pointer(&ev.data)) = pd
  return -epollctl(epfd, _EPOLL_CTL_ADD, int32(fd), &ev)
}

poll_runtime_pollWait 函數(shù)

poll_runtime_pollWait 函數(shù)只是對(duì) netpollblock 函數(shù)的封裝，增加了容錯(cuò)。值得注意的是，該函數(shù)不是runtime觸發(fā)的，而是用戶(hù)程序觸發(fā)的。

func poll_runtime_pollWait(pd *pollDesc, mode int) int {
  errcode := netpollcheckerr(pd, int32(mode))
  if errcode != pollNoError {
    return errcode
  }
  // As for now only Solaris, illumos, and AIX use level-triggered IO.
  if GOOS == "solaris" || GOOS == "illumos" || GOOS == "aix" {
    netpollarm(pd, mode)
  }
  for !netpollblock(pd, int32(mode), false) {
    errcode = netpollcheckerr(pd, int32(mode))
    if errcode != pollNoError {
      return errcode
    }
    // Can happen if timeout has fired and unblocked us,
    // but before we had a chance to run, timeout has been reset.
    // Pretend it has not happened and retry.
  }
  return pollNoError
}

下面我們看下用戶(hù)程序如何觸發(fā) poll_runtime_xxx 系列的函數(shù)。首先，套接字分為兩類(lèi)：LISTEN套接字(Server套接字) 和 ESTABLISHED套接字(TCPConn)；

LISTEN 套接字通過(guò)系統(tǒng)調(diào)用 socket/bind/listen 去生成。
ESTABLISHED 套接字通過(guò)系統(tǒng)調(diào)用 accept 去生成。

LISTEN套接字(Server套接字)

從http server的角度來(lái)看，LISTEN套接字注冊(cè)epoll監(jiān)聽(tīng)的鏈路如下:

// net/http/server.go
func ListenAndServe(addr string, handler Handler) error

// net/http/server.go
func (srv *Server) ListenAndServe() error

// net/dial.go
func Listen(network, address string) (Listener, error) {
  var lc ListenConfig
  return lc.Listen(context.Background(), network, address)
}

// net/dial.go
func (lc *ListenConfig) Listen(ctx context.Context, network, address string) (Listener, error)

// net/tcpsock_posix.go
func (sl *sysListener) listenTCP(ctx context.Context, laddr *TCPAddr) (*TCPListener, error)

// net/ipsock_posix.go
func internetSocket(ctx context.Context, net string, laddr, raddr sockaddr, sotype, proto int, mode string, ctrlFn func(string, string, syscall.RawConn) error) (fd *netFD, err error) 

// net/sock_posix.go
func socket(ctx context.Context, net string, family, sotype, proto int, ipv6only bool, laddr, raddr sockaddr, ctrlFn func(string, string, syscall.RawConn) error) (fd *netFD, err error)

// net/sock_posix.go
func (fd *netFD) listenStream(laddr sockaddr, backlog int, ctrlFn func(string, string, syscall.RawConn) error)
  if err = fd.init(); err != nil {
    return err
  }

// net/fd_unix.go
func (fd *netFD) init() error {
  // fd.pfd 類(lèi)型是 poll.FD
  return fd.pfd.Init(fd.net, true)
}

// internal/poll/fd_unix.go
func (fd *FD) Init(net string, pollable bool) error {
  // We don't actually care about the various network types.
  if net == "file" {
    fd.isFile = true
  }
  if !pollable {
    fd.isBlocking = 1
    return nil
  }
  err := fd.pd.init(fd)
  if err != nil {
    // If we could not initialize the runtime poller,
    // assume we are using blocking mode.
    fd.isBlocking = 1
  }
  return err
}

// internal/poll/fd_poll_runtime.go
func (pd *pollDesc) init(fd *FD) error {
  serverInit.Do(runtime_pollServerInit)
  ctx, errno := runtime_pollOpen(uintptr(fd.Sysfd))
  if errno != 0 {
    return errnoErr(syscall.Errno(errno))
  }
  pd.runtimeCtx = ctx
  return nil
}

ESTABLISHED套接字(TCPConn)

http server accept 新的tcp conn。

// net/http/server.go
func (srv *Server) Serve(l net.Listener) error {
  for {
    rw, err := l.Accept()

// net/tcpsock.go
func (l *TCPListener) Accept() (Conn, error)

func (ln *TCPListener) accept() (*TCPConn, error) {
  fd, err := ln.fd.accept()

// net/fd_posix.go
func (fd *netFD) accept() (netfd *netFD, err error) {
  d, rsa, errcall, err := fd.pfd.Accept()
  // 省略部分代碼
  if err = netfd.init(); err != nil
  // 省略部分代碼


// internal/poll/fd_unix.go
func (fd *FD) Init(net string, pollable bool) error

// internal/poll/fd_poll_runtime.go
func (pd *pollDesc) init(fd *FD) error

關(guān)于 net.netFD struct

netFD是對(duì)套接字(網(wǎng)絡(luò)文件描述符)的封裝。對(duì)于Server套接字而言，可以通過(guò)accept方法從Server套接字(LISTEN套接字)獲取新的TCP連接(或ESTABLISHED套接字)。Linux的accept系統(tǒng)調(diào)用返回的ESTABLISHED套接字是一個(gè)int值，通過(guò) newFD 和 init 函數(shù)將其封裝為一個(gè)完整的 netFD結(jié)構(gòu)，后面會(huì)被封裝為一個(gè)net.TCPConn。

對(duì)于操作系統(tǒng)而言，LISTEN套接字和ESTABLISHED套接字都只是一個(gè)int類(lèi)型的文件描述符，沒(méi)有本質(zhì)區(qū)別。系統(tǒng)調(diào)用accept和read都是從套接字讀取數(shù)據(jù)，所以epoll里會(huì)放到一個(gè)batch里去監(jiān)聽(tīng)。

這是 netFD 的定義和accept方法的實(shí)現(xiàn)：

// Network file descriptor.
type netFD struct {
  pfd poll.FD

  // immutable until Close
  family      int
  sotype      int
  isConnected bool // handshake completed or use of association with peer
  net         string
  laddr       Addr
  raddr       Addr
}

func (fd *netFD) accept() (netfd *netFD, err error) {
  d, rsa, errcall, err := fd.pfd.Accept()
  if err != nil {
    if errcall != "" {
      err = wrapSyscallError(errcall, err)
    }
    return nil, err
  }

  if netfd, err = newFD(d, fd.family, fd.sotype, fd.net); err != nil {
    poll.CloseFunc(d)
    return nil, err
  }
  if err = netfd.init(); err != nil {
    netfd.Close()
    return nil, err
  }
  lsa, _ := syscall.Getsockname(netfd.pfd.Sysfd)
  netfd.setAddr(netfd.addrFunc()(lsa), netfd.addrFunc()(rsa))
  return netfd, nil
}

net.netFD 依賴(lài) poll.FD 實(shí)現(xiàn)poll功能。區(qū)別正如名字所展示，net.netFD是封裝了網(wǎng)絡(luò)相關(guān)的功能，而 poll.FD是更為通用的FD，封裝了文件描述符上能進(jìn)行的操作。其定義如下：

// FD is a file descriptor. The net and os packages use this type as a
// field of a larger type representing a network connection or OS file.
type FD struct {
  // Lock sysfd and serialize access to Read and Write methods.
  fdmu fdMutex

  // System file descriptor. Immutable until Close.
  Sysfd int

  // I/O poller.
  pd pollDesc

  // Writev cache.
  iovecs *[]syscall.Iovec

  // Semaphore signaled when file is closed.
  csema uint32

  // Non-zero if this file has been set to blocking mode.
  isBlocking uint32

  // Whether this is a streaming descriptor, as opposed to a
  // packet-based descriptor like a UDP socket. Immutable.
  IsStream bool

  // Whether a zero byte read indicates EOF. This is false for a
  // message based socket connection.
  ZeroReadIsEOF bool

  // Whether this is a file rather than a network socket.
  isFile bool
}

poll.FD 依賴(lài) poll.pollDesc 實(shí)現(xiàn)poll功能。poll.pollDesc 實(shí)現(xiàn)了 IO polling 的功能。poll.pollDesc 有一系列的方法，比如 init、wait、close、prepare 等都是對(duì) runtime_pollXXX 函數(shù)系列的封裝，下面詩(shī)pollDesc的部分邏輯:

type pollDesc struct {
  runtimeCtx uintptr
}

var serverInit sync.Once

func (pd *pollDesc) init(fd *FD) error {
  serverInit.Do(runtime_pollServerInit)
  ctx, errno := runtime_pollOpen(uintptr(fd.Sysfd))
  if errno != 0 {
    return errnoErr(syscall.Errno(errno))
  }
  pd.runtimeCtx = ctx
  return nil
}