自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

動手實驗+源碼分析，徹底弄懂 Linux 網(wǎng)絡(luò)命名空間

作者：張彥飛allen 2021-10-26 00:17:21

系統(tǒng) Linux

在 Linux 上通過 veth 我們可以創(chuàng)建出許多的虛擬設(shè)備。通過 Bridge 模擬以太網(wǎng)交換機的方式可以讓這些網(wǎng)絡(luò)設(shè)備之間進行通信。

大家好，我是飛哥!

在 Linux 上通過 veth 我們可以創(chuàng)建出許多的虛擬設(shè)備。通過 Bridge 模擬以太網(wǎng)交換機的方式可以讓這些網(wǎng)絡(luò)設(shè)備之間進行通信。不過虛擬化中還有很重要的一步，那就是隔離。借用 Docker 的概念來說，那就是不能讓 A 容器用到 B 容器的設(shè)備，甚至連看一眼都不可以。只有這樣才能保證不同的容器之間復用硬件資源的同時，還不會影響其它容器的正常運行。

在 Linux 上實現(xiàn)隔離的技術(shù)手段就是 namespace。通過 namespace 可以隔離容器的進程 PID、文件系統(tǒng)掛載點、主機名等多種資源。不過我們今天重點要介紹的是網(wǎng)絡(luò) namespace，簡稱 netns。它可以為不同的命名空間從邏輯上提供獨立的網(wǎng)絡(luò)協(xié)議棧，具體包括網(wǎng)絡(luò)設(shè)備、路由表、arp表、iptables、以及套接字(socket)等。使得不同的網(wǎng)絡(luò)空間就都好像運行在獨立的網(wǎng)絡(luò)中一樣。

你是不是和飛哥一樣，也很好奇 Linux 底層到底是如何實現(xiàn)網(wǎng)絡(luò)隔離的?我們今天來好好挖一挖 netns 的內(nèi)部實現(xiàn)。

一、如何使用 netns

由于我們平時的開發(fā)工作很少涉及網(wǎng)絡(luò)空間，所以我們先來看一下網(wǎng)絡(luò)空間是如何使用的吧。我們來創(chuàng)建一個新的命名空間net1。再創(chuàng)建一對兒 veth，將 veth 的一頭放到 net1 中。分別查看一下母機和 net1 空間內(nèi)的 iptable、設(shè)備等。最后讓兩個命名空間之間進行通信。

下面是詳細的使用過程。首先我們先來創(chuàng)建一個新的網(wǎng)絡(luò)命名空間 - net1。

# ip netns add net1

來查看一下它的 iptable、路由表、以及網(wǎng)絡(luò)設(shè)備

# ip netns exec net1 route 
Kernel IP routing table 
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface 
 
# ip netns exec net1 iptables -L 
ip netns exec net1 iptables -L 
Chain INPUT (policy ACCEPT) 
target     prot opt source               destination 
...... 
 
# ip netns exec net1 ip link list 
lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT qlen 1 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

由于是新創(chuàng)建的 netns，所以上述的輸出中路由表、iptable規(guī)則都是空的。不過這個命名空間中初始的情況下就存在一個 lo 本地環(huán)回設(shè)備，只不過默認是 DOWN(未啟動)狀態(tài)。

接下來我們創(chuàng)建一對兒 veth，并把 veth 的一頭添加給它。

# ip link add veth1 type veth peer name veth1_p 
# ip link set veth1 netns net1

在母機上查看一下當前的設(shè)備，發(fā)現(xiàn)已經(jīng)看不到 veth1 這個網(wǎng)卡設(shè)備了，只能看到 veth1_p。

# ip link list 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 ... 
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ... 
3: eth1: <BROADCAST,MULTICAST> mtu 1500 ... 
45: veth1_p@if46: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000 
    link/ether 0e:13:18:0a:98:9c brd ff:ff:ff:ff:ff:ff link-netnsid 0

這個新設(shè)備已經(jīng)跑到 net1 這個網(wǎng)絡(luò)空間里了。

# ip netns exec net1 ip link list 
1: lo: <LOOPBACK> mtu 65536 ... 
46: veth1@if45: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000 
    link/ether 7e:cd:ec:1c:5d:7a brd ff:ff:ff:ff:ff:ff link-netnsid 0

把這對兒 veth 分別配置上 ip，并把它們啟動起來

# ip netns exec net1 ip addr add 192.168.0.100/24 dev veth1_p 
# ip netns exec net1 ip addr add 192.168.0.101/24 dev veth1 
# ip netns exec net1 ip link set dev veth1_p up  
# ip netns exec net1 ip link set dev veth1 up

在母機和 net1 中分別執(zhí)行 ifconfig 查看當前啟動的網(wǎng)絡(luò)設(shè)備。

# ifconfig 
eth0: ... 
lo: ... 
veth1_p: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500 
        inet 192.168.0.100  netmask 255.255.255.0  broadcast 0.0.0.0 
        ... 
 
# ip netns exec net1 ifconfig 
veth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500 
        inet 192.168.0.101  netmask 255.255.255.0  broadcast 0.0.0.0 
        ...

我們來讓它和母機通信一下試試。

# ip netns exec net1 ping 192.168.0.100 -I veth1 
PING 192.168.0.100 (192.168.0.100) from 192.168.0.101 veth1: 56(84) bytes of data. 
64 bytes from 192.168.0.100: icmp_seq=1 ttl=64 time=0.027 ms 
64 bytes from 192.168.0.100: icmp_seq=2 ttl=64 time=0.010 ms

好了，現(xiàn)在一個新網(wǎng)絡(luò)命名空間創(chuàng)建實驗就結(jié)束了。在這個空間里，網(wǎng)絡(luò)設(shè)備、路由表、arp表、iptables都是獨立的，不會和母機上的沖突，也不會和其它空間里的產(chǎn)生干擾。而且還可以通過 veth 來和其它空間下的網(wǎng)絡(luò)進行通信。

想快速做這個實驗的同學可以使用我寫的一個makefile，見 https://github.com/yanfeizhang/coder-kung-fu/tree/main/tests/network/test05

二、內(nèi)核中 namespace 的定義

在內(nèi)核中，很多組件都是和 namespace 有關(guān)系的，我們先來看看這個關(guān)聯(lián)關(guān)系是如何定義的。后面我們再看下 namespace 本身的詳細結(jié)構(gòu)。

2.1 歸屬到 namespace 的東東

在 Linux 中，很多我們平常熟悉的概念都是歸屬到某一個特定的網(wǎng)絡(luò) namespace 中的，比如進程、網(wǎng)卡設(shè)備、socket 等等。

Linux 中每個進程(線程)都是用 task_struct 來表示的。每個 task_struct 都要關(guān)聯(lián)到一個 namespace 對象 nsproxy，而 nsproxy 又包含了 netns。對于網(wǎng)卡設(shè)備和 socket 來說，通過自己的成員來直接表明自己的歸屬。

拿網(wǎng)絡(luò)設(shè)備來舉例，只有歸屬到當前 netns 下的時候才能夠通過 ifconfig 看到，否則是不可見的。我們詳細來看看這幾個數(shù)據(jù)結(jié)構(gòu)的定義，先來看進程。

//file:include/linux/sched.h 
struct task_struct { 
 /* namespaces */ 
 struct nsproxy *nsproxy; 
 ...... 
}

命名空間的核心數(shù)據(jù)結(jié)構(gòu)是上面的這個 struct nsproxy。所有類型的 namespace(包括 pid、文件系統(tǒng)掛載點、網(wǎng)絡(luò)棧等等)都是在這里定義的。

//file: include/linux/nsproxy.h 
struct nsproxy { 
 struct uts_namespace *uts_ns; // 主機名 
 struct ipc_namespace *ipc_ns; // IPC 
 struct mnt_namespace *mnt_ns; // 文件系統(tǒng)掛載點 
 struct pid_namespace *pid_ns; // 進程標號 
 struct net       *net_ns;  // 網(wǎng)絡(luò)協(xié)議棧 
};

其中 struct net *net_ns 就是今天我們要討論的網(wǎng)絡(luò)命名空間。它的詳細定義我們待會再說。我們接著再看表示網(wǎng)絡(luò)設(shè)備的 struct net_device，它也是要歸屬到某一個網(wǎng)絡(luò)空間下的。

//file: include/linux/netdevice.h 
struct net_device{ 
 //設(shè)備名 
 char   name[IFNAMSIZ]; 
 
 //網(wǎng)絡(luò)命名空間 
 struct net  *nd_net; 
 
 ... 
}

所有的網(wǎng)絡(luò)設(shè)備剛創(chuàng)建出來都是在宿主機默認網(wǎng)絡(luò)空間下的?？梢酝ㄟ^ip link set 設(shè)備名 netns 網(wǎng)絡(luò)空間名將設(shè)備移動到另外一個空間里去。前面的實驗里，當 veth 1 移動到 net1 下的時候，該設(shè)備在宿主機下“消失”了，在 net1 下就能看到了。

還有我們經(jīng)常用的 socket，也是歸屬在某一個網(wǎng)絡(luò)命名空間下的。

//file: 
struct sock_common { 
 struct net   *skc_net; 
}

2.2 網(wǎng)絡(luò) namespace 定義

本小節(jié)中，我們來看網(wǎng)絡(luò) namespace 的主要數(shù)據(jù)結(jié)構(gòu) struct net 的定義。

可見每個 net 下都包含了自己的路由表、iptable 以及內(nèi)核參數(shù)配置等等。我們來看具體的代碼。

//file:include/net/net_namespace.h 
struct net { 
 //每個 net 中都有一個回環(huán)設(shè)備 
 struct net_device       *loopback_dev;          /* The loopback */ 
 
 //路由表、netfilter都在這里 
 struct netns_ipv4 ipv4; 
 ...... 
 
 unsigned int  proc_inum; 
}

由上述定義可見，每一個 netns 中都有一個 loopback_dev，這就是為什么我們在第一節(jié)中看到剛創(chuàng)建出來的空間里就能看到一個 lo 設(shè)備的底層原因。

網(wǎng)絡(luò) netspace 中最核心的數(shù)據(jù)結(jié)構(gòu)是 struct netns_ipv4 ipv4。在這個數(shù)據(jù)結(jié)構(gòu)里，定義了每一個網(wǎng)絡(luò)空間專屬的路由表、ipfilter 以及各種內(nèi)核參數(shù)。

//file: include/net/netns/ipv4.h 
struct netns_ipv4 { 
 //路由表  
 struct fib_table *fib_local; 
 struct fib_table *fib_main; 
 struct fib_table *fib_default; 
 
 //ip表 
 struct xt_table  *iptable_filter; 
 struct xt_table  *iptable_raw; 
 struct xt_table  *arptable_filter; 
 
 //內(nèi)核參數(shù) 
 long sysctl_tcp_mem[3]; 
 ... 
}

三、網(wǎng)絡(luò) namespace 的創(chuàng)建

回顧第一小節(jié)中，我們實驗步驟主要是創(chuàng)建了一個 netns，為其添加了一個 veth 設(shè)備。在這節(jié)中我們來窺探一下剛才的實驗步驟內(nèi)部到底是如何運行的。

3.1 進程與網(wǎng)絡(luò)命名空間

Linux 上存在一個默認的網(wǎng)絡(luò)命名空間，Linux 中的 1 號進程初始使用該默認空間。Linux 上其它所有進程都是由 1 號進程派生出來的，在派生 clone 的時候如果沒有額外特別指定，所有的進程都將共享這個默認網(wǎng)絡(luò)空間。

在 clone 里可以指定創(chuàng)建新進程時的 flag，都是 CLONE_ 開頭的。和 namespace 有的的標志位有 CLONE_NEWIPC、CLONE_NEWNET、CLONE_NEWNS、CLONE_NEWPID 等等。如果在創(chuàng)建進程時指定了 CLONE_NEWNET 標記位，那么該進程將會創(chuàng)建并使用新的 netns。

其實內(nèi)核提供了三種操作命名空間的方式，分別是 clone、setns 和 unshare。本文中我們只用 clone 來舉例，ip netns add 使用的是 unshare，原理和 clone 是類似的。

我們先來看下默認的網(wǎng)絡(luò)命名空間的初始化過程。

//file: init/init_task.c 
struct task_struct init_task = INIT_TASK(init_task); 
 
//file: include/linux/init_task.h 
#define INIT_TASK(tsk)  \ 
{ 
 ... 
  .nsproxy = &init_nsproxy, \ 
}

上面的代碼是在初始化第 1 號進程?？梢?nsproxy 是已經(jīng)創(chuàng)建好的 init_nsproxy。再看 init_nsproxy 是如何創(chuàng)建的。

//file: kernel/nsproxy.c 
struct nsproxy init_nsproxy = { 
 .uts_ns = &init_uts_ns, 
 .ipc_ns = &init_ipc_ns, 
 .mnt_ns = NULL, 
 .pid_ns = &init_pid_ns, 
 .net_ns = &init_net, 
};

初始的 init_nsproxy 里將多個命名空間都進行了初始化，其中我們關(guān)注的網(wǎng)絡(luò)命名空間，用的是默認網(wǎng)絡(luò)空間 init_net。它是系統(tǒng)初始化的時候就創(chuàng)建好的。

//file: net/core/net_namespace.c 
struct net init_net = { 
 .dev_base_head = LIST_HEAD_INIT(init_net.dev_base_head), 
}; 
EXPORT_SYMBOL(init_net); 
 
//file: net/core/net_namespace.c 
static int __init net_ns_init(void) 
{ 
 ... 
 setup_net(&init_net, &init_user_ns); 
 ... 
 register_pernet_subsys(&net_ns_ops); 
 return 0; 
}

上面的 setup_net 方法中對這個默認網(wǎng)絡(luò)命名空間進行初始化。

看到這里我們清楚了 1 號進程的命名空間初始化過程。Linux 中所有的進程都是由這個 1 號進程創(chuàng)建的。如果創(chuàng)建子進程過程中沒有指定 CLONE_NEWNET 這個 flag 的話，就直接還使用這個默認的網(wǎng)絡(luò)空間。

如果創(chuàng)建進程過程中指定了 CLONE_NEWNET，那么就會重新申請一個網(wǎng)絡(luò)命名空間出來。見如下的關(guān)鍵函數(shù) copy_net_ns(它的調(diào)用鏈是 do_fork => copy_process => copy_namespaces => create_new_namespaces => copy_net_ns)。

//file: net/core/net_namespace.c 
struct net *copy_net_ns(unsigned long flags, 
   struct user_namespace *user_ns, struct net *old_net) 
{ 
 struct net *net; 
 
 // 重要?。?！ 
 // 不指定 CLONE_NEWNET 就不會創(chuàng)建新的網(wǎng)絡(luò)命名空間 
 if (!(flags & CLONE_NEWNET)) 
  return get_net(old_net); 
 
 //申請新網(wǎng)絡(luò)命名空間并初始化 
 net = net_alloc(); 
 rv = setup_net(net, user_ns); 
 ... 
}

記住 setup_net 是初始化網(wǎng)絡(luò)命名空間的，這個函數(shù)接下來我們還會提到。

3.2 命名空間內(nèi)的網(wǎng)絡(luò)子系統(tǒng)初始化

命名空間內(nèi)的各個組件都是在 setup_net 時初始化的，包括路由表、tcp 的 proc 偽文件系統(tǒng)、iptable 規(guī)則讀取等等，所以這個小節(jié)也是蠻重要的。

由于內(nèi)核網(wǎng)絡(luò)模塊的復雜性，在內(nèi)核中將網(wǎng)絡(luò)模塊劃分成了各個子系統(tǒng)。每個子系統(tǒng)都定義了一個

//file: include/net/net_namespace.h 
struct pernet_operations { 
 // 鏈表指針 
 struct list_head list; 
 
 // 子系統(tǒng)的初始化函數(shù) 
 int (*init)(struct net *net); 
 
 // 網(wǎng)絡(luò)命名空間每個子系統(tǒng)的退出函數(shù) 
 void (*exit)(struct net *net); 
 void (*exit_batch)(struct list_head *net_exit_list); 
 int *id; 
 size_t size; 
};

各個子系統(tǒng)通過調(diào)用 register_pernet_subsys 或 register_pernet_device 將其初始化函數(shù)注冊到網(wǎng)絡(luò)命名空間系統(tǒng)的全局鏈表 pernet_list 中。你在源碼目錄下用這兩個函數(shù)搜索的話，會看到各個子系統(tǒng)的注冊過程。

拿 register_pernet_subsys 來舉例，我們來簡單看下它是如何將子系統(tǒng)都注冊到 pernet_list 中的。

//file: net/core/net_namespace.c 
static struct list_head *first_device = &pernet_list; 
int register_pernet_subsys(struct pernet_operations *ops) 
{ 
 error =  register_pernet_operations(first_device, ops); 
 ... 
}

register_pernet_operations 又會調(diào)用 __register_pernet_operations。

//file: include/net/net_namespace.h 
#define for_each_net(VAR)    \ 
 list_for_each_entry(VAR, &net_namespace_list, list) 
 
//file: net/core/net_namespace.c 
static int __register_pernet_operations(struct list_head *list, 
     struct pernet_operations *ops) 
{ 
 struct net *net; 
 
 list_add_tail(&ops->list, list); 
 if (ops->init || (ops->id && ops->size)) { 
  for_each_net(net) { 
   error = ops_init(ops, net); 
   ... 
}

在上面 list_add_tail 這一行，完成了將子系統(tǒng)傳入的 struct pernet_operations *ops 鏈入到 pernet_list 中。并注意一下，for_each_net 是遍歷了所有的網(wǎng)絡(luò)命名空間，然后在這個空間內(nèi)執(zhí)行了 ops_init 初始化。

這個初始化是網(wǎng)絡(luò)子系統(tǒng)在注冊的時候調(diào)用的。同樣當新的命名空間創(chuàng)建時，會遍歷該全局變量 pernet_list，執(zhí)行每個子模塊注冊上來的初始化函數(shù)。再回到我們 3.1.1 里提到的 setup_net 函數(shù)。

//file: net/core/net_namespace.c 
static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) 
{ 
 const struct pernet_operations *ops; 
 list_for_each_entry(ops, &pernet_list, list) { 
  error = ops_init(ops, net); 
 ... 
} 
 
//file: net/core/net_namespace.c 
static int ops_init(const struct pernet_operations *ops, struct net *net) 
{ 
 if (ops->init) 
  err = ops->init(net); 
}

在創(chuàng)建新命名空間調(diào)用到 setup_net 時，會通過 pernet_list 找到所有的網(wǎng)絡(luò)子系統(tǒng)，把它們都 init 一遍。

我們拿路由表來舉例，路由表子系統(tǒng)通過 register_pernet_subsys 將 fib_net_ops 注冊進來了。

//file: net/ipv4/fib_frontend.c 
static struct pernet_operations fib_net_ops = { 
 .init = fib_net_init, 
 .exit = fib_net_exit, 
}; 
 
void __init ip_fib_init(void) 
{ 
 register_pernet_subsys(&fib_net_ops); 
 ... 
}

這樣每當創(chuàng)建一個新的命名空間的時候，就會調(diào)用 fib_net_init 來創(chuàng)建一套獨立的路由規(guī)則。

再比如拿 iptable 中的 nat 表來說，也是一樣。每當創(chuàng)建新命名空間的時候，就會調(diào)用 iptable_nat_net_init 創(chuàng)建一套新的表。

//file: net/ipv4/netfilter/iptable_nat.c 
static struct pernet_operations iptable_nat_net_ops = { 
 .init = iptable_nat_net_init, 
 .exit = iptable_nat_net_exit, 
}; 
static int __init iptable_nat_init(void) 
{ 
 err = register_pernet_subsys(&iptable_nat_net_ops); 
 ...

3.3 添加設(shè)備

在一個設(shè)備剛剛創(chuàng)建出來的時候，它是屬于默認網(wǎng)絡(luò)命名空間 init_net 的，包括 veth 設(shè)備。不過可以在創(chuàng)建完后修改設(shè)備到新的網(wǎng)絡(luò)命名空間。

拿 veth 設(shè)備來舉例，它是在創(chuàng)建時的源碼 alloc_netdev_mqs 中設(shè)置到 init_net 上的。(執(zhí)行代碼路徑：veth_newlink => rtnl_create_link => alloc_netdev_mqs)

//file: core/dev.c 
struct net_device *alloc_netdev_mqs(...) 
{ 
 dev_net_set(dev, &init_net); 
} 
 
//file: include/linux/netdevice.h 
void dev_net_set(struct net_device *dev,struct net *net) 
{ 
 release_net(dev->nd_net); 
 dev->nd_net = hold_net(net); 
}

在執(zhí)行修改設(shè)備所屬的 namespace 的時候，會將 dev->nd_net 再指向新的 netns。對于 veth 來說，它包含了兩個設(shè)備。這兩個設(shè)備可以放在不同的 namespace 中。這就是 Docker 容器和其母機或者其它容器通信的基礎(chǔ)。

//file: core/dev.c 
int dev_change_net_namespace(struct net_device *dev, struct net *net, ...) 
{ 
 ... 
 dev_net_set(dev, net) 
}

四、在 namespace 下的網(wǎng)絡(luò)收發(fā)

在前面一節(jié)中，我們知道了內(nèi)核是如何創(chuàng)建 netns 出來，也了解了網(wǎng)絡(luò)設(shè)備是如何添加到其它命名空間里的。在這一小節(jié)，我們聊聊，當考慮到網(wǎng)絡(luò)命名空間的時候，網(wǎng)絡(luò)包的收發(fā)又是怎么樣的呢?

4.1 socket 與網(wǎng)絡(luò)命名空間

首先來考慮的就是我們熟悉的 socket。其實每個 socket 都是歸屬于某一個網(wǎng)絡(luò)命名空間的，這個關(guān)聯(lián)關(guān)系在上面的 2.1 小節(jié)提到過。

到底歸屬那個 netns，這是由創(chuàng)建這個 socket 的進程所屬的 netns 來決定。當在某個進程里創(chuàng)建 socket 的時候，內(nèi)核就會把當前進程的 nsproxy->net_ns 找出來，并把它賦值給 socket 上的網(wǎng)絡(luò)命名空間成員 skc_net。

在默認下，我們創(chuàng)建的 socket 都屬于默認的網(wǎng)絡(luò)命名空間 init_net

我們來展開看下 socket 是如何被放到某個網(wǎng)絡(luò)命名空間中的。在 socket 中，用來保存和網(wǎng)絡(luò)命名空間歸屬關(guān)系的變量是 skc_net，如下。

//file: include/net/sock.h 
struct sock_common { 
 ... 
 struct net   *skc_net; 
}

接下來就是 socket 創(chuàng)建的時候，內(nèi)核中可以通過 current->nsproxy->net_ns 把當前進程所屬的 netns 找出來，最終把 socket 中的 sk_net 成員和該命名空間建立好了聯(lián)系。

//file: net/socket.c 
int sock_create(int family, int type, int protocol, struct socket **res) 
{ 
 return __sock_create(current->nsproxy->net_ns, family, type, protocol, res, 0); 
}

在 socket_create 中，看到 current->nsproxy->net_ns 了吧，它獲取到了進程的 netns。再依次經(jīng)過__sock_create => inet_create => sk_alloc，調(diào)用到 sock_net_set 的時候，成功設(shè)置了新 socket 和 netns 的關(guān)聯(lián)關(guān)系。

//file: include/net/sock.h 
static inline 
void sock_net_set(struct sock *sk, struct net *net) 
{ 
 write_pnet(&sk->sk_net, net); 
}

4.2 網(wǎng)絡(luò)包的收發(fā)過程

網(wǎng)絡(luò)包的接收和發(fā)送過程我們在這兩篇文章里詳細介紹過,圖解Linux網(wǎng)絡(luò)包接收過程和 25 張圖，一萬字，拆解 Linux 網(wǎng)絡(luò)包發(fā)送過程。

本小節(jié)的不再重復贅述這個收發(fā)過程，我們就以網(wǎng)絡(luò)包發(fā)送過程中的路由功能為例，來看一下網(wǎng)絡(luò)在傳輸?shù)臅r候是如何使用到 netns 的。其它收發(fā)過程中的各個步驟也都是類似的。

大致的原理就是 socket 上記錄了其歸屬的網(wǎng)絡(luò)命名空間。需要查找路由表之前先找到該命名空間，再找到命名空間里的路由表，然后再開始執(zhí)行查找。這樣，各個命名空間中的路由過程就都隔離開了。

我們來看詳細的路由查找源碼。在25 張圖，一萬字，拆解 Linux 網(wǎng)絡(luò)包發(fā)送過程中我們提到過在發(fā)送過程中在 IP 層的發(fā)送函數(shù) ip_queue_xmit 中調(diào)用 ip_route_output_ports 來查找路由項。

//file: net/ipv4/ip_output.c 
int ip_queue_xmit(struct sk_buff *skb, struct flowi *fl) 
{ 
 rt = ip_route_output_ports(sock_net(sk), fl4, sk, 
     daddr, inet->inet_saddr, 
     ...); 
}

注意上面的 sock_net(sk) 這一步，在這里將 socket 上記錄的命名空間 struct net *sk_net 給找了出來。

//file: include/net/sock.h 
static inline struct net *sock_net(const struct sock *sk) 
{ 
 return read_pnet(&sk->sk_net); 
}

找到命名空間以后，就會將它(以 struct net * 指針的形式)一路透傳到后面的各個函數(shù)中。在127.0.0.1 之本機網(wǎng)絡(luò)通信過程知多少 ?! 中我們介紹了路由查找最后會執(zhí)行到 fib_lookup，我們來看下這個函數(shù)的源碼。

路由查找的調(diào)用鏈條有點長，是 ip_route_output_ports => ->ip_route_output_flow => __ip_route_output_key() => ip_route_output_key_hash => ip_route_output_key_hash_rcu)

//file: include/net/ip_fib.h 
static inline int fib_lookup(struct net *net, ...) 
{ 
 struct fib_table *table; 
 table = fib_get_table(net, RT_TABLE_LOCAL); 
 table = fib_get_table(net, RT_TABLE_MAIN); 
 ... 
} 
 
static inline struct fib_table *fib_get_table(struct net *net, u32 id) 
{ 
 ptr = id == RT_TABLE_LOCAL ? 
  &net->ipv4.fib_table_hash[TABLE_LOCAL_INDEX] : 
  &net->ipv4.fib_table_hash[TABLE_MAIN_INDEX]; 
 return hlist_entry(ptr->first, struct fib_table, tb_hlist); 
}

由上述代碼可見，在路由過程中是根據(jù)前面步驟中確定好的命名空間 struct net *net 來查找路由項的。不同的命名空間有不同的 net 變量，所以不同的 netns 中自然也就可以配置不同的路由表了。

網(wǎng)絡(luò)收發(fā)過程中其它步驟也是類似的，涉及到需要隔離的地方，都是通過命名空間( struct net *) 去查找的。

五、結(jié)論

Linux 的網(wǎng)絡(luò) namespace 實現(xiàn)了獨立協(xié)議棧的隔離。這個說法其實不是很準確，內(nèi)核網(wǎng)絡(luò)代碼只有一套，并沒有隔離。

它是通過為不同空間創(chuàng)建不同的 struct net 對象。每個 struct net 中都有獨立的路由表、iptable 等數(shù)據(jù)結(jié)構(gòu)。每個設(shè)備、每個 socket 上也都有指針指明自己歸屬那個 netns。通過這種方法從邏輯上看起來好像是真的有多個協(xié)議棧一樣。

這樣，就為一臺物理上創(chuàng)建出多個邏輯上的協(xié)議棧，為 Docker 容器的誕生提供了可能。

在上面的示例中，Docker1 和 Docker2 都可以分別擁有自己獨立的網(wǎng)卡設(shè)備，配置自己的路由規(guī)則、iptable。從而使得他們的網(wǎng)絡(luò)功能不會相互影響。

怎么樣，今天是不是對網(wǎng)絡(luò) namespace 理解更深了呢?

責任編輯：武曉燕來源：開發(fā)內(nèi)功修煉

Linux 網(wǎng)絡(luò)命名

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<style id="oucxu"></style>

<s id="oucxu"><li id="oucxu"></li></s><cite id="oucxu"></cite>