淺析Linux的共享內(nèi)存與tmpfs文件系統(tǒng)
前言
共享內(nèi)存主要用于進(jìn)程間通信,Linux有兩種共享內(nèi)存(Shared Memory)機(jī)制:
(1) ** System V shared memory(shmget/shmat/shmdt) **
Original shared memory mechanism, still widely used Sharing between unrelated processes.
(2) ** POSIX shared memory(shm_open/shm_unlink) **
Sharing between unrelated processes, without overhead of filesystem I/O Intended to be simpler and better than older APIs.
另外,在Linux中不得不提一下內(nèi)存映射(也可用于進(jìn)程間通信):
** Shared mappings – mmap(2) **
l Shared anonymous mappings:Sharing between related processes only (related via fork())
l Shared file mappings:Sharing between unrelated processes, backed by file in filesystem
System V共享內(nèi)存歷史悠久,使用也很廣范,很多類Unix系統(tǒng)都支持。一般來說,我們?cè)趯懗绦驎r(shí)也通常使用***種。這里不再討論如何使用它們,關(guān)于POSIX共享內(nèi)存的詳細(xì)介紹可以參考這里1,這里2。
** 講到那么多,那么問題來了,共享內(nèi)存與tmpfs有什么關(guān)系? **
The POSIX shared memory object implementation on Linux 2.4 makes use of a dedicated filesystem, which is normally mounted under /dev/shm.
從這里可以看到,POSIX共享內(nèi)存是基于tmpfs來實(shí)現(xiàn)的。實(shí)際上,更進(jìn)一步,不僅PSM(POSIX shared memory),而且SSM(System V shared memory)在內(nèi)核也是基于tmpfs實(shí)現(xiàn)的。
tmpfs介紹
下面是內(nèi)核文檔中關(guān)于tmpfs的介紹:
tmpfs has the following uses:
1) There is always a kernel internal mount which you will not see at all. This is used for shared anonymous mappings and SYSV shared memory.
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not set, the user visible part of tmpfs is not build. But the internal mechanisms are always present.
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for POSIX shared memory (shm_open, shm_unlink). Adding the following line to /etc/fstab should take care of this:
tmpfs /dev/shm tmpfs defaults 0 0
Remember to create the directory that you intend to mount tmpfs on if necessary.
This mount is not needed for SYSV shared memory. The internal mount is used for that. (In the 2.3 kernel versions it was necessary to mount the predecessor of tmpfs (shm fs) to use SYSV shared memory)
從這里可以看到tmpfs主要有兩個(gè)作用:
(1)用于SYSV共享內(nèi)存,還有匿名內(nèi)存映射;這部分由內(nèi)核管理,用戶不可見;
(2)用于POSIX共享內(nèi)存,由用戶負(fù)責(zé)mount,而且一般mount到/dev/shm;依賴于CONFIG_TMPFS;
到這里,我們可以了解,SSM與PSM之間的區(qū)別,也明白了/dev/shm的作用。
下面我們來做一些測(cè)試:
測(cè)試
我們將/dev/shm的tmpfs設(shè)置為64M:
- # mount -size=64M -o remount /dev/shm# df -lh
- Filesystem Size Used Avail Use% Mounted on
- tmpfs 64M 0 64M 0% /dev/shm
SYSV共享內(nèi)存的***大小為32M:
- # cat /proc/sys/kernel/shmmax
- 33554432
(1)創(chuàng)建65M的system V共享內(nèi)存失敗:
- # ipcmk -M 68157440
- ipcmk: create share memory failed: Invalid argument
這是正常的。
(2)將shmmax調(diào)整為65M
- # echo 68157440 > /proc/sys/kernel/shmmax# cat /proc/sys/kernel/shmmax
- 68157440# ipcmk -M 68157440
- Shared memory id: 0# ipcs -m
- ------ Shared Memory Segments --------
- key shmid owner perms bytes nattch status
- 0xef46b249 0 root 644 68157440 0
可以看到system v共享內(nèi)存的大小并不受/dev/shm的影響。
(3)創(chuàng)建POSIX共享內(nèi)存
點(diǎn)擊(此處)折疊或打開
- /*gcc -o shmopen shmopen.c -lrt*/#include <unistd.h>
- #include <fcntl.h>
- #include <sys/stat.h>
- #include <sys/types.h>
- #include <sys/mman.h>
- #include <stdio.h>
- #include <stdlib.h>
- #define MAP_SIZE 68157440
- int main(int argc, char *argv[])
- {
- int fd;
- void* result;
- fd = shm_open("/shm1", O_RDWR|O_CREAT, 0644);
- if(fd < 0){
- printf("shm_open failed\n");
- exit(1);
- }
- return 0;
- }
- # ./shmopen# ls -lh /dev/shm/shm1
- -rw-r--r-- 1 root root 65M Mar 3 06:19 /dev/shm/shm1
僅管/dev/shm只有64M,但創(chuàng)建65M的POSIX SM也可以成功。
(4)向POSIX SM寫數(shù)據(jù)
點(diǎn)擊(此處)折疊或打開
- /*gcc -o shmwrite shmwrite.c -lrt*/#include <unistd.h>
- #include <fcntl.h>
- #include <sys/stat.h>
- #include <sys/types.h>
- #include <sys/mman.h>
- #include <stdio.h>
- #include <stdlib.h>
- #define MAP_SIZE 68157440
- int main(int argc, char *argv[])
- {
- int fd;
- void* result;
- fd = shm_open("/shm1", O_RDWR|O_CREAT, 0644);
- if(fd < 0){
- printf("shm_open failed\n");
- exit(1);
- }
- if (ftruncate(fd, MAP_SIZE) < 0){
- printf("ftruncate failed\n");
- exit(1);
- }
- result = mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
- if(result == MAP_FAILED){
- printf("mapped failed\n");
- exit(1);
- }
- /* ... operate result pointer */
- printf("memset\n");
- memset(result, 0, MAP_SIZE);
- //shm_unlink("/shm1");
- return 0;
- }
- # ./shmwrite
- memset
- Bus error
可以看到,寫65M的數(shù)據(jù)會(huì)報(bào)Bus error錯(cuò)誤。
但是,卻可以在/dev/shm創(chuàng)建新的文件:
- # ls -lh /dev/shm/ -lh
- 總用量 64M
- -rw-r--r-- 1 root root 65M 3月 3 15:23 shm1
- -rw-r--r-- 1 root root 65M 3月 3 15:24 shm2
- 這很正常,ls顯示的是inode->size。
- # stat /dev/shm/shm2
- File: "/dev/shm/shm2"
- Size: 68157440 Blocks: 0 IO Block: 4096 普通文件
- Device: 10h/16d Inode: 217177 Links: 1
- Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
- Access: 2015-03-03 15:24:28.025985167 +0800
- Modify: 2015-03-03 15:24:28.025985167 +0800
- Change: 2015-03-03 15:24:28.025985167 +0800
(5)向SYS V共享內(nèi)存寫數(shù)據(jù)
將System V共享內(nèi)存的***值調(diào)整為65M(/dev/shm仍然為64M)。
- # cat /proc/sys/kernel/shmmax
- 68157440
點(diǎn)擊(此處)折疊或打開
- /*gcc -o shmv shmv.c*/#include <sys/ipc.h>
- #include <sys/shm.h>
- #include <sys/types.h>
- #include <unistd.h>
- #define MAP_SIZE 68157440
- int main(int argc, char** argv){
- int shm_id,i;
- key_t key;
- char temp;
- char *p_map;
- char* name = "/dev/shm/shm3";
- key = ftok(name,0);
- if(key==-1)
- perror("ftok error");
- shm_id=shmget(key,MAP_SIZE,IPC_CREAT);
- if(shm_id==-1)
- {
- perror("shmget error");
- return;
- }
- p_map=(char*)shmat(shm_id,NULL,0);
- memset(p_map, 0, MAP_SIZE);
- if(shmdt(p_map)==-1)
- perror(" detach error ");
- }
- #./shmv
卻可以正常執(zhí)行。
(7)結(jié)論
雖然System V與POSIX共享內(nèi)存都是通過tmpfs實(shí)現(xiàn),但是受的限制卻不相同。也就是說/proc/sys/kernel/shmmax只會(huì)影響SYS V共享內(nèi)存,/dev/shm只會(huì)影響Posix共享內(nèi)存。實(shí)際上,System V與Posix共享內(nèi)存本來就是使用的兩個(gè)不同的tmpfs實(shí)例(instance)。
內(nèi)核分析
內(nèi)核在初始化時(shí),會(huì)自動(dòng)mount一個(gè)tmpfs文件系統(tǒng),掛載為shm_mnt:
點(diǎn)擊(此處)折疊或打開
- //mm/shmem.cstatic struct file_system_type
- shmem_fs_type = {
- .owner = THIS_MODULE,
- .name = "tmpfs",
- .get_sb = shmem_get_sb,
- .kill_sb = kill_litter_super,
- };
- int __init shmem_init(void) {
- ...
- error = register_filesystem(&shmem_fs_type);
- if (error)
- {
- printk(KERN_ERR "Could not register tmpfs\n");
- goto out2;
- }
- ///掛載tmpfs(用于SYS V)
- shm_mnt = vfs_kern_mount(&shmem_fs_type, MS_NOUSER,shmem_fs_type.name, NULL);
/dev/shm的mount與普通文件mount的流程類似,不再討論。但是,值得注意的是,/dev/shm默認(rèn)的大小為當(dāng)前物理內(nèi)存的1/2:
- shmem_get_sb –> shmem_fill_super
點(diǎn)擊(此處)折疊或打開
- //mem/shmem.c
- int shmem_fill_super(struct super_block *sb, void *data, int silent)
- {
- ...
- #ifdef CONFIG_TMPFS
- /*
- * Per default we only allow half of the physical ram per
- * tmpfs instance, limiting inodes to one per page of lowmem;
- * but the internal instance is left unlimited.
- */
- if (!(sb->s_flags & MS_NOUSER)) {///內(nèi)核會(huì)設(shè)置MS_NOUSER
- sbinfo->max_blocks = shmem_default_max_blocks();
- sbinfo->max_inodes = shmem_default_max_inodes();
- if (shmem_parse_options(data, sbinfo, false)) {
- err = -EINVAL;
- goto failed;
- }
- }
- sb->s_export_op = &shmem_export_ops;
- #else
- ...
- #ifdef CONFIG_TMPFS
- static unsigned long shmem_default_max_blocks(void) {
- return totalram_pages / 2;
- }
可以看到:由于內(nèi)核在mount tmpfs時(shí),指定了MS_NOUSER,所以該tmpfs沒有大小限制,因此,SYS V共享內(nèi)存能夠使用的內(nèi)存空間只受/proc/sys/kernel/shmmax限制;而用戶通過掛載的/dev/shm,默認(rèn)為物理內(nèi)存的1/2。
注意CONFIG_TMPFS.
另外,在/dev/shm創(chuàng)建文件走VFS接口,而SYS V與匿名映射卻是通過shmem_file_setup實(shí)現(xiàn):
SIGBUS
當(dāng)應(yīng)用訪問共享內(nèi)存對(duì)應(yīng)的地址空間,如果對(duì)應(yīng)的物理PAGE還沒有分配,就會(huì)調(diào)用fault方法,分配失敗,就會(huì)返回OOM或者BIGBUS錯(cuò)誤:
點(diǎn)擊(此處)折疊或打開
- static const struct vm_operations_struct shmem_vm_ops = {
- .fault = shmem_fault,
- #ifdef CONFIG_NUMA
- .set_policy = shmem_set_policy,
- .get_policy = shmem_get_policy,
- #endif
- };
- static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
- {
- struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
- int error;
- int ret = VM_FAULT_LOCKED;
- error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
- if (error)
- return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
- return ret;
- }
- shmem_getpage –> shmem_getpage_gfp:
- /*
- * shmem_getpage_gfp - find page in cache, or get from swap, or allocate
- *
- * If we allocate a new one we do not mark it dirty. That's up to the
- * vm. If we swap it in we mark it dirty since we also free the swap
- * entry since a page cannot live in both the swap and page cache
- */
- static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
- struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type)
- {
- ...
- if (sbinfo->max_blocks) { ///dev/shm會(huì)有該值
- if (percpu_counter_compare(&sbinfo->used_blocks,sbinfo->max_blocks) >= 0) {
- error = -ENOSPC;
- goto unacct;
- }
- percpu_counter_inc(&sbinfo->used_blocks);
- }
- //分配一個(gè)物理PAGE
- page = shmem_alloc_page(gfp, info, index);
- if (!page) {
- error = -ENOMEM;
- goto decused;
- }
- SetPageSwapBacked(page);
- __set_page_locked(page);
- error = mem_cgroup_cache_charge(page, current->mm,gfp & GFP_RECLAIM_MASK); ///mem_cgroup檢查
- if (!error)
- error = shmem_add_to_page_cache(page, mapping, index, gfp, NULL);
共享內(nèi)存與CGROUP
目前,共享內(nèi)存的空間計(jì)算在***個(gè)訪問共享內(nèi)存的group,參考:
l http://lwn.net/Articles/516541/
l https://www.kernel.org/doc/Documentation/cgroups/memory.txt
POSIX共享內(nèi)存與Docker
目前Docker將/dev/shm限制為64M,卻沒有提供參數(shù),這種做法比較糟糕。如果應(yīng)用使用大內(nèi)存的POSIX共享內(nèi)存,必然會(huì)導(dǎo)致問題。 參考:
l https://github.com/docker/docker/issues/2606
l https://github.com/docker/docker/pull/4981
總結(jié)
(1)POSIX共享內(nèi)存與SYS V共享內(nèi)存在內(nèi)核都是通過tmpfs實(shí)現(xiàn),但對(duì)應(yīng)兩個(gè)不同的tmpfs實(shí)例,相互獨(dú)立。
(2)通過/proc/sys/kernel/shmmax可以限制SYS V共享內(nèi)存(單個(gè))的***值,通過/dev/shm可以限制POSIX共享內(nèi)存的***值(所有之和)。