Docker 基础技术之 Linux namespace 源码分析
上篇我們從進(jìn)程 clone 的角度,結(jié)合代碼簡單分析了 Linux 提供的 6 種 namespace,本篇從源碼上進(jìn)一步分析 Linux namespace,讓你對 Docker namespace 的隔離機(jī)制有更深的認(rèn)識。我用的是 Linux-4.1.19 的版本,由于 namespace 模塊更新都比較少,所以,只要 3.0 以上的版本都是差不多的。
從內(nèi)核進(jìn)程描述符 task_struct 開始切入
由于 Linux namespace 是用來做進(jìn)程資源隔離的,所以在進(jìn)程描述符中,一定有 namespace 所對應(yīng)的信息,我們可以從這里開始切入代碼。
首先找到描述進(jìn)程信息 task_struct,找到指向 namespace 的結(jié)構(gòu) struct *nsproxy(sched.h):
struct task_struct { ...... /* namespaces */ struct nsproxy *nsproxy; ...... }其中 nsproxy 結(jié)構(gòu)體定義在 nsproxy.h 中:
/* * A structure to contain pointers to all per-process * namespaces - fs (mount), uts, network, sysvipc, etc. * * 'count' is the number of tasks holding a reference. * The count for each namespace, then, will be the number * of nsproxies pointing to it, not the number of tasks. * * The nsproxy is shared by tasks which share all namespaces. * As soon as a single namespace is cloned or unshared, the * nsproxy is copied. */ struct nsproxy {atomic_t count;struct uts_namespace *uts_ns;struct ipc_namespace *ipc_ns;struct mnt_namespace *mnt_ns;struct pid_namespace *pid_ns;struct net *net_ns; }; extern struct nsproxy init_nsproxy;這個(gè)結(jié)構(gòu)是被所有 namespace 所共享的,只要一個(gè) namespace 被 clone 了,nsproxy 也會被 clone。注意到,由于 user namespace 是和其他 namespace 耦合在一起的,所以沒出現(xiàn)在上述結(jié)構(gòu)中。
同時(shí),nsproxy.h 中還定義了一些對 namespace 的操作,包括 copy_namespaces 等。
int copy_namespaces(unsigned long flags, struct task_struct *tsk); void exit_task_namespaces(struct task_struct *tsk); void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new); void free_nsproxy(struct nsproxy *ns); int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,struct fs_struct *);task_struct,nsproxy,幾種 namespace 之間的關(guān)系如下所示:
各個(gè) namespace 的初始化
在各個(gè) namespace 結(jié)構(gòu)定義下都有個(gè) init 函數(shù),nsproxy 也有個(gè) init_nsproxy 函數(shù),init_nsproxy 在 task 初始化的時(shí)候會被初始化,附帶的,init_nsproxy 中定義了各個(gè) namespace 的 init 函數(shù),如下:
在 init_task 函數(shù)中(init_task.h):
/* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1fffff (=2MB) */ #define INIT_TASK(tsk) \ { .......nsproxy = &init_nsproxy, ...... }繼續(xù)跟進(jìn) init_nsproxy,在 nsproxy.c 中:
struct nsproxy init_nsproxy = {.count = ATOMIC_INIT(1),.uts_ns = &init_uts_ns, #if defined(CONFIG_POSIX_MQUEUE) || defined(CONFIG_SYSVIPC).ipc_ns = &init_ipc_ns, #endif.mnt_ns = NULL,.pid_ns_for_children = &init_pid_ns, #ifdef CONFIG_NET.net_ns = &init_net, #endif };可見,init_nsproxy 中,對 uts, ipc, pid, net 都進(jìn)行了初始化,但 mount 卻沒有。
創(chuàng)建新的 namespace
初始化完之后,下面看看如何創(chuàng)建一個(gè)新的 namespace,通過前面的文章,我們知道是通過 clone 函數(shù)來完成的,在 Linux kernel 中,fork/vfork() 對 clone 進(jìn)行了封裝。如下:
#ifdef __ARCH_WANT_SYS_FORK SYSCALL_DEFINE0(fork) { #ifdef CONFIG_MMUreturn do_fork(SIGCHLD, 0, 0, NULL, NULL); #else/* can not support in nommu mode */return -EINVAL; #endif } #endif#ifdef __ARCH_WANT_SYS_VFORK SYSCALL_DEFINE0(vfork) {return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,0, NULL, NULL); } #endif#ifdef __ARCH_WANT_SYS_CLONE #ifdef CONFIG_CLONE_BACKWARDS SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,int __user *, parent_tidptr,int, tls_val,int __user *, child_tidptr) #elif defined(CONFIG_CLONE_BACKWARDS2) SYSCALL_DEFINE5(clone, unsigned long, newsp, unsigned long, clone_flags,int __user *, parent_tidptr,int __user *, child_tidptr,int, tls_val) #elif defined(CONFIG_CLONE_BACKWARDS3) SYSCALL_DEFINE6(clone, unsigned long, clone_flags, unsigned long, newsp,int, stack_size,int __user *, parent_tidptr,int __user *, child_tidptr,int, tls_val) #else SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,int __user *, parent_tidptr,int __user *, child_tidptr,int, tls_val) #endif {return do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr); } #endif可以看到,無論是 fork() 還是 vfork(),最終都會調(diào)用到 do_fork() 函數(shù):
/* * Ok, this is the main fork-routine. * * It copies the process, and if successful kick-starts * it and waits for it to finish using the VM if required. */ long do_fork(unsigned long clone_flags,unsigned long stack_start,unsigned long stack_size,int __user *parent_tidptr,int __user *child_tidptr) {// 創(chuàng)建進(jìn)程描述符指針struct task_struct *p;int trace = 0;long nr;/** Determine whether and which event to report to ptracer. When* called from kernel_thread or CLONE_UNTRACED is explicitly* requested, no event is reported; otherwise, report if the event* for the type of forking is enabled.*/if (!(clone_flags & CLONE_UNTRACED)) {if (clone_flags & CLONE_VFORK)trace = PTRACE_EVENT_VFORK;else if ((clone_flags & CSIGNAL) != SIGCHLD)trace = PTRACE_EVENT_CLONE;elsetrace = PTRACE_EVENT_FORK;if (likely(!ptrace_event_enabled(current, trace)))trace = 0;}// 復(fù)制進(jìn)程描述符,返回值是 task_structp = copy_process(clone_flags, stack_start, stack_size,child_tidptr, NULL, trace);/** Do this prior waking up the new thread - the thread pointer* might get invalid after that point, if the thread exits quickly.*/if (!IS_ERR(p)) {struct completion vfork;struct pid *pid;trace_sched_process_fork(current, p);// 得到新進(jìn)程描述符的 pidpid = get_task_pid(p, PIDTYPE_PID);nr = pid_vnr(pid);if (clone_flags & CLONE_PARENT_SETTID)put_user(nr, parent_tidptr);// 調(diào)用 vfork() 方法,完成相關(guān)的初始化工作 if (clone_flags & CLONE_VFORK) {p->vfork_done = &vfork;init_completion(&vfork);get_task_struct(p);}// 將新進(jìn)程加入到調(diào)度器中,為其分配 CPU,準(zhǔn)備執(zhí)行wake_up_new_task(p);// fork() 完成,子進(jìn)程開始運(yùn)行,并讓 ptrace 跟蹤/* forking complete and child started to run, tell ptracer */if (unlikely(trace))ptrace_event_pid(trace, pid);// 如果是 vfork(),將父進(jìn)程加入等待隊(duì)列,等待子進(jìn)程完成if (clone_flags & CLONE_VFORK) {if (!wait_for_vfork_done(p, &vfork))ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);}put_pid(pid);} else {nr = PTR_ERR(p);}return nr; }do_fork() 首先調(diào)用 copy_process 將父進(jìn)程信息復(fù)制給子進(jìn)程,然后調(diào)用 vfork() 完成相關(guān)的初始化工作,接著調(diào)用 wake_up_new_task() 將進(jìn)程加入調(diào)度器中,為之分配 CPU。最后,等待子進(jìn)程退出。
copy_process():
static struct task_struct *copy_process(unsigned long clone_flags,unsigned long stack_start,unsigned long stack_size,int __user *child_tidptr,struct pid *pid,int trace) {int retval;// 創(chuàng)建進(jìn)程描述符指針struct task_struct *p;// 檢查 clone flags 的合法性,比如 CLONE_NEWNS 與 CLONE_FS 是互斥的if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))return ERR_PTR(-EINVAL);if ((clone_flags & (CLONE_NEWUSER|CLONE_FS)) == (CLONE_NEWUSER|CLONE_FS))return ERR_PTR(-EINVAL);/** Thread groups must share signals as well, and detached threads* can only be started up within the thread group.*/if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))return ERR_PTR(-EINVAL);/** Shared signal handlers imply shared VM. By way of the above,* thread groups also imply shared VM. Blocking this case allows* for various simplifications in other code.*/if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))return ERR_PTR(-EINVAL);/** Siblings of global init remain as zombies on exit since they are* not reaped by their parent (swapper). To solve this and to avoid* multi-rooted process trees, prevent global and container-inits* from creating siblings.*/// 比如CLONE_PARENT時(shí)得檢查當(dāng)前signal flags是否為SIGNAL_UNKILLABLE,防止kill init進(jìn)程。if ((clone_flags & CLONE_PARENT) &¤t->signal->flags & SIGNAL_UNKILLABLE)return ERR_PTR(-EINVAL);/** If the new process will be in a different pid or user namespace* do not allow it to share a thread group or signal handlers or* parent with the forking task.*/if (clone_flags & CLONE_SIGHAND) {if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) ||(task_active_pid_ns(current) !=current->nsproxy->pid_ns_for_children))return ERR_PTR(-EINVAL);}retval = security_task_create(clone_flags);if (retval)goto fork_out;retval = -ENOMEM;// 復(fù)制當(dāng)前的 task_structp = dup_task_struct(current);if (!p)goto fork_out;ftrace_graph_init_task(p);rt_mutex_init_task(p);#ifdef CONFIG_PROVE_LOCKINGDEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled); #endifretval = -EAGAIN;// 檢查進(jìn)程是否超過限制,由 OS 定義if (atomic_read(&p->real_cred->user->processes) >=task_rlimit(p, RLIMIT_NPROC)) {if (p->real_cred->user != INIT_USER &&!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))goto bad_fork_free;}current->flags &= ~PF_NPROC_EXCEEDED;retval = copy_creds(p, clone_flags);if (retval < 0)goto bad_fork_free;/** If multiple threads are within copy_process(), then this check* triggers too late. This doesn't hurt, the check is only there* to stop root fork bombs.*/retval = -EAGAIN;// 檢查進(jìn)程數(shù)是否超過 max_threads,由內(nèi)存大小定義if (nr_threads >= max_threads)goto bad_fork_cleanup_count;// ......// 初始化 io 計(jì)數(shù)器task_io_accounting_init(&p->ioac);acct_clear_integrals(p);// 初始化 CPU 定時(shí)器posix_cpu_timers_init(p);// ......// 初始化進(jìn)程數(shù)據(jù)結(jié)構(gòu),并為進(jìn)程分配 CPU,進(jìn)程狀態(tài)設(shè)置為 TASK_RUNNING/* Perform scheduler related setup. Assign this task to a CPU. */retval = sched_fork(clone_flags, p);if (retval)goto bad_fork_cleanup_policy;retval = perf_event_init_task(p);if (retval)goto bad_fork_cleanup_policy;retval = audit_alloc(p);if (retval)goto bad_fork_cleanup_perf;/* copy all the process information */// 復(fù)制所有進(jìn)程信息,包括文件系統(tǒng),信號處理函數(shù)、信號、內(nèi)存管理等shm_init_task(p);retval = copy_semundo(clone_flags, p);if (retval)goto bad_fork_cleanup_audit;retval = copy_files(clone_flags, p);if (retval)goto bad_fork_cleanup_semundo;retval = copy_fs(clone_flags, p);if (retval)goto bad_fork_cleanup_files;retval = copy_sighand(clone_flags, p);if (retval)goto bad_fork_cleanup_fs;retval = copy_signal(clone_flags, p);if (retval)goto bad_fork_cleanup_sighand;retval = copy_mm(clone_flags, p);if (retval)goto bad_fork_cleanup_signal;// !!! 復(fù)制 namespaceretval = copy_namespaces(clone_flags, p);if (retval)goto bad_fork_cleanup_mm;retval = copy_io(clone_flags, p);if (retval)goto bad_fork_cleanup_namespaces;// 初始化子進(jìn)程內(nèi)核棧retval = copy_thread(clone_flags, stack_start, stack_size, p);if (retval)goto bad_fork_cleanup_io;// 為新進(jìn)程分配新的 pidif (pid != &init_struct_pid) {pid = alloc_pid(p->nsproxy->pid_ns_for_children);if (IS_ERR(pid)) {retval = PTR_ERR(pid);goto bad_fork_cleanup_io;}}// ......// 返回新進(jìn)程 preturn p; }copy_process 主要分為三步:首先調(diào)用 dup_task_struct() 復(fù)制當(dāng)前的進(jìn)程描述符信息 task_struct,為新進(jìn)程分配新的堆棧,第二步調(diào)用 sched_fork() 初始化進(jìn)程數(shù)據(jù)結(jié)構(gòu),為其分配 CPU,把進(jìn)程狀態(tài)設(shè)置為 TASK_RUNNING,最后一步就是調(diào)用 copy_namespaces() 復(fù)制 namesapces。我們重點(diǎn)關(guān)注最后一步 copy_namespaces():
/* * called from clone. This now handles copy for nsproxy and all * namespaces therein. */ int copy_namespaces(unsigned long flags, struct task_struct *tsk) {struct nsproxy *old_ns = tsk->nsproxy;struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);struct nsproxy *new_ns;if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |CLONE_NEWPID | CLONE_NEWNET)))) {get_nsproxy(old_ns);return 0;}if (!ns_capable(user_ns, CAP_SYS_ADMIN))return -EPERM;/** CLONE_NEWIPC must detach from the undolist: after switching* to a new ipc namespace, the semaphore arrays from the old* namespace are unreachable. In clone parlance, CLONE_SYSVSEM* means share undolist with parent, so we must forbid using* it along with CLONE_NEWIPC.*/if ((flags & (CLONE_NEWIPC | CLONE_SYSVSEM)) ==(CLONE_NEWIPC | CLONE_SYSVSEM)) return -EINVAL;new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs);if (IS_ERR(new_ns))return PTR_ERR(new_ns);tsk->nsproxy = new_ns;return 0; }可見,copy_namespace() 主要基于“舊的” namespace 創(chuàng)建“新的” namespace,核心函數(shù)在于 create_new_namespaces:
/* * Create new nsproxy and all of its the associated namespaces. * Return the newly created nsproxy. Do not attach this to the task, * leave it to the caller to do proper locking and attach it to task. */ static struct nsproxy *create_new_namespaces(unsigned long flags,struct task_struct *tsk, struct user_namespace *user_ns,struct fs_struct *new_fs) {struct nsproxy *new_nsp;int err;// 創(chuàng)建新的 nsproxynew_nsp = create_nsproxy();if (!new_nsp)return ERR_PTR(-ENOMEM);//創(chuàng)建 mnt namespacenew_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);if (IS_ERR(new_nsp->mnt_ns)) {err = PTR_ERR(new_nsp->mnt_ns);goto out_ns;} //創(chuàng)建 uts namespacenew_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);if (IS_ERR(new_nsp->uts_ns)) {err = PTR_ERR(new_nsp->uts_ns);goto out_uts;} //創(chuàng)建 ipc namespacenew_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);if (IS_ERR(new_nsp->ipc_ns)) {err = PTR_ERR(new_nsp->ipc_ns);goto out_ipc;} //創(chuàng)建 pid namespacenew_nsp->pid_ns_for_children =copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children);if (IS_ERR(new_nsp->pid_ns_for_children)) {err = PTR_ERR(new_nsp->pid_ns_for_children);goto out_pid;} //創(chuàng)建 network namespacenew_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);if (IS_ERR(new_nsp->net_ns)) {err = PTR_ERR(new_nsp->net_ns);goto out_net;}return new_nsp; // 出錯(cuò)處理 out_net:if (new_nsp->pid_ns_for_children)put_pid_ns(new_nsp->pid_ns_for_children); out_pid:if (new_nsp->ipc_ns)put_ipc_ns(new_nsp->ipc_ns); out_ipc:if (new_nsp->uts_ns)put_uts_ns(new_nsp->uts_ns); out_uts:if (new_nsp->mnt_ns)put_mnt_ns(new_nsp->mnt_ns); out_ns:kmem_cache_free(nsproxy_cachep, new_nsp);return ERR_PTR(err); }在create_new_namespaces()中,分別調(diào)用 create_nsproxy(), create_utsname(), create_ipcs(), create_pid_ns(), create_net_ns(), create_mnt_ns() 來創(chuàng)建 nsproxy 結(jié)構(gòu),uts,ipcs,pid,mnt,net。
具體的函數(shù)我們就不再分析,基本到此為止,我們從子進(jìn)程創(chuàng)建,到子進(jìn)程相關(guān)的信息的初始化,包括文件系統(tǒng),CPU,內(nèi)存管理等,再到各個(gè) namespace 的創(chuàng)建,都走了一遍,下面附上 namespace 創(chuàng)建的代碼流程圖。
具體流程圖和更多的細(xì)節(jié)(包括各個(gè) namespace 的創(chuàng)建過程)大家可以關(guān)注我的公眾號閱讀,那里的閱讀體驗(yàn)會更好一些。
PS:對云計(jì)算感興趣的小伙伴可以關(guān)注我的微信公眾號:aCloudDeveloper,專注云計(jì)算領(lǐng)域,堅(jiān)持分享干貨。
總結(jié)
以上是生活随笔為你收集整理的Docker 基础技术之 Linux namespace 源码分析的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 英语写作学习笔录 task1 concl
- 下一篇: 模拟退火算法c++