我们知道,以现在dockerd的架构,起容器需要有containerd,containerd-shim和容器进程(即容器主进程)三个进程。那么,这三个进程的依存关系如何?本次分析将介绍这方面的内容。
需要说明的是,由于不同shell中的内容并不是连贯执行的,所以进程号可能会不一致。

整体关系

首先,我们来看下containerd,containerd-shim和容器进程的关系:

1
2
3
root 2156 1733 0 13:17 pts/0 00:00:00 ./bin/containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --shim /home/fankang/docker/containerd-0.2.4/src/github.com/docker/containerd/bin/containerd-shim --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --runtime docker-runc
root 2198 2156 0 13:45 pts/0 00:00:00 /home/fankang/docker/containerd-0.2.4/src/github.com/docker/containerd/bin/containerd-shim nginx /home/fankang/mycontainer runc
root 2214 2198 0 13:45 ? 00:00:00 /usr/bin/python /usr/bin/supervisord

可以看出,containerd是containerd-shim的父进程,contaienrd-shim是容器进程的父进程。
而杀死containerd进程后,contaienrd-shim和容器进程依然存在,只是containerd进程成孤儿进程后,被1进程接收了:

1
2
root 2301 1 0 13:50 pts/0 00:00:00 /home/fankang/docker/containerd-0.2.4/src/github.com/docker/containerd/bin/containerd-shim nginx /home/fankang/mycontainer runc
root 2317 2301 1 13:50 ? 00:00:00 /usr/bin/python /usr/bin/supervisord

所以,为了简化三个进程的关系,我们从下面4种情况来分析:

  1. containerd进程存在的情况下,杀死containerd-shim进程;
  2. containerd进程存在的情况下,杀死容器进程;
  3. containerd进程不存在的情况下,杀死containerd-shim进程,然后启动containerd进程;
  4. containerd进程不存在的情况下,杀死容器进程,然后启动containerd进程;

第一种情况

第一种情况:containerd进程存在的情况下,杀死containerd-shim进程
containerd运行中,containerd-shim和容器进程如下:

1
2
root 2414 2383 0 14:02 pts/0 00:00:00 /home/fankang/docker/containerd-0.2.4/src/github.com/docker/containerd/bin/containerd-shim nginx /home/fankang/mycontainer runc
root 2429 2414 1 14:02 ? 00:00:00 /usr/bin/python /usr/bin/supervisord

现在使用kill -9 2414杀死cotnainerd-shim进程。
现在可以得出结论:容器进程退出。在containerd运行的情况下,杀死containerd-shim,容器进程会退出。

所以,现在来看下为什么容器进程会退出。
之前分析过,创建容器时会调用container的Start()方法,定义在containerd/runtime/container.go中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
func (c *container) Start(checkpointPath string, s Stdio) (Process, error) {
//***processRoot: /var/run/docker/libcontainerd/containerd/mynginx/init***//
processRoot := filepath.Join(c.root, c.id, InitProcessID)
if err := os.Mkdir(processRoot, 0755); err != nil {
return nil, err
}
//***构建cmd,调用的是containerd-shim***//
//***docker-containerd-shim nginx /home/fankang/mycontainer runc***//
cmd := exec.Command(c.shim,
c.id, c.bundle, c.runtime,
)
cmd.Dir = processRoot
cmd.SysProcAttr = &syscall.SysProcAttr{
Setpgid: true,
}
//***读取bundle目录下的config.json文件***//
spec, err := c.readSpec()
if err != nil {
return nil, err
}
//***InitProcessID = "init"***//
config := &processConfig{
checkpoint: checkpointPath,
root: processRoot,
id: InitProcessID,
c: c,
stdio: s,
spec: spec,
processSpec: specs.ProcessSpec(spec.Process),
}
//****生成process**//
p, err := newProcess(config)
if err != nil {
return nil, err
}
//***执行cmd***//
if err := c.createCmd(InitProcessID, cmd, p); err != nil {
return nil, err
}
return p, nil
}

而Start()方法又会调用createCmd()方法执行命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
func (c *container) createCmd(pid string, cmd *exec.Cmd, p *process) error {
p.cmd = cmd
//***执行cmd***//
if err := cmd.Start(); err != nil {
close(p.cmdDoneCh)
if exErr, ok := err.(*exec.Error); ok {
if exErr.Err == exec.ErrNotFound || exErr.Err == os.ErrNotExist {
return fmt.Errorf("%s not installed on system", c.shim)
}
}
return err
}
// We need the pid file to have been written to run
//***defer中执行***//
defer func() {
//***起一个go routine等待shim结束***//
go func() {
//***等待cmd执行完成***//
err := p.cmd.Wait()
if err == nil {
p.cmdSuccess = true
}
//***此处在调用ctr kill时或直接杀死shim进程时,会执行到,表明shim进程退出时所要做的处理***//
//***系统中进程的启动时间和内存中记录的时间比较,查看是否为同一process***//
//***此处如果是正常退出的话,则linux系统上进程已经不存在,所以linux系统上进程时间为空***//
//***如果是异常退出的话,如kill -9 shim进程,则linux系统上进程仍存在,此时same为true***//
if same, err := p.isSameProcess(); same && p.pid > 0 {
// The process changed its PR_SET_PDEATHSIG, so force
// kill it
logrus.Infof("containerd: %s:%s (pid %v) has become an orphan, killing it", p.container.id, p.id, p.pid)
err = unix.Kill(p.pid, syscall.SIGKILL)
if err != nil && err != syscall.ESRCH {
logrus.Errorf("containerd: unable to SIGKILL %s:%s (pid %v): %v", p.container.id, p.id, p.pid, err)
} else {
for {
err = unix.Kill(p.pid, 0)
if err != nil {
break
}
time.Sleep(5 * time.Millisecond)
}
}
}
close(p.cmdDoneCh)
}()
}()
//***等待进行创建完成***//
if err := c.waitForCreate(p, cmd); err != nil {
return err
}
c.processes[pid] = p
return nil
}

可以看出,createCmd()在启动进程后,在defer中会起一个go routine,如果containerd-shim异常退出,那么cmd.wait()阻塞消除,如果容器进程存在,则执行unix.Kill(p.pid, syscall.SIGKILL)操作杀死容器进程。

所以,containerd存在的情况下,手动杀死containerd-shim进程,容器进程将会被containerd中创建容器时留下的go routine杀死。

第二种情况

第二种情况:containerd进程存在的情况下,杀死容器进程
一方面,在容器进程退出时,containerd-shim也会捕获到信号退出,这将在第四种情况下详细分析。
另一方面,容器进程退出,containerd中的monitor会会捕获到该事件,从而触发容器进程退出流程,这是本小节详细分析的内容。
之前分析过,monitor会把容器退出事件放到monitor的exits channel中,在containerd/supevisor/monitor_linux.go中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
func (m *Monitor) start() {
var events [128]syscall.EpollEvent
for {
//***EpollWait()收集在epoll监控的事件中已经发送的事件***//
n, err := archutils.EpollWait(m.epollFd, events[:], -1)
if err != nil {
if err == syscall.EINTR {
continue
}
logrus.WithField("error", err).Fatal("containerd: epoll wait")
}
// process events
for i := 0; i < n; i++ {
fd := int(events[i].Fd)
m.m.Lock()
r := m.receivers[fd]
switch t := r.(type) {
//***process类型***//
case runtime.Process:
if events[i].Events == syscall.EPOLLHUP {
delete(m.receivers, fd)
if err = syscall.EpollCtl(m.epollFd, syscall.EPOLL_CTL_DEL, fd, &syscall.EpollEvent{
Events: syscall.EPOLLHUP,
Fd: int32(fd),
}); err != nil {
logrus.WithField("error", err).Error("containerd: epoll remove fd")
}
if err := t.Close(); err != nil {
logrus.WithField("error", err).Error("containerd: close process IO")
}
EpollFdCounter.Dec(1)
//***放入exits channel中***//
m.exits <- t
}
//***被OOM***//
case runtime.OOM:
// always flush the event fd
t.Flush()
if t.Removed() {
delete(m.receivers, fd)
// epoll will remove the fd from its set after it has been closed
t.Close()
EpollFdCounter.Dec(1)
} else {
//***放入到ooms channel中***//
m.ooms <- t.ContainerID()
}
}
m.m.Unlock()
}
}
}

而在containerd的supervisor启动时,会启动eixthandler(),在containerd/supervisor/supervisor.go中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
func New(stateDir string, runtimeName, shimName string, runtimeArgs []string, timeout time.Duration, retainCount int) (*Supervisor, error) {
startTasks := make(chan *startTask, 10)
if err := os.MkdirAll(stateDir, 0755); err != nil {
return nil, err
}
machine, err := CollectMachineInformation()
if err != nil {
return nil, err
}
monitor, err := NewMonitor()
if err != nil {
return nil, err
}
s := &Supervisor{
stateDir: stateDir,
containers: make(map[string]*containerInfo),
startTasks: startTasks,
machine: machine,
subscribers: make(map[chan Event]struct{}),
tasks: make(chan Task, defaultBufferSize),
monitor: monitor,
runtime: runtimeName,
runtimeArgs: runtimeArgs,
shim: shimName,
timeout: timeout,
}
//***处理event日志***//
if err := setupEventLog(s, retainCount); err != nil {
return nil, err
}
go s.exitHandler()
go s.oomHandler()
if err := s.restore(); err != nil {
return nil, err
}
return s, nil
}
func (s *Supervisor) exitHandler() {
for p := range s.monitor.Exits() {
e := &ExitTask{
Process: p,
}
s.SendTask(e)
}
}

可以看到,exitHandler()会消费monitor exits channel中的事件,然后包装成ExitTask,然后发送到supervisor的tasks中以进一步处理。
所以,容器进程退出会触发containerd对容器进行exit处理。在exit处理中会调用delete处理,这些就不再细展开。

所以,containerd存在的情况下,杀死容器进程,conainerd-shim主动退出,containerd触发exit事件以清理该容器。

第三种情况

第三种情况:containerd进程不存在的情况下,杀死containerd-shim进程,然后启动containerd进程
现在容器在运行,containerd关闭,进程如下:

1
2
root 2522 1 0 15:33 pts/0 00:00:00 /home/fankang/docker/containerd-0.2.4/src/github.com/docker/containerd/bin/containerd-shim nginx /home/fankang/mycontainer runc
root 2537 2522 0 15:33 ? 00:00:00 /usr/bin/python /usr/bin/supervisord

现在调用kill -9 2522杀死2522。可以看到容器进程还在,成为孤儿进程,被进程1接收。

1
2
root 2537 1 0 15:33 ? 00:00:00 /usr/bin/python /usr/bin/supervisord
root 2571 2537 0 15:33 ? 00:00:00 /usr/sbin/sshd -D

启动containerd,容器进程消失。

所以containerd在启动时会清理残留的容器进程(对应的containerd-shim不存在)。

那么,这清理工作的流程是怎样的呢?supervisor在启动的时候会调用restore()方法,supervisor的restore()定义在containerd/supervisor/supervisor.go中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
func (s *Supervisor) restore() error {
dirs, err := ioutil.ReadDir(s.stateDir)
if err != nil {
return err
}
for _, d := range dirs {
if !d.IsDir() {
continue
}
id := d.Name()
container, err := runtime.Load(s.stateDir, id, s.shim, s.timeout)
if err != nil {
return err
}
processes, err := container.Processes()
if err != nil {
return err
}
ContainersCounter.Inc(1)
s.containers[id] = &containerInfo{
container: container,
}
if err := s.monitor.MonitorOOM(container); err != nil && err != runtime.ErrContainerExited {
logrus.WithField("error", err).Error("containerd: notify OOM events")
}
logrus.WithField("id", id).Debug("containerd: container restored")
var exitedProcesses []runtime.Process
for _, p := range processes {
if p.State() == runtime.Running {
if err := s.monitorProcess(p); err != nil {
return err
}
} else {
exitedProcesses = append(exitedProcesses, p)
}
}
if len(exitedProcesses) > 0 {
// sort processes so that init is fired last because that is how the kernel sends the
// exit events
sortProcesses(exitedProcesses)
for _, p := range exitedProcesses {
e := &ExitTask{
Process: p,
}
s.SendTask(e)
}
}
}
return nil
}

restore()会读取contaienrd主目录下各容器目录,调用runtime.Load()导入容器。如果容器不为runnning,则触发exit事件。
所以,现在的关键是看如何导入容器,runtime.Load()定义在containerd/runtime/container.go中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Load return a new container from the matchin state file on disk.
func Load(root, id, shimName string, timeout time.Duration) (Container, error) {
var s state
//***StateFile = "state.json"***//
f, err := os.Open(filepath.Join(root, id, StateFile))
if err != nil {
return nil, err
}
defer f.Close()
if err := json.NewDecoder(f).Decode(&s); err != nil {
return nil, err
}
c := &container{
root: root,
id: id,
bundle: s.Bundle,
labels: s.Labels,
runtime: s.Runtime,
runtimeArgs: s.RuntimeArgs,
shim: s.Shim,
noPivotRoot: s.NoPivotRoot,
processes: make(map[string]*process),
timeout: timeout,
}
if c.shim == "" {
c.shim = shimName
}
dirs, err := ioutil.ReadDir(filepath.Join(root, id))
if err != nil {
return nil, err
}
//***一个目录代表一个进程***//
for _, d := range dirs {
if !d.IsDir() {
continue
}
pid := d.Name()
s, err := readProcessState(filepath.Join(root, id, pid))
if err != nil {
return nil, err
}
p, err := loadProcess(filepath.Join(root, id, pid), pid, c, s)
if err != nil {
logrus.WithField("id", id).WithField("pid", pid).Debug("containerd: error loading process %s", err)
continue
}
c.processes[pid] = p
}
return c, nil
}

在Load()中先通过loadProcess()导入容器目录下的进程。loadProcess()定义在containerd/runtime/process.go中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
//***从process.json中还原process***//
func loadProcess(root, id string, c *container, s *ProcessState) (*process, error) {
p := &process{
root: root,
id: id,
container: c,
spec: s.ProcessSpec,
stdio: Stdio{
Stdin: s.Stdin,
Stdout: s.Stdout,
Stderr: s.Stderr,
},
state: Stopped,
}
startTime, err := ioutil.ReadFile(filepath.Join(p.root, StartTimeFile))
if err != nil && !os.IsNotExist(err) {
return nil, err
}
p.startTime = string(startTime)
if _, err := p.getPidFromFile(); err != nil {
return nil, err
}
//***此处调用ExitStatus(),会走到handleSigkilledShim()的p.updateExitStatusFile(128 + uint32(syscall.SIGKILL))***//
//***即往exit中写入数据***//
//***在exit.go中调用ExitStatus()时,就可以提取exit中的数据***//
if _, err := p.ExitStatus(); err != nil {
if err == ErrProcessNotExited {
exit, err := getExitPipe(filepath.Join(root, ExitFile))
if err != nil {
return nil, err
}
p.exitPipe = exit
control, err := getControlPipe(filepath.Join(root, ControlFile))
if err != nil {
return nil, err
}
p.controlPipe = control
p.state = Running
return p, nil
}
return nil, err
}
return p, nil
}

loadProcess()最重要的调用是p.ExitStatus(),如果出错,则状态为Running。所以琰看ExitStatus():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
//***使用exit管道判断shim是否退出***//
func (p *process) ExitStatus() (rst uint32, rerr error) {
data, err := ioutil.ReadFile(filepath.Join(p.root, ExitStatusFile))
defer func() {
if rerr != nil {
rst, rerr = p.handleSigkilledShim(rst, rerr)
}
}()
if err != nil {
if os.IsNotExist(err) {
return UnknownStatus, ErrProcessNotExited
}
return UnknownStatus, err
}
if len(data) == 0 {
return UnknownStatus, ErrProcessNotExited
}
p.stateLock.Lock()
p.state = Stopped
p.stateLock.Unlock()
i, err := strconv.ParseUint(string(data), 10, 32)
return uint32(i), err
}

ExitStatus()会去读exit pipe。此时exit中没有数据,所以会出错。这里的ExitStatus()参数很特别,rerr先获取ExitStatus()主流程的错误,然后在defer中把rerr交给handleSigkilledShim()处理,最后把handleSigkilledShim()的结果错误作为rerr返回。现在流程会转移到handleSigkilledShim():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
func (p *process) handleSigkilledShim(rst uint32, rerr error) (uint32, error) {
if p.cmd == nil || p.cmd.Process == nil
//***此处向容器进程发送0信号***//
e := unix.Kill(p.pid, 0)
//***第二次执行的时候,容器进程已经不存在,ESRCH表示参数 pid 所指定的进程或进程组不存在***//
if e == syscall.ESRCH {
logrus.Warnf("containerd: %s:%s (pid %d) does not exist", p.container.id, p.id, p.pid)
// The process died while containerd was down (probably of
// SIGKILL, but no way to be sure)
return p.updateExitStatusFile(UnknownStatus)
}
// If it's not the same process, just mark it stopped and set
// the status to the UnknownStatus value (i.e. 255)
if same, err := p.isSameProcess(); !same {
logrus.Warnf("containerd: %s:%s (pid %d) is not the same process anymore (%v)", p.container.id, p.id, p.pid, err)
// Create the file so we get the exit event generated once monitor kicks in
// without having to go through all this process again
return p.updateExitStatusFile(UnknownStatus)
}
ppid, err := readProcStatField(p.pid, 4)
if err != nil {
return rst, fmt.Errorf("could not check process ppid: %v (%v)", err, rerr)
}
//***容器进程为1,则表明容器的守护进程shim意外退出***//
if ppid == "1" {
logrus.Warnf("containerd: %s:%s shim died, killing associated process", p.container.id, p.id)
//***真正杀死容器进程的地方***//
unix.Kill(p.pid, syscall.SIGKILL)
if err != nil && err != syscall.ESRCH {
return UnknownStatus, fmt.Errorf("containerd: unable to SIGKILL %s:%s (pid %v): %v", p.container.id, p.id, p.pid, err)
}
// wait for the process to die
for {
e := unix.Kill(p.pid, 0)
if e == syscall.ESRCH {
break
}
time.Sleep(5 * time.Millisecond)
}
// Create the file so we get the exit event generated once monitor kicks in
// without having to go through all this process again
return p.updateExitStatusFile(128 + uint32(syscall.SIGKILL))
}
return rst, rerr
}
// Possible that the shim was SIGKILLED
e := unix.Kill(p.cmd.Process.Pid, 0)
if e != syscall.ESRCH {
return rst, rerr
}
// Ensure we got the shim ProcessState
<-p.cmdDoneCh
shimStatus := p.cmd.ProcessState.Sys().(syscall.WaitStatus)
if shimStatus.Signaled() && shimStatus.Signal() == syscall.SIGKILL {
logrus.Debugf("containerd: ExitStatus(container: %s, process: %s): shim was SIGKILL'ed reaping its child with pid %d", p.container.id, p.id, p.pid)
rerr = nil
rst = 128 + uint32(shimStatus.Signal())
p.stateLock.Lock()
p.state = Stopped
p.stateLock.Unlock()
}
return rst, rerr
}

handleSigkilledShim()的if p.cmd == nil || p.cmd.Process == nil流程如下:

  1. 如果容器进程不存在,则返回;
  2. 如果容器进程发生改变,则交由monitor处理,返回;
  3. 如果容器进程的父进程为1,则表明shim退出,杀死容器进程,并调用updateExitStatusFile()把内容写到exit,返回;
  4. 返回。

现在,按我们分析的流程,handleSigkilledShim()将运行到步骤3。由于ExitStatus()的rerr接收了handleSigkilledShim()的返回值,所以rerr为nil,所以process的状态不为running。

所以supervisor的restore()会对该容器作exit操作。

exit操作中也会调用ExitStatus(),但此时exit中是有内容的;也会走到handleSigkilledShim()流程,但会在步骤1就返回,因为容器进程在之前的流程中已经被删除。

如果容器中containerd-shim和容器进程都存在,则从步骤4返回。

第四种情况

第四种情况:containerd进程不存在的情况下,杀死容器进程,然后启动containerd进程
杀死容器进程,containerd-shim进程主动退出。containerd在restore()中对该容器做exit操作。

这时提供一个demo,来看下go语言使用exec包启动进程的方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
package main
import (
"os"
"os/signal"
"os/exec"
"syscall"
)
func main() {
signals := make(chan os.Signal, 2048)
signal.Notify(signals)
cmd1 := exec.Command("/bin/sh", "-c", "sleep 50")
cmd1.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
cmd1.Start()
cmd2 := exec.Command("/bin/sh", "-c", "sleep 50")
cmd2.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
cmd2.Start()
select {
case <-signals:
syscall.Kill(-cmd1.Process.Pid, syscall.SIGKILL)
syscall.Kill(-cmd2.Process.Pid, syscall.SIGKILL)
}
}

编译执行的结果如下:

1
2
3
root 5838 1733 0 17:16 pts/0 00:00:00 ./test
root 5843 5838 0 17:16 pts/0 00:00:00 /bin/sh -c sleep 50
root 5844 5838 0 17:16 pts/0 00:00:00 /bin/sh -c sleep 50

执行kill 5843后,所有进程都不存在。
所以,在Go中,默认子进程的退出会引起父进程的退出。

分析完毕。