Docker Daemon 掛掉 - 問題排解

Share on:

起初不知為何的, 把 Docker 搞掛掉了, 然後一直 systemctl restart docker 都無解

但之後想一想, 剛剛只是做了 docker build, docker run, 會不會是 Container 出問題導致 Docker Daemon 掛掉?

於是開始了下面的解法

節錄了其中幾段 Log

Terminal 1

 1$ docker ps
 2Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
 3
 4$ systemctl status docker
 5● docker.service - Docker Application Container Engine
 6   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
 7   Active: inactive (dead) (Result: exit-code) since Tue 2019-12-24 13:40:18 CST; 2min 8s ago
 8     Docs: https://docs.docker.com
 9  Process: 21233 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=1/FAILURE)
10 Main PID: 21233 (code=exited, status=1/FAILURE)
11    Tasks: 0
12   Memory: 0B
13   CGroup: /system.slice/docker.service
14
15Dec 24 13:40:18 tgfc-220 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
16Dec 24 13:40:18 tgfc-220 systemd[1]: Failed to start Docker Application Container Engine.
17Dec 24 13:40:18 tgfc-220 systemd[1]: Unit docker.service entered failed state.
18Dec 24 13:40:18 tgfc-220 systemd[1]: docker.service failed.
19Dec 24 13:40:18 tgfc-220 systemd[1]: Stopped Docker Application Container Engine.
20Dec 24 13:42:24 tgfc-220 systemd[1]: Dependency failed for Docker Application Container Engine.
21Dec 24 13:42:24 tgfc-220 systemd[1]: Job docker.service/start failed with result 'dependency'.
22# 上面的資訊似乎價值有限
23
24$ journalctl -f
25...()...
26Dec 24 13:52:20 tgfc-220 systemd[1]: Failed unmounting /var.
27Dec 24 13:52:20 tgfc-220 systemd[1]: Failed unmounting /var.
28Dec 24 13:52:20 tgfc-220 umount[15579]: umount: /home: target is busy.
29Dec 24 13:52:20 tgfc-220 umount[15579]: (In some cases useful info about processes that use
30Dec 24 13:52:20 tgfc-220 umount[15579]: the device is found by lsof(8) or fuser(1))
31Dec 24 13:52:20 tgfc-220 systemd[1]: Failed unmounting /home.
32Dec 24 13:52:20 tgfc-220 umount[15582]: umount: /home: target is busy.
33Dec 24 13:52:20 tgfc-220 umount[15582]: (In some cases useful info about processes that use
34Dec 24 13:52:20 tgfc-220 umount[15582]: the device is found by lsof(8) or fuser(1))
35...()...

Terminal 2

1$ systemctl start docker

那, 我到底都幹了什麼?

稍早掛掉以前, 我依照下面的 Dockerfile 來建立 Image

 1FROM centos:centos7
 2
 3ENV TZ Asia/Taipei
 4RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
 5
 6ENV PY_VERSION=3.7.4
 7RUN set -ex && \
 8    yum install -y wget tar libffi-devel zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make initscripts && \
 9    wget https://www.python.org/ftp/python/${PY_VERSION}/Python-${PY_VERSION}.tgz && \
10    tar -zxvf Python-${PY_VERSION}.tgz && \
11    cd Python-${PY_VERSION} && \
12    ./configure prefix=/usr/local/python3 && \
13    make && \
14    make install && \
15    make clean && \
16    rm -rf /Python-${PY_VERSION}* && \
17    yum install -y epel-release && \
18    yum install -y python-pip && \
19    yum clean all
20
21ENV PATH "$PATH:/usr/local/python3/bin"
22
23RUN pip3 install --upgrade pip
24
25RUN mkdir /etc/app
26
27COPY sys.conf /etc/app
28COPY daemon.conf /etc/app
29
30# 就是這邊出了問題!!
31RUN sudo groupadd docker
32RUN adduser --system app
33RUN sudo usermod -aG docker app
34RUN mkdir -p /home/app /var/app /var/log/app
35
36# .... 其他... (略)...
37
38CMD ["systemctl", "restart", "appd.service"]

然後很理所當然的來執行

 1$ docker run -d \
 2   --restart=always \
 3   --name=monitor_site \
 4   --hostname=app-site-monitoring \
 5   --privileged=true \
 6   monitor_site /usr/sbin/init
 7# 然後 Docker Daemon 就掛掉了
 8
 9### 但依然可以重啟它, 只是重啟後馬上又掛掉
10$ docker restart docker

直到我有把上面的 Dockerfile 改一改, 然後在建立另一個 Image, 執行 docker run 以後就整個死掉

以上敘述大概是整個事件的還原

那, 怎麼救回來?

想一想, 如果是 Container 掛掉導致 Docker Daemon 掛掉, 那就把 Container 移除不就好了!!

1$ docker rm --force monitor_site
2# 乾~ 對齁, Docker 掛了, 指令無法使用

經過 Google 之後發現, Container 存在於 /var/lib/docker 之中

 1$ ls -l /var/lib/docker/containers
 2drwx------. 4 root root 237 Dec 24 13:52 27edb86d856b422956434cc80c885ac5c64a598e3a5b222fffc0fafe046a0da0
 3drwx------. 4 root root 237 Dec 24 13:52 4e2d8099f383ee38c85e6970c00c85d50ff8f9d086f32552caa1661d1d8eb752
 4drwx------. 4 root root 237 Dec 24 13:52 5555ff882b9456cd8c11573f3b7d83149231629ce5df5b42f2e48870b3c60e63
 5drwx------. 4 root root 237 Dec 24 13:52 6e1a346731b2e1ffb524ba934f0a0a49bd4adc077e8e439aff7af6cf2d708d07
 6# 然後再一個一個進去, 找出出問題的 Container
 7# 把整個資料夾移除即可. (但為了保險起見, 建議先把上面的資料夾 mv 到其他地方)
 8
 9### 假設是 6e1a346731b2e1ffb524ba934f0a0a49bd4adc077e8e439aff7af6cf2d708d07 出問題
10$ mv ./6e1a346731b2e1ffb524ba934f0a0a49bd4adc077e8e439aff7af6cf2d708d07 /root/.
11
12$ systemctl restart docker

如此一來, Docker Daemon 就救回來了!! 哈雷嚕雅!!

Notes

把 Container 移除前, 請確保當初有把裏頭的東西 -v 映射出來, 做好備份再移除... 不然後果自行負責...

comments powered by Disqus