Ubuntu使用Docker安装TensorFlow1.7.0和Facenet开启GPU运行环境下载

2021-12-28

0评论

529阅读

爱搜啊

目前只有Linux允许Docker调用GPU，自然要使用Linux啦

为什么想要使用这个方案

TensorFlow对依赖包要求很高，但一个conda环境只能安装一个版本的，如果使用多个conda环境将难以在一套系统内运行（可能需要编写我不熟悉的shell脚本）
后续可能使用其他的包，需要安装其他包的环境，可能会和TensorFlow的环境冲突
TensorFlow1.7.0是非常老的版本，配套的软件也都已经过时，如果还要强行安装，可能出现兼容性问题

而如果使用Docker技术，则可以避开这几个问题

Docker可以安装多个，也就是可以多环境共存
每个Container环境独立，不存在环境冲突
不存在兼容性问题，因为都是TensorFlow官方配置好的Docker镜像

外部安装

简明步骤：

1. 安装docker

Ubuntu安装docker命令

2. pull

docker pull tensorflow/tensorflow:1.7.0-gpu-py3

3. 运行container

docker container run -it --runtime=nvidia \
-v /home/vision/undergraduate/:/mnt \
-p 10022:22 \
-p 18888:8888 \
--dns 8.8.8.8 \
--privileged \
tensorflow/tensorflow:1.7.0-gpu-py3 bash

关于docker的一些操作：https://tf.wiki/zh_hans/appendix/docker.html

进入Docker执行命令：

docker exec -it f62279378b31a21c77240851a62db5d5d4c7125cb41faa43e7dac0f576e7192a bash

让VSCode可以管理Docker

需要让Docker可以不用sudo管理：https://docs.docker.com/engine/install/linux-postinstall/

sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
sudo reboot   重启一下就可以了
docker run hello-world

运行代码

Docker里面是已经配置好的环境，可以直接运行

nvidia-smi

查看GPU状态

这里我使用挂载方式，可以直接访问主机上的文件夹，从而可以实现文件互通

运行Facenet代码之前，需要先对Docker做一点小操作

apt update --如果卡住，试试全局代理
apt install vim --默认是没有vim的:(
pip install opencv-python

之后尝试运行代码，报错：ImportError: libGL.so.1: cannot open shared object file: No such file or directory

解决办法：ImportError: libGL.so.1: cannot open shared object file: No such file or directory

apt install libgl1-mesa-glx

然后就可以正常运行：

root@f62279378b31:/mnt/asep/2-vipiden/facenet/src# python compare.py 20180408-102900 /mnt/asep/photo-op/facecompare/face1.png /mnt/asep/photo-op/facecompare/face2.png                     /usr/local/lib/python3.5/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Creating networks and loading parameters
2021-01-23 12:22:37.069440: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2021-01-23 12:22:37.446821: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-23 12:22:37.447106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.725
pciBusID: 0000:03:00.0
totalMemory: 7.79GiB freeMemory: 7.70GiB
2021-01-23 12:22:37.447125: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2021-01-23 12:22:37.849823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-23 12:22:37.849883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2021-01-23 12:22:37.849897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2021-01-23 12:22:37.850003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7982 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:03:00.0, compute capability: 7.5)
2021-01-23 12:22:37.850882: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 7.79G (8370061312 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2021-01-23 12:22:39.624162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2021-01-23 12:22:39.624230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-23 12:22:39.624272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2021-01-23 12:22:39.624305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2021-01-23 12:22:39.624400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7982 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:03:00.0, compute capability: 7.5)
Model directory: 20180408-102900
Metagraph file: model-20180408-102900.meta
Checkpoint file: model-20180408-102900.ckpt-90
Images:
0: /mnt/asep/photo-op/facecompare/face1.png
1: /mnt/asep/photo-op/facecompare/face2.png
Distance matrix
        0         1     
0    0.0000    0.6990  
1    0.6990    0.0000

解决权限问题

参考来源：https://blog.csdn.net/easylife206/article/details/103750309

比较简单的，在container里面新建GID和UID都和TrueNAS一样的用户和用户组就可以了，Dockerfile如下：

ARG USER_ID=1004 # 设置一个变量
ARG GROUP_ID=1003
RUN groupadd -g ${GROUP_ID} asep
 # 执行这个命令，就是Linux的标准指令

RUN useradd -u ${USER_ID} asep -g asep
RUN usermod -G sudo asep
USER asep # 以下使用asep账户
WORKDIR /home/asep # 以下使用这个路径来操作（也就是打开docker的bash的默认路径）

完整Dockerfile文件

FROM tensorflow/tensorflow:1.7.0-GPU-py3
RUN apt update
RUN apt install vim -y
RUN apt install libgl1-mesa-glx -y
RUN apt install net-tools
RUN apt install wget
RUN pip install --upgrade pip
RUN pip install opencv-python
RUN pip install django==2.2
EXPOSE 8000

# 解决中文乱码
ENV LANG C.UTF-8

# 解决权限问题
ARG USER_ID=1004
ARG GROUP_ID=1003
RUN groupadd -g ${GROUP_ID} asep
RUN useradd -u ${USER_ID} asep -g asep
RUN usermod -G sudo asep
USER asep
WORKDIR /home/asep

# 复制CUDNN文件
USER root
COPY ./cuda/include/cudnn.h /usr/local/cuda/include/
COPY ./cuda/lib64/libcudnn* /usr/local/cuda/lib64/
RUN chmod a+r /usr/local/cuda/include/cudnn.h
RUN chmod a+r /usr/local/cuda/lib64/libcudnn*

同时需要在Dockerfile同级放置cudnn解压出来的cuda文件夹

使用docker build .，然后docker container run即可

docker container run -it --runtime=nvidia \
-v /home/vision/undergraduate/:/mnt \
-p 10022:22 \
-p 18888:8888 \
--dns 8.8.8.8 \
--privileged \
tensorflow/tensorflow:1.7.0-gpu-py3 bash0

重新训练模型：

docker container run -it --runtime=nvidia \
-v /home/vision/undergraduate/:/mnt \
-p 10022:22 \
-p 18888:8888 \
--dns 8.8.8.8 \
--privileged \
tensorflow/tensorflow:1.7.0-gpu-py3 bash1

同时，这份Dockerfile文件已经上传github：https://github.com/Vision0220/FaceNet-Docker

Docker image也已经上传docker hub：https://hub.docker.com/r/vision20/facenet-gpu

Facenet开启GPU运行环境下载