[AI] Ubuntu 22 Nvidia GPU Docker로 연동하기

목표

ubuntu 22에 nvidia 드라이버를 최신으로 설치 하고, cuda를 이에 맞추어서 설치 한다.

tensorflow jupyter를 이용해서 gpu를 확인한다.

아래 가이드는 아래와 같은 오류를 해결하고자 하는 방법으로 활용되었다.

오류 메시지들

2024-03-31 12:27:30.432902: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2024-03-31 12:27:30.432953: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: 03c8b10def1c
2024-03-31 12:27:30.432958: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: 03c8b10def1c
2024-03-31 12:27:30.433014: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: 545.23.6
2024-03-31 12:27:30.433028: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: 535.161.7
2024-03-31 12:27:30.433032: E external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:251] kernel version 535.161.7 does not match DSO version 545.23.6 -- cannot find working devices in this configuration
load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

 

기존 드라이버 삭제

NVIDIA 드라이버 삭제

sudo apt-get remove --purge '^nvidia-.*'
sudo ubuntu-drivers autoinstall
sudo apt-get update

 

CUDA  드라이버 삭제

sudo apt-get --purge remove "*cuda*"
sudo apt-get --purge remove "*cudnn*"

 

NVIDIA toolkit 및 docker 삭제

sudo apt-get remove --purge nvidia-docker2 nvidia-container-toolkit docker docker-engine docker.io containerd runc

 

Driver 재설치

https://developer.nvidia.com/cuda-toolkit-archive

드라이버의 버전과 CUDA의 버전은 반듯이 동일해야 한다. 이를 해결하기 위해 Nvidia 드라이버도 CUDA에 맞추어야 하기 때문에 위의 페이지 가이드를 따라서 드라이버도 설치 해야 한다.

https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4
sudo apt-get install -y nvidia-driver-550-open
sudo apt-get install -y cuda-drivers-550

 

Docker 및 Docker toolkit 설치

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
sudo usermod -aG docker $USER
newgrp docker

 

Docker Configuration Result

$ docker info | grep nvidia
 Runtimes: io.containerd.runc.v2 nvidia runc

위와 같이 나와야 한다

다음과 같이 도커 데몬 설정 파일을 수정하는 것을 추천한다

/etc/docker/daemon.json

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime":"nvidia"
}

 

Reboot

sudo reboot

 

Docker 연동 확인

sudo docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
[sudo] password for steven: 
Unable to find image 'nvidia/cuda:11.8.0-base-ubuntu22.04' locally
11.8.0-base-ubuntu22.04: Pulling from nvidia/cuda
aece8493d397: Already exists 
5e3b7ee77381: Pull complete 
5bd037f007fd: Pull complete 
4cda774ad2ec: Pull complete 
775f22adee62: Pull complete 
Digest: sha256:f895871972c1c91eb6a896eee68468f40289395a1e58c492e1be7929d0f8703b
Status: Downloaded newer image for nvidia/cuda:11.8.0-base-ubuntu22.04
Sun Mar 31 13:27:28 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti     Off |   00000000:01:00.0 Off |                  Off |
|  0%   39C    P8             14W /  450W |      84MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

위와 같이 해당 그래픽 카드의 정보가 나오면 성공이다.

 

Notebook 설치 및 GPU 연동 확인

docker run --gpus all -d -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter

 Notebook 연동 및 token 확인

localhost:8888

$ docker exec -it 8bf /bin/bash

________                               _______________
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ /
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

root@8bfb03e57f8b:/tf# history
    1  jupyter server list
    2  exit
    3  history
root@8bfb03e57f8b:/tf# jupyter server list
Currently running servers:
http://8bfb03e57f8b:8888/?token=48c6439946bd0e34d389c792708f3ee35cf98d929a889c99 :: /tf

 

여기서 토큰 이하가 입력 값이다. 

다음 코드를 입력 한 후 GPU의 갯수가 나오면 성공이다.

import tensorflow as tf

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

728x90
반응형

'AI' 카테고리의 다른 글

Llama 2 Local Install  (0) 2024.04.14
Llama 2 Download Error (416 Requested Range Not Satisfiable)  (0) 2024.04.14
Sum of the squared errors  (0) 2021.08.01
Predicting Student Admissions with Neural Networks  (0) 2021.08.01
Backpropagation  (0) 2021.08.01