[AI] Ubuntu 22 Nvidia GPU Docker로 연동하기

By enumclass

2024. 3. 31. 22:34

1. 목표

1.1. 오류 메시지들

2. 기존 드라이버 삭제

2.1. NVIDIA 드라이버 삭제

2.2. CUDA 드라이버 삭제

2.3. NVIDIA toolkit 및 docker 삭제

3. Driver 재설치

4. Docker 및 Docker toolkit 설치

4.0.1. Docker Configuration Result

4.1. Reboot

5. Docker 연동 확인

6. Notebook 설치 및 GPU 연동 확인

1. 목표

ubuntu 22에 nvidia 드라이버를 최신으로 설치 하고, cuda를 이에 맞추어서 설치 한다.

tensorflow jupyter를 이용해서 gpu를 확인한다.

아래 가이드는 아래와 같은 오류를 해결하고자 하는 방법으로 활용되었다.

1.1. 오류 메시지들

2024-03-31 12:27:30.432902: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2024-03-31 12:27:30.432953: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: 03c8b10def1c
2024-03-31 12:27:30.432958: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: 03c8b10def1c
2024-03-31 12:27:30.433014: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: 545.23.6
2024-03-31 12:27:30.433028: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: 535.161.7
2024-03-31 12:27:30.433032: E external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:251] kernel version 535.161.7 does not match DSO version 545.23.6 -- cannot find working devices in this configurationbash

load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknownbash

2. 기존 드라이버 삭제

2.1. NVIDIA 드라이버 삭제

sudo apt-get remove --purge '^nvidia-.*'bash

sudo ubuntu-drivers autoinstallbash

sudo apt-get updatebash

2.2. CUDA 드라이버 삭제

sudo apt-get --purge remove "*cuda*"
sudo apt-get --purge remove "*cudnn*"bash

2.3. NVIDIA toolkit 및 docker 삭제

sudo apt-get remove --purge nvidia-docker2 nvidia-container-toolkit docker docker-engine docker.io containerd runcbash

3. Driver 재설치

https://developer.nvidia.com/cuda-toolkit-archive

드라이버의 버전과 CUDA의 버전은 반듯이 동일해야 한다. 이를 해결하기 위해 Nvidia 드라이버도 CUDA에 맞추어야 하기 때문에 위의 페이지 가이드를 따라서 드라이버도 설치 해야 한다.

https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4bash

sudo apt-get install -y nvidia-driver-550-open
sudo apt-get install -y cuda-drivers-550bash

4. Docker 및 Docker toolkit 설치

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart dockerbash

sudo usermod -aG docker $USER
newgrp dockerbash

4.0.1. Docker Configuration Result

$ docker info | grep nvidia
 Runtimes: io.containerd.runc.v2 nvidia runcbash

위와 같이 나와야 한다

다음과 같이 도커 데몬 설정 파일을 수정하는 것을 추천한다

/etc/docker/daemon.json

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime":"nvidia"
}bash

4.1. Reboot

sudo rebootbash

5. Docker 연동 확인

sudo docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
[sudo] password for steven: 
Unable to find image 'nvidia/cuda:11.8.0-base-ubuntu22.04' locally
11.8.0-base-ubuntu22.04: Pulling from nvidia/cuda
aece8493d397: Already exists 
5e3b7ee77381: Pull complete 
5bd037f007fd: Pull complete 
4cda774ad2ec: Pull complete 
775f22adee62: Pull complete 
Digest: sha256:f895871972c1c91eb6a896eee68468f40289395a1e58c492e1be7929d0f8703b
Status: Downloaded newer image for nvidia/cuda:11.8.0-base-ubuntu22.04
Sun Mar 31 13:27:28 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti     Off |   00000000:01:00.0 Off |                  Off |
|  0%   39C    P8             14W /  450W |      84MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+bash

위와 같이 해당 그래픽 카드의 정보가 나오면 성공이다.

6. Notebook 설치 및 GPU 연동 확인

docker run --gpus all -d -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyterbash

Notebook 연동 및 token 확인

localhost:8888

$ docker exec -it 8bf /bin/bash

________                               _______________
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ /
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

root@8bfb03e57f8b:/tf# history
    1  jupyter server list
    2  exit
    3  history
root@8bfb03e57f8b:/tf# jupyter server list
Currently running servers:
http://8bfb03e57f8b:8888/?token=48c6439946bd0e34d389c792708f3ee35cf98d929a889c99 :: /tfbash

여기서 토큰 이하가 입력 값이다.

다음 코드를 입력 한 후 GPU의 갯수가 나오면 성공이다.

import tensorflow as tf

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))bash

728x90

'AI' 카테고리의 다른 글

Llama 2 Local Install (0)	2024.04.14
Llama 2 Download Error (416 Requested Range Not Satisfiable) (0)	2024.04.14
Sum of the squared errors (0)	2021.08.01
Predicting Student Admissions with Neural Networks (0)	2021.08.01
Backpropagation (0)	2021.08.01

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Category

Tags

Popular Post

Recent Post

1. 목표

1.1. 오류 메시지들

2. 기존 드라이버 삭제

2.1. NVIDIA 드라이버 삭제

2.2. CUDA 드라이버 삭제

2.3. NVIDIA toolkit 및 docker 삭제

3. Driver 재설치

4. Docker 및 Docker toolkit 설치

4.0.1. Docker Configuration Result

4.1. Reboot

5. Docker 연동 확인

6. Notebook 설치 및 GPU 연동 확인

'AI' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역