목표
ubuntu 22에 nvidia 드라이버를 최신으로 설치 하고, cuda를 이에 맞추어서 설치 한다.
tensorflow jupyter를 이용해서 gpu를 확인한다.
아래 가이드는 아래와 같은 오류를 해결하고자 하는 방법으로 활용되었다.
오류 메시지들
2024-03-31 12:27:30.432902: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2024-03-31 12:27:30.432953: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: 03c8b10def1c
2024-03-31 12:27:30.432958: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: 03c8b10def1c
2024-03-31 12:27:30.433014: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: 545.23.6
2024-03-31 12:27:30.433028: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: 535.161.7
2024-03-31 12:27:30.433032: E external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:251] kernel version 535.161.7 does not match DSO version 545.23.6 -- cannot find working devices in this configuration
load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
기존 드라이버 삭제
NVIDIA 드라이버 삭제
sudo apt-get remove --purge '^nvidia-.*'
sudo ubuntu-drivers autoinstall
sudo apt-get update
CUDA 드라이버 삭제
sudo apt-get --purge remove "*cuda*"
sudo apt-get --purge remove "*cudnn*"
NVIDIA toolkit 및 docker 삭제
sudo apt-get remove --purge nvidia-docker2 nvidia-container-toolkit docker docker-engine docker.io containerd runc
Driver 재설치
https://developer.nvidia.com/cuda-toolkit-archive
드라이버의 버전과 CUDA의 버전은 반듯이 동일해야 한다. 이를 해결하기 위해 Nvidia 드라이버도 CUDA에 맞추어야 하기 때문에 위의 페이지 가이드를 따라서 드라이버도 설치 해야 한다.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4
sudo apt-get install -y nvidia-driver-550-open
sudo apt-get install -y cuda-drivers-550
Docker 및 Docker toolkit 설치
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
sudo usermod -aG docker $USER
newgrp docker
Docker Configuration Result
$ docker info | grep nvidia
Runtimes: io.containerd.runc.v2 nvidia runc
위와 같이 나와야 한다
다음과 같이 도커 데몬 설정 파일을 수정하는 것을 추천한다
/etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime":"nvidia"
}
Reboot
sudo reboot
Docker 연동 확인
sudo docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
[sudo] password for steven:
Unable to find image 'nvidia/cuda:11.8.0-base-ubuntu22.04' locally
11.8.0-base-ubuntu22.04: Pulling from nvidia/cuda
aece8493d397: Already exists
5e3b7ee77381: Pull complete
5bd037f007fd: Pull complete
4cda774ad2ec: Pull complete
775f22adee62: Pull complete
Digest: sha256:f895871972c1c91eb6a896eee68468f40289395a1e58c492e1be7929d0f8703b
Status: Downloaded newer image for nvidia/cuda:11.8.0-base-ubuntu22.04
Sun Mar 31 13:27:28 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Ti Off | 00000000:01:00.0 Off | Off |
| 0% 39C P8 14W / 450W | 84MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
위와 같이 해당 그래픽 카드의 정보가 나오면 성공이다.
Notebook 설치 및 GPU 연동 확인
docker run --gpus all -d -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter
Notebook 연동 및 token 확인
localhost:8888
$ docker exec -it 8bf /bin/bash
________ _______________
___ __/__________________________________ ____/__ /________ __
__ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / /
_ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ /
/_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/
WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.
To avoid this, run the container by specifying your user's userid:
$ docker run -u $(id -u):$(id -g) args...
root@8bfb03e57f8b:/tf# history
1 jupyter server list
2 exit
3 history
root@8bfb03e57f8b:/tf# jupyter server list
Currently running servers:
http://8bfb03e57f8b:8888/?token=48c6439946bd0e34d389c792708f3ee35cf98d929a889c99 :: /tf
여기서 토큰 이하가 입력 값이다.
다음 코드를 입력 한 후 GPU의 갯수가 나오면 성공이다.
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
728x90
반응형
'AI' 카테고리의 다른 글
Llama 2 Local Install (0) | 2024.04.14 |
---|---|
Llama 2 Download Error (416 Requested Range Not Satisfiable) (0) | 2024.04.14 |
Sum of the squared errors (0) | 2021.08.01 |
Predicting Student Admissions with Neural Networks (0) | 2021.08.01 |
Backpropagation (0) | 2021.08.01 |