If I set up a conda pytorch environment like this:
conda activate pytorch-cuda
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
That works; at least insofar as being able to import torch in python. If, however, I add cuDNN:
conda install cudnn -c nvidia
Things are no longer warm and fuzzy:
(torch-cuda1) pgoetz@finglas ~$ python --version
Python 3.11.5
(torch-cuda1) pgoetz@finglas ~$ python
Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/lusr/opt/miniconda/envs/torch-cuda1/lib/python3.11/site-packages/torch/__init__.py", line 229, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: /lusr/opt/miniconda/envs/torch-cuda1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so: undefined symbol: cudaMemPoolSetAttribute, version libcudart.so.11.0
>>>
What’s happening is the cuDNN conda package is installing and relinking an older version of libcudart.so.11.0. Here is what is in /miniconda/envs/pytorch-cuda/lib before cuDNN is installed:
# ls -l libcudart*
-rwxr-xr-x 3 root root 695712 Sep 21 2022 libcudart.so.11.8.89
Here is what it looks like after the cudnn package is installed from the nvidia channel:
# ls -l libcudart*
lrwxrwxrwx 1 root root 20 Sep 25 13:12 libcudart.so -> libcudart.so.11.1.74
lrwxrwxrwx 1 root root 20 Sep 25 13:12 libcudart.so.11.0 -> libcudart.so.11.1.74
-rwxr-xr-x 2 root root 554032 Oct 14 2020 libcudart.so.11.1.74
-rwxr-xr-x 3 root root 695712 Sep 21 2022 libcudart.so.11.8.89
It looks like something similar is happening with libcusparse.so.11, and possibly other libraries, I didn’t bother trying to track them all down.
My question is whom can should I bring this up with? The maintainers of the pytorch and nividia channels? Possibly just the nvidia channel?
Thanks.