Conda scratch disk workflow on typical HPC systems

mostsquares · June 1, 2023, 2:03pm

It’s typical for HPC clusters to have a large (and often pretty slow) network storage system, plus ephemeral node-local storage which is usually much faster. My workflow usually involves running Jupyter Lab using conda (well, mamba) environments which are stored on the network, and this can lead to excruciating wait times when you restart your kernel and run that one cell in the top of your notebook which has all the imports.

I’m wondering, is there a way to copy my environments to the scratch disk transparently? In particular, I currently have several environments and I use nb-conda-kernels to toggle between them inside Jupyter, and I’m wondering how to ensure that this system still works. Wondering if anyone has any personal workflows along these lines. Maybe setting $CONDA_PREFIX will be necessary? Thanks!

maddenp · June 2, 2023, 6:26pm

I’ve done a lot of work with conda on HPCs and am familiar with the issues you mention. For what it’s worth, I’ve found that Lustre is the worst of the lot for conda, in that it is optimized for large file IO and performs poorly on the myriads-of-small-files use case that conda represents. GPFS is better in my experience, and NFS does better, too. Some of the HPCs I’ve worked with have Lustre or GPFS for scratch/project space, but also have an NFS-based filesystem for home directories and – quota permitting – it helped me to build my conda environments on NFS.

Where that wasn’t possible, I often found it fast enough to install Miniconda/Miniforge/Mambaforge on my allocated HPC node’s local disk (or even in /dev/shm) and reconstruct my conda environment in that ephemeral conda installation. With all that scripted, I could be up and running on a compute node in a minute or two, with great performance.

If none of that is helpful, you might look at Conda-Pack — conda-pack 0.7.0 documentation. I haven’t used it, but it claims to address conda’s portability issue (i.e. you can’t just copy environments from one path to another and expect them to work) in a way that might be helpful to you.

mostsquares · June 2, 2023, 6:57pm

Thank you for the reply, and for your insightful comments about the file systems. It’s an important point that sometimes there are multiple available at once. And I like the idea of just going from scratch :). Much appreciated.