I recently joined a start up and have been tasked with setting up package management for our internal python libraries. We work in the biotech and ml space, and a lot of the packages we use are index on conda channels.
The current setup we have right now is to install our local repositories from GitHub, which are built with setup.py. Initially I thought that we should just use poetry for all of our package management, and for any of our own private libraries, have a aws codebuild/artifact setup to host our libraries. I still think this seems like the best option for doing package management in python.
On the other hand, there are too many packages that we need that are available in conda ecosystem that are not available on pypi. We’ve noticed now that several dependencies have clashed between pip and conda when trying to use both at the same time. So we might as well lean into using the conda ecosystem completely.
In order to do this, I think that a good idea would be to use a private conda channel for any of our own libraries, and use conda-forge for any repos that we might need from pip. If for some reason we can’t find a package on conda-forge, there seems to be a pretty easy process to follow to get it there from pypi.
My question is the following: Has anyone hosted their own conda channel before?
I’ve seen and tried options from:
- AWS: Private repository for runtime dependencies - Amazon SageMaker
- Azure: Create custom Conda channel for package management - Azure Synapse Analytics | Microsoft Learn
- making a channel using an s3 bucket as a web server- JFrog Help Center
- using anaconda: Working with private channels using third-party tools — Anaconda documentation
- quetz: GitHub - mamba-org/quetz: The Open-Source Server for Conda Packages
The only way to use the AWS and Azure solutions is to locally mount the files from s3 in order to use the channel correctly, this just does not seem like the right way to use a conda channel, not to mention it involves downloading all/most of the files in the bucket in order to properly index the channel.
The anaconda and quetz solutions seem like a step up from mounting the s3 buckets, but they don’t allow for federated logins, at least not natively, which leaves using something artifactory or some equivalent tool. Quetz seems like it’s not in active development.
I haven’t really found any reports/guides for the standard way of doing this, which I find really surprising, because I can’t be the only one running into this. As far as I can tell, artifactory is the most enterprise ready solution that is available in order to do this, but I’m curious if there’s something I haven’t seen before, and whether others have ran into this problem as well.
Here is what the environment.yml file would look like for what we currently do.
name: package
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- awscli=1.27.134
- pip
- python=3.10.11
- pip:
- GitHub.com/internal_repo/version/files.tar.gz
- other pip dependencies