How-to: Set up a Bioinformatics Environment


Time and again I've set up a new Linux environment (for various reasons) and I've found that while there are innumerable ways to go about doing this, I have a few standard steps that I always follow.

And the one program to rule them all is: conda !!

While most probably associate conda with python package management, conda goes well beyond that and serves as a general environment/package manager for scientific libraries.

As it says on conda's site:

A conda environment is a directory that contains a specific collection of conda packages that you have installed. For example, you may have one environment with NumPy 1.7 and its dependencies, and another environment with NumPy 1.6 for legacy testing. If you change one environment, your other environments are not affected. You can easily activate or deactivate environments, which is how you switch between them. You can also share your environment with someone by giving them a copy of your environment.yaml file.

First, download the Anaconda installer for your system (or select Mini-conda if you need a smaller install) and follow your system-specific directions. Here, I will walk through the Linux install and setup.

bash Anaconda-2.x.x-Linux-x86[_64].sh

You'll want to add important channels. This essentially adds repositories. The order matters such that the later the channel is added the higher the priority (it looks in conda-forge first). Do note, however, that while adding these channels will increase the number of identified packages, it will also increase the time it takes to look through all of those packages for updates/dependency resolution/etc.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

From here, it's easy to install desired packages.

conda install samtools    # Note: a C-based package
conda install pysam       # a python package
conda install gffutils    # another python package

Now, I don't recommend you install all of your packages to a single environment. For instance, I like to have one environment for python3, one for python2, one for R, etc.

As an example, let's create an R environment. Now, my by base environment is python3 (since it's my primary tool), but you may make R your primary and python a secondary. Warning: Installing R in conda works great except if you already have R/R libraries installed the traditional way. It seems that the R libraries are the biggest issue and removing them allows the conda R to work properly.

conda create -n mro_env r-essentials mro-base

You may want to check that you've properly set up the environment (or if you ever forget what you called it).

conda env list

On my Linux system, I had to add one line to by .bashrc file so I could activate my environments. You can try activating without adding this line, and conda will tell you if you need it. Conda used to add a path export to your .bashrc, but that is no longer recommended.

echo ". /home/lboat/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc

# Now activate your new environment
conda activate mro_env

This has changed the paths to all of the installed conda libraries such that you are now only looking at those programs/packages installed in this environment. This is very helpful if there are package conflicts or if a program requires an older version that you don't typically use.

From here, you can install any library you like (even ones you installed in a different environment). Since this is an example of my R environment, I'm going to install some Bioconductor packages.

# Install Bioconductor -- if you set the channels above the channel parameter (-c) is optional
# NOTE: installing bioconductor this way actually caused me some problems
#       read further to see a better way to install packages
conda install -c bioconda bioconductor-biocinstaller

Now, install the R packages you want:

# conda install -c bioconda bioconductor-{package_name}
conda install -c bioconda bioconductor-edgeR

As with other packages and programs, conda will identify dependencies and resolve versions to the best of its ability.

Now, in my case, the edgeR package had an outdated limma dependencies that prevented me from loading edgeR. This provides a great opportunity to describe an alternate way to install packages.

Most people work with RStudio if they work in R. It can also be installed using conda.

conda install rstudio

rstudio

Now, you can run RStudio and install any package you like as you normally would (the library location specific to your conda installation should automatically be the default install location).

# Whether that be from CRAN
install.packages("ggplot2")
# Or Bioconductor
# lib path included as an example
biocLite("edgeR", lib="~/anaconda3/envs/mro_env/lib/R/library")

Your environment path may be different so be sure to check. As such, conda is capable of managing just about all of my major dependencies, from C-libraries to python and R. As an added benefit, this can all be done on a computer where you don't have sudo/admin access. So, if you work at a university where you have to request package installation, conda can save you a ton of headache.

One last tip, you can also put your entire conda environment into a Singularity container and run it anywhere! See my post on Singularity containers.