How-to: Learn Bioinformatics


Throughout my time in bioinformatics, I've found that many people are interested in learning it but don't know how to get started.

First, you need to know a few things.

  1. You can learn a lot on your own. Just like you can learn a lot about frogs and salamanders without having a degree in herpetology, you can pick up the basics of bioinformatics if you're just willing to sit down and do it.

  2. You need to be willing to learn programming (probably in several languages but start with one). You don't have to be an expert programmer but you should be relatively comfortable in at least one language.

  3. A basic understanding of genetics is required. If you haven't taken a genetics class, there are plenty of videos on YouTube.

  4. If you really want to excel in bioinformatics, it's a good idea to work on your mathematics (like linear algebra) and statistics.

  5. Bioinformatics is just like every other field of research. I point this out because just like other fields, it takes years of experience to be an expert. Bioinformatics is much more than just plugging the appropriate files into a program and getting results.

Okay, so I'm going to assume that you have basically zero experience. So, your first step should be to become extremely familiar with the terminal (on Unix, Linux, or a Mac). If you don't own a computer with any of these operating systems, download a virtual machine monitor (like VirtualBox) and one of the many Linux ISOs (like Ubuntu).

I recommend you try to learn how to do everything from the terminal. Need to make a file? Do it in terminal. Look through folders? Terminal. Kill a frozen program? You guessed it, terminal. Keyboard shortcuts are also nice to learn, but those can come later.

Since you're going to be spending a lot of time in the terminal, you're going to need a way to write and edit files. There are plenty of good text editors out there with many reasons to prefer one over another. I'm going to leave the decision up to you, but here are some options.

  • nano - an easy-to-use text editor with simple functions (used to be my preferred editor)
  • vim - a text editor (and more) with many functions, configurations, and potential add-ons. Steep learning curve but comes with a tutorial (vimtutor). This is my new preferred editor.
  • emacs - According to their webpage, "At its core is an interpreter for Emacs Lisp, a dialect of the Lisp programming language with extensions to support text editing." Like vim, it's a very powerful text editor, but it's also more than just an editor.

If you're going to take learning bioinformatics seriously, you should learn either vim or emacs (eventually). That being said, nano is a good place to start.

Next, you'll need a good programming language. Historically, Perl has been a popular choice for bioinformatics. More recently, Python has become a primary contended. In certain circles, R is the primary choice. C and C++ are good choices if you're planning on a strong programming focus. I've heard of a few other labs that prefer Java, Ruby, SAS, AWK, or even Lisp. My recommendation is that you choose a language used by others in your primary area of interest. If you don't know that, a well-established language is probably your best choice. There are the well-supported BioPerl, Biopython, and BioJava packages that give a great starting point for any bioinformatics enthusiast (there's also BioRuby, but I've never used it nor heard of anyone that does). As such, I would prioritize one of those languages or R.

If you choose Python, which I highly recommend, installing the Anaconda distribution is probably the easiest way to go. It will keep you from manually installing a lot of packages (pre-written programs that ease your workload).

After that, you can go to the terminal and interactively use python:

ipython

Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 

This environment allows for REPL (read-evaluate-print loop), which lets you see the output of a command immediately. This sort of environment is a great way to learn and also available in other languages (in Java it's jshell).

jshell

|  Welcome to JShell -- Version 10.0.2
|  For an introduction type: /help intro

jshell> 

Now, you need some problems to practice. And when I say practice, I mean actually do them. Like the vimtutor says:

It is important to remember that this tutor is set up to teach by use. That means that you need to execute the commands to learn them properly. If you only read the text, you will forget the commands!

This statements holds true for all of programming/bioinformatics. Use it or lose it!

I highly recommend visiting Rosalind. While the site assumes you'll be using Python, you can solve the problems however you like. Once you begin to feel confident with your abilities,you may even attempt to reproduce works from publications. Good luck!