How-to: Use Bioinformatics Libraries in JShell


I'm a big fan of Java's new read-evaluate-print-loop (REPL) environment, JShell. For beginners in Java, JShell provides a fantastic environment to troubleshoot issues and get comfortable with Java/new packages. As this is a bioinformatics blog, I will be demonstrating how to read FASTA files using the htsjdk (High-throughput sequencing Java Development Kit) library.

First, we have to import all of the appropriate package information. Here, I've downloaded and compiled the library using my IntelliJ IDE (integrated development environment). So, I'm exporting the CLASSPATH to the Java bytecode and designating the CLASSPATH to JShell in bash.

export CLASSPATH="/home/lboat/IdeaProjects/htsjdk/build/classes/java/main/"
jshell -c /home/lboat/IdeaProjects/htsjdk/build/classes/java/main/

Now that JShell is open, we can import all of the necessary functions.

import htsjdk.samtools.reference.ReferenceSequenceFile
import htsjdk.samtools.reference.ReferenceSequenceFileFactory
import htsjdk.samtools.reference.FastaSequenceIndexCreator
import htsjdk.samtools.reference.FastaSequenceIndex
import java.nio.file.Paths
import java.nio.file.Path

Now we get the path to the FASTA file and generate FASTA object using a FactoryBuilder (A specialized Java class for generating objects).

// Get the path to our file
Path path = Paths.get("big.fasta")

// Generate a FASTA object
ReferenceSequenceFile fasta = ReferenceSequenceFileFactory.getReferenceSequenceFile(path)

// Index our FASTA file
FastaSequenceIndex fsi = FastaSequenceIndexCreator.buildFromFasta(path)

The FASTA file is now indexed and we can readily check some specific metrics for this FASTA.

// Determine the number of sequences in our file
fsi.size()

// Get sequence-specific information
fsi.getIndexEntry("sequence_5")
   // contig sequence_5; location 5152; size 999; basesPerLine 60; bytesPerLine 61

// Notice, our FASTA isn't recognized as indexed
fasta.isIndexed()
   // false

// Write our index 
fsi.write(Paths.get("big.fa.fai"))

// Re-open the FASTA using this index
fasta = ReferenceSequenceFileFactory.getReferenceSequenceFile(path)

// Notice, our FASTA is now indexed
fasta.isIndexed()
  //  true

// You may want to check the first sequnce
ReferenceSequence rs = fasta.nextSequence()

// Look at its name
rs.getName()

// Look at the sequence
rs.getBaseString()