How-to: Process a FASTQ file in parallel


In this post, I will be demonstrating how to read FASTQ files using the htsjdk java library and then process those files using functions that allow for parallel processing.

I'll be using JShell as described in a previous post, and to begin, we'll designate the class path to the HTSJDK library.

export CLASSPATH="/home/lboat/IdeaProjects/htsjdk/build/classes/java/main/"
jshell 

Now that JShell is open, we can import all of the necessary functions.

import java.io.File;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.zip.GZIPInputStream;
import htsjdk.samtools.fastq.FastqRecord
import htsjdk.samtools.fastq.FastqReader
File file = new File("R1.fastq.gz")
FileInputStream fis = new FileInputStream(file)
GZIPInputStream gis = new GZIPInputStream(fis)
InputStreamReader isr = new InputStreamReader(gis)
BufferedReader br = new BufferedReader(isr)
// I'm adding file here (which is optional) so we can see the file name
FastqReader fr = new FastqReader(file,br)
FastqRecord record = fr.next()
record.getReadLength()
record.getBaseQualities()
Spliterator<FastqRecord> spliterator = fr.spliterator()
// Note: You can only iterate over a stream once, 
//       so typically you won't save the stream to a variable
Stream<FastqRecord> stream = StreamSupport.stream(spliterator, true)
stream.isParallel()
// Just want to print the 
//stream.forEach(n -> System.out.println(n.getBaseQualities()))
Map<String, Long> map = stream.collect(groupingBy(FastqRecord::getReadString, counting()));
// Now, we have all of our reads counted

Now we get the path to the FASTA file and generate FASTA object using a FactoryBuilder (A specialized Java class for generating objects).

// Get the path to our file
Path path = Paths.get("big.fasta")

// Generate a FASTA object
ReferenceSequenceFile fasta = ReferenceSequenceFileFactory.getReferenceSequenceFile(path)

// Index our FASTA file
FastaSequenceIndex fsi = FastaSequenceIndexCreator.buildFromFasta(path)

The FASTA file is now indexed and we can readily check some specific metrics for this FASTA.

// Determine the number of sequences in our file
fsi.size()

// Get sequence-specific information
fsi.getIndexEntry("sequence_5")
   # contig sequence_5; location 5152; size 999; basesPerLine 60; bytesPerLine 61

// Notice, our FASTA isn't recognized as indexed
fasta.isIndexed()
   # false

// Write our index 
fsi.write(Paths.get("big.fa.fai"))

// Re-open the FASTA using this index
fasta = ReferenceSequenceFileFactory.getReferenceSequenceFile(path)

// Notice, our FASTA is now indexed
fasta.isIndexed()
    true

// You may want to check the first sequnce
ReferenceSequence rs = fasta.nextSequence()

// Look at its name
rs.getName()

// Look at the sequence
rs.getBaseString()