About ten years ago, when RNA-Seq was young, we struggled to make sense of the huge quantity of data that came out of Next-Generation Sequencers. The RNA-Seq pipelines were founded on the simple scheme:
Reads -> Alignments -> Quantification
The most popular RNA-Seq alignment tool, Tophat (now Tophat2) was actually built on the Bowtie aligner to focus on transcribed genomic regions (the Transcriptome), with the optional feature of aligning reads in the whole Genome, for de-novo transcript discovery.
These alignments are stored in a wonderful (no joke) file format called SAM (or the compressed counterpart, BAM). Then, the most common Quantification pipeline has been to convert Tophat alignments into discrete gene (or transcript) counts with tools like RSEM, the R/Bioconductor RSubread or my favourite, HTSeq.
Recently however, a class of tools emerged for quantifying transcripts without the need for BAM alignments (credit to Dr. Marco Russo for pointing them out!):
Reads -> Quantification
Bypassing the alignments step has obvious advantages in terms of speed and resources (memory) required. There are three tools available to do so: kallisto, Sailfish and Salmon. A recent BMC Genomics paper by Zhang et al., has collected all quantifiers (both “old-school” and new) and compared them:
(Left) Figure 1 from Evaluation and comparison of computational tools for RNA-seq isoform quantification by Zhang et al., BMC Genomics 2018.
The paper tests several features for these quantifiers, but I personally wanted to see if the output from a Tophat/HTSeq pipeline could be compared with Salmon. I therefore ran two parallel quantification steps on 10 RNASeq samples from Arabidopsis thaliana (genome version TAIR10, transcript annotation Araport11).
The pseudocode is simple for both pipelines. For htseq, I forced to count the same reads multiple times if they mapped different overlapping transcripts (i.e. 99.9% of the times, splice variants of the same gene):
reference=arabidopsis.tair10.reference annotation=arabidopsis.araport11.gff fq=sample.fastq tophat2 -o counts $reference $fq htseq-count --nonunique all -f bam -s no -t mRNA -i ID counts/accepted_hits.bam $annotation > htseq.counts
And then Salmon, with default parameters (kmer length 31nt, quasi-mapping-based mode). Instead of a genome and an annotation, Salmon requires transcript sequences:
index=araport.cdna.index transcripts=Araport11_genes.201606.cdna.fasta salmon index -t $transcripts -i $index --type quasi -k 31 fq=sample.fastq salmon quant -i $index -l A -r $fq -o salmon.counts
The whole Salmon pipeline is >10 times faster than the Tophat/HTseq one. But how do they compare? For a single sample, these are the quantities for each Arabidopsis mRNA computed by the two (integer counts for HTSeq and “NumReads” quants for Salmon):
Overall, the gene-wise correlation between quantities computed by HTSeq and Salmon in the whole dataset (only 10 samples, but hey, budget was low).
So, these plots tell us three things
- The overall quantification process is similar, with a Spearman correlation within sample of 0.7
- The quantities inferred by Salmon are ten times bigger than those inferred by HTSeq, due most likely to tuning inferred by the Quasi-mapping mode, GC-calibration, etc.)
- For coexpression networks, things won’t change very much, however for some genes there is an inverted gene expression profile correlation for the same transcript between the two pipelines.
So: I will probably use Salmon alongside HTSeq for the nearby future.