RNA-Skim: a rapid method for RNA-Seq quantification at transcript-level

Zhaojun Zhang1, and Wei Wang2

1Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC.
2Department of Computer Science, University of California, Los Angles, CA.

RNA-Seq technique has been demonstrated as a revolutionary means for exploring transcriptome because it provides deep coverage and base-pair level resolution. RNA-Seq quantification is proven to be an efficient alternative to Microarray technique in gene expression study, and is a critical component in RNA-Seq differential expression analysis. Most existing RNA-Seq quantification tools require the alignments of fragments to either a genome or a transcriptome, entailing a time-consuming and intricate alignment step. In order to improve the performance of RNA-Seq quantification, an alignment-free method, Sailfish, has been recently proposed to quantify transcript abundances using all k-mers in the transcriptome, demonstrating the feasibility of designing an efficient alignment free method for transcriptome quantification. Even though Sailfish is substantially faster than alternative alignment-based methods such as Cufflinks, using all k-mers in the transcriptome quantification impedes the scalability of the method. We proposed a novel RNA-Seq quantification method, RNA-Skim, which partitions the transcriptome into disjoint transcript clusters based on sequence similarity and introduces the notion of sig-mers that are a special type of k-mers uniquely associated with each cluster. We demonstrated that the sig-mer counts within a cluster are sufficient for estimating transcript abundance with an accuracy comparable to any state of the art methods. This enables RNA-Skim to perform transcript quantification on each cluster independently, reducing a complex optimization problem into smaller optimization tasks that can be run in parallel. As a result, RNA-Skim uses less than $4\%$ of the k-mers and less than $10\%$ of the CPU time required by Sailfish. It is able to finish transcriptome quantification in less than 10 minutes per sample by using just a single thread on a commodity computer, which represents more than 100 speedup over the state of the art alignment based methods, while delivering comparable or higher accuracy.