New, more accurate calculation tool for long-read RNA sequencing – ScienceDaily

New, more accurate calculation tool for long-read RNA sequencing – ScienceDaily

On the way from gene to protein, a nascent RNA molecule can be cut and joined or spliced ​​in a number of ways before it is translated into a protein. This process, known as alternative splicing, allows a single gene to encode multiple different proteins. Alternative splicing occurs in many biological processes, such as when stem cells mature into tissue-specific cells. However, in the context of disease, alternative splicing can be dysregulated. Therefore, it is important to study the transcriptome – that is, all RNA molecules that could come from genes – to understand the root cause of a disease.

In the past, however, it was difficult to “read” RNA molecules in their entirety, since they are typically thousands of bases in length. Instead, researchers have relied on what’s known as short-read RNA sequencing, in which RNA molecules are broken up and sequenced into much shorter pieces — between 200 and 600 bases, depending on the platform and protocol. Computer programs are then used to reconstruct the complete sequences of RNA molecules. Short-read RNA sequencing can provide highly accurate sequencing data, with a low per-base error rate of approximately 0.1% (meaning one base is misdetermined for every 1,000 bases sequenced). However, due to the short length of sequencing reads, the information it can provide is limited. In many ways, short-read RNA sequencing is like breaking a large picture into many jigsaw pieces, all the same shape and size, and then trying to put the picture back together.

Recently, long-read platforms have become available that can end-to-end sequence RNA molecules longer than 10,000 bases. These platforms do not require disruption of RNA molecules before sequencing, but they have a much higher error rate per base, typically between 5% and 20%. This well-known limitation has severely hampered the widespread adoption of long-read RNA sequencing. In particular, the high error rate has made it difficult to determine the validity of new, previously unknown RNA molecules discovered in a given condition or disease.

To circumvent this problem, researchers at the Children’s Hospital of Philadelphia (CHOP) have developed a new computational tool to more accurately discover and quantify RNA molecules from this error-prone, long-read RNA sequencing data. The tool called ESPRESSO (Error Statistics PRomoted Evaluator of Splice Site Options) was reported in today scientific advances.

“Long-read RNA sequencing is a powerful technology that will enable us to uncover RNA variation in rare genetic diseases and other conditions such as cancer,” said Yi Xing, PhD, director of the Center for Computational and Genomic Medicine at CHOP and senior author of the study. “We are probably at a turning point in the way we discover and analyze RNA molecules. The transition from short-read to long-read RNA sequencing represents an exciting technological transformation, and computational tools that reliably interpret long-read RNA sequencing data are badly needed.”

ESPRESSO can accurately detect and quantify different RNA molecules of the same gene – known as RNA isoforms – using error-prone long-read RNA sequencing data alone. To do this, the math tool compares all of the long RNA sequencing reads of a given gene to its corresponding genomic DNA, and then uses the error patterns of individual long reads to confidently identify splice sites — sites where the resulting RNA molecule was cut and joined — as well as theirs corresponding full-length RNA isoforms. By finding regions of perfect matches between long RNA sequencing reads and genomic DNA, as well as borrowing information about all long RNA sequencing reads of a gene, the tool is able to identify high-confidence splice junctions and RNA isoforms, including those that were not previously documented in existing databases.

Researchers evaluated ESPRESSO’s performance using simulated data and data from real biological samples. They found that ESPRESSO outperformed several currently available tools in both detecting and quantifying RNA isoforms. Researchers also generated and analyzed over 1 billion long RNA sequencing reads covering 30 human tissue types and three human cell lines, providing a useful resource for studying human transcriptome variations when resolving full-length RNA isoforms.

“ESPRESSO addresses a long-standing problem in long-read RNA sequencing and could open new avenues of discovery,” said Dr. Xing. “We envision that ESPRESSO will be a useful tool for researchers to explore the RNA repertoire of cells in diverse biomedical and clinical settings.”

This work was supported in part by the Immuno-Oncology Translational Network (IOTN) of the National Cancer Institute’s Cancer Moonshot Initiative (U01CA233074), other National Institutes of Health (R01GM088342, R01GM121827, and R56HG012310), and a National Institutes of Health T32 training grant in Computational Genomics ( T32HG000046).

Leave a Reply

Your email address will not be published. Required fields are marked *