Widespread Purifying Selection on RNA Structure in Mammals - Martin Smith


Published on

In the past few years it has become evident that over 85%of the human genome is processed into RNA, with less than 2% encoding proteins. The expanding compendium of non-coding RNAs identified in transcriptomic studies lies in stark contrast to their functional annotation.
Evolutionarily conserved RNA secondary structures are a robust indicator of purifying selection and, consequently, molecular function. Evaluating their genome-wide occurrence through comparative genomics has consistently been plagued by high false-positive rates and divergent predictions.
This poster presents a novel benchmarking pipeline aimed at calibrating the precision of genome-wide scans for consensus RNA structure prediction. The benchmarking data was used to fine-tune the parameters of an optimized workflow for genomic sliding window screens.
When applied to consistency-based multiple genome alignments of 35 mammals, our approach confidently identifies >4 million evolutionarily constrained RNA structures using a conservative sensitivity threshold that entails historically low false discovery rates for such analyses (5–22%). These predictions comprise 13.6% of the human genome, 88% of which fall outside any known sequence-constrained element, suggesting that a large proportion of the mammalian genome is functional.
This work provides an extensive set of functional transcriptomic annotations that will assist researchers in uncovering the precise mechanisms underlying regulation of gene expression via ncRNAs.

Authors: Martin A. Smith, Tanja Gesel, Peter F. Stadler, John S. Mattick

Published in: Health & Medicine
  • Be the first to comment

  • Be the first to like this

Widespread Purifying Selection on RNA Structure in Mammals - Martin Smith

  1. 1. WIDESPREAD PURIFYING SELECTION ON RNA STRUCTURE IN MAMMALS Martin A. Smith Tanja Gesel Peter F. Stadler John S. Mattick Garvan Institute, Sydney, Australia [m.smith@garvan.org.au] Centre for Integrative Bioinformatics, Vienna, Austria Interdisciplinary Centre for Bioinformatics, Leipzig, Germany Garvan Institute, Sydney, Australia Over 75% of the human genome is processed into RNA, with only 2% encoding proteins. Less than 10% of the genome is currently defined as evolutionarily constrained. Most identified genetic variants associated to complex diseases occur in non-coding regions of the genome with no evidence of purifying evolutionary selection. Using an optimised sliding-window approach, we report that a large proportion of 35 sequenced mammalian genomes harbors evolutionarily conserved RNA structure motifs with unprecedented accuracy. We propose that the higher-order structural components of RNA serve as a flexible and modular evolutionary platform for the diversification of genetic regulatory mechanisms, assisted by low penetrance of affected alleles and by compensatory base-pairing. e disco 0.1 0 5 10 15 20 25 30 35 Species in alignment Compares a native consensus structure prediction against a background distribution of randomized alignments RNAz 2.0 Similar to regular SISSIz but Employs a regression model employs a RIBOSUM sub- trained on known RNA structures stitution matrix to score to classify sampled alignments compensatory mutations. as structred or non-structured Sensitivity 12 Specificity Count (log) SISSIz 2.0 [+R] 0.06 0.04 0.02 8 0 4 30 40 50 60 70 80 90 Mean pairwise identity (%) 0 100% 48% 5% 60% 17% 40% 1% 20% 13% 2% 0% Partial structure alignments [RFAM] Partial sequence alignments [RFAM+MAFFT] 4% Scan me ! 0.06 Density 80% Overlap between predictions 100000 1000 10 CDS 5’UTR Non-coding (0.3%) 2.5 2 1.5 1 0.5 B. Overlap with annotated sequence constrained elements Gerp++ 1.3% 0.9% 4.4% 6.8% 0.04 0.7% 0.6% 3.3% 3.5% 0.02 10% Exonic 3% 8% 2% 3’UTR 0 2000 4000 Overlapping predictions (nt) 9.2% SISSIz 2.0 Average runtime for 200 nt (s) Density 0.08 II. Performance on benchmarking data Intronic 55% 3% Fold Enrichment vs Uniform Distribution 6 Submit to RNA structure prediction algorithms % Fals 5 Select random sub-alignment simulating a sliding window 0.2 [5-22] 4 Use native RFAM alignments as reference Density 0.3 Intergenic 41% 13.6% Select subset of sequences randomly A. Genomic distribution of evolutionarily conserved RNA structures >4,000,000 high-confidence predictions Exonic CDS 5’UTR 3’UTR Non-coding Intronic Intergenic Repeats SISSIz 2.0 SISSIz 2.0 [+R] RNAz 2.0 Genomic background 3 Emulate genomic alignment by realigning with MAFFT IV. Optimised genome-wide screen te 2 1 Select random RNA family III. Performance on chr10 very ra I. Generating positive controls for algorithm benchmarking 0.8% 0.7% 3.5% 4.1% 0 20 40 60 80 G+C content (%) Access predicted | structures in UCSC Genome Browser www.martinalexandersmith.com/ECS 13.6% 1.3% 6.4% 2D structures PhastCons SyPhi-merged