Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20140710 3 l_paul_ercc2.0_workshop

565 views

Published on

20140710 Lukas Paul ERCC 2.0 Workshop

Published in: Science
  • Be the first to comment

  • Be the first to like this

20140710 3 l_paul_ercc2.0_workshop

  1. 1. © Lexogen, 2013 Spike-In RNA Variants: Design, Production and Application ERCC 2.0 workshop Stanford University – July 10-11, 2014 PPT Number TBD Project Number 0221 Theme T5.2 Mixquer Transcript Quantification (WAFF) Author Lukas Paul
  2. 2. © Lexogen, 20142 1. Company introduction 2. ERCC spike-in  mixes  in  Lexogen‘s  R&D 3. Design and rational of Spike-In RNA Variants 4. Production and application of Spike-In RNA Variants ERCC 2.0 Workshop Vertraulich / Confidential
  3. 3. © Lexogen, 20143Vertraulich / Confidential Lexogen: Company • Founded in 2007 • Based in Vienna, Austria • 28 employees (75% in R&D) • Lexogen, Inc.: o/n delivery to US customers • Services & products with focus on o Transcriptome profiling technologies o Complementary technologies to Next Generation Sequencing o Innovative solutions for transcriptome research Lexogen’s mission is to develop innovative technologies that will allow to resolve all complexities of the transcriptome - one of the most enigmatic and exciting areas in biology. www.LEXOGEN.com
  4. 4. © Lexogen, 20144 1. Company introduction 2. ERCC spike-in  mixes  in  Lexogen‘s  R&D 3. Design and rational of Spike-In RNA Variants 4. Production and application of Spike-In RNA Variants ERCC 2.0 Workshop Vertraulich / Confidential
  5. 5. © Lexogen, 20145 SENSETM mRNA-Seq Library Preparation Kit • Convenient, fragmentation-free workflow • Core technology: reverse transcription and ligation on intact RNA • Results in very high preservation of strand orientation Vertraulich / Confidential PN0203 PPT0383
  6. 6. © Lexogen, 20146 ERCC-based Validation of Strandedness • Strandedness usually quantified by comparing the orientation of a mapped read with the genome annotation • Problem: annotation incomplete & natural antisense transcription interferes Use of ERCC transcripts with known orientation provides an absolute means to determine strandedness Vertraulich / Confidential PN0203 PPT0383 Total RNA Strand Specificity (ERCCs only)a False Antisense Readsb Sense Reads (genome-wide)c 2 µg 99.997% 0.003% 99.890% 1 µg 99.986% 0.014% 99.815% 500 ng 99.997% 0.003% 99.821% 50 ng 99.965% 0.035% 99.779% a number of reads mapping to ERCC genes in the sense direction divided by total number of ERCC reads b number of antisense reads mapping to ERCC transcripts divided by the total number of reads mapped to the ERCC genome c number of reads mapping to annotated genes in the sense orientation divided by the number of reads mapping in both directions. Note that this measure includes biologically relevant antisense transcription.
  7. 7. © Lexogen, 20147 ERCC-validated Strandedness Determines False Positive Background of Library Preparation Method Vertraulich / Confidential Knowing the strandedness of the library preparation protocol allows for determining whether a detected transcript is truly antisense or belongs to the false positive background. 98% 99.9% strandedness 1153 2415 true antisense transcripts
  8. 8. © Lexogen, 20148 “ERCC-validated”  Strandedness  in  Lexogen’s  Portfolio   • SENSE mRNA-Seq library preparation kit • SENSE Total RNA-Seq library preparation kit Vertraulich / Confidential PN0203 PPT0383 • QuantSeqTM 3’  mRNA   library preparation Kit, see workflow (right), ERCCs also used to assess correctness  of  3’  end   mapping
  9. 9. © Lexogen, 20149 Correlation Between ERCC Input and FPKM Measured Vertraulich / Confidential PN0203 PPT0383 FPKM N of molecules [102] 1 10 102 103 104 105 106 10-21101021037.5x104 o SENSE, R2=0.910 Competitors, R2=0.834 •
  10. 10. © Lexogen, 201410 Further Use for ERCC: Transcript Length Coverage: • Native genes: interference from divergent annotations and differentially expressed transcript variants • Primer selectivity: aa  ERCCs with seamless coverage from first to last nucleotide  Native transcripts start  with  high  coverage  indicative  of  5’  truncated   annotations Vertraulich / Confidential PN0203 PPT0383 Example: SQUARE TM library prep with intrinsic over-representation of termini ERCC-0096 Top 500 transcripts
  11. 11. © Lexogen, 201411 1. Company introduction 2. ERCC spike-in  mixes  in  Lexogen‘s  R&D 3. Design and rational of Spike-In RNA variants 4. Production and application of Spike-In RNA variants ERCC 2.0 Workshop Vertraulich / Confidential
  12. 12. © Lexogen, 201412 Spike-In RNA Variants (SIRVs) - Rational • ERCC spike-in controls were designed as mono-exonic RNAs without sequence overlap. • Complementary, we found it to be desirable to have a set of nucleic acids simulating transcript variants that can be used as external spike-in controls. • This reference set would o comprise two or more transcript families, with transcripts of the same family representing reference transcript variants of the same gene o enable the controlled identification and/or quantification of transcript variants in one or more samples and o permit the assessment, validation and correction of Bioinformatics pipelines. Vertraulich / Confidential
  13. 13. © Lexogen, 201413 Spike-In RNA Variants – Gene Structure Reference genes • 7 human genes selected because of diversity in exon-intron structure • Annotated transcripts (Ensembl database) aligned to gene in CLC workbench • „Master  transcript“  created  for  each  gene  (sequence  of  all  transcript  variants) KLK5 LDHD Vertraulich / Confidential CLC main workbench 5 CLC main workbench 5 PN0203 PPT0383
  14. 14. © Lexogen, 201414 Addition of Transcript Variants • Annotated transcript variants were analyzed for AS events • AS events not covered by a variant within a family were incorporated in a new variant based on the master transcript • To cover non-splicing variants, antisense and overlapping transcripts were added (mono- and poly-exonic) • Further, Transcription Start-Site (TSS) and End-Site (TES) variants were added KLK5 SIRV1 Vertraulich / Confidential
  15. 15. © Lexogen, 201415 Spike-In RNA Variants (SIRV): Nucleotide Sequence AIM • The nucleotide sequence of the SIRVs should be non-homologous at least to eukarytic genomes and transcriptomes. • In the best case they should not align with any natural occurring sequence. SOLUTION • Genomic sequences from viruses were used to fill-in exon sequences.  Would work in external controls for eukaryotes. • Sequences were then inverted (flipped) to lose alignment identiy.  Final sequences do not align with any entry in the NCBI nt collection when blasted with standard parameters.  SIRV sequences also do not align with ERCC sequences.  In silico experiments confirmed that NGS reads generated from the SIRVs would  not  map  to  the  genome  of  any  model  organism  or  the  “ERCCome”. Vertraulich / Confidential
  16. 16. © Lexogen, 201416 Re-establishing Exon-Intron Junction Dinucleotides Vertraulich / Confidential • Most junctions are common, i.e. are also annotated in the master transcript. • These intron sequences are currently annotated as NN (see below), hence junction recognition is no problem for alignment programs NN-NN GT-AG GC-AG AT-AC SIRVS 198 (61.11%) 116 (31.10%) 7 (2.16%) 3 (0.93) 314 (96.91%) ICE database 98.70% 0.79% 0.08% • Exon-defined intron boundaries were converted to GT-AG (97%), GC-AG (2%) or AT-AC (1%) Nucleotide conversion to conform with GT-AG rule
  17. 17. © Lexogen, 201417 SIRV Properties - Summary SIRVs are modelled on mammalian sequences • Set of seven SIRV families with 6-18 transcript variants each • 74 transcript variants in total, average length 1200 nt (median 917 nt) • Variants include alternative splicing, start- and end-site variations , antisense and overlapping transcripts • GC content: 30-50% (in analogy to ERCC standards) • Poly(A)  tail:  A(30)  at  3’-end (ERCCs: 19-25 adenosines) • Length: 220-2,557 nt, longer SIRVs were trimmed by exon removal Further modifications • GT-AT exon-intron junction dinucleotide rule observed • Homopolymer runs:  ≤7nt • 5’  truncation  to  obtain  5’  G,  needed  for  T7  transcription • No homology to NCBI nt collection entries or ERCC sequences due to sequence inversion Vertraulich / Confidential PN0203 PPT0383
  18. 18. © Lexogen, 201418 SIRV Design - Overview Vertraulich / Confidential Take natural gene structure and annotated transcript variants Shorten transcript length to a maximum of 2500 nt Fill gene structure with heterologous sequence Duplicate and modify to add alternative splicing variants Add transcription start-site and end-site variants Add antisense and overlapping variants observe GU-AG intron rule cassette exon alternative start-site alternative end-site alternative last exon intron retention overlapping, antisense antisense A5SS A3SS MXEalternative first exon overlapping
  19. 19. © Lexogen, 201419 1. Company introduction 2. ERCC spike-in  mixes  in  Lexogen‘s  R&D 3. Design and rational of Spike-In RNA Variants 4. Production and application of Spike-In RNA Variants ERCC 2.0 Workshop Vertraulich / Confidential
  20. 20. © Lexogen, 201420 SIRV Production: In vitro Transcription Construct Vertraulich / Confidential starts with 5’  G, cap optional poly(A) tail added Synthetic constructs cloned for singularization and amplification Run-off T7 transcription T7-PromoterRestr.Site G Sequence A(30) Restr.Site5’ 3’ 220 - 2557 nt
  21. 21. © Lexogen, 201421 SIRV Production, QC and quantification Production  Plasmid linearization  T7 run-off transcription  Purification (essential!)  Storage in Na-Citrate buffer Quality Control  Photometric (Nanodrop): Purity, quantifcation  Microfluidics (Bioanalyzer): Integrity, quantifcation • Planned: qPCR: Accurate quantification Vertraulich / Confidential
  22. 22. © Lexogen, 201422 SIRVs: Mixes & RNA-Seq Samples Initially, 2 mixes were prepared from 60 purified transcript variants: 1. Equimolar:  1:1:1… 2. Low dynamic range: 1:10:100 3 Samples were prepared from these: 1. Equimolar mix, SIRVs only illumina TruSeq library prep without poly(A) selection 2. Equimolar mix, 30% SIRVs, 3% ERCCs, 67% UHR (Universal Human Reference RNA) illumina TruSeq library prep without poly(A) selection 3. Low dynamic range, 30% SIRVs, 3% ERCCs, 67% UHR (Universal Human Reference RNA) illumina TruSeq library prep without poly(A) selection Vertraulich / Confidential
  23. 23. © Lexogen, 201423 SIRVs: RNA-Seq Experiment • Illumina MiSeq run: 1x150 nt, 27M reads obtained • Mapping with tophat (v.2.0.8) against combined transcriptomic and genomic reference (Ensembl GRCh 37.75), Ambion’s ERCC92, and SIRVs Vertraulich / Confidential Total reads Mapping reads (%) Uniquely Mapping reads (%) #1, equimolar SIRVs 10,246,442 8,585,641 83.79% 8,505,344 83.01% #2, equimolar SIRVs, ERCCs, UHR 10,119,416 8,642,852 85.41% 8,399,336 83.00% #3, 1:10:100 SIRVs, ERCCs, UHR 6,308,855 5,404,486 85.67% 5,268,757 83.51% GRCh37.75 ERCC92 SIRVs Sample #1 4,330 0.05% 11 0.00% 8,505,555 99.95% Sample #2 7,521,308 89.55% 38,031 0.45% 839,997 10.00% Sample #3 4,156,399 78.89% 22,207 0.42% 1,090,151 20.69%
  24. 24. © Lexogen, 201424 SIRV RNA-Seq: Input / Output correlation Vertraulich / Confidential Molecules Molecules Molecules sample #1 FPKM sample#2FPKM #1 #2 #3 #1 vs #2
  25. 25. © Lexogen, 201425 SIRVs RNA-Seq: Transcript Hypotheses Transcript Hypotheses by Cufflinks • Not complete: e.g., 3ASS and exons not recognized despite multiple exon- exon reads Vertraulich / Confidential cufflinks
  26. 26. © Lexogen, 201426 Spike-In RNA Variants: Short Summary Design & production • 74 transcript variants in 7 families (6-18 variants / family) • Mimic eukaryotic genes in length and GC content; A(30) tail • Include variation on alternative splicing, transcription start-sites and end- sites, sense/antisense and overlapping genes • No homology to NCBI nt collection entries or ERCC sequences • Produced from stock plasmids as T7 run-off transcripts Mixtures • 60 SIRVs were mixed in equimolar or low dynamic range (10²) concentrations Application in RNA-Seq • Mixtures showed high mapability and no cross-mapping with UHR or ERCCs • Low input / output correlation as determined by tophat / cufflinks derived FPKM • Cufflinks cannot reconstruct all SIRV transcript variants, even in the equimolar mix, which will lead to wrong FPKM values Vertraulich / Confidential
  27. 27. © Lexogen, 201427 Spike-In RNA Variants: Outlook Optimizing production & quantification • Large-scale production and purification of transcripts • qPCR-based quantification in addition to Nanodrop & Bioanalyzer results Application • Evaluation of software for its performance in transcript hypothesis building and transcript isoform quantification Open questions • Concentration range? • Sufficient variant complexity? Length? Capping? SNPs? • How many different mixes? • Pipeline validation (Consortium?) • Sample comparison (DE) • Technical variation • Master mix vs. modules: ERCCs, SIRVs, ncRNA standards & miRNA standards (complexity, price, validation?) Vertraulich / Confidential

×