Tag-based transcript sequencing: Comparison of SAGE and CAGE


Published on

Talk given at the European Meeting on Next Generation Sequencing, August 29 to September 1, 2010, at Leiden University Medical Center, The Netherland.
Data have been published in: Hestand et al.: “Tissue-specific transcript annotation and expression profiling with complementary next-generation sequencing technologies.” Nucleic Acids Res. 2010 Sep;38(16):e165.
PMID: 20615900

Published in: Health & Medicine
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Tag-based transcript sequencing: Comparison of SAGE and CAGE

  1. 1. Tag-based transcript sequencing: Comparison of SAGE and CAGE Matthias Harbers European Meeting on Next Generation Sequencing August 29 to September 1, 2010 Leiden University Medical Center, The NetherlandsMatthias Harbers 1
  2. 2. Focusing on transcriptome analysis: Transcript Start Site Nucleus Promoter “Gene” Genomic DNA (storage of information) Transcription Factors Transcription by RNA polymerase II AAAAA Processed mRNA (7-methylguanosine cap) Cap (transport of information) Translation at ribosome Non-coding RNAs Protein (mostly regulatory functions ?) (tools to operate “functions”) Cytoplasm Transcript information is the basis to understanding genomes, proteins, and non-coding RNAs! Matthias Harbers 2
  3. 3. cDNA cloning and sequencing: Genome 1a 1b 2 3 4 5 AAAAAA mRNA AAAAAA G Databases AAAAAA Clone resources Databases mRNA pool Clone resources AAAAAA Great asset foreach gene) (Representative clone for research community! AAAAAA Functional analysis of genes. cDNA Library Preparation AAAAAA AAAAAA AAAAAA High throughput Sequencing AAAAAA cDNA library Random clone picking Matthias Harbers 3
  4. 4. What did we learn from cDNA cloning projects? End-sequencing of cDNA clones 1st approach to transcript discovery Great improvements in full-length cDNA cloning Building of large cDNA collections (FANTOM, MGC, others…) Limited by throughput of capillary sequencing (RIKEN FANTOM Pipeline: ~40,000 reads per day) Limited by high cost of capillary sequencing (Reagent cost only per read in the US$ 1 to 1.5 range) Hence, cDNA cloning and sequencing did cover entire complexity of transcriptomes Other methods needed to uncover complexity of transcriptome Matthias Harbers 4
  5. 5. Tag-based methods for high-throughput sequencing  Short sequences (“tags”) are sufficient for transcript identification  Short sequencing reads reduce cost  Short sequencing reads increase throughput  Protocol should provide 1 tag per transcript  Digital expression profiling by counting “tags”  Unbiased transcript discovery  Transcript annotation using reference data  Serial Analysis Gene Expression (SAGE) (Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. (1995). "Serial analysis of gene expression". Science 270 (5235): 484–7.) Matthias Harbers 5
  6. 6. Preparation of SAGE libraries AAAAAA Full-length mRNA  Non-polyadenylated mRNA No! mRNA pool AAAAAA Truncated mRNA  AAAAAA Full-length mRNA  cDNA synthesis Attach cDNA to surface AAAAAA AAAAAA TTTTTT TTTTTT Cut with anchoring enzyme (frequent cutters, commonly NlaIII) AAAAAA AAAAAA GTAC TTTTTT GTAC TTTTTT Linker ligation to open Nla III site CATG AAAAAA CATG AAAAAA A GTAC TTTTTT B GTAC TTTTTT Cut with tagging enzyme CATG CATG (LongSAGE: MmeI, SuperSAGE: EcoP15I) A GTAC B GTAC Release from surface B Ligation to form “Ditag” CATG GTAC A GTAC CATG PCR amplification Cut with anchoring enzyme CATG GTAC Concatenation/Cloning Digital Gene Expression (DGE) … CATG GTAC CATG GTAC GTAC CATG … L CATG GTAC R Sequencing concatemers by capillary sequencing New protocols for direct sequencing on high-speed sequencers Matthias Harbers 6
  7. 7. Why Cap Analysis Gene Expression (CAGE)?  Sequencing of 5’ends allows discovery of Transcription Start Sites  5’-end sequencing allows transcript identification  5’-end sequencing allows promoter identification  5’-end sequencing allows monitoring of non-polyadenylated mRNAs  Cap-Trapper method very effective for 5’-end selection  Cap-Trapper allows library preparation directly from total RNA  Shift from “3’-end information” to “5’-end information”  Cap Analysis Gene Expression (CAGE) (Shiraki T et al. Proc Natl Acad Sci U S A. 2003 Dec 23;100(26):15776-81. Epub 2003 Dec 8) Matthias Harbers 7
  8. 8. Flow of CAGE projects using high-speed sequencing Gene Network Models(genome-wide view on promoter activities) CAGE Library Preparation Barcode CAGE Tag ©Illumina Data processing Illumina GAIIx Sequencer (36 bp reads) Data visualization 27bp tag Broad Narrow Use of barcodes CAGE Clusters (genome annotation) Illumina mapped to genome Sequences Streptavidin TF TF TF TF TF AAAAAAPromoter 1a 1b 2 3 4 5 Genome Biotin RNase treatment EcoP15I site AAAAAA AAAAAA Cap structure mRNA at 5’ end of mRNA Biotinylation of Cap AAAAAA AAAAAA Random priming AAAAAA Full-length mRNA  AAAAAA Non-polyadenylated mRNA  AAAAAA Truncated mRNA No! AAAAAA Full-length mRNA  Starting from total RNA mRNA pool Matthias Harbers 8
  9. 9. Comparison of SAGE and CAGE data  Directly compare SAGE (DGE) and CAGE from same samples  Use of proliferating and differentiated C2C12 myoblasts as a model Proliferating C2C12 cells Differentiated C2C12 cells: Fusion into myotubes (picture provided by Willem Hoogaars) (picture provided by Willem Hoogaars)  Use of biological triplicates  Use of Illumina Genome Analyzer for high-speed sequencing  Jointly Leiden University, Genomatix, ServiceXS, DNAFORM (Hestand, MS et al., Nucleic Acids Res. 2010 Sep;38(16):e165]) Matthias Harbers 9
  10. 10. Flow SAGE and CAGE data analysis CAGE Prolif1-3 CAGE Prolif1-3 CAGE Diff1-3 CAGE Diff1-3 SAGE Prolif1-3 SAGE Prolif1-3 SAGE Diff1-3 SAGE Diff1-3 CAGE Prolif1-3 CAGE Diff1-3 SAGE Prolif1-3 SAGE Diff1-3 Illumina sequencing (1 channel/sample, 2 technical replica) and data processing CAGE: Remove 1 base SAGE: Add CATG Mapping 2 mismatches allowed Mapping 1 mismatch allowed CAGE: 742,355 regions SAGE: 361,655 regions Set threshold to > 2 TPM Set threshold to > 2 TPM CAGE: 41,862 regions SAGE: 43,512 regions ElDorado mouse genome: 9,957 annotated exons ElDorado mouse genome: 27,190 partially annotated exons ElDorado mouse genome: 2,368 annotated introns ElDorado mouse genome: 2,347 intergenic regions Annotated TSS: 13,541 (32%) Annotated promoter regions: 6,331 (15%) Annotated 3’-end of transcripts: 8,028 (19%) FANTOM 3 CAGE data set: 31,680 (76%) Assigning CAGE regions to genes (1,000 bp window) Assigning SAGE regions to genes (1,000 bp window) CAGE: 10,409 genes SAGE: 10,987 genes Matthias Harbers 10
  11. 11. SAGE and CAGE data sets Sample No. seq reads No. aligned % aligned CAGE: Prolif-1 4,886,341 2,086,233 42.7 CAGE: Prolif-1 (re-sequenced) 3,933,233 1,770,247 45.0 CAGE: Prolif-2 5,003,964 2,421,443 48.4 CAGE: Prolif-3 4,734,605 2,062,081 43.6 CAGE: Diff-1 4,525,321 1,679,081 37.1 CAGE: Diff-1 (re-sequenced) 3,101,153 1,252,451 40.4 CAGE: Diff-2 5,060,041 2,195,263 43.4 CAGE: Diff-3 4,830,194 1,578,087 32.7 SAGE: Prolif-1 5,941,753 3,351,426 56.4 SAGE: Prolif-2 7,768,787 4,464,057 57.5 SAGE: Prolif-3 6,723,476 3,878,953 57.7 SAGE: Diff-1 9,467,926 5,811,947 61.4 SAGE: Diff-2 7,269,002 4,618,715 63.5 SAGE: Diff-3 4,392,416 2,494,618 56.8 CAGE average 4.5 mil reads 1.9 mil reads 42% aligned SAGE average 6.9 mil reads 4.1 mil reads 59% aligned Matthias Harbers 11
  12. 12. MyoD (myogenic maker): Viewed in UCSC Genome Browser Transcriptional “Exon Painting” activity at 3’-end Narrow CAGE peak (MyoD promoter has TATA box) SAGE peak Matthias Harbers 12
  13. 13. Reproducibility of SAGE and CAGE data P=0.981 P=0.963 P=0.771 CAGE Sequencing replica Biological replica Differential expression (CAGE only) P=0.930 P=0.839 SAGE Matthias Harbers 13
  14. 14. Correlation of SAGE and CAGE data CAGE: 10,409 genes SAGE: 10,9879 genes Overlap all detectable genes Overlap differentially expressed genes 1169 9240 1747 2160 2144 1702 CAGE SAGE CAGE SAGE Matthias Harbers 14
  15. 15. Top 30 genes from SAGE and CAGE expression dataCAGE gene Ration Microarray SAGE gene Ratio MicroarrayHfe2 4,073 NA RP23-36P22.5 576 NAMyom3 1,624 NA Neb 525 NALmod2 1,305 NA Mylpf 504 YesMyh7 1,124 Yes Ttn 380 NAMb 908 Yes Myh3 368 YesRP23-36P22.5 735 NA Xirp1 306 YesPygm 717 Yes 1110002H13Rik 263 NAMyl4 614 Yes Tnnc1 232 YesSynpo21 595 NA Cav3 150 YesMyh1 561 Yes Cbfa2t3 133 Yes…… ……13 out of 30 not found by microarray 10 out of 30 not found by microarrayMicroarray data from same cell line published by: Tomczak KK et al. FASEB J. 2004 Feb;18(2):403-5. Epub 2003 Dec 19.(Affymetrix mouse MG_U74Av2 and MG_U74Cv2 oligonucleotide-based GeneChips) Matthias Harbers 15
  16. 16. GO terms found in SAGE, CAGE and microarray data CAGE GO SAGE GO Microarray GORegulation of striated muscle Regulation of muscle contraction Cycline-dependent protein kinasecontraction inhibitor activityCardiac muscle contraction Cardiac muscle contraction MyogenesisMyogenesis Myogenesis Skeletal muscle developmentRegulation of muscle contraction Regulation of striated muscle Myoblast differentiation contractionSkeletal muscle development Skeletal muscle development 6-phosphofructokinase activityMuscle development Myofibril assembly Muscle developmentStriated muscle contraction Muscle development Muscle cell differentiationMyoblast differentiation Myoblast fusion Tumor suppressor activityMuscle cell differentiation Striated muscle contraction Myofibril assemblySarcomere organization Muscle cell differentiation Heart development 10/10 muscle related 10/10 muscle related 7/10 muscle related Matthias Harbers 16
  17. 17. Myl1 CAGE region as example for new TSS discoveryCAGE Diff1CAGE Prolif1 Mouse Human Horse Myosin light chain 1 (Myl1) promoter Matthias Harbers 17
  18. 18. Differentially regulated TSS found in CAGE regions  Found 196 new differentially regulated TSS in CAGE data  Out of which 111 regions are upstream of known genes  Out of which 85 regions are downstream of known genes (Lower Cp value = higher expression) + - + + + + + +  7 out of 8 regions tested could be confirmed by RT-PCR Matthias Harbers 18
  19. 19. Discovery of novel TSS by CAGE Matthias Harbers 19
  20. 20. “Exon-painting” found in CAGE libraries  CAGE tags with some frequency found in exons  Exon-painting is reproducible and gene specific  New re-capping activity suggestedCol1a1Col1a2 Matthias Harbers 20
  21. 21. New developments for CAGE method  nanoCAGE: Preparation of CAGE libraries starting from as little as 50 ng total RNA (Plessy C et al. Nat Methods. 2010 Jul;7(7):528-34. Epub 2010 Jun 13.)  CAGE-Scan: Use of paired-end sequencing to link new TSS to known genes (Plessy C et al. Nat Methods. 2010 Jul;7(7):528-34. Epub 2010 Jun 13.) AAAAAA AAAAAA AAAAAA AAAAAA 1b 2 3 4 5 AAAAAA AAAAAA  Helicos-CAGE: Use of single-molecule sequencing to reduce bias in CAGE libraries and reduced sample requirements  Link high-speed sequencing to cDNA cloning: Creating the resources needed to study newly discovered transcripts! Matthias Harbers 21
  22. 22. Summary SAGE and CAGE both provide highly reproducible data sets SAGE and CAGE data show great overlap on genes covered Both methods show better coverage than microarray data CAGE data more complex than SAGE: 67% of gene have multiple CAGE regions CAGE data allowed for discovery of new TSS CAGE data indicate transcriptional at 3’-ends of annotated genes CAGE data showed some exon-painting for many transcripts Matthias Harbers 22
  23. 23. Acknowledgements Leiden University Medical Center: Genomatix: Matthew S. Hestand Andreas Klinghoff Yavuz Ariyurek Matthias Scherf Yolande Ramos Thomas Werner Gert-Jan B. van Ommen Johan T. den Dunnen Peter A.C. ‘t Hoen DNAFORM: Service XS: Makoto Suzuki Wilbert van WorkumDNAFORM Omics Science Center RIKEN Yokohama: Piero Carninci Charles Plessy Yoshihida Hayashizaki Matthias Harbers 23
  24. 24. Thank you for your interest!Matthias Harbers 24