Talk given at the European Meeting on Next Generation Sequencing, August 29 to September 1, 2010, at Leiden University Medical Center, The Netherland.
Data have been published in: Hestand et al.: “Tissue-specific transcript annotation and expression profiling with complementary next-generation sequencing technologies.” Nucleic Acids Res. 2010 Sep;38(16):e165.
PMID: 20615900
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
Tag-based transcript sequencing: Comparison of SAGE and CAGE
1. Tag-based transcript sequencing:
Comparison of SAGE and CAGE
Matthias Harbers
European Meeting on
Next Generation Sequencing
August 29 to September 1, 2010
Leiden University Medical Center, The Netherlands
Matthias Harbers 1
2. Focusing on transcriptome analysis:
Transcript Start Site
Nucleus
Promoter “Gene”
Genomic DNA
(storage of information)
Transcription Factors
Transcription by RNA polymerase II
AAAAA Processed mRNA
(7-methylguanosine cap) Cap
(transport of information)
Translation at ribosome
Non-coding RNAs Protein
(mostly regulatory functions ?) (tools to operate “functions”)
Cytoplasm
Transcript information is the basis to understanding genomes, proteins, and non-coding RNAs!
Matthias Harbers 2
3. cDNA cloning and sequencing:
Genome
1a 1b 2 3 4 5
AAAAAA
mRNA
AAAAAA G
Databases
AAAAAA Clone resources Databases
mRNA pool Clone resources
AAAAAA Great asset foreach gene)
(Representative clone for
research community!
AAAAAA
Functional analysis of genes.
cDNA Library Preparation
AAAAAA
AAAAAA
AAAAAA
High throughput
Sequencing
AAAAAA
cDNA library
Random clone picking
Matthias Harbers 3
4. What did we learn from cDNA cloning projects?
End-sequencing of cDNA clones 1st approach to transcript discovery
Great improvements in full-length cDNA cloning
Building of large cDNA collections (FANTOM, MGC, others…)
Limited by throughput of capillary sequencing
(RIKEN FANTOM Pipeline: ~40,000 reads per day)
Limited by high cost of capillary sequencing
(Reagent cost only per read in the US$ 1 to 1.5 range)
Hence, cDNA cloning and sequencing did cover entire complexity of
transcriptomes
Other methods needed to uncover complexity of transcriptome
Matthias Harbers 4
5. Tag-based methods for high-throughput sequencing
Short sequences (“tags”) are sufficient for transcript identification
Short sequencing reads reduce cost
Short sequencing reads increase throughput
Protocol should provide 1 tag per transcript
Digital expression profiling by counting “tags”
Unbiased transcript discovery
Transcript annotation using reference data
Serial Analysis Gene Expression (SAGE)
(Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. (1995). "Serial analysis of gene expression". Science 270 (5235): 484–7.)
Matthias Harbers 5
6. Preparation of SAGE libraries
AAAAAA Full-length mRNA
Non-polyadenylated mRNA No!
mRNA pool AAAAAA Truncated mRNA
AAAAAA Full-length mRNA
cDNA synthesis
Attach cDNA to surface
AAAAAA AAAAAA
TTTTTT TTTTTT
Cut with anchoring enzyme
(frequent cutters, commonly NlaIII)
AAAAAA AAAAAA
GTAC TTTTTT GTAC TTTTTT
Linker ligation to open Nla III site
CATG AAAAAA CATG AAAAAA
A GTAC TTTTTT B GTAC TTTTTT
Cut with tagging enzyme
CATG CATG (LongSAGE: MmeI, SuperSAGE: EcoP15I)
A GTAC B GTAC Release from surface
B Ligation to form “Ditag”
CATG
GTAC
A GTAC CATG
PCR amplification
Cut with anchoring enzyme
CATG
GTAC Concatenation/Cloning
Digital Gene Expression (DGE)
… CATG
GTAC
CATG
GTAC
GTAC
CATG … L CATG
GTAC R
Sequencing concatemers by capillary sequencing New protocols for direct sequencing on high-speed sequencers
Matthias Harbers 6
7. Why Cap Analysis Gene Expression (CAGE)?
Sequencing of 5’ends allows discovery of Transcription Start Sites
5’-end sequencing allows transcript identification
5’-end sequencing allows promoter identification
5’-end sequencing allows monitoring of non-polyadenylated mRNAs
Cap-Trapper method very effective for 5’-end selection
Cap-Trapper allows library preparation directly from total RNA
Shift from “3’-end information” to “5’-end information”
Cap Analysis Gene Expression (CAGE)
(Shiraki T et al. Proc Natl Acad Sci U S A. 2003 Dec 23;100(26):15776-81. Epub 2003 Dec 8)
Matthias Harbers 7
9. Comparison of SAGE and CAGE data
Directly compare SAGE (DGE) and CAGE from same samples
Use of proliferating and differentiated C2C12 myoblasts as a model
Proliferating C2C12 cells Differentiated C2C12 cells:
Fusion into myotubes
(picture provided by Willem Hoogaars) (picture provided by Willem Hoogaars)
Use of biological triplicates
Use of Illumina Genome Analyzer for high-speed sequencing
Jointly Leiden University, Genomatix, ServiceXS, DNAFORM
(Hestand, MS et al., Nucleic Acids Res. 2010 Sep;38(16):e165])
Matthias Harbers 9
10. Flow SAGE and CAGE data analysis
CAGE Prolif1-3
CAGE Prolif1-3 CAGE Diff1-3
CAGE Diff1-3 SAGE Prolif1-3
SAGE Prolif1-3 SAGE Diff1-3
SAGE Diff1-3
CAGE Prolif1-3 CAGE Diff1-3 SAGE Prolif1-3 SAGE Diff1-3
Illumina sequencing (1 channel/sample, 2 technical replica) and data processing
CAGE: Remove 1 base SAGE: Add CATG
Mapping 2 mismatches allowed Mapping 1 mismatch allowed
CAGE: 742,355 regions SAGE: 361,655 regions
Set threshold to > 2 TPM Set threshold to > 2 TPM
CAGE: 41,862 regions SAGE: 43,512 regions
ElDorado mouse genome: 9,957 annotated exons
ElDorado mouse genome: 27,190 partially annotated exons
ElDorado mouse genome: 2,368 annotated introns
ElDorado mouse genome: 2,347 intergenic regions
Annotated TSS: 13,541 (32%)
Annotated promoter regions: 6,331 (15%)
Annotated 3’-end of transcripts: 8,028 (19%)
FANTOM 3 CAGE data set: 31,680 (76%)
Assigning CAGE regions to genes (1,000 bp window) Assigning SAGE regions to genes (1,000 bp window)
CAGE: 10,409 genes SAGE: 10,987 genes
Matthias Harbers 10
12. MyoD (myogenic maker): Viewed in UCSC Genome Browser
Transcriptional
“Exon Painting”
activity at 3’-end
Narrow CAGE peak
(MyoD promoter has TATA box)
SAGE
peak
Matthias Harbers 12
13. Reproducibility of SAGE and CAGE data
P=0.981 P=0.963 P=0.771
CAGE
Sequencing replica Biological replica Differential expression
(CAGE only)
P=0.930 P=0.839
SAGE
Matthias Harbers 13
14. Correlation of SAGE and CAGE data
CAGE: 10,409 genes SAGE: 10,9879 genes
Overlap all detectable genes Overlap differentially expressed genes
1169 9240 1747 2160 2144 1702
CAGE SAGE CAGE SAGE
Matthias Harbers 14
15. Top 30 genes from SAGE and CAGE expression data
CAGE gene Ration Microarray SAGE gene Ratio Microarray
Hfe2 4,073 NA RP23-36P22.5 576 NA
Myom3 1,624 NA Neb 525 NA
Lmod2 1,305 NA Mylpf 504 Yes
Myh7 1,124 Yes Ttn 380 NA
Mb 908 Yes Myh3 368 Yes
RP23-36P22.5 735 NA Xirp1 306 Yes
Pygm 717 Yes 1110002H13Rik 263 NA
Myl4 614 Yes Tnnc1 232 Yes
Synpo21 595 NA Cav3 150 Yes
Myh1 561 Yes Cbfa2t3 133 Yes
…… ……
13 out of 30 not found by microarray 10 out of 30 not found by microarray
Microarray data from same cell line published by: Tomczak KK et al. FASEB J. 2004 Feb;18(2):403-5. Epub 2003 Dec 19.
(Affymetrix mouse MG_U74Av2 and MG_U74Cv2 oligonucleotide-based GeneChips)
Matthias Harbers 15
16. GO terms found in SAGE, CAGE and microarray data
CAGE GO SAGE GO Microarray GO
Regulation of striated muscle Regulation of muscle contraction Cycline-dependent protein kinase
contraction inhibitor activity
Cardiac muscle contraction Cardiac muscle contraction Myogenesis
Myogenesis Myogenesis Skeletal muscle development
Regulation of muscle contraction Regulation of striated muscle Myoblast differentiation
contraction
Skeletal muscle development Skeletal muscle development 6-phosphofructokinase activity
Muscle development Myofibril assembly Muscle development
Striated muscle contraction Muscle development Muscle cell differentiation
Myoblast differentiation Myoblast fusion Tumor suppressor activity
Muscle cell differentiation Striated muscle contraction Myofibril assembly
Sarcomere organization Muscle cell differentiation Heart development
10/10 muscle related 10/10 muscle related 7/10 muscle related
Matthias Harbers 16
17. Myl1 CAGE region as example for new TSS discovery
CAGE Diff1
CAGE Prolif1
Mouse
Human
Horse
Myosin light chain 1 (Myl1) promoter
Matthias Harbers 17
18. Differentially regulated TSS found in CAGE regions
Found 196 new differentially regulated TSS in CAGE data
Out of which 111 regions are upstream of known genes
Out of which 85 regions are downstream of known genes
(Lower Cp value = higher expression)
+ - + + + + + +
7 out of 8 regions tested could be confirmed by RT-PCR
Matthias Harbers 18
20. “Exon-painting” found in CAGE libraries
CAGE tags with some frequency found in exons
Exon-painting is reproducible and gene specific
New re-capping activity suggested
Col1a1
Col1a2
Matthias Harbers 20
21. New developments for CAGE method
nanoCAGE: Preparation of CAGE libraries starting from
as little as 50 ng total RNA
(Plessy C et al. Nat Methods. 2010 Jul;7(7):528-34. Epub 2010 Jun 13.)
CAGE-Scan: Use of paired-end sequencing to link new TSS
to known genes
(Plessy C et al. Nat Methods. 2010 Jul;7(7):528-34. Epub 2010 Jun 13.)
AAAAAA AAAAAA
AAAAAA AAAAAA
1b 2 3 4 5
AAAAAA AAAAAA
Helicos-CAGE: Use of single-molecule sequencing to reduce
bias in CAGE libraries and reduced sample requirements
Link high-speed sequencing to cDNA cloning: Creating the
resources needed to study newly discovered transcripts!
Matthias Harbers 21
22. Summary
SAGE and CAGE both provide highly reproducible data sets
SAGE and CAGE data show great overlap on genes covered
Both methods show better coverage than microarray data
CAGE data more complex than SAGE: 67% of gene have multiple
CAGE regions
CAGE data allowed for discovery of new TSS
CAGE data indicate transcriptional at 3’-ends of annotated genes
CAGE data showed some exon-painting for many transcripts
Matthias Harbers 22
23. Acknowledgements
Leiden University Medical Center: Genomatix:
Matthew S. Hestand Andreas Klinghoff
Yavuz Ariyurek Matthias Scherf
Yolande Ramos Thomas Werner
Gert-Jan B. van Ommen
Johan T. den Dunnen
Peter A.C. ‘t Hoen
DNAFORM: Service XS:
Makoto Suzuki Wilbert van Workum
DNAFORM
Omics Science Center RIKEN Yokohama:
Piero Carninci
Charles Plessy
Yoshihida Hayashizaki
Matthias Harbers 23