Cancer Systems Biology:
RNA-Seq
August 16, 2012
Anne Deslattes Mays
Wellstein/Riegel Laboratory
Mentor: Anton Wellstein, M...
Talk Outline
• What is Systems Biology?
• What is RNA-Seq?
• RNA-Seq Differential Expression Analysis
9/13/2013 Wellstein/...
Systems Biology is a systems approach to building
testable models of biology using observation and
measurement
9/13/2013 W...
Systems Biology brings together interdisciplinary
fields, tools, analysis and platforms
• Genomics
• Epigenomics/epgenetic...
What is the discipline of Systems Biology?
A Reverse Engineering Discipline
9/13/2013 Wellstein/Riegel Laboratory 5
Input
...
9/13/2013 Wellstein/Riegel Laboratory 6
Genome
Transcriptome
Proteome
Metabolome
What is Systems Biology?
Systems Biology ...
9/13/2013 Wellstein/Riegel Laboratory 7
Sequencing
technologies
Mass Spec
technologies
What is Systems Biology?
Systems Bi...
9/13/2013 Wellstein/Riegel Laboratory 8
What is Systems Biology?
Technology Advances
Spurs
Research Advances
Systems Biolo...
9/13/2013 Wellstein/Riegel Laboratory 9
RNA-seq
Here is an example RNA-Seq Workflow
9/13/2013 Wellstein/Riegel Laboratory 10
Experimental
Design
Sample
Collection
Quality...
9/13/2013 Wellstein/Riegel Laboratory 11
Three steps to get to a fresh sequence with the Illumina
Genome Sequence Analyzer...
9/13/2013 Wellstein/Riegel Laboratory 12
Before Library Construction
1. Poly-A Selection (Total RNA ->
mRNA)
2. mRNA fragm...
9/13/2013 Wellstein/Riegel Laboratory 13
cDNA (single or double stranded)
1. cDNA is blunt end-repaired and
phosphorylated...
9/13/2013 Wellstein/Riegel Laboratory 14
Index adapter ligation and product
ready for amplification on cBot or
the cluster...
9/13/2013 Wellstein/Riegel Laboratory 15
Single DNA molecules hybridize to
the lawn of oligos grafted to the
surface of th...
9/13/2013 Wellstein/Riegel Laboratory 16
Bridge amplifications resulting in
100s of millions of unique clusters
1. Each fr...
9/13/2013 Wellstein/Riegel Laboratory 17
4 fluorescently labeled reversibly
terminated nucleotides
1. Each base competes f...
What was good for DNA is now good for RNA
• Technology advances => higher throughput sequencing at
lower costs
• Whole Gen...
There are other ways to Inquire about the
Transcriptome
• Array Based Technologies
– Affymetrix
– Agilent
– Known genes an...
What is unique about RNA-Seq?
• Allows you to discover and profile the entire transcriptome of
any organism
• No probes or...
9/13/2013 Wellstein/Riegel Laboratory 21
After sequencing…
1. Quality control – trim your reads
2. Count Reads
• Align to ...
9/13/2013 Wellstein/Riegel Laboratory 22
How much RNA-sequencing data?
1. 20 million paired end reads ~ 2 GB of data
2. 10...
9/13/2013 Wellstein/Riegel Laboratory 23
A growing number of tools enable RNA-Seq analysis
These RNA-Seq tools are used for mapping reads, aligning
reads and providing input for differential expression analysis
• ...
9/13/2013 Wellstein/Riegel Laboratory 25
What percentage of reads are covered? What
percentage of reads are mapped?
3’ Bia...
9/13/2013 Wellstein/Riegel Laboratory 26
Galaxy is a web based tool committed to enable a
researcher (more than just for R...
9/13/2013 Wellstein/Riegel Laboratory 27
How to visualize mapped results?
• UCSC Genome Browser (Gbrowse)
• Integrated Genome Browser (IGB)
• Integrated Genome Vie...
9/13/2013 Wellstein/Riegel Laboratory 29
9/13/2013 Wellstein/Riegel Laboratory 30
What do RNA-Seq reads look like for GAPDH?
Repeat masked allowing 1/2 mismatched bases blat’d reads
viewed in IGB 6.7.2
9/13/2013 Wellstein/Riegel Laboratory 32
RNA-Seq Differential
Expression analysis
What does GAPDH look like in terms of quantitation?
TOTAL BM HPP
RPKM 3SEQ Counts BLAT Reads RPKM 3SEQ Counts BLAT Reads
C...
9/13/2013 Wellstein/Riegel Laboratory 34
RNA-Seq Quantification Challenge: A problem that
exists with RNA-Seq data that do...
9/13/2013 Wellstein/Riegel Laboratory 35
Cufflinks: Transcript assembly, differential expression, and
differential regulat...
9/13/2013 Wellstein/Riegel Laboratory 36
Cuffdiff produces many output files:
1. Transcript FPKM expression tracking.
2. G...
9/13/2013 Wellstein/Riegel Laboratory 37
RNA-Seq Quantification Challenge: DESeq Method uses
the geometric mean of counts ...
DESeq: an R package that works with Raw Counts to
determine genes differentially expressed across samples
• Simon Anders
9...
9/13/2013 Wellstein/Riegel Laboratory 39
9/13/2013 Wellstein/Riegel Laboratory 40
9/13/2013 Wellstein/Riegel Laboratory 41
What is Systems Biology?
Technology Advances
Spurs
Research Advances
Systems Biol...
Resources
• http://dx.doi.org/10.1038/npre.2010.4282.1 (DESeq)
• http://galaxy.psu.edu/
• http://seqanswers.com/
• http://...
9/13/2013 Wellstein/Riegel Laboratory 43
Acknowledgements
Dr. Anton Wellstein
Dr. Anna Riegel
Dr. Marcel Schmidt
Jean-Bapt...
Given a list of differentially expressed Genes now
enrichment analysis should be performed
• Enrichment analysis allows th...
Using differentially expressed genes, biological
pathways should be explored
• Differentially expressed genes are put into...
9/13/2013 Wellstein/Riegel Laboratory 46
FGFBP1 pathways control after induction of a conditional transgene in a mouse mod...
Scientific knowledge is limited (and advanced) by the
limits (and advancements) of measurement
9/13/2013 Wellstein/Riegel ...
9/13/2013 Wellstein/Riegel Laboratory 48
Before Library Construction
1. Most vendors and cores will assess
the quality of ...
9/13/2013 Wellstein/Riegel Laboratory 49
Cluster Generation
• Cbot cluster system single molecules are isothermally amplif...
9/13/2013 Wellstein/Riegel Laboratory 50
Sequencing
• 100s of millions of clusters sequenced simultaneously
• Using 4 fluo...
9/13/2013 Wellstein/Riegel Laboratory 51
Systems Biology History (wikipedia)
• Systems biology roots found in
– Quantitative modeling of enzyme kinetics
– Mathemat...
Institutes of Systems Biology
• 2000 – Institutes of Systems Biology established in Seattle and
Tokyo
• After completion o...
Upcoming SlideShare
Loading in …5
×

2012 august 16 systems biology rna seq v2

1,073 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,073
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
25
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • “Nothing scientific can be said about a system for which no measurements are possible at the scale of the theory”Erwin ShrödingerScience Theory and Man: “It really is the ultimate purpose of all schemes and models to serve as scaffolding for any observations that are at all means observable”“It makes no sense to apply a mathematical method that either depends on or utilizes unobservable measurements”
  • Capillary Gel Electrophoresis enabled the sequencing of the human genome faster, cheaper – spurred the completion of the human genome project
  • Sample starts with total RNA,Message RNA purified by polyA selection and then Chemically fragmented and converted into sscDNA using random hexamer priming.Second strand generated to create double stranded cDNA. And then this is ready for the TruSeq Library Construction.Blunt-ended DNA fragments are generated using a combination of fill in reactions and exonnuclease activity.An “A” base is added to the blunt ends of each strand. Preparing them for ligation to the sequence adapters.
  • TrueSeQ workflowBlunt end fragments created.An A base is addedPrepare for indexed adapter ligations.Final product created which is ready for applicfication either the cBot or the Cluster Station.Pooling strategy is applied to allow multiplexing on the HiSeQ 2000 by using these adapters…In this way the paired end sequencing can be performed, the tags are assigned to each strand, so strandedness is preserved.
  • RNA Seq allows you to discover and profile the entire transcriptomeNo ProbesNo PrimersRNA Seq delivers unbiased, unparalleled information about the transcriptome.Simple Sequencing WorkflowIlluminas optimized TRUSeq RNA Sample Prep Kits.
  • Tools
  • Once you get started – than there are a number of tools that allow you to visualize and understand your data.
  • I will talk more about these tools on Thursday – When I give a talk on RNASeq for the Systems Biology series. But lets go back to our particular problem
  • 2012 august 16 systems biology rna seq v2

    1. 1. Cancer Systems Biology: RNA-Seq August 16, 2012 Anne Deslattes Mays Wellstein/Riegel Laboratory Mentor: Anton Wellstein, MD, PhD 9/13/2013 Wellstein/Riegel Laboratory 1
    2. 2. Talk Outline • What is Systems Biology? • What is RNA-Seq? • RNA-Seq Differential Expression Analysis 9/13/2013 Wellstein/Riegel Laboratory 2
    3. 3. Systems Biology is a systems approach to building testable models of biology using observation and measurement 9/13/2013 Wellstein/Riegel Laboratory 3
    4. 4. Systems Biology brings together interdisciplinary fields, tools, analysis and platforms • Genomics • Epigenomics/epgenetics • Transcriptomics • Proteomics • Metabolomics • Glycomics • Lipidomics • Interactomics • NeuroElectroDynamics • Fluxomics • Biomics 9/13/2013 Wellstein/Riegel Laboratory 4
    5. 5. What is the discipline of Systems Biology? A Reverse Engineering Discipline 9/13/2013 Wellstein/Riegel Laboratory 5 Input Process Output Perhaps more Equivalent to a Decipher Project: Alan Turing and the group of codebreakers during world war two deciphered the codes created by the Enigma. A Biological System is communicating we are trying to crack the code.
    6. 6. 9/13/2013 Wellstein/Riegel Laboratory 6 Genome Transcriptome Proteome Metabolome What is Systems Biology? Systems Biology is a discipline using a multitude of measurement technologies to capture the entirety of a biological systems parts and then attempts to reverse engineer that biological system’s ability to dynamically remodel in its response to stimuli
    7. 7. 9/13/2013 Wellstein/Riegel Laboratory 7 Sequencing technologies Mass Spec technologies What is Systems Biology? Systems Biology is a discipline using a multitude of measurement technologies to capture the entirety of a biological systems parts and then attempts to reverse engineer that biological system’s ability to dynamically remodel in its response to stimuliGenome Transcriptome Proteome Metabolome
    8. 8. 9/13/2013 Wellstein/Riegel Laboratory 8 What is Systems Biology? Technology Advances Spurs Research Advances Systems Biology is a discipline using a multitude of measurement technologies to capture the entirety of a biological systems parts and then attempts to reverse engineer that biological system’s ability to dynamically remodel in its response to stimuli Sequencing technologies Mass Spec technologies Genome Transcriptome Proteome Metabolome
    9. 9. 9/13/2013 Wellstein/Riegel Laboratory 9 RNA-seq
    10. 10. Here is an example RNA-Seq Workflow 9/13/2013 Wellstein/Riegel Laboratory 10 Experimental Design Sample Collection Quality Control Read Trimming Differential Analysis Transcript Identification Pathway Analysis Marker Discovery Sequencing
    11. 11. 9/13/2013 Wellstein/Riegel Laboratory 11 Three steps to get to a fresh sequence with the Illumina Genome Sequence Analyzer • Library generation • Cluster generation • Sequencing
    12. 12. 9/13/2013 Wellstein/Riegel Laboratory 12 Before Library Construction 1. Poly-A Selection (Total RNA -> mRNA) 2. mRNA fragmentation 3. First strand synthesis (here we stop if we want to maintain strand specificity 4. Second strand synthesis Other techniques 1. Ribozero 2. Ribominus Library Construction: Messenger RNA are Poly-A selected from Total RNA, fragmented and cDNA synthesized
    13. 13. 9/13/2013 Wellstein/Riegel Laboratory 13 cDNA (single or double stranded) 1. cDNA is blunt end-repaired and phosphorylated (B.) 2. A-base added to prepare for indexed adapter ligation (C.) Library Construction: End repair and adenylation results in adapter ligation ready constructs
    14. 14. 9/13/2013 Wellstein/Riegel Laboratory 14 Index adapter ligation and product ready for amplification on cBot or the cluster station 1. Strand specific tags are added to the A base – ligate index adapter (D) 2. Denature and amplify for final product (E) Library Construction: Adapter ligation results in cluster- generation-ready constructs
    15. 15. 9/13/2013 Wellstein/Riegel Laboratory 15 Single DNA molecules hybridize to the lawn of oligos grafted to the surface of the flow cell 1. Oligo lawn 2. Oligos hybridize to the adapters that had been ligated to the library fragments which flow through the cell Cluster Generation: In the illumina Cbot system, single molecules are isothermally amplified in a flow cell to prepare them for sequencing
    16. 16. 9/13/2013 Wellstein/Riegel Laboratory 16 Bridge amplifications resulting in 100s of millions of unique clusters 1. Each fragment is clonally amplified through a series of extensions and isothermal bridge amplifications 2. Reverse strands cleaved and washed away 3. Ends are blocked 4. Sequencing primer hybridized to the DNA template 5. Libraries are ready for sequencing Cluster generation: Bound fragments are extended to make copies and reverse strands cleaved and washed away
    17. 17. 9/13/2013 Wellstein/Riegel Laboratory 17 4 fluorescently labeled reversibly terminated nucleotides 1. Each base competes for addition 2. Natural competition ensures highest accuracy 3. After each round of synthesis, clusters are excited by a laser emitting a color that identifies the newly added base 4. Fluorescent label and blocking group are removed allowing for addition of next nucleotide 5. Proprietary (Illumina) chemistry reads a base in each cycle 6. Allows for accurate sequencing through difficult regions such as homopolymers and repetitive sequence Sequencing: 100s of millions of clusters sequenced simultaneously
    18. 18. What was good for DNA is now good for RNA • Technology advances => higher throughput sequencing at lower costs • Whole Genome Sequencing has enabled • Whole Transcriptome Sequencing • Workflow for DNA sequencing and RNA sequencing is similar 9/13/2013 Wellstein/Riegel Laboratory 18
    19. 19. There are other ways to Inquire about the Transcriptome • Array Based Technologies – Affymetrix – Agilent – Known genes and hybridization protocols • Microarray – 20,000+ array experiments on a single platform – Edge effects – False positives / false negatives • Bead-based arrays • Tiling arrays • SAGE 9/13/2013 Wellstein/Riegel Laboratory 19
    20. 20. What is unique about RNA-Seq? • Allows you to discover and profile the entire transcriptome of any organism • No probes or primers to design • Novel transcripts • Novel isoforms • Alternative splice sites • Rare transcripts • cSNPS – all of this in one experiment 9/13/2013 Wellstein/Riegel Laboratory 20
    21. 21. 9/13/2013 Wellstein/Riegel Laboratory 21 After sequencing… 1. Quality control – trim your reads 2. Count Reads • Align to genome • Align to transcriptome 3. Interpret Data • Statistical tests (differential expression analysis) • Visualization (mapped reads) • Pathway analysis Not so simple – big data, big compute requirements After sequencing, we must then perform RNA-Seq Data Analysis
    22. 22. 9/13/2013 Wellstein/Riegel Laboratory 22 How much RNA-sequencing data? 1. 20 million paired end reads ~ 2 GB of data 2. 100 million paired end reads ~ 10 GB of data How much computation power? 1. More memory, more processors, less time it takes to compute 2. Outsource the analysis, still will need to store the results somewhere Amazon web services S3 storage EC elastic cloud on demand computational facility Georgetown University High Performance Computer Core matrix.georgetown.edu UPENN Galaxy services How much RNA-sequencing data, how much computation power and where do you go to compute?
    23. 23. 9/13/2013 Wellstein/Riegel Laboratory 23 A growing number of tools enable RNA-Seq analysis
    24. 24. These RNA-Seq tools are used for mapping reads, aligning reads and providing input for differential expression analysis • Tuxedo suite – Bowtie, Tophat, Cufflinks • Trinity Suite – Inchworm, chrysallis, butte rfly • RUM – RNA Unified Mapper 9/13/2013 Wellstein/Riegel Laboratory 24
    25. 25. 9/13/2013 Wellstein/Riegel Laboratory 25 What percentage of reads are covered? What percentage of reads are mapped? 3’ Bias on transcript reads 1. 60-80% of reads are mapped 2. Highest percentage or 3’ end of reads are mapped 3. Reads need to be quality trimmed Mapping tools bias exons to known genes
    26. 26. 9/13/2013 Wellstein/Riegel Laboratory 26 Galaxy is a web based tool committed to enable a researcher (more than just for RNA-Seq)
    27. 27. 9/13/2013 Wellstein/Riegel Laboratory 27
    28. 28. How to visualize mapped results? • UCSC Genome Browser (Gbrowse) • Integrated Genome Browser (IGB) • Integrated Genome Viewer (IGV) Many shared formats, reading many of the outputs generated by the programs, ability to generate ones own tracks 9/13/2013 Wellstein/Riegel Laboratory 28
    29. 29. 9/13/2013 Wellstein/Riegel Laboratory 29
    30. 30. 9/13/2013 Wellstein/Riegel Laboratory 30
    31. 31. What do RNA-Seq reads look like for GAPDH? Repeat masked allowing 1/2 mismatched bases blat’d reads viewed in IGB 6.7.2
    32. 32. 9/13/2013 Wellstein/Riegel Laboratory 32 RNA-Seq Differential Expression analysis
    33. 33. What does GAPDH look like in terms of quantitation? TOTAL BM HPP RPKM 3SEQ Counts BLAT Reads RPKM 3SEQ Counts BLAT Reads CD34 0.7 340 230 8 8 14 BST1 19.7 5374 31 31 CD133 0.2 173 176 16 16 33 THY1 0 7 4 4 A12 1 0 A5 0 0 ALK 0 9 24 0 0 3 B9 0 0 C1 0 0 C2 0 0 C7 0 0 E7 0 0 E9 2 0 F6 0 0 G12 0 0 GAPDH 3013.2 727831 356289 120.8 5559 2670 H3 0 0 Blat read raw counts ratio == 3Seq counts ratio ~= 130 to 1 RPKM ratio ~= 24.3
    34. 34. 9/13/2013 Wellstein/Riegel Laboratory 34 RNA-Seq Quantification Challenge: A problem that exists with RNA-Seq data that doesn’t exist with array data: Longer transcripts produce more reads than shorter transcripts One solution to account for this is RPKM (FPKM used by Cufflinks) RPKM = 10^9 x C / NL, which is really just simply C/N C(gene)= the number of mappable reads that fall onto a gene's exons N= total number of mappable reads in the experiment L(gene)= the sum of the exons in base pairs. Wold (2008)
    35. 35. 9/13/2013 Wellstein/Riegel Laboratory 35 Cufflinks: Transcript assembly, differential expression, and differential regulation for RNA-seq
    36. 36. 9/13/2013 Wellstein/Riegel Laboratory 36 Cuffdiff produces many output files: 1. Transcript FPKM expression tracking. 2. Gene FPKM expression tracking; tracks the summed FPKM of transcripts sharing each gene_id 3. Primary transcript FPKM tracking; tracks the summed FPKM of transcripts sharing each tss_id 4. Coding sequence FPKM tracking; tracks the summed FPKM of transcripts sharing each p_id, independent of tss_id 5. Transcript differential FPKM. 6. Gene differential FPKM. Tests differences in the summed FPKM of transcripts sharing each gene_id 7. Primary transcript differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each tss_id 8. Coding sequence differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each p_id independent of tss_id 9. Differential splicing tests: this tab delimited file lists, for each primary transcript, the amount of overloading detected among its isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript. Only primary transcripts from which two or more isoforms are spliced are listed in this file. 10. Differential promoter tests: this tab delimited file lists, for each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples. Only genes producing two or more distinct primary transcripts (i.e. multi-promoter genes) are listed here. 11. Differential CDS tests: this tab delimited file lists, for each gene, the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples. Only genes producing two or more distinct CDS (i.e. multi-protein genes) are listed here.
    37. 37. 9/13/2013 Wellstein/Riegel Laboratory 37 RNA-Seq Quantification Challenge: DESeq Method uses the geometric mean of counts in all samples DESeq Method: Construct a "reference sample" by taking, for each gene, the geometric mean of the counts in all samples. To get the sequencing depth of a sample relative to the reference, calculate for each gene the quotient of the counts in your sample divided by the counts of the reference sample. Now you have, for each gene, an estimate of the depth ratio. Simply take the median of all the quotients to get the relative depth of the library. 'estimateSizeFactors' function of DESeq package does this calculation.
    38. 38. DESeq: an R package that works with Raw Counts to determine genes differentially expressed across samples • Simon Anders 9/13/2013 Wellstein/Riegel Laboratory 38
    39. 39. 9/13/2013 Wellstein/Riegel Laboratory 39
    40. 40. 9/13/2013 Wellstein/Riegel Laboratory 40
    41. 41. 9/13/2013 Wellstein/Riegel Laboratory 41 What is Systems Biology? Technology Advances Spurs Research Advances Systems Biology is a discipline using a multitude of measurement technologies to capture the entirety of a biological systems parts and then attempts to reverse engineer that biological system’s ability to dynamically remodel in its response to stimuli Sequencing technologies Mass Spec technologies Genome Transcriptome Proteome Metabolome
    42. 42. Resources • http://dx.doi.org/10.1038/npre.2010.4282.1 (DESeq) • http://galaxy.psu.edu/ • http://seqanswers.com/ • http://www.broadinstitute.org/igv/ • http://bioviz.org/igb/index.html • http://www.illumina.com • http://www.otogenetics.com • http://www.dnanexus.com • http://cufflinks.cbcb.umd.edu/ • http://brb.nci.nih.gov/BRB-ArrayTools.html 9/13/2013 Wellstein/Riegel Laboratory 42
    43. 43. 9/13/2013 Wellstein/Riegel Laboratory 43 Acknowledgements Dr. Anton Wellstein Dr. Anna Riegel Dr. Marcel Schmidt Jean-Baptiste Masarati Dr. Elena Tassi The entire lab: Tibari, Ghada, Ivana, Eveline, the entire Wellstein/Riegel laboratory My Committee Dr. Yuri Gusev Dr. Anatoly Dritschilo Dr. Michael Johnson Dr. Christopher Loffredo Dr. Habtom Ressom Dr. Terry Ryan (external committee member) High Performance Core Group, Steve Moore, especially Woonki Chung Amazon Cloud Services Dr. Ann Loraine, UNC, IGB Developer Brian Haas, Author Trinity Suite
    44. 44. Given a list of differentially expressed Genes now enrichment analysis should be performed • Enrichment analysis allows the researcher to leverage documented experiments which provide evidence for genes roles in pathways and functions that enable the researcher to determine the results and significance of their experiments • DAVID – Gene ontology – Functional ontology • Revigo – Output of David may be placed in REVIGO for further interpretation and statistical exploration of significance of discovered sets of genes 9/13/2013 Wellstein/Riegel Laboratory 44
    45. 45. Using differentially expressed genes, biological pathways should be explored • Differentially expressed genes are put into programs such as pathway studio or ingenuity • Shortest path programs and • Canonical pathway analysis • Enables a researcher to reverse engineer the pathways expressed in the course of a healthy response to a diseased response • Ideally a pathway reveals the observed phenotype – connecting the expressed gene expression program with the phenotype – genotype – gene expression program to phenotype 9/13/2013 Wellstein/Riegel Laboratory 45
    46. 46. 9/13/2013 Wellstein/Riegel Laboratory 46 FGFBP1 pathways control after induction of a conditional transgene in a mouse model: Information derived from mRNA expression pattern analysis Anne Deslattes Mays, Elena Tassi, Anton Wellstein Department of Oncology and Medicine, Lombardi Cancer Center, Washington DC 20057 Abstract Fibroblast Growth Factors (FGFs) play a significant role in embryonic development, maintenance of tissue homeostasis in the adult as well as in different diseases. FGF-binding proteins (FGF-BP) are secreted proteins that chaperone FGFs stored in the extracellular matrix to their cognate receptor, and can thus modulate FGF signaling. FGF-BP1 (BP1 a.k.a. HBp17) expression is required for embryonic survival, can modulate FGF-dependent vascular permeability in embryos and is an angiogenic switch in human cancers. To determine the function of BP1 in vivo, we generated tetracycline-regulated conditional BP1 transgenic mice. BP1 expressing mice are viable, fertile and phenotypically indistinguishable from their littermates. Five cDNA Affymetrix arrays were run on the kidneys of the FGF-BP1 transgenic mice. Two arrays were run for the animals under doxycyclin diet with the transgene switched off, one array was run with induction of the FGF-BP1 transgene for 24 hours, one array was run with induction of the FGF-BP1 transgene for 336 hours representing a chronic induction of the transgene. The results indicate that when properly normalized, time series analysis of a large array can reveal the signal transduction pathways. Pattern analysis allows for a systems biology review of the data and allows for the exploration and generation of testable hypotheses. Figure 3 – Heatmap scaled by probe - After RMA normalization, selection of significant over and under expressors relative to the average of the FGFBP1 transgene being off, analysis of the heatmap reveals mutually exclusive clusters. These clusters indicate genes that are off from one state until the other. Cluster A represents those genes that are off with the FGFBP1 transgene being off and switched on when the FGFBP1 transgene is activated for 24 hours. Cluster B contains those genes that are off at 24 hours but activated when the FGFBP1 transgene is on for 48 hours. Cluster C contains those genes that are off at 48 hours but on when the FGFBP1 transgene is on for 336 hours – or chronically. Studying these genes in this order, and with this pattern, allows the exploration of the signal transduction and activation pathway in response to the activation of FGFBP1 transgene. A B C A B C Figure 5– Gene Details – The detail for the genes found in the clusters of Figure 1 are described above in tables A, B and C. The genes responding after activation of the FGFBP1 transgene for 24 hours includes immunoglobulin kappa chain variable 21, 3-phosphoglycerate dehydrogenazes, a zinc finger protein, neuroantin, and homeobox B8. The genes found in table B, represent those genes activated after 48 hours of the FGFBP1 transgene being on. Included in this set is the hemopexin and major urinary protein 3. Finally after 336 hours – truly representing chronic activation of the FGFBP1 transgene, we have one gene, Reg3b, associated with inflammatory response (according to GO ontology). Figure 2 – Distinct Expression Patterns When Filtering by Thresholds at Timepoints. By creating a filter to capture the distinctive patterns that are expressing themselves at each of the separate timepoints, One can understand the major message being communicated at each timep oint. The patterns of expression are distinctive. Panel A are the expression patterns for those genes above a threshold at 24 hours. Panel B are the expression patterns for those genes above a threshold at 48 hours and Panel C are the expression patterns for those genes above a threshold at 336 hours – or at a chronic transgene Expression level. Figure 4- FGFBP1 pathways – Using Pathway Studio, the shortest path through the set of genes that were selected from filtering by a band pass filter at each of the time points, 24 hours, 48 hours and 336 hours was constructed. The resulting selection of diseases, cell processes, and functional classes were the result of Pathway Studio constructing the shortest path to connect those genes in the set. Conclusions A systems biology approach to analyzing large data sets, such as this study which involved five full mouse cDNA arrays allows the researcher to capture a snapshot of the unfolding remodeling events of an organisms response to change, stress or disease. Analyzing data in this form involves filtering the biological signal from the noise. Sorting the noise in appropriate manners is essential to be able to complete the biological story. Building on existing knowledge base, we can complete the picture as long as the proper context of the collection, normalization and analysis is maintained. High throughput technologies such as microarrays and RNA sequencing as enabled by next generation sequencing presents the researcher with the challenge of extracting meaningful information from the measurements. Software tools and analysis techniques are not a substitute to understanding the biological context from which the data are collected. Engineering and digital signal processing has allowed us to derive the understanding of how to reconstruct a signal from the presence of a continual stream of noisy analog data. Sampling frequency and proper filtering are a must to be able to sort out a meaningful signal from the noise. These same principles apply not only to communication theory but also when studying large data such as those that may be collected from high throughput systems such as a Affymetrix mouse cDNA array. A B C 0 A B C Figure 1 Panels. 0, A, B, and C, illustrate ordering based upon the expression values of the control (FGFBP OFF), 24 hour expression (FGFBP1 On 24 hours), 48 hour expression (FGFBP1 On 48 hours), and 336 hour expression (FGFBP On 336 hours). The insight gained from this inspection includes the ability to see the relative changes of expression at each of these time points. Figure 6 – Graphical Gaussian Model. Using the expression profiles, a quassi-Bayesian analysis is performed constructing the partial correlation network among the top expressing genes. Note that C9 (complement component 9) was not able to be placed in context of the data in the Pathway Studio diagram, however using the partial correlations, we are able to place it as strongly positively correlated to Serpina3k, Cyp3all, MUG1, Tdo2, Mup3, Hpx, weakly positively correlated to Hamp, and strongly negatively correlated to Tex10. Together indicating the placement of C9 in the Endothelial response.
    47. 47. Scientific knowledge is limited (and advanced) by the limits (and advancements) of measurement 9/13/2013 Wellstein/Riegel Laboratory 47 • Ilya Shmulevich Genomic Signal Processing “Validity of the model involves observation and measurement, scientific knowledge is limited by the limits of measurement” • Erwin Shrödinger Science Theory and Man: “It really is the ultimate purpose of all schemes and models to serve as scaffolding for any observations that are at all means observable”
    48. 48. 9/13/2013 Wellstein/Riegel Laboratory 48 Before Library Construction 1. Most vendors and cores will assess the quality of the RNA before sequencing 2. Important to determine before sequencing begins Garbage – in == Garbage out Before library construction, RNA quality must be assessed
    49. 49. 9/13/2013 Wellstein/Riegel Laboratory 49 Cluster Generation • Cbot cluster system single molecules are isothermally amplified in a flow cell to prepare them for high-throughput sequencing • 8 channel genome analyzer has a dense lawn of oligos • Single DNA molecules hybridize to the lawn of oligos • Bound fragments are extended to make copies • Copies covalently bound to the flowcells surface • Each fragment is clonally amplified through a series of extensions and isothermal bridge amplifications resulting in 100s millions of unique clusters • Reverse strands cleaved and washed away • Ends are blocked • Sequencing primer hybridized to the DNA template • After cluster generation, libraries are ready for sequencing
    50. 50. 9/13/2013 Wellstein/Riegel Laboratory 50 Sequencing • 100s of millions of clusters sequenced simultaneously • Using 4 fluorescently labeled reversibly terminated nucleotides • Natural competition ensures highest accuracy • After each round of synthesis, clusters are excited by a laser emitting a color that identifies the newly added base • Fluorescent label and blocking group are then removed allowing for the addition of the next nucleotide • Proprietary chemistry (Illumina) reads a base in each cycle • Allows for accurate sequencing through difficult regions such as homopolymers and repetitive sequence
    51. 51. 9/13/2013 Wellstein/Riegel Laboratory 51
    52. 52. Systems Biology History (wikipedia) • Systems biology roots found in – Quantitative modeling of enzyme kinetics – Mathematical modeling of population growth – Simulations to study neurophysiology – Control theory and cybernetics • Theorists – Ludwig von Bertalanffy – General Systems Theory – Alan Lloyd Hodgkin and Andrew Fielding Huxley – constructed a mathematical model that explained potential propagating along the axon of a neuron cell – Denis Nobel – first computer model of the heart Pacemaker 9/13/2013 Wellstein/Riegel Laboratory 52
    53. 53. Institutes of Systems Biology • 2000 – Institutes of Systems Biology established in Seattle and Tokyo • After completion of Human Genome projects • NSF grand challenge for systems biology – build a mathematical model of the whole cell 9/13/2013 Wellstein/Riegel Laboratory 53

    ×