Introduction to Multilingual Retrieval Augmented Generation (RAG)
2012 august 16 systems biology rna seq v2
1. Cancer Systems Biology:
RNA-Seq
August 16, 2012
Anne Deslattes Mays
Wellstein/Riegel Laboratory
Mentor: Anton Wellstein, MD, PhD
9/13/2013 Wellstein/Riegel Laboratory 1
2. Talk Outline
• What is Systems Biology?
• What is RNA-Seq?
• RNA-Seq Differential Expression Analysis
9/13/2013 Wellstein/Riegel Laboratory 2
3. Systems Biology is a systems approach to building
testable models of biology using observation and
measurement
9/13/2013 Wellstein/Riegel Laboratory 3
5. What is the discipline of Systems Biology?
A Reverse Engineering Discipline
9/13/2013 Wellstein/Riegel Laboratory 5
Input
Process
Output
Perhaps more Equivalent to a Decipher Project:
Alan Turing and the group of codebreakers during world war two
deciphered the codes created by the Enigma.
A Biological System is communicating we are trying to crack the code.
6. 9/13/2013 Wellstein/Riegel Laboratory 6
Genome
Transcriptome
Proteome
Metabolome
What is Systems Biology?
Systems Biology is a discipline using a
multitude of measurement technologies to
capture the entirety of a biological systems
parts and then attempts to reverse engineer
that biological system’s ability to dynamically
remodel in its response to stimuli
7. 9/13/2013 Wellstein/Riegel Laboratory 7
Sequencing
technologies
Mass Spec
technologies
What is Systems Biology?
Systems Biology is a discipline using a
multitude of measurement technologies to
capture the entirety of a biological systems
parts and then attempts to reverse engineer
that biological system’s ability to dynamically
remodel in its response to stimuliGenome
Transcriptome
Proteome
Metabolome
8. 9/13/2013 Wellstein/Riegel Laboratory 8
What is Systems Biology?
Technology Advances
Spurs
Research Advances
Systems Biology is a discipline using a
multitude of measurement technologies to
capture the entirety of a biological systems
parts and then attempts to reverse engineer
that biological system’s ability to dynamically
remodel in its response to stimuli
Sequencing
technologies
Mass Spec
technologies
Genome
Transcriptome
Proteome
Metabolome
10. Here is an example RNA-Seq Workflow
9/13/2013 Wellstein/Riegel Laboratory 10
Experimental
Design
Sample
Collection
Quality Control
Read Trimming
Differential
Analysis
Transcript
Identification
Pathway
Analysis
Marker
Discovery
Sequencing
11. 9/13/2013 Wellstein/Riegel Laboratory 11
Three steps to get to a fresh sequence with the Illumina
Genome Sequence Analyzer
• Library generation
• Cluster generation
• Sequencing
12. 9/13/2013 Wellstein/Riegel Laboratory 12
Before Library Construction
1. Poly-A Selection (Total RNA ->
mRNA)
2. mRNA fragmentation
3. First strand synthesis (here we stop
if we want to maintain strand
specificity
4. Second strand synthesis
Other techniques
1. Ribozero
2. Ribominus
Library Construction: Messenger RNA are Poly-A selected
from Total RNA, fragmented and cDNA synthesized
13. 9/13/2013 Wellstein/Riegel Laboratory 13
cDNA (single or double stranded)
1. cDNA is blunt end-repaired and
phosphorylated (B.)
2. A-base added to prepare for
indexed adapter ligation (C.)
Library Construction: End repair and adenylation results in
adapter ligation ready constructs
14. 9/13/2013 Wellstein/Riegel Laboratory 14
Index adapter ligation and product
ready for amplification on cBot or
the cluster station
1. Strand specific tags are added to
the A base – ligate index adapter
(D)
2. Denature and amplify for final
product (E)
Library Construction: Adapter ligation results in cluster-
generation-ready constructs
15. 9/13/2013 Wellstein/Riegel Laboratory 15
Single DNA molecules hybridize to
the lawn of oligos grafted to the
surface of the flow cell
1. Oligo lawn
2. Oligos hybridize to the adapters
that had been ligated to the
library fragments which flow
through the cell
Cluster Generation: In the illumina Cbot system, single molecules are
isothermally amplified in a flow cell to prepare them for sequencing
16. 9/13/2013 Wellstein/Riegel Laboratory 16
Bridge amplifications resulting in
100s of millions of unique clusters
1. Each fragment is clonally
amplified through a series of
extensions and isothermal bridge
amplifications
2. Reverse strands cleaved and
washed away
3. Ends are blocked
4. Sequencing primer hybridized to
the DNA template
5. Libraries are ready for
sequencing
Cluster generation: Bound fragments are extended to make
copies and reverse strands cleaved and washed away
17. 9/13/2013 Wellstein/Riegel Laboratory 17
4 fluorescently labeled reversibly
terminated nucleotides
1. Each base competes for addition
2. Natural competition ensures
highest accuracy
3. After each round of
synthesis, clusters are excited by
a laser emitting a color that
identifies the newly added base
4. Fluorescent label and blocking
group are removed allowing for
addition of next nucleotide
5. Proprietary (Illumina) chemistry
reads a base in each cycle
6. Allows for accurate sequencing
through difficult regions such as
homopolymers and repetitive
sequence
Sequencing: 100s of millions of clusters sequenced
simultaneously
18. What was good for DNA is now good for RNA
• Technology advances => higher throughput sequencing at
lower costs
• Whole Genome Sequencing has enabled
• Whole Transcriptome Sequencing
• Workflow for DNA sequencing and RNA sequencing is similar
9/13/2013 Wellstein/Riegel Laboratory 18
19. There are other ways to Inquire about the
Transcriptome
• Array Based Technologies
– Affymetrix
– Agilent
– Known genes and hybridization protocols
• Microarray
– 20,000+ array experiments on a single platform
– Edge effects
– False positives / false negatives
• Bead-based arrays
• Tiling arrays
• SAGE
9/13/2013 Wellstein/Riegel Laboratory 19
20. What is unique about RNA-Seq?
• Allows you to discover and profile the entire transcriptome of
any organism
• No probes or primers to design
• Novel transcripts
• Novel isoforms
• Alternative splice sites
• Rare transcripts
• cSNPS – all of this in one experiment
9/13/2013 Wellstein/Riegel Laboratory 20
21. 9/13/2013 Wellstein/Riegel Laboratory 21
After sequencing…
1. Quality control – trim your reads
2. Count Reads
• Align to genome
• Align to transcriptome
3. Interpret Data
• Statistical tests (differential
expression analysis)
• Visualization (mapped
reads)
• Pathway analysis
Not so simple – big data, big
compute requirements
After sequencing, we must then perform
RNA-Seq Data Analysis
22. 9/13/2013 Wellstein/Riegel Laboratory 22
How much RNA-sequencing data?
1. 20 million paired end reads ~ 2 GB of data
2. 100 million paired end reads ~ 10 GB of data
How much computation power?
1. More memory, more processors, less time it takes to compute
2. Outsource the analysis, still will need to store the results somewhere
Amazon web services
S3 storage
EC elastic cloud on demand computational facility
Georgetown University High Performance Computer Core
matrix.georgetown.edu
UPENN Galaxy services
How much RNA-sequencing data, how much computation
power and where do you go to compute?
24. These RNA-Seq tools are used for mapping reads, aligning
reads and providing input for differential expression analysis
• Tuxedo suite
– Bowtie, Tophat, Cufflinks
• Trinity Suite
– Inchworm, chrysallis, butte
rfly
• RUM
– RNA Unified Mapper
9/13/2013 Wellstein/Riegel Laboratory 24
25. 9/13/2013 Wellstein/Riegel Laboratory 25
What percentage of reads are covered? What
percentage of reads are mapped?
3’ Bias on transcript reads
1. 60-80% of reads are mapped
2. Highest percentage or 3’ end of
reads are mapped
3. Reads need to be quality trimmed
Mapping tools bias exons to known
genes
28. How to visualize mapped results?
• UCSC Genome Browser (Gbrowse)
• Integrated Genome Browser (IGB)
• Integrated Genome Viewer (IGV)
Many shared formats, reading many of the outputs generated by
the programs, ability to generate ones own tracks
9/13/2013 Wellstein/Riegel Laboratory 28
34. 9/13/2013 Wellstein/Riegel Laboratory 34
RNA-Seq Quantification Challenge: A problem that
exists with RNA-Seq data that doesn’t exist with array
data: Longer transcripts produce more reads than
shorter transcripts
One solution to account for this is RPKM (FPKM used by Cufflinks)
RPKM = 10^9 x C / NL, which is really just simply C/N
C(gene)= the number of mappable reads that fall onto a gene's exons
N= total number of mappable reads in the experiment
L(gene)= the sum of the exons in base pairs.
Wold (2008)
36. 9/13/2013 Wellstein/Riegel Laboratory 36
Cuffdiff produces many output files:
1. Transcript FPKM expression tracking.
2. Gene FPKM expression tracking; tracks the summed FPKM of transcripts sharing each gene_id
3. Primary transcript FPKM tracking; tracks the summed FPKM of transcripts sharing each tss_id
4. Coding sequence FPKM tracking; tracks the summed FPKM of transcripts sharing each p_id, independent
of tss_id
5. Transcript differential FPKM.
6. Gene differential FPKM. Tests differences in the summed FPKM of transcripts sharing each gene_id
7. Primary transcript differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each
tss_id
8. Coding sequence differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each p_id
independent of tss_id
9. Differential splicing tests: this tab delimited file lists, for each primary transcript, the amount of
overloading detected among its isoforms, i.e. how much differential splicing exists between isoforms
processed from a single primary transcript. Only primary transcripts from which two or more isoforms are
spliced are listed in this file.
10. Differential promoter tests: this tab delimited file lists, for each gene, the amount of overloading detected
among its primary transcripts, i.e. how much differential promoter use exists between samples. Only
genes producing two or more distinct primary transcripts (i.e. multi-promoter genes) are listed here.
11. Differential CDS tests: this tab delimited file lists, for each gene, the amount of overloading detected
among its coding sequences, i.e. how much differential CDS output exists between samples. Only genes
producing two or more distinct CDS (i.e. multi-protein genes) are listed here.
37. 9/13/2013 Wellstein/Riegel Laboratory 37
RNA-Seq Quantification Challenge: DESeq Method uses
the geometric mean of counts in all samples
DESeq Method:
Construct a "reference sample" by taking, for each gene, the geometric mean
of the counts in all samples.
To get the sequencing depth of a sample relative to the reference, calculate
for each gene the quotient of the counts in your sample divided by the counts
of the reference sample.
Now you have, for each gene, an estimate of the depth ratio.
Simply take the median of all the quotients to get the relative depth of the
library.
'estimateSizeFactors' function of DESeq package does this calculation.
38. DESeq: an R package that works with Raw Counts to
determine genes differentially expressed across samples
• Simon Anders
9/13/2013 Wellstein/Riegel Laboratory 38
41. 9/13/2013 Wellstein/Riegel Laboratory 41
What is Systems Biology?
Technology Advances
Spurs
Research Advances
Systems Biology is a discipline using a
multitude of measurement technologies to
capture the entirety of a biological systems
parts and then attempts to reverse engineer
that biological system’s ability to dynamically
remodel in its response to stimuli
Sequencing
technologies
Mass Spec
technologies
Genome
Transcriptome
Proteome
Metabolome
43. 9/13/2013 Wellstein/Riegel Laboratory 43
Acknowledgements
Dr. Anton Wellstein
Dr. Anna Riegel
Dr. Marcel Schmidt
Jean-Baptiste Masarati
Dr. Elena Tassi
The entire lab: Tibari, Ghada, Ivana, Eveline, the entire Wellstein/Riegel laboratory
My Committee
Dr. Yuri Gusev
Dr. Anatoly Dritschilo
Dr. Michael Johnson
Dr. Christopher Loffredo
Dr. Habtom Ressom
Dr. Terry Ryan (external committee member)
High Performance Core Group, Steve Moore, especially Woonki Chung
Amazon Cloud Services
Dr. Ann Loraine, UNC, IGB Developer
Brian Haas, Author Trinity Suite
44. Given a list of differentially expressed Genes now
enrichment analysis should be performed
• Enrichment analysis allows the researcher to leverage
documented experiments which provide evidence for genes
roles in pathways and functions that enable the researcher to
determine the results and significance of their experiments
• DAVID
– Gene ontology
– Functional ontology
• Revigo
– Output of David may be placed in REVIGO for further
interpretation and statistical exploration of significance of
discovered sets of genes
9/13/2013 Wellstein/Riegel Laboratory 44
45. Using differentially expressed genes, biological
pathways should be explored
• Differentially expressed genes are put into programs such as
pathway studio or ingenuity
• Shortest path programs and
• Canonical pathway analysis
• Enables a researcher to reverse engineer the pathways
expressed in the course of a healthy response to a diseased
response
• Ideally a pathway reveals the observed phenotype –
connecting the expressed gene expression program with the
phenotype – genotype – gene expression program to
phenotype
9/13/2013 Wellstein/Riegel Laboratory 45
46. 9/13/2013 Wellstein/Riegel Laboratory 46
FGFBP1 pathways control after induction of a conditional transgene in a mouse model:
Information derived from mRNA expression pattern analysis
Anne Deslattes Mays, Elena Tassi, Anton Wellstein
Department of Oncology and Medicine, Lombardi Cancer Center, Washington DC 20057
Abstract
Fibroblast Growth Factors (FGFs) play a significant role in embryonic development,
maintenance of tissue homeostasis in the adult as well as in different diseases. FGF-binding
proteins (FGF-BP) are secreted proteins that chaperone FGFs stored in the extracellular matrix
to their cognate receptor, and can thus modulate FGF signaling. FGF-BP1 (BP1 a.k.a. HBp17)
expression is required for embryonic survival, can modulate FGF-dependent vascular
permeability in embryos and is an angiogenic switch in human cancers. To determine the
function of BP1 in vivo, we generated tetracycline-regulated conditional BP1 transgenic mice.
BP1 expressing mice are viable, fertile and phenotypically indistinguishable from their
littermates. Five cDNA Affymetrix arrays were run on the kidneys of the FGF-BP1 transgenic
mice. Two arrays were run for the animals under doxycyclin diet with the transgene switched
off, one array was run with induction of the FGF-BP1 transgene for 24 hours, one array was run
with induction of the FGF-BP1 transgene for 336 hours representing a chronic induction of the
transgene. The results indicate that when properly normalized, time series analysis of a large
array can reveal the signal transduction pathways. Pattern analysis allows for a systems
biology review of the data and allows for the exploration and generation of testable hypotheses.
Figure 3 – Heatmap scaled by probe - After RMA normalization, selection of significant over and
under expressors relative to the average of the FGFBP1 transgene being off, analysis of the heatmap
reveals mutually exclusive clusters. These clusters indicate genes that are off from one state until the
other. Cluster A represents those genes that are off with the FGFBP1 transgene being off and switched
on when the FGFBP1 transgene is activated for 24 hours. Cluster B contains those genes that are off
at 24 hours but activated when the FGFBP1 transgene is on for 48 hours. Cluster C contains those
genes that are off at 48 hours but on when the FGFBP1 transgene is on for 336 hours – or chronically.
Studying these genes in this order, and with this pattern, allows the exploration of the signal transduction
and activation pathway in response to the activation of FGFBP1 transgene.
A
B
C
A
B
C
Figure 5– Gene Details – The detail for the genes found in the clusters of Figure 1 are described above
in tables A, B and C. The genes responding after activation of the FGFBP1 transgene for 24 hours
includes immunoglobulin kappa chain variable 21, 3-phosphoglycerate dehydrogenazes, a zinc finger
protein, neuroantin, and homeobox B8. The genes found in table B, represent those genes activated
after 48 hours of the FGFBP1 transgene being on. Included in this set is the hemopexin and major
urinary protein 3. Finally after 336 hours – truly representing chronic activation of the FGFBP1
transgene, we have one gene, Reg3b, associated with inflammatory response (according to GO
ontology).
Figure 2 – Distinct Expression Patterns When
Filtering by Thresholds at Timepoints. By
creating a filter to capture the distinctive patterns
that are expressing themselves at each of the
separate timepoints, One can understand the
major message being communicated at each
timep oint. The patterns of expression are
distinctive. Panel A are the expression patterns for
those genes above a threshold at 24 hours. Panel
B are the expression patterns for those genes
above a threshold at 48 hours and Panel C are the
expression patterns for those genes above a
threshold at 336 hours – or at a chronic transgene
Expression level.
Figure 4- FGFBP1 pathways – Using Pathway Studio, the shortest path through the set of genes that
were selected from filtering by a band pass filter at each of the time points, 24 hours, 48 hours and 336
hours was constructed. The resulting selection of diseases, cell processes, and functional classes
were the result of Pathway Studio constructing the shortest path to connect those genes in the set.
Conclusions
A systems biology approach to analyzing large data sets, such as this study which involved five full mouse
cDNA arrays allows the researcher to capture a snapshot of the unfolding remodeling events of an
organisms response to change, stress or disease. Analyzing data in this form involves filtering the
biological signal from the noise. Sorting the noise in appropriate manners is essential to be able to
complete the biological story. Building on existing knowledge base, we can complete the picture as long as
the proper context of the collection, normalization and analysis is maintained. High throughput technologies
such as microarrays and RNA sequencing as enabled by next generation sequencing presents the
researcher with the challenge of extracting meaningful information from the measurements. Software tools
and analysis techniques are not a substitute to understanding the biological context from which the data
are collected. Engineering and digital signal processing has allowed us to derive the understanding of how
to reconstruct a signal from the presence of a continual stream of noisy analog data. Sampling frequency
and proper filtering are a must to be able to sort out a meaningful signal from the noise. These same
principles apply not only to communication theory but also when studying large data such as those that
may be collected from high throughput systems such as a Affymetrix mouse cDNA array.
A
B
C
0 A
B
C
Figure 1 Panels. 0, A, B, and
C, illustrate ordering based
upon the expression values of
the control (FGFBP OFF), 24
hour expression (FGFBP1 On
24 hours), 48 hour expression
(FGFBP1 On 48 hours), and
336 hour expression (FGFBP
On 336 hours). The insight
gained from this inspection
includes the ability to see the
relative changes of
expression at each of these
time points.
Figure 6 – Graphical Gaussian
Model. Using the expression profiles,
a quassi-Bayesian analysis is
performed constructing the partial
correlation network among the top
expressing genes. Note that C9
(complement component 9) was not
able to be placed in context of the data
in the Pathway Studio diagram,
however using the partial correlations,
we are able to place it as strongly
positively correlated to Serpina3k,
Cyp3all, MUG1, Tdo2, Mup3, Hpx,
weakly positively correlated to Hamp,
and strongly negatively correlated to
Tex10. Together indicating the
placement of C9 in the Endothelial
response.
47. Scientific knowledge is limited (and advanced) by the
limits (and advancements) of measurement
9/13/2013 Wellstein/Riegel Laboratory 47
• Ilya Shmulevich Genomic Signal Processing “Validity of the
model involves observation and measurement, scientific
knowledge is limited by the limits of measurement”
• Erwin Shrödinger Science Theory and Man: “It really is the
ultimate purpose of all schemes and models to serve as
scaffolding for any observations that are at all means
observable”
48. 9/13/2013 Wellstein/Riegel Laboratory 48
Before Library Construction
1. Most vendors and cores will assess
the quality of the RNA before
sequencing
2. Important to determine before
sequencing begins
Garbage – in == Garbage out
Before library construction, RNA quality must be assessed
49. 9/13/2013 Wellstein/Riegel Laboratory 49
Cluster Generation
• Cbot cluster system single molecules are isothermally amplified in
a flow cell to prepare them for high-throughput sequencing
• 8 channel genome analyzer has a dense lawn of oligos
• Single DNA molecules hybridize to the lawn of oligos
• Bound fragments are extended to make copies
• Copies covalently bound to the flowcells surface
• Each fragment is clonally amplified through a series of extensions
and isothermal bridge amplifications resulting in 100s millions of
unique clusters
• Reverse strands cleaved and washed away
• Ends are blocked
• Sequencing primer hybridized to the DNA template
• After cluster generation, libraries are ready for sequencing
50. 9/13/2013 Wellstein/Riegel Laboratory 50
Sequencing
• 100s of millions of clusters sequenced simultaneously
• Using 4 fluorescently labeled reversibly terminated
nucleotides
• Natural competition ensures highest accuracy
• After each round of synthesis, clusters are excited by a laser
emitting a color that identifies the newly added base
• Fluorescent label and blocking group are then removed
allowing for the addition of the next nucleotide
• Proprietary chemistry (Illumina) reads a base in each cycle
• Allows for accurate sequencing through difficult regions such
as homopolymers and repetitive sequence
52. Systems Biology History (wikipedia)
• Systems biology roots found in
– Quantitative modeling of enzyme kinetics
– Mathematical modeling of population growth
– Simulations to study neurophysiology
– Control theory and cybernetics
• Theorists
– Ludwig von Bertalanffy – General Systems Theory
– Alan Lloyd Hodgkin and Andrew Fielding Huxley – constructed a
mathematical model that explained potential propagating along the
axon of a neuron cell
– Denis Nobel – first computer model of the heart Pacemaker
9/13/2013 Wellstein/Riegel Laboratory 52
53. Institutes of Systems Biology
• 2000 – Institutes of Systems Biology established in Seattle and
Tokyo
• After completion of Human Genome projects
• NSF grand challenge for systems biology – build a
mathematical model of the whole cell
9/13/2013 Wellstein/Riegel Laboratory 53
Editor's Notes
“Nothing scientific can be said about a system for which no measurements are possible at the scale of the theory”Erwin ShrödingerScience Theory and Man: “It really is the ultimate purpose of all schemes and models to serve as scaffolding for any observations that are at all means observable”“It makes no sense to apply a mathematical method that either depends on or utilizes unobservable measurements”
Capillary Gel Electrophoresis enabled the sequencing of the human genome faster, cheaper – spurred the completion of the human genome project
Sample starts with total RNA,Message RNA purified by polyA selection and then Chemically fragmented and converted into sscDNA using random hexamer priming.Second strand generated to create double stranded cDNA. And then this is ready for the TruSeq Library Construction.Blunt-ended DNA fragments are generated using a combination of fill in reactions and exonnuclease activity.An “A” base is added to the blunt ends of each strand. Preparing them for ligation to the sequence adapters.
TrueSeQ workflowBlunt end fragments created.An A base is addedPrepare for indexed adapter ligations.Final product created which is ready for applicfication either the cBot or the Cluster Station.Pooling strategy is applied to allow multiplexing on the HiSeQ 2000 by using these adapters…In this way the paired end sequencing can be performed, the tags are assigned to each strand, so strandedness is preserved.
RNA Seq allows you to discover and profile the entire transcriptomeNo ProbesNo PrimersRNA Seq delivers unbiased, unparalleled information about the transcriptome.Simple Sequencing WorkflowIlluminas optimized TRUSeq RNA Sample Prep Kits.
Tools
Once you get started – than there are a number of tools that allow you to visualize and understand your data.
I will talk more about these tools on Thursday – When I give a talk on RNASeq for the Systems Biology series. But lets go back to our particular problem