Improved Medical Education in Basic
Sciences
for Better Medical Practicing
ImproveMEd
Systems biology for medicine
III. How to analyze the big data sets?
The systems biology studies
often start with expression
profile (drug treated versus non-
treated cell, normal versus
cancer cell, cells in different
developmental stage)…using
microarray or
RNAseq…microarray is cost-
effective approach…
And we got this…
A microarray can fit 10 000 spots. Let’s assume that each
spot is a gene – how do we organize spots/genes in order
to extract result?
A laser scanner measures one fluorescent label than
another and superimpose one over another… each spot is
measured twice!
intensity of fluorescent signal = quantity of bound DNA
Each spot can be substituted with a number representing
relative change from ‘normal’ levels.
N = R/G …..1 means equal expression in both samples
R=red fluorescence (tumor)
G=green fluorescence (normal cell)
Colors are converted to numbers, because numbers are easier to
organize!
Each spot can be substituted with a number representing relative
change from ‘normal’ levels.
R=red fluorescence (tumor)
G=green fluorescence (normal cell)
N = R/G
N=1 equal expression in both samples
N›1 induction
N‹1 repression
http://www.hhmi.org/biointeractive/how-analyze-dna-microarray-
data
http://www.hhmi.org/biointeractive/scanning-lifes-matrix-genes-
proteins-and-small-molecules
We can compare many samples….or
we can follow one over time - human
fibroblastst stimulated with serum
and followed for 24 hours (Iyer et al.
1999)
And organize genes so that
induced one are clustered at
one end-opposite from
repressed one…
Such presentation of data is called Heat Map
For extracting knowledge from big data
we need statistical methods!
Commonly used – R statistical package
LIMMA
To identify clusters we can use –
cluster analysis!
Original numbers are logaritmized (by
base 2 or 10) and than we proceed by
calculating similarity scores – using a
computer program accompanying
microarray platform.
For visual presentation of data we turn
numbers again into colors, but this
time green means repression and red
means induction.
Another way of presenting data
is Volcano plot (common for
GWS studies).
The data are presented in
‘scatter-plot’ in order to quickly
find the most interesting e.g.
gene candidate in some
disease.
Combines two statistical tests:
e.g., a p value from an ANOVA
model with the magnitude of
the change.
Quick visual identification of
data (genes, etc.) that display
large magnitude changes that
are also statistically significant.
The border
between
p>0.05 &
p<0.05
Difference between same parameters in two samples
presented as ‘fold change’
In grey are changes smaller then 2x.
http://genomicsclass.github.io/book/pages/using_limma.html
Statistical significance
Interesting data
Both, Heat Map and Volcano Plot (and statistical analysis
behind them), are the first step toward identifying and ranking
genes/proteins behind observed phenotype. Generated the
lists of genes, responsible for observed mechanisms or
potential therapy targets, are further processed by different
bioinformatics tools.
The gene list can be fed into: Gene Ontology, Gene Set Enrichment
Analysis, Transcription Factor Analysis…
Generated lists have to use the unique nomenclature in order to be mutually
comparable.
Gene Ontology – http://geneontology.org/
Bioinformatics tool useful for assigning the right
name to sequence and connecting molecular
changes to cellular processes
Genes and proteins are conserved in the most living
organisms and have shared functions. Finding role of
a gene in one organism can help illuminating its role
in another. Gene Ontology Consortium deals with
gene nomenclature.
Sets are organized according to:
-Biological process
-Molecular function
-Cellular compartment
The Gene Ontology Consortium, Nature, 2000.
Biological process like : cell growth, proliferation,
translation or cAMP synthesis…
Cellular compartment
Parent nodes Children nodes
Systematic ORF
name
The standard
gene name
GO biological
process
Molecular function
Cellular component
Gene set enrichment analysis – GSEA
Analytical method designed for finding and interpreting
sets of genes.
Looking for genes that change together
- determining levels of proteins participating in the same
signaling pathway
- looking for molecules participating in the same
biological process
Free software package with initial database of 1,325
biologically defined gene sets.
http://software.broadinstitute.org/gsea/index.jsp
Subramanian et al. (2005) PNAS 102:15545
1. Sort the genes according to a criterion e.g. expression
level
2. Compare your list to some already existing lists and
allocate individual genes to ‘erichrichment score' - overly
represented or excessively reduced genes according to
Kolmogorov-Smirnov type statistics
3. The Max Enrichment Score (MES) is a relevance indicator
of an existing gene set for a new data-set just being
investigated
Transcription Factor Analysis
Genes that have changed the level of expression may
have been regulated by the same transcription
factor.
Genes are identified by combining omics data and
prior knowledge.
ChEA database currently links 159 transcription
factors to more than 30,000 genes - a total of 361
299 interactions – extracted from 157 publications.
TRANSFAC, PAINT, JASPAR – other databases for ChIP
Kinase Enrichment Analysis (KEA)
Web-base command- line software that links list of
mammalian proteins with protein kinases that likely
phosphorylate them. The database containes 436
kinases and 14 374 interactions from 3469
publications.
http://amp.pharm.mssm.edu/Enrichr/
https://www.ncbi.nlm.nih.gov/pmc/articl
es/PMC2944209/
A number of transcription factors acts at the
same time on the same promoter…
Chromatin
immunoprecipitation is
the method of choice for
finding all sequences
interacting with
proteins. Data from all
ChIP-seq experiments
can be fed in the same
database (ChEA)…
https://galaxyproject.org/tutorials/chip/
Expression2Kinases –X2K
The software which combines different databases
and tools .
INPUT: the list of differently expressed genes
OUTPUT: protein kinases, transcription factors and
protein complexes that are putative regulators of
inputted genes.
Using such sotwere we can construct hypothetical
regulatory pathways and construct protein
interaction networks.
The results need experimental prove of concept!
The work-flow of X2K
Chen et al. (2012) Bioinformatics 28:105
What we really want is to transform list into a network
– often used to present interactions between cellular
components
Euler, 1700s, Seven Bridges of Konigsberg
Node
molecule
Edge
interaction
Types of networks relevant to systems biology
1. Cell Signaling Networks
- cancer signaling network
doi:10.1038/psp.2013.38
2. Protein-Protein Interaction Networks
- Dystrophin protein-protein intersctions
http://parendogen677s10.weebly.com/protein-protein-interactions.html
3. Gene Regulatory Networks
- Development od Drosophila eye
http://dev.biologists.org/content/140/1/82
Genes2Networks
Lists2Networks
Combines experimental data (mRNA
expression microarray, genome-wide
ChI-X, RNAi screens, proteomics &
phosphoproteomics) with a bacground
network of all known interactions (prior
biological knowladge)
http://www.lists2networks.org
Additional sofwers exist for visualisation and analysis of
networks:
Pajek (Vladimir Batagelj & Andrej Mrvar, Ljubljana,
Slovenia)
http://vlado.fmf.uni-
lj.si/pub/networks/doc/gd.01/Pajek2.png
http://vlado.fmf.uni-lj.si/pub/networks/doc/pajek.pdf
Cytoscape (Trey Ideker, Shannon et al.,2003.))
http://www.cytoscape.org/
SNAVI (Ma’ayan et al. 2009)
yEd…..
Identification of pathways, subnetworks, clusters, special
features of network…
Molecular data could be further
integrated with structural data in
order to produce 3D models
(macromolecular complexes,
virtual cells)….
Patwardhan et al. 2017, DOI:
10.7554/eLife.25835
(erytrocytes infected with
plasmodium)
1. Statistical analysis is critical in extracting knowladge about
system from a big data sets. Statistical analysis generates a list of
genes/proteins/RNAs relevant for the study.
2. The list of genes can be fed into software (bioinformatics' tools)
and combined with prior knowledge in order to find theoretical
new pathways, subnetworks, regulatory mechanism…
3. Integration of experimental big data and prior knowledge
(multiple databases) allows multiscale understanding of
physiological functions, pathophysiology or pharmacokinetics.
4. Computationally generated predictions have to be
experimentally proved.

How to analyse large data sets

  • 1.
    Improved Medical Educationin Basic Sciences for Better Medical Practicing ImproveMEd Systems biology for medicine III. How to analyze the big data sets?
  • 2.
    The systems biologystudies often start with expression profile (drug treated versus non- treated cell, normal versus cancer cell, cells in different developmental stage)…using microarray or RNAseq…microarray is cost- effective approach… And we got this…
  • 3.
    A microarray canfit 10 000 spots. Let’s assume that each spot is a gene – how do we organize spots/genes in order to extract result? A laser scanner measures one fluorescent label than another and superimpose one over another… each spot is measured twice! intensity of fluorescent signal = quantity of bound DNA Each spot can be substituted with a number representing relative change from ‘normal’ levels. N = R/G …..1 means equal expression in both samples R=red fluorescence (tumor) G=green fluorescence (normal cell)
  • 4.
    Colors are convertedto numbers, because numbers are easier to organize! Each spot can be substituted with a number representing relative change from ‘normal’ levels. R=red fluorescence (tumor) G=green fluorescence (normal cell) N = R/G N=1 equal expression in both samples N›1 induction N‹1 repression http://www.hhmi.org/biointeractive/how-analyze-dna-microarray- data http://www.hhmi.org/biointeractive/scanning-lifes-matrix-genes- proteins-and-small-molecules We can compare many samples….or we can follow one over time - human fibroblastst stimulated with serum and followed for 24 hours (Iyer et al. 1999) And organize genes so that induced one are clustered at one end-opposite from repressed one… Such presentation of data is called Heat Map
  • 5.
    For extracting knowledgefrom big data we need statistical methods! Commonly used – R statistical package LIMMA To identify clusters we can use – cluster analysis! Original numbers are logaritmized (by base 2 or 10) and than we proceed by calculating similarity scores – using a computer program accompanying microarray platform. For visual presentation of data we turn numbers again into colors, but this time green means repression and red means induction.
  • 6.
    Another way ofpresenting data is Volcano plot (common for GWS studies). The data are presented in ‘scatter-plot’ in order to quickly find the most interesting e.g. gene candidate in some disease. Combines two statistical tests: e.g., a p value from an ANOVA model with the magnitude of the change. Quick visual identification of data (genes, etc.) that display large magnitude changes that are also statistically significant. The border between p>0.05 & p<0.05 Difference between same parameters in two samples presented as ‘fold change’ In grey are changes smaller then 2x. http://genomicsclass.github.io/book/pages/using_limma.html Statistical significance Interesting data
  • 7.
    Both, Heat Mapand Volcano Plot (and statistical analysis behind them), are the first step toward identifying and ranking genes/proteins behind observed phenotype. Generated the lists of genes, responsible for observed mechanisms or potential therapy targets, are further processed by different bioinformatics tools. The gene list can be fed into: Gene Ontology, Gene Set Enrichment Analysis, Transcription Factor Analysis… Generated lists have to use the unique nomenclature in order to be mutually comparable.
  • 8.
    Gene Ontology –http://geneontology.org/ Bioinformatics tool useful for assigning the right name to sequence and connecting molecular changes to cellular processes Genes and proteins are conserved in the most living organisms and have shared functions. Finding role of a gene in one organism can help illuminating its role in another. Gene Ontology Consortium deals with gene nomenclature. Sets are organized according to: -Biological process -Molecular function -Cellular compartment The Gene Ontology Consortium, Nature, 2000. Biological process like : cell growth, proliferation, translation or cAMP synthesis…
  • 9.
  • 10.
    Systematic ORF name The standard genename GO biological process Molecular function Cellular component
  • 11.
    Gene set enrichmentanalysis – GSEA Analytical method designed for finding and interpreting sets of genes. Looking for genes that change together - determining levels of proteins participating in the same signaling pathway - looking for molecules participating in the same biological process Free software package with initial database of 1,325 biologically defined gene sets. http://software.broadinstitute.org/gsea/index.jsp Subramanian et al. (2005) PNAS 102:15545 1. Sort the genes according to a criterion e.g. expression level 2. Compare your list to some already existing lists and allocate individual genes to ‘erichrichment score' - overly represented or excessively reduced genes according to Kolmogorov-Smirnov type statistics 3. The Max Enrichment Score (MES) is a relevance indicator of an existing gene set for a new data-set just being investigated
  • 12.
    Transcription Factor Analysis Genesthat have changed the level of expression may have been regulated by the same transcription factor. Genes are identified by combining omics data and prior knowledge. ChEA database currently links 159 transcription factors to more than 30,000 genes - a total of 361 299 interactions – extracted from 157 publications. TRANSFAC, PAINT, JASPAR – other databases for ChIP Kinase Enrichment Analysis (KEA) Web-base command- line software that links list of mammalian proteins with protein kinases that likely phosphorylate them. The database containes 436 kinases and 14 374 interactions from 3469 publications. http://amp.pharm.mssm.edu/Enrichr/ https://www.ncbi.nlm.nih.gov/pmc/articl es/PMC2944209/
  • 13.
    A number oftranscription factors acts at the same time on the same promoter…
  • 14.
    Chromatin immunoprecipitation is the methodof choice for finding all sequences interacting with proteins. Data from all ChIP-seq experiments can be fed in the same database (ChEA)… https://galaxyproject.org/tutorials/chip/
  • 15.
    Expression2Kinases –X2K The softwarewhich combines different databases and tools . INPUT: the list of differently expressed genes OUTPUT: protein kinases, transcription factors and protein complexes that are putative regulators of inputted genes. Using such sotwere we can construct hypothetical regulatory pathways and construct protein interaction networks. The results need experimental prove of concept! The work-flow of X2K Chen et al. (2012) Bioinformatics 28:105
  • 16.
    What we reallywant is to transform list into a network – often used to present interactions between cellular components Euler, 1700s, Seven Bridges of Konigsberg Node molecule Edge interaction
  • 17.
    Types of networksrelevant to systems biology 1. Cell Signaling Networks - cancer signaling network doi:10.1038/psp.2013.38 2. Protein-Protein Interaction Networks - Dystrophin protein-protein intersctions http://parendogen677s10.weebly.com/protein-protein-interactions.html 3. Gene Regulatory Networks - Development od Drosophila eye http://dev.biologists.org/content/140/1/82
  • 18.
    Genes2Networks Lists2Networks Combines experimental data(mRNA expression microarray, genome-wide ChI-X, RNAi screens, proteomics & phosphoproteomics) with a bacground network of all known interactions (prior biological knowladge) http://www.lists2networks.org
  • 20.
    Additional sofwers existfor visualisation and analysis of networks: Pajek (Vladimir Batagelj & Andrej Mrvar, Ljubljana, Slovenia) http://vlado.fmf.uni- lj.si/pub/networks/doc/gd.01/Pajek2.png http://vlado.fmf.uni-lj.si/pub/networks/doc/pajek.pdf Cytoscape (Trey Ideker, Shannon et al.,2003.)) http://www.cytoscape.org/ SNAVI (Ma’ayan et al. 2009) yEd….. Identification of pathways, subnetworks, clusters, special features of network…
  • 21.
    Molecular data couldbe further integrated with structural data in order to produce 3D models (macromolecular complexes, virtual cells)…. Patwardhan et al. 2017, DOI: 10.7554/eLife.25835 (erytrocytes infected with plasmodium)
  • 22.
    1. Statistical analysisis critical in extracting knowladge about system from a big data sets. Statistical analysis generates a list of genes/proteins/RNAs relevant for the study. 2. The list of genes can be fed into software (bioinformatics' tools) and combined with prior knowledge in order to find theoretical new pathways, subnetworks, regulatory mechanism… 3. Integration of experimental big data and prior knowledge (multiple databases) allows multiscale understanding of physiological functions, pathophysiology or pharmacokinetics. 4. Computationally generated predictions have to be experimentally proved.