The document outlines the general pipeline for transcriptomics analysis based on microarray experiments. It discusses the main steps which include quality control, normalization, annotation, differential expression analysis, clustering, and supplemental analyses such as functional enrichment and transcription factor binding site analysis. Key points within each step are highlighted, such as common normalization and differential expression methods, different clustering algorithms, and tools used for enrichment and transcription factor analysis.
2. GENERAL PIPELINE OF ANALYSIS
Quality
Control
Raw data
Controlled
Raw data
Normalised
data
Controlled
Normalised
data
Normalisation
Quality
Control
Annotated
Normalised
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation
3. PIPELINE OF ANALYSIS
NORMALISATION STEP
Quality
Control
Raw data
Controlled
Raw data
Normalised
data
Controlled
Normalised
data
Normalisation
Quality
Control
Annotated
Normalised
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation
4. PIPELINE OF ANALYSIS
NORMALISATION STEP
5 Steps of Robust Multi-array Average (= RMA)
1) Background correction
2) Normalisation (across arrays)
3) Probe level intensity calculation
4) Probe set summarisation
5. Other normalisation algorithms
A few examples
1) MAS 5.0
2) gcRMA
3) Li Wang
…
Can be long to evaluate :
- implement the different algorithms
- find the best way to compare them
PIPELINE OF ANALYSIS
NORMALISATION STEP
6. PIPELINE OF ANALYSIS
QUALITY CONTROL STEP
Quality
Control
Raw data
Controlled
Raw data
Normalised
data
Controlled
Normalised
data
Normalisation
Quality
Control
Annotated
Normalised
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation
8. PIPELINE OF ANALYSIS
QUALITY CONTROL STEP
3 Main tests :
Mean-Average plots
Before
Normalisation
After
Normalisation
9. PIPELINE OF ANALYSIS
ANNOTATION STEP
Quality
Control
Raw data
Controlled
Raw data
Normalised
data
Controlled
Normalised
data
Normalisation
Quality
Control
Annotated
Normalised
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation
10. PIPELINE OF ANALYSIS
ANNOTATION STEP
Step 1 :
Annotate Transcript Cluster Probes
GENERAL CASE :
Transc ID : 00 000 000 + Gene Accession : NM… // Gene Name // Description
ELSE :
Transc ID : 00 000 000 + Gene Accession : - - - or NA + mRNAAccession : NM…
EXCEPTIONS :
Transc ID : 00 000 000 + Gene Accession : - - - or NA + mRNAAccession : ?? // NONCODE // Linc…
Transc ID : 00 000 000 + Gene Accession : - - - or NA + mRNAAccession : ?? // NONCODE // ??
KEEP ONLY Category : Main or Rescue
11. PIPELINE OF ANALYSIS
ANNOTATION STEP
Step 2 :
From TC probes to genes
“single
gene”
NA
“multiple
probes”
maximum
signal
verification
(iso-forms …)
12. PIPELINE OF ANALYSIS
DIFFERENTIAL EXPRESSION STEP
Quality
Control
Raw data
Controlled
Raw data
Normalised
data
Controlled
Normalised
data
Normalisation
Quality
Control
Annotated
Normalised
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation
13. - Limma is a package for the analysis of gene expression data arising
from microarray or RNA-Seq technologies.
- A core capability is the use of linear models to assess differential
expression in the context of multi-factor designed experiments.
- Limma provides the ability to analyse comparisons between many
RNA targets simultaneously.
- It has features that make the analyses stable even for experiments
with small number of arrays.
PIPELINE OF ANALYSIS
DIFFERENTIAL EXPRESSION STEP
Linear Models for Micro-Array & RNA-seq
14. PIPELINE OF ANALYSIS
DIFFERENTIAL EXPRESSION STEP
Selection of DE threshold :
Volcano-plots
−3 −2 −1 0 1 2 3
02468
Volcano plot HIV1AZTvsNS on genes
log2 (Fold−Change)
−log10(adj−P−Value)
log10of
adjustedPvalue
log2 of Fold-Change
1%
5%
UPDOWN
15. PIPELINE OF ANALYSIS
DIFFERENTIAL EXPRESSION STEP
Representation of DE probes / genes :
Heat-maps
- derived from heatplot() function.
- several parameters to adapt manually
to each case.
Optimise function to detect best
parameters automatically from data
matrix.
BDCA1_CTRL_D1
BDCA1_CTRL_D2
BDCA1_CTRL_D3
BDCA1_JK_D1
BDCA1_JK_D2
BDCA1_JK_D3
BDCA1_R5_D1
BDCA1_R5_D2
BDCA1_R5_D3
BDCA3_CTRL_D1
BDCA3_CTRL_D2
BDCA3_CTRL_D3
BDCA3_JK_D1
BDCA3_JK_D2
BDCA3_JK_D3
BDCA3_R5_D1
BDCA3_R5_D2
BDCA3_R5_D3
pDC_CTRL_D1
pDC_CTRL_D2
pDC_CTRL_D3
pDC_JK_D1
pDC_JK_D2
pDC_JK_D3
pDC_R5_D1
pDC_R5_D2
pDC_R5_D3
TLR10
TLR8
AIM2
TLR3
IRAK2
TICAM1
IKBKG
TRAF3
IKBKB
NFKB1
NFKB2
TANK
TRAF6
IRAK4
IFI16
MYD88
MB21D1
TLR7
AZI2
TMEM173
TLR6
TBK1
IKBKE
TLR4
IRAK3
TLR2
TLR5
ZBP1
IRAK1
IRF3
−3 −2 −1 0 1 2 3
Value
Color Key
16. PIPELINE OF ANALYSIS
CLUSTERING STEP
Quality
Control
Raw data
Controlled
Raw data
Normalized
data
Controlled
Normalized
data
Normalisation
Quality
Control
Annotated
Normalized
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation
18. Connectivity-based clustering
Hierarchical clustering
- Hierarchical clustering (= HCA) seeks to build a hierarchy of clusters.
Strategies for hierarchical clustering generally fall into two types:
- Agglomerative or "bottom up" approach = each observation starts
in its own cluster, and pairs of clusters are merged as one moves up
the hierarchy.
- Divisive or "top down" approach = all observations start in one
cluster, and splits are performed recursively as one moves down the
hierarchy.
CLUSTERING
HIERARCHICAL CLUSTERING
20. Centroid-based clustering
K-means
●
●
●
●
●
●
● ●
●
● ● ● ● ● ●
2 4 6 8 10 12 14
1000200030004000
Number of Clusters
WithingroupssumofsquaresWGSS
# of cluster
●
●
●
●
●
●
●
● ● ● ● ● ● ● ●
2 4 6 8 10 12 14
20000400006000080000100000120000
Number of Clusters
WithingroupssumofsquaresWGSS
# of cluster
A B C D E F G H
orderedlistofDEgenes
CLUSTERING
K-MEANS
21. Centroid-based clustering
SOM
- Similar to K-means algorithm.
- Centroids used have a predetermined topographic ordering
relationship.
- For each processing, nearby centroids are also updated.
Standard SOM algorithm
CLUSTERING
SOM
22. Distribution-based clustering
EM
- Iterative method for finding maximum likelihood or maximum a
posteriori (MAP) estimates of parameters in statistical models,
where the model depends on unobserved latent variables.
- An expectation (E) step, which creates a function for the
expectation of the log-likelihood evaluated using the current
estimate for the parameters.
- A maximisation (M) step, which computes parameters maximising
the expected log-likelihood found on the E step.These parameter-
estimates are then used to determine the distribution of the latent
variables in the next E step.
CLUSTERING
EM
23. Distribution-based clustering
EM
A B C D E F G H
orderedlistofDEgenes
2 4 6 8
−14000−12000−10000−8000−6000
Number of components
BIC
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●EII
VII
EEI
VEI
EVI
VVI
EEE
EEV
VEV
VVV
BIC
# of components
2 4 6 8
−160000−140000−120000−100000−80000−60000
Number of components
BIC
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●EII
VII
EEI
VEI
EVI
VVI
EEE
EEV
VEV
VVV
BIC
# of components
CLUSTERING
EM
24. - Finds a number of clusters starting from the estimated
density distribution of corresponding nodes.
- 2 parameters : M minimum number of points having to be
found in a specific radius R to be considered in the same
cluster.
A = CORE points
B & C = DENSITY-REACHABLE points
N = NOISE point
Density-based clustering
DBSCAN
CLUSTERING
DBSCAN
25. - WGCNA cluster datasets by quantifying not only the correlations
between individual pairs of genes, but also the extent to which these
genes share the same neighbours.
- WGCNA R is usable through 2 methodology that we compared,
choosing the methodology “N-step” :
- Manual selection of parameters
- Noise taken in account
- Merging analysis
“Network”-based clustering
WGCNA
CLUSTERING
WGCNA
26. “Network”-based clustering
WGCNA
0.50.60.70.80.91.0
hclust (*, "average")
d
Height
0
100
200
300
400
M
odule
0
M
odule
1
M
odule
2
M
odule
3
M
odule
4
M
odule
5
M
odule
6
M
odule
7
Merging Dynamic
0
75
150
225
300
M
odule
0
M
odule
1
M
odule
2
M
odule
3
M
odule
4
M
odule
5
M
odule
6
M
odule
7
Dynamic Tree Cut
Dynamic
TreeCut
Merging
Dynamic
CLUSTERING
WGCNA
28. SUPPLEMENTAL STEPS
VENN DIAGRAMS
Specific analysis :
Venn-n comparison
Step 1 : Calculating all possible comparisons 2n
Step 2 : Membership of all the genes
Step 3 : For each possible comparison
- Find genes corresponding
- Organise them by HC using their
expressions
INPUT
Lists of genes & data matrix
ALGORITHM
OUTPUT
List of genes organised by comparison & sub-organised
by HC
31. Functional Enrichment analysis
Database for Annotation,Visualisation & Integrated Discovery
Some numbers about DAVID :
- More than 16 000 citations
- Around 40 Nature-branded
citations
- Daily Usage : 1200 gene lists/
sublists & 400 unique researchers 0
1000
2000
3000
4000
2004 2006 2008 2010 2012 2014
SUPPLEMENTAL STEPS
ENRICHMENT ANALYSIS
32. Functional Enrichment analysis
Gene Set Enrichment Analysis
Get ranked
list L of all the
genes on the
chip based
on a chosen
measure
(expression, FC…)
A B
rankedlistof
genes
For each
gene set S:
find the
location of
each gene s
in S within L
Generate enrichment
score ES for S based
on running-sum statistic:
“reward” presence of s
toward top or bottom of L
runningsum
L
Analyse significance of
this Kolmogorov-Smirnov
type statistic by
permutation testing
Multiple hypothesis testing (MHT)
error control for multiple S’s using
the estimated false discovery rate (FDR)
SUPPLEMENTAL STEPS
ENRICHMENT ANALYSIS
33. Input :
- Gene list or Gene sequences
- 116TF binding sites motifs in Chip-Seq JASPAR db (2010)
- Selection of distances up and/or down the gene coding sequence
Processing :
- Find specific sequence up and/or down a gene sequence
- Compute over-represented motifs
- Comparison toTF binding sequences ( Z and F score )
Output :
- List ofTF hierarchically organised by Z & F score
- Automatic representation F or Z score vs %GC content
TF Binding Site analysis
oPOSSUM-3
SUPPLEMENTAL STEPS
TF ANALYSIS
35. TF Binding Site analysis
Other tools
Databases :
- TRANSFAC database v7.4 through GSEA (615 gene sets)
- JASPAR Core database (205 non redundant motifs)
- UCSC database (690 gene sets) based on data from ENCODETFBS
ChIP-seq production groups (from 2007 to 2012)
Tools :
- HOMER : TRANSFAC database + own motifs database through ChIP-
Seq analysis
- GSEA : usingTRANSFAC db on enriched motifs in ranked list of genes
- DAVID : Using UCSC database
- oPOSSUM : using JASPAR (116Transcription Factors)
SUPPLEMENTAL STEPS
TF ANALYSIS
36. - Reverse engineering models :
- Algorithm for the reconstruction of accurate cellular
networks = ARACNe
- Tool for Inferring Network of Genes =TINGe
- Other models :
- Functional networks = Cytoscape software (Bingo / ClueGO
plugins)
PIPELINE OF ANALYSIS
NETWORK STEP
General overview :
37. - Reverse engineering method, it is specifically designed to scale up to the
complexity of regulatory networks in mammalian cells.
- ARACNe defines an edge as an irreducible statistical dependency between
gene expression profiles that cannot be explained as an artifact of other
statistical dependencies in the network.
- An edge is likely to identify direct regulatory interactions mediated by a
transcription factor binding to a target gene's promoter region, although
other types of interactions may also be identified.
PIPELINE OF ANALYSIS
NETWORK STEP
Data networks
ARACNe
38. - Reverse engineer genome-scale gene networks from large number of
expression profiles based on mutual information (MI), data processing
inequality (DPI) and permutation testing to assess statistical significance
of each inferred edge.
- TINGe can be used for directly constructing high-quality networks, or it
can be used as a component along with other types of data in building
probabilistic networks.
PIPELINE OF ANALYSIS
NETWORK STEP
Data networks
TINGe
39. - Several Net-tools or softwares can realise it : DAVID and Cytoscape. It
links modules of genes by their functional annotation enrichment.
- Rely on annotations and not on the expression values of the
experiment. Can give somme annotations without any information for
a specific biological question.
Example of network via ClueGO
on list of diff expressed genes
PIPELINE OF ANALYSIS
NETWORK STEP
Functional networks
Cytoscape