Transcriptomics Pipeline Analysis

GENERAL PIPELINE OF
TRANSCRIPTOMICS ANALYSIS
(Based on micro-arrays experiments)
SANTY MARQUES-LADEIRA

GENERAL PIPELINE OF ANALYSIS
Quality
Control
Raw data
Controlled
Raw data
Normalised
data
Controlled
Normalised
data
Normalisation
Quality
Control
Annotated
Normalised
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation

PIPELINE OF ANALYSIS
NORMALISATION STEP
Quality
Control
Raw data
Controlled
Raw data
Normalised
data
Controlled
Normalised
data
Normalisation
Quality
Control
Annotated
Normalised
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation

NORMALISATION STEP
5 Steps of Robust Multi-array Average (= RMA)
1) Background correction
2) Normalisation (across arrays)
3) Probe level intensity calculation
4) Probe set summarisation

Other normalisation algorithms
A few examples
1) MAS 5.0
2) gcRMA
3) Li Wang
…
Can be long to evaluate :
- implement the different algorithms
- ﬁnd the best way to compare them
NORMALISATION STEP

QUALITY CONTROL STEP
Quality
Control
Raw data
Controlled
Raw data
Normalised
data
Controlled
Normalised
data
Normalisation
Quality
Control
Annotated
Normalised
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation

Before
Normalisation
After
Normalisation
3 Main tests :
Matrix distances & Box-plots

3 Main tests :
Mean-Average plots
Before
Normalisation
After
Normalisation

ANNOTATION STEP
Quality
Control
Raw data
Controlled
Raw data
Normalised
data
Controlled
Normalised
data
Normalisation
Quality
Control
Annotated
Normalised
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation

ANNOTATION STEP
Step 1 :
Annotate Transcript Cluster Probes
GENERAL CASE :
Transc ID : 00 000 000 + Gene Accession : NM… // Gene Name // Description
ELSE :
Transc ID : 00 000 000 + Gene Accession : - - - or NA + mRNAAccession : NM…
EXCEPTIONS :
Transc ID : 00 000 000 + Gene Accession : - - - or NA + mRNAAccession : ?? // NONCODE // Linc…
Transc ID : 00 000 000 + Gene Accession : - - - or NA + mRNAAccession : ?? // NONCODE // ??
KEEP ONLY Category : Main or Rescue

ANNOTATION STEP
Step 2 :
From TC probes to genes
“single
gene”
NA
“multiple
probes”
maximum
signal
veriﬁcation
(iso-forms …)

DIFFERENTIAL EXPRESSION STEP
Quality
Control
Raw data
Controlled
Raw data
Normalised
data
Controlled
Normalised
data
Normalisation
Quality
Control
Annotated
Normalised
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation

- Limma is a package for the analysis of gene expression data arising
from microarray or RNA-Seq technologies.
- A core capability is the use of linear models to assess differential
expression in the context of multi-factor designed experiments.
- Limma provides the ability to analyse comparisons between many
RNA targets simultaneously.
- It has features that make the analyses stable even for experiments
with small number of arrays.
Linear Models for Micro-Array & RNA-seq

Selection of DE threshold :
Volcano-plots
−3 −2 −1 0 1 2 3
02468
Volcano plot HIV1AZTvsNS on genes
log2 (Fold−Change)
−log10(adj−P−Value)
log10of
adjustedPvalue
log2 of Fold-Change
1%
5%
UPDOWN

Representation of DE probes / genes :
Heat-maps
- derived from heatplot() function.
- several parameters to adapt manually
to each case.
Optimise function to detect best
parameters automatically from data
matrix.
BDCA1_CTRL_D1
BDCA1_CTRL_D2
BDCA1_CTRL_D3
BDCA1_JK_D1
BDCA1_JK_D2
BDCA1_JK_D3
BDCA1_R5_D1
BDCA1_R5_D2
BDCA1_R5_D3
BDCA3_CTRL_D1
BDCA3_CTRL_D2
BDCA3_CTRL_D3
BDCA3_JK_D1
BDCA3_JK_D2
BDCA3_JK_D3
BDCA3_R5_D1
BDCA3_R5_D2
BDCA3_R5_D3
pDC_CTRL_D1
pDC_CTRL_D2
pDC_CTRL_D3
pDC_JK_D1
pDC_JK_D2
pDC_JK_D3
pDC_R5_D1
pDC_R5_D2
pDC_R5_D3
TLR10
TLR8
AIM2
TLR3
IRAK2
TICAM1
IKBKG
TRAF3
IKBKB
NFKB1
NFKB2
TANK
TRAF6
IRAK4
IFI16
MYD88
MB21D1
TLR7
AZI2
TMEM173
TLR6
TBK1
IKBKE
TLR4
IRAK3
TLR2
TLR5
ZBP1
IRAK1
IRF3
−3 −2 −1 0 1 2 3
Value
Color Key

CLUSTERING STEP
Quality
Control
Raw data
Controlled
Raw data
Normalized
data
Controlled
Normalized
data
Normalisation
Quality
Control
Annotated
Normalized
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation

CLUSTERING STEP
General overview :
- Connectivity based clustering = Hierarchical clustering
- Centroid-based clustering
- K-means
- Distribution-based clustering = Expectation-Maximisation (EM)
- “Network”-based clustering = Weighted correlation network
analysis = WGCNA
- Artiﬁcial neural network = SOM
- Density-based clustering = DBSCAN

Connectivity-based clustering
Hierarchical clustering
- Hierarchical clustering (= HCA) seeks to build a hierarchy of clusters.
Strategies for hierarchical clustering generally fall into two types:
- Agglomerative or "bottom up" approach = each observation starts
in its own cluster, and pairs of clusters are merged as one moves up
the hierarchy.
- Divisive or "top down" approach = all observations start in one
cluster, and splits are performed recursively as one moves down the
hierarchy.
CLUSTERING
HIERARCHICAL CLUSTERING

Connectivity-based clustering
Hierarchical clustering
A B C D E F G H
orderedlistofDEgenes
CLUSTERING
HIERARCHICAL CLUSTERING

Centroid-based clustering
K-means
●
●
●
●
●
●
● ●
●
● ● ● ● ● ●
2 4 6 8 10 12 14
1000200030004000
Number of Clusters
WithingroupssumofsquaresWGSS
# of cluster
●
●
●
●
●
●
●
● ● ● ● ● ● ● ●
2 4 6 8 10 12 14
20000400006000080000100000120000
Number of Clusters
WithingroupssumofsquaresWGSS
# of cluster
A B C D E F G H
CLUSTERING
K-MEANS

Centroid-based clustering
SOM
- Similar to K-means algorithm.
- Centroids used have a predetermined topographic ordering
relationship.
- For each processing, nearby centroids are also updated.
Standard SOM algorithm
CLUSTERING
SOM

Distribution-based clustering
EM
- Iterative method for ﬁnding maximum likelihood or maximum a
posteriori (MAP) estimates of parameters in statistical models,
where the model depends on unobserved latent variables.
- An expectation (E) step, which creates a function for the
expectation of the log-likelihood evaluated using the current
estimate for the parameters.
- A maximisation (M) step, which computes parameters maximising
the expected log-likelihood found on the E step.These parameter-
estimates are then used to determine the distribution of the latent
variables in the next E step.
CLUSTERING
EM

Distribution-based clustering
EM
A B C D E F G H
2 4 6 8
−14000−12000−10000−8000−6000
Number of components
BIC
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●EII
VII
EEI
VEI
EVI
VVI
EEE
EEV
VEV
VVV
BIC
# of components
2 4 6 8
−160000−140000−120000−100000−80000−60000
Number of components
BIC
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●EII
VII
EEI
VEI
EVI
VVI
EEE
EEV
VEV
VVV
BIC
# of components
CLUSTERING
EM

- Finds a number of clusters starting from the estimated
density distribution of corresponding nodes.
- 2 parameters : M minimum number of points having to be
found in a speciﬁc radius R to be considered in the same
cluster.
A = CORE points
B & C = DENSITY-REACHABLE points
N = NOISE point
Density-based clustering
DBSCAN
CLUSTERING
DBSCAN

- WGCNA cluster datasets by quantifying not only the correlations
between individual pairs of genes, but also the extent to which these
genes share the same neighbours.
- WGCNA R is usable through 2 methodology that we compared,
choosing the methodology “N-step” :
- Manual selection of parameters
- Noise taken in account
- Merging analysis
“Network”-based clustering
WGCNA
CLUSTERING
WGCNA

“Network”-based clustering
WGCNA
0.50.60.70.80.91.0
hclust (*, "average")
d
Height
0
100
200
300
400
M
odule
0
M
odule
1
M
odule
2
M
odule
3
M
odule
4
M
odule
5
M
odule
6
M
odule
7
Merging Dynamic
0
75
150
225
300
M
odule
0
M
odule
1
M
odule
2
M
odule
3
M
odule
4
M
odule
5
M
odule
6
M
odule
7
Dynamic Tree Cut
Dynamic
TreeCut
Merging
Dynamic
CLUSTERING
WGCNA

Quality
Control
Raw data
Controlled
Raw data
Normalised
data
Controlled
Normalised
data
Normalisation
Quality
Control
Annotated
Normalised
data
Diff.
Expressed
genes
Diff.
Expressed
modules
Differential
Expression
Clustering
Annotation
Speciﬁc
Analysis
Speciﬁc
Analysis
SUPPLEMENTAL STEP

SUPPLEMENTAL STEPS
VENN DIAGRAMS
Speciﬁc analysis :
Venn-n comparison
Step 1 : Calculating all possible comparisons 2n
Step 2 : Membership of all the genes
Step 3 : For each possible comparison
- Find genes corresponding
- Organise them by HC using their
expressions
INPUT
Lists of genes & data matrix
ALGORITHM
OUTPUT
List of genes organised by comparison & sub-organised
by HC

Venn-n comparison
−3−2−10123
NormalisedMicro-ArrayExpressionValues
CTRL CTRL CTRLJK JK JKR5 R5 R5
BDCA1 BDCA3 pDC
JK JK JKR5 R5 R5
pDCBDCA3BDCA1
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
DEgenes
SUPPLEMENTAL STEPS
VENN DIAGRAMS

X-Y plots (correlation)
−2 0 2 4
−2024
HIV1vsNS log2(FC)
HIV1+AZTvsNSlog2(FC)log2(Fold-Change[condition1])
log2( Fold-Change[condition2] )
SUPPLEMENTAL STEPS
CORRELATION

Functional Enrichment analysis
Database for Annotation,Visualisation & Integrated Discovery
Some numbers about DAVID :
- More than 16 000 citations
- Around 40 Nature-branded
citations
- Daily Usage : 1200 gene lists/
sublists & 400 unique researchers 0
1000
2000
3000
4000
2004 2006 2008 2010 2012 2014
SUPPLEMENTAL STEPS
ENRICHMENT ANALYSIS

Functional Enrichment analysis
Gene Set Enrichment Analysis
Get ranked
list L of all the
genes on the
chip based
on a chosen
measure
(expression, FC…)
A B
rankedlistof
genes
For each
gene set S:
find the
location of
each gene s
in S within L
Generate enrichment
score ES for S based
on running-sum statistic:
“reward” presence of s
toward top or bottom of L
runningsum
L
Analyse significance of
this Kolmogorov-Smirnov
type statistic by
permutation testing
Multiple hypothesis testing (MHT)
error control for multiple S’s using
the estimated false discovery rate (FDR)
SUPPLEMENTAL STEPS
ENRICHMENT ANALYSIS

Input :
- Gene list or Gene sequences
- 116TF binding sites motifs in Chip-Seq JASPAR db (2010)
- Selection of distances up and/or down the gene coding sequence
Processing :
- Find speciﬁc sequence up and/or down a gene sequence
- Compute over-represented motifs
- Comparison toTF binding sequences ( Z and F score )
Output :
- List ofTF hierarchically organised by Z & F score
- Automatic representation F or Z score vs %GC content
TF Binding Site analysis
oPOSSUM-3
SUPPLEMENTAL STEPS
TF ANALYSIS

oPOSSUM-3
0.0 0.2 0.4 0.6 0.8
012345
TF profile %GC composition
Fisherscore
CEBPA
ELF5
FEV
GABPA
IRF1
Klf4
MZF1_5−13
NF−kappaB
NFATC2
RELA
RUNX1
SPI1
STAT1
Stat3 Tcfcp2l1
SUPPLEMENTAL STEPS
TF ANALYSIS

Other tools
Databases :
- TRANSFAC database v7.4 through GSEA (615 gene sets)
- JASPAR Core database (205 non redundant motifs)
- UCSC database (690 gene sets) based on data from ENCODETFBS
ChIP-seq production groups (from 2007 to 2012)
Tools :
- HOMER : TRANSFAC database + own motifs database through ChIP-
Seq analysis
- GSEA : usingTRANSFAC db on enriched motifs in ranked list of genes
- DAVID : Using UCSC database
- oPOSSUM : using JASPAR (116Transcription Factors)
SUPPLEMENTAL STEPS
TF ANALYSIS

- Reverse engineering models :
- Algorithm for the reconstruction of accurate cellular
networks = ARACNe
- Tool for Inferring Network of Genes =TINGe
- Other models :
- Functional networks = Cytoscape software (Bingo / ClueGO
plugins)
NETWORK STEP
General overview :

- Reverse engineering method, it is specifically designed to scale up to the
complexity of regulatory networks in mammalian cells.
- ARACNe defines an edge as an irreducible statistical dependency between
gene expression profiles that cannot be explained as an artifact of other
statistical dependencies in the network.
- An edge is likely to identify direct regulatory interactions mediated by a
transcription factor binding to a target gene's promoter region, although
other types of interactions may also be identified.
NETWORK STEP
Data networks
ARACNe

- Reverse engineer genome-scale gene networks from large number of
expression proﬁles based on mutual information (MI), data processing
inequality (DPI) and permutation testing to assess statistical signiﬁcance
of each inferred edge.
- TINGe can be used for directly constructing high-quality networks, or it
can be used as a component along with other types of data in building
probabilistic networks.
NETWORK STEP
Data networks
TINGe

- Several Net-tools or softwares can realise it : DAVID and Cytoscape. It
links modules of genes by their functional annotation enrichment.
- Rely on annotations and not on the expression values of the
experiment. Can give somme annotations without any information for
a speciﬁc biological question.
Example of network via ClueGO
on list of diff expressed genes
NETWORK STEP
Functional networks
Cytoscape

Transcriptomics Pipeline Analysis

Recommended

Recommended

More Related Content

Similar to Transcriptomics Pipeline Analysis

Similar to Transcriptomics Pipeline Analysis (20)

Recently uploaded

Recently uploaded (20)

Transcriptomics Pipeline Analysis