CDAC 2018 Pellegrini clustering ppi networks

Clustering/Community Detection in
Large Protein-Protein Interaction
networks
Marco Pellegrini
IIT- CNR
CDAC2018, May 22-25, 2018

Why clustering PPIN
• PPIN give a (static) system-level overview of
proteomic machinery.
• PPIN have a natural modularity structure
• Functional modules, protein complexes
appears as dense sub-networks (in absolute or
relative sense)
• PPIN topology gives hints at to the role of
Proteins and Protein Interactions (e.g. hubs).
• Diseases produce network perturbations.

Excursus on PPIN clustering algorithms
• Methods from disciplines of Computer Science/OR/Graph
Theory/Social Network Analysis (e.g. MCL, RNSC, Cfinder).
• Methods specifically designed for PPI and based on topology
only (MCODE,ClusterONE,..).
• Methods designed for PPI, based on abstracting a complex
model e.g. core-attachment (e.g. COACH, CORE,…)
• Methods incorporating additional biological annotations (e.g.
functional coherence, evolutionary conservation, co-location,
PPI 3D interface constraints, dynamics of PPI, gene co-
expression,..) (e.g. DECAFF, MATISSE,…)

Timeline
• 2000/2002 : MCL
• 2003: MCODE
• 2004: RNCS
• 2005 LCMA
• 2006: Cfinder, CDPlus
• 2007: PCP, DECAFF
• 2009: COACH, CMC, HACO, CORE, RRW
• 2010: CFA, MCL-Caw, SPICi
• 2011: HC-PIN,
• 2012: ClusterONE, Prorank, Weak-Ties, CMBI
• 2013: WPNCA, ppsampler2, C-BNMF, PCIA,
• 2014: RFC, C-element….=27 and counting..

Large PPI networks and Protein Complexes
• 5 PPI-networks: Biogrid-yeast, String-HS, DIP-
yeast, Biogrid-HS, String-HS.
• 2 Complex DB: CYC2008-yeast, Corum-HS
• Basic measurements to validate the following
working hypothesis for large PPIN:
• PC are ego-networks
• PC are highly dense
• PC may be highly overlapping

Some PPI Data sets - statistics
Database species proteins interactions av degree
DIP yeast 4.637 21.107 9,1
Giogrid yeast 6.686 220.499 65,9
String yeast 5.590 133.082 47,6
Biogrid Homo S. 18.170 137.775 15,1
String Homo S. 12.717 193.150 30,3

Size of PPI networks used
0
50000
100000
150000
200000
250000
y y y hs hs
DIP Biogrid String Biogrid String
proteins
interactions

Complexes data sets
Data base species Num
comple
xes
Compl.
Size >2
Comp
l. size
≤ 2
Num proteins
CORUM Homo
sapiens
1750 1257 493 2506
CYC2008 yeast 408 236 172 1627
Used in the
evaluation

Are Complexes Dense?
Data set # CX Min
size
Max
size
mean Δ > 0.9 Δ > 0.5
DIP-
CYC2008
236 3 40 6.02 60
(25%)
131
(55%)
BG-
CYC2008
236 3 81 6.67 173
(73%)
223
(94%)
STR-
CYC2008
236 3 81 6.67 220
(93%)
235
(99%)
BG-
CORUM
1257 3 143 6.12 516
(41%)
943
(75%)
STR-
CORUM
1188 3 133 6.07 621
(52%)
981
(82%)

Are Complexes Ego-nets
Data set # CX R1 > 0.9 R1>0.5
DIP-CYC2008 236 131
(55%)
197
(83%)
BG-CYC2008 236 216
(91%)
234
(99%)
STR-CYC2008 236 235
(99%)
236
(100%)
BG-CORUM 1257 891
(70%)
1162
(92%)
STR-CORUM 1188 923
(77%)
1139
(95%)

Overlapping structure of complexes
Data set #P In 1 CX In 2 CX In 3 CX In > 3 CX
DIP-
CYC2008
1175 1005
(86%)
134
(11%)
23
(2%)
13
(1%)
BG-
CYC2008
1342 1166
(87%)
139
(10%)
24
(2%)
13
(1%)
STR-
CYC2008
1341 1165
(87%)
139
(10%)
24
(2%)
13
(1%)
BG-
CORUM
2227 909
(41%)
483
(21%)
233
(11%)
602
(27%)
STR-
CORUM
2067 852
(42%)
430
(20%)
217
(10%)
568
(28%)

Core & Peel
Protein complex prediction for large protein protein interaction networks with the Core&Peel
mehod. M Pellegrini, M Baglioni, F Geraci. BMC Bioinformatics 17 ((Suppl 12)), 37–58 (2016)

Some Definitions of “density”
• G=(V,E), V= Vertex set, E= Edge set. d(x)= degree of x.
V
E
Gav
2
)( 
Density:
)1(
2
)(


VV
E
G
Conductance:  SVS
SVSE
VSG
(),(min
),(
min




Sx
xdS )()(
Average degree:

Algorithmic ideas behind Core&Peel
• Consider the following planted clique detection
problem (Kucera ‘95, Deket et al. ‘10).
• Take a random graph G=G(n,p). Embed Kk in G
obtain G’(n,p,k).
• If k is o.d.g of the max degree in G w.h.p. the
following simple algorithm works well:
(1) Sort the nodes of G’ by degree.
(2) Take the top c nodes v1,..,vc for c=3,4,5….
(3) Verify if v1,..,vc induce a complete graph.

Why this works?
• The planted clique (dense sub-graph) of a certain
size induces a bias in an easy “measure” so that
the clique nodes emerge as higher in rank than
random non-clique nodes (background).
• Thus sorting by this measure highlights the
possible good seed candidates pushing them to
the top of the list.
• Verification of a candidate set is relatively easy
and local.

Making things less simple…
• Consider the clique number of a node v in G:
Clq(v,G), that is the size of the largest clique in
G incident to v.
• Clearly degree(v) is an upper bound to Clq(v),
it is easy to compute but often not too tight.
• So we look for a better upper bound to Clq(v)
that (1) can be computed efficiently (2) can be
easily extended to quasi-cliques.

Degeneracy and Core decomposition
• The core decomposition CD(G) assigns to each
node in v its core number cn(v).
• cn(v)=k if v belongs to a maximal induced
subgraph of min degree k.
• cn(v) is the largest integer k you can assign to v,
so that at least k neighbors w of v in G have
cn(w)>=k.
• Observation 1: cn(v) is an upper bound to Clq(v).
• Observation 2: CD(G) can be computed in linear
time O(n+m) [Batagelj-Zaversnik, 2003].
• Thus it is a good candidate to replace the degree
as a convenient upper bound.

Last phase: Charikar peeling+
• In [Charikar 2000] it is shown that recursively
peeling the node of lowest degree in G gives a
(1/2)-approximation to the densest subgraph
(measured as average degree).
• The intuition is that if we start with a graph
that is sufficiently dense, then the optimum
densest subgraph is often the same object
with either definitions.

Core&Peel in one shot
• (1) Compute the Core Decompostition of G.
• (2) Sort the vertices by decreasing core number
(solve ties by |Nr(v,c(v))|).
• (3) For each node v in turn:
• (4) Extract G[Nr(v,c(v))]
• (5) if it is above 50% density, apply Peeling+, stop
when sufficiently dense
• (6) Remove from the final output list any dense
subgraph completely included in another.
• (7) Remove from the final output almost
duplicates by Jaccard similarity above a threshold

Experiments with PPI networks
• 6 PPI-networks: Biogrid-yeast, String-HS, DIP-yeast,
Biogrid-HS, Biogrid-HS-UBC, String-HS.
• 2 Complex DB: CYC2008-yeast, Corum-HS
• 10 Competitors: MCL , COACH, MCODE, MCL-Caw,
CMC, ProRank, Spici, RNSC, ClusterOne, CFinder.
• 1 Evaluation metric : F-measure from [Li et al. BMC
genomics, 2010, 11:S3], for ω=0.2.
• 3 Evaluation metrics: J-measure, PR-measure, SS-
measure from [Song et al., Bioinformatics, 2009]
• Make a Aggregated Quality Index by summing them.
• Measure GO enrichment by BH-corrected p-values in
hypergoemetric tests (q-values).

Testing Overview
Prediction-Alg1 prediction Evaluator scores
Gene
Ontology
Validated
Complexes
PPI-
Network
Prediction-Alg2
Prediction-Algn
prediction
prediction
Objective 1: pick the champion for the problem
Objective 2: hedge strength/weak features of several methods

F-measure - yeast
0
0.1
0.2
0.3
0.4
0.5
0.6
DIP-y BG-y STR-y
C&P
MCL
COACH
MCODE
CMC
MCL-CAW
ProRank
Spici
ClusterOne
RNSC
Cfinder
Rand1
Rand2
Rand3

Semantic Similarity - yeast
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
DIP-y BG-y STR-y
C&P
MCL
COACH
MCODE
CMC
MCL-CAW
ProRank
Spici
ClusterOne
RNSC
Cfinder
Rand1
Rand2
Rand3

Aggregated Score - yeast
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
DIP-y BG-y STR-y
C&P
MCL
COACH
MCODE
CMC
MCL-CAW
ProRank
Spici
ClusterOne
RNSC
Cfinder
Rand1
Rand2
Rand3

F-measure - hs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
BG-hs BG-hs-UBC STR-hs
C&P
MCL
COACH
MCODE
CMC
MCL-CAW
ProRank
Spici
ClusterOne
RNSC
Cfinder
Rand1
Rand2
Rand3

Semantic Similarity - hs
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
C&P
MCL
COACH
MCODE
CMC
MCL-CAW
ProRank
Spici
ClusterOne
RNSC
Cfinder
Rand1
Rand2
Rand3

Aggregated Score - hs
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
C&P
MCL
COACH
MCODE
CMC
MCL-CAW
ProRank
Spici
ClusterOne
RNSC
Cfinder
Rand1

DIP-y (hypergeometric q-values)
0
500
1000
1500
2000
2500
exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7)
C&P
MCL
Coach
Mcode
CMC
MCLCAW
PRPlus
Spici
Clusterone
RNSC
Cfinder

Bg-y (hypergeometric q-values)
0
500
1000
1500
2000
2500
3000
3500
C&P
MCL
Coach
Mcode
CMC
MCLCAW
PRPlus
Spici
Clusterone
RNSC
Cfinder

Str-y (hypergeometric q-values)
0
500
1000
1500
2000
2500
3000
3500
C&P
MCL
Coach
Mcode
CMC
MCLCAW
PRPlus
Spici
Clusterone
RNSC
Cfinder

Bg-hs (hypergeometric q-values)
0
1000
2000
3000
4000
5000
6000
7000
C&P
MCL
Coach
Mcode
CMC
PRPlus
Spici
Clusterone
RNSC
Cfinder

Bg-hs-UBC (hypergeometric q-values)
0
1000
2000
3000
4000
5000
6000
7000
C&P
MCL
Coach
Mcode
CMC
MCLCAW
PRPlus
Spici
Clusterone
RNSC
Cfinder

Str-hs (hypergeometric q-values)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
C&P
MCL
Coach
Mcode
CMC
MCLCAW
PRPlus
Spici
Clusterone
RNSC
Cfinder

Time (in seconds) in Log10 scale : str-hs
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
STR-hs
STR-hs
Runs optimizing the f-measure for each algorithm

Time (in seconds) in Log10 scale: bg-hs-UBC
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
BG-Hs-UBC
BG-Hs-UBC
Runs optimizing the f-measure for each algorithm

Weghted vs unweighted PPI Networks
T. Nepusz, H.
Yu, and A.
Paccanaro
Detecting
overlapping
protein
complexes in
protein-
protein
interaction
networks
Nature
Methods, vol.
9, pp. 471-472,
2012.

PPIN and Cancer studies
• Objective 1: - Identification of Cancer Modules.
• Objective 2: - Identification of Candidate Drug Target (bridges between
modules, bottlenecks, inter-modular hubs)
• Dynamic modularity in protein interaction networks predicts breast cancer outcome. Taylor
IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, Bull S, Pawson T, Morris Q, Wrana JL.
Nat Biotechnol. 2009 Feb;27(2):199-204.
• A network module-based method for identifying cancer prognostic signatures. Guanming Wu
and Lincoln Stein, Genome Biology 2012 13:R112
• Disease biomarker identification from gene network modules for metastasized breast cancer.
Pooja Sharma, Dhruba K. Bhattacharyya, and Jugal Kalita, Sci Rep. 2017; 7: 1072.
• The OncoPPi network of cancer-focused protein–protein interactions to inform biological
insights and therapeutic strategies, Zenggang Li et al. Nature Communications . volume 8,
Article number: 14356 (2017)
• Disease Module Identification DREAM Challenge. bioRxiv 265553 (2018). doi:
https://doi.org/10.1101/265553

Conclusions /Perspectives
• Other types of Molecular networks (gene co-
expression networks, signaling networks,
metabolic networks).
• Tissue specific and disease specific
information.
• Evaluation /Ranking of clusters/communities
produced by ‘in silico’ methods.
• Disease modules vs cancer modules

Conclusions/future work
• Compare it to more tools…
• Use Inter-species PPI conservation as an
additional ranking criterion.
• Introduce co-expression data (tissue/process
specificity).
• Identify classes of complexes differentially
predicted by different tools.
• Extension to “functional modules” prediction
• Objective: study the “interactome” using large
PPI networks with efficient algorithms.

CDAC 2018 Pellegrini clustering ppi networks

More Related Content

Similar to CDAC 2018 Pellegrini clustering ppi networks

More from Marco Antoniotti

Recently uploaded

CDAC 2018 Pellegrini clustering ppi networks