Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CDAC 2018 Pellegrini clustering ppi networks

14 views

Published on

Presentation at the 2018 Workshop and School on Cancer Development and Complexity
http://cdac2018.lakecomoschool.org

Published in: Science
  • Be the first to comment

  • Be the first to like this

CDAC 2018 Pellegrini clustering ppi networks

  1. 1. Clustering/Community Detection in Large Protein-Protein Interaction networks Marco Pellegrini IIT- CNR CDAC2018, May 22-25, 2018
  2. 2. Support
  3. 3. Why clustering PPIN • PPIN give a (static) system-level overview of proteomic machinery. • PPIN have a natural modularity structure • Functional modules, protein complexes appears as dense sub-networks (in absolute or relative sense) • PPIN topology gives hints at to the role of Proteins and Protein Interactions (e.g. hubs). • Diseases produce network perturbations.
  4. 4. Excursus on PPIN clustering algorithms • Methods from disciplines of Computer Science/OR/Graph Theory/Social Network Analysis (e.g. MCL, RNSC, Cfinder). • Methods specifically designed for PPI and based on topology only (MCODE,ClusterONE,..). • Methods designed for PPI, based on abstracting a complex model e.g. core-attachment (e.g. COACH, CORE,…) • Methods incorporating additional biological annotations (e.g. functional coherence, evolutionary conservation, co-location, PPI 3D interface constraints, dynamics of PPI, gene co- expression,..) (e.g. DECAFF, MATISSE,…)
  5. 5. Timeline • 2000/2002 : MCL • 2003: MCODE • 2004: RNCS • 2005 LCMA • 2006: Cfinder, CDPlus • 2007: PCP, DECAFF • 2009: COACH, CMC, HACO, CORE, RRW • 2010: CFA, MCL-Caw, SPICi • 2011: HC-PIN, • 2012: ClusterONE, Prorank, Weak-Ties, CMBI • 2013: WPNCA, ppsampler2, C-BNMF, PCIA, • 2014: RFC, C-element….=27 and counting..
  6. 6. Large PPI networks and Protein Complexes • 5 PPI-networks: Biogrid-yeast, String-HS, DIP- yeast, Biogrid-HS, String-HS. • 2 Complex DB: CYC2008-yeast, Corum-HS • Basic measurements to validate the following working hypothesis for large PPIN: • PC are ego-networks • PC are highly dense • PC may be highly overlapping
  7. 7. Some PPI Data sets - statistics Database species proteins interactions av degree DIP yeast 4.637 21.107 9,1 Giogrid yeast 6.686 220.499 65,9 String yeast 5.590 133.082 47,6 Biogrid Homo S. 18.170 137.775 15,1 String Homo S. 12.717 193.150 30,3
  8. 8. Size of PPI networks used 0 50000 100000 150000 200000 250000 y y y hs hs DIP Biogrid String Biogrid String proteins interactions
  9. 9. Complexes data sets Data base species Num comple xes Compl. Size >2 Comp l. size ≤ 2 Num proteins CORUM Homo sapiens 1750 1257 493 2506 CYC2008 yeast 408 236 172 1627 Used in the evaluation
  10. 10. Are Complexes Dense? Data set # CX Min size Max size mean Δ > 0.9 Δ > 0.5 DIP- CYC2008 236 3 40 6.02 60 (25%) 131 (55%) BG- CYC2008 236 3 81 6.67 173 (73%) 223 (94%) STR- CYC2008 236 3 81 6.67 220 (93%) 235 (99%) BG- CORUM 1257 3 143 6.12 516 (41%) 943 (75%) STR- CORUM 1188 3 133 6.07 621 (52%) 981 (82%)
  11. 11. Are Complexes Ego-nets Data set # CX R1 > 0.9 R1>0.5 DIP-CYC2008 236 131 (55%) 197 (83%) BG-CYC2008 236 216 (91%) 234 (99%) STR-CYC2008 236 235 (99%) 236 (100%) BG-CORUM 1257 891 (70%) 1162 (92%) STR-CORUM 1188 923 (77%) 1139 (95%)
  12. 12. Overlapping structure of complexes Data set #P In 1 CX In 2 CX In 3 CX In > 3 CX DIP- CYC2008 1175 1005 (86%) 134 (11%) 23 (2%) 13 (1%) BG- CYC2008 1342 1166 (87%) 139 (10%) 24 (2%) 13 (1%) STR- CYC2008 1341 1165 (87%) 139 (10%) 24 (2%) 13 (1%) BG- CORUM 2227 909 (41%) 483 (21%) 233 (11%) 602 (27%) STR- CORUM 2067 852 (42%) 430 (20%) 217 (10%) 568 (28%)
  13. 13. Core & Peel Protein complex prediction for large protein protein interaction networks with the Core&Peel mehod. M Pellegrini, M Baglioni, F Geraci. BMC Bioinformatics 17 ((Suppl 12)), 37–58 (2016)
  14. 14. Some Definitions of “density” • G=(V,E), V= Vertex set, E= Edge set. d(x)= degree of x. V E Gav 2 )(  Density: )1( 2 )(   VV E G Conductance:  SVS SVSE VSG (),(min ),( min     Sx xdS )()( Average degree:
  15. 15. Algorithmic ideas behind Core&Peel • Consider the following planted clique detection problem (Kucera ‘95, Deket et al. ‘10). • Take a random graph G=G(n,p). Embed Kk in G obtain G’(n,p,k). • If k is o.d.g of the max degree in G w.h.p. the following simple algorithm works well: (1) Sort the nodes of G’ by degree. (2) Take the top c nodes v1,..,vc for c=3,4,5…. (3) Verify if v1,..,vc induce a complete graph.
  16. 16. Why this works? • The planted clique (dense sub-graph) of a certain size induces a bias in an easy “measure” so that the clique nodes emerge as higher in rank than random non-clique nodes (background). • Thus sorting by this measure highlights the possible good seed candidates pushing them to the top of the list. • Verification of a candidate set is relatively easy and local.
  17. 17. Making things less simple… • Consider the clique number of a node v in G: Clq(v,G), that is the size of the largest clique in G incident to v. • Clearly degree(v) is an upper bound to Clq(v), it is easy to compute but often not too tight. • So we look for a better upper bound to Clq(v) that (1) can be computed efficiently (2) can be easily extended to quasi-cliques.
  18. 18. Degeneracy and Core decomposition • The core decomposition CD(G) assigns to each node in v its core number cn(v). • cn(v)=k if v belongs to a maximal induced subgraph of min degree k. • cn(v) is the largest integer k you can assign to v, so that at least k neighbors w of v in G have cn(w)>=k. • Observation 1: cn(v) is an upper bound to Clq(v). • Observation 2: CD(G) can be computed in linear time O(n+m) [Batagelj-Zaversnik, 2003]. • Thus it is a good candidate to replace the degree as a convenient upper bound.
  19. 19. Last phase: Charikar peeling+ • In [Charikar 2000] it is shown that recursively peeling the node of lowest degree in G gives a (1/2)-approximation to the densest subgraph (measured as average degree). • The intuition is that if we start with a graph that is sufficiently dense, then the optimum densest subgraph is often the same object with either definitions.
  20. 20. Core&Peel in one shot • (1) Compute the Core Decompostition of G. • (2) Sort the vertices by decreasing core number (solve ties by |Nr(v,c(v))|). • (3) For each node v in turn: • (4) Extract G[Nr(v,c(v))] • (5) if it is above 50% density, apply Peeling+, stop when sufficiently dense • (6) Remove from the final output list any dense subgraph completely included in another. • (7) Remove from the final output almost duplicates by Jaccard similarity above a threshold
  21. 21. Experiments with PPI networks • 6 PPI-networks: Biogrid-yeast, String-HS, DIP-yeast, Biogrid-HS, Biogrid-HS-UBC, String-HS. • 2 Complex DB: CYC2008-yeast, Corum-HS • 10 Competitors: MCL , COACH, MCODE, MCL-Caw, CMC, ProRank, Spici, RNSC, ClusterOne, CFinder. • 1 Evaluation metric : F-measure from [Li et al. BMC genomics, 2010, 11:S3], for ω=0.2. • 3 Evaluation metrics: J-measure, PR-measure, SS- measure from [Song et al., Bioinformatics, 2009] • Make a Aggregated Quality Index by summing them. • Measure GO enrichment by BH-corrected p-values in hypergoemetric tests (q-values).
  22. 22. Testing Overview Prediction-Alg1 prediction Evaluator scores Gene Ontology Validated Complexes PPI- Network Prediction-Alg2 Prediction-Algn prediction prediction Objective 1: pick the champion for the problem Objective 2: hedge strength/weak features of several methods
  23. 23. F-measure - yeast 0 0.1 0.2 0.3 0.4 0.5 0.6 DIP-y BG-y STR-y C&P MCL COACH MCODE CMC MCL-CAW ProRank Spici ClusterOne RNSC Cfinder Rand1 Rand2 Rand3
  24. 24. Semantic Similarity - yeast 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 DIP-y BG-y STR-y C&P MCL COACH MCODE CMC MCL-CAW ProRank Spici ClusterOne RNSC Cfinder Rand1 Rand2 Rand3
  25. 25. Aggregated Score - yeast 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 DIP-y BG-y STR-y C&P MCL COACH MCODE CMC MCL-CAW ProRank Spici ClusterOne RNSC Cfinder Rand1 Rand2 Rand3
  26. 26. F-measure - hs 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 BG-hs BG-hs-UBC STR-hs C&P MCL COACH MCODE CMC MCL-CAW ProRank Spici ClusterOne RNSC Cfinder Rand1 Rand2 Rand3
  27. 27. Semantic Similarity - hs 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 BG-hs BG-hs-UBC STR-hs C&P MCL COACH MCODE CMC MCL-CAW ProRank Spici ClusterOne RNSC Cfinder Rand1 Rand2 Rand3
  28. 28. Aggregated Score - hs 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 BG-hs BG-hs-UBC STR-hs C&P MCL COACH MCODE CMC MCL-CAW ProRank Spici ClusterOne RNSC Cfinder Rand1
  29. 29. DIP-y (hypergeometric q-values) 0 500 1000 1500 2000 2500 exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7) C&P MCL Coach Mcode CMC MCLCAW PRPlus Spici Clusterone RNSC Cfinder
  30. 30. Bg-y (hypergeometric q-values) 0 500 1000 1500 2000 2500 3000 3500 exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7) C&P MCL Coach Mcode CMC MCLCAW PRPlus Spici Clusterone RNSC Cfinder
  31. 31. Str-y (hypergeometric q-values) 0 500 1000 1500 2000 2500 3000 3500 exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7) C&P MCL Coach Mcode CMC MCLCAW PRPlus Spici Clusterone RNSC Cfinder
  32. 32. Bg-hs (hypergeometric q-values) 0 1000 2000 3000 4000 5000 6000 7000 exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7) C&P MCL Coach Mcode CMC PRPlus Spici Clusterone RNSC Cfinder
  33. 33. Bg-hs-UBC (hypergeometric q-values) 0 1000 2000 3000 4000 5000 6000 7000 exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7) C&P MCL Coach Mcode CMC MCLCAW PRPlus Spici Clusterone RNSC Cfinder
  34. 34. Str-hs (hypergeometric q-values) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7) C&P MCL Coach Mcode CMC MCLCAW PRPlus Spici Clusterone RNSC Cfinder
  35. 35. Time (in seconds) in Log10 scale : str-hs -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 STR-hs STR-hs Runs optimizing the f-measure for each algorithm
  36. 36. Time (in seconds) in Log10 scale: bg-hs-UBC -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 BG-Hs-UBC BG-Hs-UBC Runs optimizing the f-measure for each algorithm
  37. 37. Weghted vs unweighted PPI Networks T. Nepusz, H. Yu, and A. Paccanaro Detecting overlapping protein complexes in protein- protein interaction networks Nature Methods, vol. 9, pp. 471-472, 2012.
  38. 38. PPIN and Cancer studies • Objective 1: - Identification of Cancer Modules. • Objective 2: - Identification of Candidate Drug Target (bridges between modules, bottlenecks, inter-modular hubs) • Dynamic modularity in protein interaction networks predicts breast cancer outcome. Taylor IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, Bull S, Pawson T, Morris Q, Wrana JL. Nat Biotechnol. 2009 Feb;27(2):199-204. • A network module-based method for identifying cancer prognostic signatures. Guanming Wu and Lincoln Stein, Genome Biology 2012 13:R112 • Disease biomarker identification from gene network modules for metastasized breast cancer. Pooja Sharma, Dhruba K. Bhattacharyya, and Jugal Kalita, Sci Rep. 2017; 7: 1072. • The OncoPPi network of cancer-focused protein–protein interactions to inform biological insights and therapeutic strategies, Zenggang Li et al. Nature Communications . volume 8, Article number: 14356 (2017) • Disease Module Identification DREAM Challenge. bioRxiv 265553 (2018). doi: https://doi.org/10.1101/265553
  39. 39. Conclusions /Perspectives • Other types of Molecular networks (gene co- expression networks, signaling networks, metabolic networks). • Tissue specific and disease specific information. • Evaluation /Ranking of clusters/communities produced by ‘in silico’ methods. • Disease modules vs cancer modules
  40. 40. Conclusions/future work • Compare it to more tools… • Use Inter-species PPI conservation as an additional ranking criterion. • Introduce co-expression data (tissue/process specificity). • Identify classes of complexes differentially predicted by different tools. • Extension to “functional modules” prediction • Objective: study the “interactome” using large PPI networks with efficient algorithms.

×