Your SlideShare is downloading. ×
Mining Billion-node Graphs: Patterns, Generators and Tools Christos Faloutsos CMU
Thanks! <ul><li>Chris Olston </li></ul>Hadoop Summit '10 C. Faloutsos (CMU)
Our goal: <ul><li>Open source system for mining huge graphs: </li></ul><ul><li>PEGASUS project (PEta GrAph mining System) ...
Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Too...
Graphs - why should we care? C. Faloutsos (CMU) Internet Map [lumeta.com] Food Web [Martinez ’91] Protein Interactions [ge...
Graphs - why should we care? <ul><li>IR: bi-partite graphs (doc-terms) </li></ul><ul><li>web: hyper-text graph </li></ul><...
Graphs - why should we care? <ul><li>network of companies & board-of-directors members </li></ul><ul><li>‘ viral’ marketin...
Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><ul><li>Static gra...
Problem #1 - network and graph mining <ul><li>How does the Internet look like? </li></ul><ul><li>How does FaceBook look li...
Problem #1 - network and graph mining <ul><li>How does the Internet look like? </li></ul><ul><li>How does FaceBook look li...
Problem #1 - network and graph mining <ul><li>How does the Internet look like? </li></ul><ul><li>How does FaceBook look li...
Graph mining <ul><li>Are real graphs random? </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
Laws and patterns <ul><li>Are real graphs random? </li></ul><ul><li>A: NO!! </li></ul><ul><ul><li>Diameter </li></ul></ul>...
Solution# S.1 <ul><li>Power law in the degree distribution [SIGCOMM99] </li></ul>C. Faloutsos (CMU) log(rank) log(degree) ...
Solution# S.2: Eigen Exponent  E <ul><li>A2: power law in the eigenvalues of the adjacency matrix </li></ul>C. Faloutsos (...
Solution# S.2: Eigen Exponent  E <ul><li>[Mihail, Papadimitriou ’02]: slope is ½ of rank exponent </li></ul>C. Faloutsos (...
But: <ul><li>How about graphs from other domains? </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
More power laws: <ul><li>web hit counts [w/ A. Montgomery] </li></ul>C. Faloutsos (CMU) Web Site Traffic in-degree (log sc...
epinions.com <ul><li>who-trusts-whom [Richardson + Domingos, KDD 2001] </li></ul>C. Faloutsos (CMU) (out) degree count tru...
And numerous more <ul><li># of sexual contacts </li></ul><ul><li>Income [Pareto] –’80-20 distribution’ </li></ul><ul><li>D...
Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><ul><li>Static gra...
Solution# S.3: Triangle ‘Laws’ <ul><li>Real social networks have a lot of triangles  </li></ul>C. Faloutsos (CMU) Hadoop S...
Solution# S.3: Triangle ‘Laws’ <ul><li>Real social networks have a lot of triangles </li></ul><ul><ul><li>Friends of frien...
Triangle Law: #S.3  [Tsourakakis ICDM 2008] C. Faloutsos (CMU) ASN HEP-TH Epinions X-axis: # of Triangles a node participa...
Triangle Law: #S.3  [Tsourakakis ICDM 2008] C. Faloutsos (CMU) ASN HEP-TH Epinions X-axis: # of Triangles a node participa...
Triangle Law: #S.4  [Tsourakakis ICDM 2008] C. Faloutsos (CMU) SN Reuters Epinions X-axis: degree Y-axis: mean # triangles...
Triangle Law: Computations  [Tsourakakis ICDM 2008] C. Faloutsos (CMU) But: triangles are expensive to compute (3-way join...
Triangle Law: Computations  [Tsourakakis ICDM 2008] C. Faloutsos (CMU) But: triangles are expensive to compute (3-way join...
Triangle Law: Computations  [Tsourakakis ICDM 2008] C. Faloutsos (CMU) 1000x+ speed-up, >90% accuracy details Hadoop Summi...
EigenSpokes <ul><li>B. Aditya Prakash, Mukund Seshadri, Ashwin Sridharan, Sridhar Machiraju and Christos Faloutsos:  Eigen...
EigenSpokes <ul><li>Eigenvectors of adjacency matrix  </li></ul><ul><ul><li>equivalent to singular vectors (symmetric, und...
EigenSpokes <ul><li>Eigenvectors of adjacency matrix  </li></ul><ul><ul><li>equivalent to singular vectors (symmetric, und...
EigenSpokes <ul><li>EE plot: </li></ul><ul><li>Scatter plot of scores of u1 vs u2 </li></ul><ul><li>One would expect </li>...
EigenSpokes <ul><li>EE plot: </li></ul><ul><li>Scatter plot of scores of u1 vs u2 </li></ul><ul><li>One would expect </li>...
EigenSpokes - pervasiveness <ul><li>Present in mobile social graph </li></ul><ul><ul><li>across time and space </li></ul><...
EigenSpokes - explanation <ul><li>Near-cliques, or near-bipartite-cores, loosely connected </li></ul>C. Faloutsos (CMU) Ha...
EigenSpokes - explanation <ul><li>Near-cliques , or near-bipartite-cores, loosely connected </li></ul>C. Faloutsos (CMU) H...
EigenSpokes - explanation <ul><li>Near-cliques, or  near-bipartite-cores , loosely connected </li></ul>C. Faloutsos (CMU) ...
EigenSpokes - explanation <ul><li>Near-cliques, or near-bipartite-cores, loosely connected </li></ul><ul><li>So what? </li...
Bipartite Communities! magnified bipartite community patents from same inventor(s) cut-and-paste bibliography! C. Faloutso...
Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><ul><li>Static gra...
Observations on  weighted graphs? <ul><li>A: yes - even more ‘laws’! </li></ul>C. Faloutsos (CMU) M. McGlohon, L. Akoglu, ...
Observation W.1: Fortification <ul><ul><li>Q: How do the weights  </li></ul></ul><ul><ul><li>of nodes relate to degree? </...
Observation W.1: Fortification C. Faloutsos (CMU) More donors,  more $ ? $10 $5 Hadoop Summit '10 ‘ Reagan’ ‘ Clinton’ $7
Observation W.1: fortification: Snapshot Power Law <ul><li>Weight: super-linear on in-degree  </li></ul><ul><li>exponent ‘...
Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><ul><li>Static gra...
Problem: Time evolution <ul><li>with Jure Leskovec (CMU -> Stanford) </li></ul><ul><li>and Jon Kleinberg (Cornell – sabb. ...
T.1 Evolution of the Diameter <ul><li>Prior work on Power Law graphs hints at   slowly growing diameter : </li></ul><ul><u...
T.1 Evolution of the Diameter <ul><li>Prior work on Power Law graphs hints at   slowly growing diameter : </li></ul><ul><u...
T.1 Diameter – “Patents” <ul><li>Patent citation network </li></ul><ul><li>25 years of data </li></ul><ul><li>@1999 </li><...
T.2 Temporal Evolution of the Graphs <ul><li>N(t) … nodes at time t </li></ul><ul><li>E(t) … edges at time t </li></ul><ul...
T.2 Temporal Evolution of the Graphs <ul><li>N(t) … nodes at time t </li></ul><ul><li>E(t) … edges at time t </li></ul><ul...
T.2 Densification – Patent Citations <ul><li>Citations among patents granted </li></ul><ul><li>@1999 </li></ul><ul><ul><li...
Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><ul><li>Static gra...
More on Time-evolving   graphs C. Faloutsos (CMU) M. McGlohon, L. Akoglu, and C. Faloutsos  Weighted Graphs and Disconnect...
Observation T.3: NLCC behavior <ul><ul><li>Q: How do NLCC’s emerge and join with the GCC? </li></ul></ul><ul><ul><li>(``NL...
Observation T.3: NLCC behavior <ul><li>After the gelling point, the GCC takes off, but NLCC’s remain ~constant (actually, ...
Timing for Blogs <ul><li>with Mary McGlohon (CMU) </li></ul><ul><li>Jure Leskovec (CMU->Stanford) </li></ul><ul><li>Natali...
T.4 : popularity over time C. Faloutsos (CMU) Post popularity drops-off – exponentially? lag: days after post # in links 1...
T.4 : popularity over time C. Faloutsos (CMU) Post popularity drops-off – exponentially? POWER LAW! Exponent? # in links (...
T.4 : popularity over time C. Faloutsos (CMU) <ul><li>Post popularity drops-off – exponentially? </li></ul><ul><li>POWER L...
Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Too...
CenterPiece Subgraphs <ul><li>Hanghang TONG et al, KDD’06 </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
Ce nter- P iece  S ubgraph Discovery  [Tong+ KDD 06] Original Graph Q: Who is the most central node wrt the black nodes?  ...
Ce nter- P iece  S ubgraph Discovery  [Tong+ KDD 06] Q: How to find hub for the query nodes? Input:  original graph Output...
CePS : Example (AND Query) ? C. Faloutsos (CMU) Hadoop Summit '10 <ul><li>DBLP co-authorship network:  </li></ul><ul><ul><...
CePS : Example (AND Query) C. Faloutsos (CMU) <ul><li>DBLP co-authorship network:  </li></ul><ul><ul><li>400,000 authors, ...
Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Too...
G raph X -Ray:  Fast Best-Effort Pattern Matching  in Large Attributed Graphs Hanghang Tong, Brian Gallagher,  Christos Fa...
Output Input Attributed Data Graph Query Graph Matching Subgraph Hadoop Summit '10 C. Faloutsos (CMU)
Effectiveness: star-query  Query  Result Hadoop Summit '10 C. Faloutsos (CMU)
Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Too...
OddBall: Spotting  A n o m a l i e s  in  Weighted Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon ...
Main idea <ul><li>For each node,  </li></ul><ul><li>extract ‘ego-net’ (=1-step-away neighbors) </li></ul><ul><li>Extract f...
What is an egonet? ego egonet C. Faloutsos (CMU) Hadoop Summit '10
Selected Features <ul><li>N i :  number of neighbors (degree) of ego  i </li></ul><ul><li>E i :  number of edges in egonet...
Near-Clique/Star Hadoop Summit '10 C. Faloutsos (CMU)
Near-Clique/Star C. Faloutsos (CMU) Hadoop Summit '10
Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Too...
Outline – Algorithms & results C. Faloutsos (CMU) Hadoop Summit '10 Centralized Hadoop/PEGASUS Degree Distr. old old Pager...
HADI for diameter estimation <ul><li>Radius Plots for Mining Tera-byte Scale Graphs  U Kang , Charalampos Tsourakakis, Ana...
<ul><li>YahooWeb graph  (120Gb, 1.4B nodes, 6.6 B edges) </li></ul><ul><li>Largest publicly available graph ever studied. ...
<ul><li>YahooWeb graph  (120Gb, 1.4B nodes, 6.6 B edges) </li></ul><ul><li>Largest publicly available graph ever studied. ...
<ul><li>YahooWeb graph  (120Gb, 1.4B nodes, 6.6 B edges) </li></ul><ul><li>effective diameter: surprisingly small. </li></...
<ul><li>YahooWeb graph  (120Gb, 1.4B nodes, 6.6 B edges) </li></ul><ul><li>effective diameter: surprisingly small. </li></...
C. Faloutsos (CMU) <ul><li>YahooWeb graph  (120Gb, 1.4B nodes, 6.6 B edges) </li></ul><ul><li>effective diameter: surprisi...
Radius Plot of  GCC  of YahooWeb. C. Faloutsos (CMU) Hadoop Summit '10
Running time -  Kronecker and Erdos-Renyi  Graphs with billions edges. details
Outline – Algorithms & results C. Faloutsos (CMU) Hadoop Summit '10 Centralized Hadoop/PEGASUS Degree Distr. old old Pager...
Generalized Iterated Matrix Vector Multiplication (GIMV) C. Faloutsos (CMU) PEGASUS: A Peta-Scale Graph Mining  System - I...
Generalized Iterated Matrix Vector Multiplication (GIMV) C. Faloutsos (CMU) <ul><li>PageRank </li></ul><ul><li>proximity (...
Example: GIM-V At Work <ul><li>Connected Components </li></ul>Size Count C. Faloutsos (CMU) Hadoop Summit '10
Example: GIM-V At Work <ul><li>Connected Components </li></ul>Size Count C. Faloutsos (CMU) Hadoop Summit '10 ~0.7B  singl...
Example: GIM-V At Work <ul><li>Connected Components </li></ul>Size Count C. Faloutsos (CMU) Hadoop Summit '10
Example: GIM-V At Work <ul><li>Connected Components </li></ul>Size Count 300-size cmpt X 500. Why? 1100-size cmpt X 65. Wh...
Example: GIM-V At Work <ul><li>Connected Components </li></ul>Size Count suspicious financial-advice sites (not existing n...
GIM-V At Work <ul><li>Connected Components over Time </li></ul><ul><li>LinkedIn: 7.5M nodes and 58M edges </li></ul>Stable...
Outline – Algorithms & results C. Faloutsos (CMU) Hadoop Summit '10 Centralized Hadoop/PEGASUS Degree Distr. old old Pager...
Triangles : Computations  [Tsourakakis ICDM 2008] C. Faloutsos (CMU) But: triangles are expensive to compute (3-way join; ...
Triangle Law: #1  [Tsourakakis ICDM 2008] C. Faloutsos (CMU) ASN HEP-TH Epinions X-axis: # of Triangles a node participate...
Outline – Algorithms & results C. Faloutsos (CMU) Hadoop Summit '10 Centralized Hadoop/PEGASUS Degree Distr. old old Pager...
Visualization: ShiftR <ul><li>Supporting Ad Hoc Sensemaking: Integrating Cognitive, HCI, and Data Mining Approaches Aniket...
C. Faloutsos (CMU) Hadoop Summit '10
Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Too...
Other topics - part#1 - tools <ul><li>Community detection – how many? </li></ul><ul><ul><li>Cross-Associations [Chakrabart...
Tensors <ul><li>Adjacency matrices, stacked (over time, and/or edge-type – ‘composite networks’) </li></ul>Hadoop Summit '...
Tensors <ul><li>Adjacency matrices, stacked (over time, and/or edge-type – ‘composite networks’) </li></ul>Hadoop Summit '...
Tensors <ul><li>Adjacency matrices, stacked (over time, and/or edge-type – ‘composite networks’) </li></ul>~ + PARAFAC ten...
Other topics – part#2 - generators <ul><li>Kronecker [PKDD’05]; </li></ul><ul><li>Random Typing [Akoglu+, PKDD’09] </li></...
Kronecker Product – a Graph <ul><li>One of most realistic generators, with  provable  properties </li></ul>Hadoop Summit '...
Other topics - part#3 – virus propagation <ul><li>Epidemic threshold for SIS: depends  only  on first eigenvalue of adjace...
More info <ul><li>Tutorial on graph mining: KDD’09  </li></ul><ul><li>(w/ Gary Miller and C. Tsourakakis) </li></ul><ul><l...
Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Too...
OVERALL CONCLUSIONS – low level: <ul><li>Several new  patterns  (fortification, triangle-laws, conn. components, etc) </li...
OVERALL CONCLUSIONS – high level <ul><li>Large datasets may reveal patterns/outliers that would be invisible otherwise </l...
References <ul><li>Leman Akoglu, Christos Faloutsos:  RTG: A Recursive Realistic Graph Generator Using Random Typing . ECM...
References <ul><li>Deepayan Chakrabarti, Yang Wang, Chenxi Wang, Jure Leskovec, Christos Faloutsos:  Epidemic thresholds i...
References <ul><li>Christos Faloutsos, Tamara G. Kolda, Jimeng Sun:  Mining large graphs and streams using matrix and tens...
References <ul><li>T. G. Kolda and J. Sun.  Scalable Tensor Decompositions for Multi-aspect Data Mining . In: ICDM 2008, p...
References <ul><li>Jure Leskovec, Jon Kleinberg and Christos Faloutsos  Graphs over Time: Densification Laws, Shrinking Di...
References <ul><li>Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos.  Less is More: Compact Matrix Decomposition fo...
References <ul><li>Jimeng Sun, Dacheng Tao, Christos Faloutsos:  Beyond streams and graphs: dynamic tensor analysis . KDD ...
References <ul><li>Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan,  Fast Random Walk with Restart and Its Applications ...
References <ul><li>Hanghang Tong, Christos Faloutsos, Brian Gallagher, Tina Eliassi-Rad: Fast best-effort pattern matching...
Joint papers with LLNL <ul><li>Keith Henderson, Tina Eliassi-Rad, Spiros Papadimitriou, Christos Faloutsos: HCDF: A Hybrid...
Joint papers with LLNL <ul><li>Jimeng Sun, Charalampos E. Tsourakakis, Evan Hoke, Christos Faloutsos, Tina Eliassi-Rad: Tw...
Joint papers with LLNL <ul><li>Brian Gallagher, Hanghang Tong, Tina Eliassi-Rad, Christos Faloutsos: Using ghost edges for...
Project info <ul><li>www.cs.cmu.edu/~pegasus </li></ul>C. Faloutsos (CMU) Akoglu,  Leman Chau,  Polo Kang, U McGlohon,  Ma...
Upcoming SlideShare
Loading in...5
×

Mining Billion-node Graphs: Patterns, Generators and Tools__HadoopSummit2010

3,944

Published on

Hadoop Summit 2010 - Research Track
Mining Billion-node Graphs: Patterns, Generators and Tools
Christos Faloutsos, Carnegie Mellon University

Published in: Technology
1 Comment
15 Likes
Statistics
Notes
  • hanghang tong, Christos Faloutsos.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
3,944
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
15
Embeds 0
No embeds

No notes for slide
  • Faloutsos
  • Faloutsos et al 06/30/10
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • A = U Sigma U^T
  • A = U Sigma U^T vec{u}_1 vec{u}_i
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos Diameter first, DPL second Check diameter formulas As the network grows the distances between nodes slowly grow
  • Faloutsos Diameter first, DPL second Check diameter formulas As the network grows the distances between nodes slowly grow
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • SOME OLD RULES
  • SOME OLD RULES
  • Faloutsos et al 06/30/10
  • Faloutsos et al 06/30/10
  • Faloutsos
  • Faloutsos et al 06/30/10
  • Faloutsos et al 06/30/10
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • lambda_1 Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos
  • Faloutsos et al 06/30/10
  • Transcript of "Mining Billion-node Graphs: Patterns, Generators and Tools__HadoopSummit2010"

    1. 1. Mining Billion-node Graphs: Patterns, Generators and Tools Christos Faloutsos CMU
    2. 2. Thanks! <ul><li>Chris Olston </li></ul>Hadoop Summit '10 C. Faloutsos (CMU)
    3. 3. Our goal: <ul><li>Open source system for mining huge graphs: </li></ul><ul><li>PEGASUS project (PEta GrAph mining System) </li></ul><ul><li>www.cs.cmu.edu/~pegasus </li></ul><ul><li>code and papers </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    4. 4. Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Tools </li></ul><ul><li>Problem#3: Scalability </li></ul><ul><li>Conclusions </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    5. 5. Graphs - why should we care? C. Faloutsos (CMU) Internet Map [lumeta.com] Food Web [Martinez ’91] Protein Interactions [genomebiology.com] Friendship Network [Moody ’01] Hadoop Summit '10
    6. 6. Graphs - why should we care? <ul><li>IR: bi-partite graphs (doc-terms) </li></ul><ul><li>web: hyper-text graph </li></ul><ul><li>... and more: </li></ul>C. Faloutsos (CMU) Hadoop Summit '10 D 1 D N T 1 T M ... ...
    7. 7. Graphs - why should we care? <ul><li>network of companies & board-of-directors members </li></ul><ul><li>‘ viral’ marketing </li></ul><ul><li>web-log (‘blog’) news propagation </li></ul><ul><li>computer network security: email/IP traffic and anomaly detection </li></ul><ul><li>.... </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    8. 8. Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><ul><li>Static graphs </li></ul></ul><ul><ul><li>Weighted graphs </li></ul></ul><ul><ul><li>Time evolving graphs </li></ul></ul><ul><li>Problem#2: Tools </li></ul><ul><li>Problem#3: Scalability </li></ul><ul><li>Conclusions </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    9. 9. Problem #1 - network and graph mining <ul><li>How does the Internet look like? </li></ul><ul><li>How does FaceBook look like? </li></ul><ul><li>What is ‘ normal ’/‘ abnormal ’? </li></ul><ul><li>which patterns/laws hold? </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    10. 10. Problem #1 - network and graph mining <ul><li>How does the Internet look like? </li></ul><ul><li>How does FaceBook look like? </li></ul><ul><li>What is ‘ normal ’/‘ abnormal ’? </li></ul><ul><li>which patterns/laws hold? </li></ul><ul><ul><li>To spot anomalies (rarities), we have to discover patterns </li></ul></ul>C. Faloutsos (CMU) Hadoop Summit '10
    11. 11. Problem #1 - network and graph mining <ul><li>How does the Internet look like? </li></ul><ul><li>How does FaceBook look like? </li></ul><ul><li>What is ‘ normal ’/‘ abnormal ’? </li></ul><ul><li>which patterns/laws hold? </li></ul><ul><ul><li>To spot anomalies (rarities), we have to discover patterns </li></ul></ul><ul><ul><li>Large datasets reveal patterns/anomalies that may be invisible otherwise… </li></ul></ul>C. Faloutsos (CMU) Hadoop Summit '10
    12. 12. Graph mining <ul><li>Are real graphs random? </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    13. 13. Laws and patterns <ul><li>Are real graphs random? </li></ul><ul><li>A: NO!! </li></ul><ul><ul><li>Diameter </li></ul></ul><ul><ul><li>in- and out- degree distributions </li></ul></ul><ul><ul><li>other (surprising) patterns </li></ul></ul><ul><li>So, let’s look at the data </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    14. 14. Solution# S.1 <ul><li>Power law in the degree distribution [SIGCOMM99] </li></ul>C. Faloutsos (CMU) log(rank) log(degree) internet domains att.com ibm.com Hadoop Summit '10 -0.82
    15. 15. Solution# S.2: Eigen Exponent E <ul><li>A2: power law in the eigenvalues of the adjacency matrix </li></ul>C. Faloutsos (CMU) E = -0.48 Exponent = slope Eigenvalue Rank of decreasing eigenvalue May 2001 Hadoop Summit '10
    16. 16. Solution# S.2: Eigen Exponent E <ul><li>[Mihail, Papadimitriou ’02]: slope is ½ of rank exponent </li></ul>C. Faloutsos (CMU) E = -0.48 Exponent = slope Eigenvalue Rank of decreasing eigenvalue May 2001 Hadoop Summit '10
    17. 17. But: <ul><li>How about graphs from other domains? </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    18. 18. More power laws: <ul><li>web hit counts [w/ A. Montgomery] </li></ul>C. Faloutsos (CMU) Web Site Traffic in-degree (log scale) Count (log scale) Zipf ``ebay’’ Hadoop Summit '10 users sites
    19. 19. epinions.com <ul><li>who-trusts-whom [Richardson + Domingos, KDD 2001] </li></ul>C. Faloutsos (CMU) (out) degree count trusts-2000-people user Hadoop Summit '10
    20. 20. And numerous more <ul><li># of sexual contacts </li></ul><ul><li>Income [Pareto] –’80-20 distribution’ </li></ul><ul><li>Duration of downloads [Bestavros+] </li></ul><ul><li>Duration of UNIX jobs (‘mice and elephants’) </li></ul><ul><li>Size of files of a user </li></ul><ul><li>… </li></ul><ul><li>‘ Black swans’ </li></ul>Hadoop Summit '10 C. Faloutsos (CMU)
    21. 21. Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><ul><li>Static graphs </li></ul></ul><ul><ul><ul><li>degree, diameter, eigen, </li></ul></ul></ul><ul><ul><ul><li>triangles </li></ul></ul></ul><ul><ul><ul><li>cliques </li></ul></ul></ul><ul><ul><li>Weighted graphs </li></ul></ul><ul><ul><li>Time evolving graphs </li></ul></ul><ul><li>Problem#2: Tools </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    22. 22. Solution# S.3: Triangle ‘Laws’ <ul><li>Real social networks have a lot of triangles </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    23. 23. Solution# S.3: Triangle ‘Laws’ <ul><li>Real social networks have a lot of triangles </li></ul><ul><ul><li>Friends of friends are friends </li></ul></ul><ul><li>Any patterns? </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    24. 24. Triangle Law: #S.3 [Tsourakakis ICDM 2008] C. Faloutsos (CMU) ASN HEP-TH Epinions X-axis: # of Triangles a node participates in Y-axis: count of such nodes Hadoop Summit '10
    25. 25. Triangle Law: #S.3 [Tsourakakis ICDM 2008] C. Faloutsos (CMU) ASN HEP-TH Epinions X-axis: # of Triangles a node participates in Y-axis: count of such nodes Hadoop Summit '10
    26. 26. Triangle Law: #S.4 [Tsourakakis ICDM 2008] C. Faloutsos (CMU) SN Reuters Epinions X-axis: degree Y-axis: mean # triangles n friends -> ~ n 1.6 triangles Hadoop Summit '10
    27. 27. Triangle Law: Computations [Tsourakakis ICDM 2008] C. Faloutsos (CMU) But: triangles are expensive to compute (3-way join; several approx. algos) Q: Can we do that quickly? details Hadoop Summit '10
    28. 28. Triangle Law: Computations [Tsourakakis ICDM 2008] C. Faloutsos (CMU) But: triangles are expensive to compute (3-way join; several approx. algos) Q: Can we do that quickly? A: Yes! #triangles = 1/6 Sum (  i 3 ) (and, because of skewness, we only need the top few eigenvalues! details Hadoop Summit '10
    29. 29. Triangle Law: Computations [Tsourakakis ICDM 2008] C. Faloutsos (CMU) 1000x+ speed-up, >90% accuracy details Hadoop Summit '10
    30. 30. EigenSpokes <ul><li>B. Aditya Prakash, Mukund Seshadri, Ashwin Sridharan, Sridhar Machiraju and Christos Faloutsos: EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs, PAKDD 2010, Hyderabad, India, 21-24 June 2010. </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    31. 31. EigenSpokes <ul><li>Eigenvectors of adjacency matrix </li></ul><ul><ul><li>equivalent to singular vectors (symmetric, undirected graph) </li></ul></ul>C. Faloutsos (CMU) Hadoop Summit '10
    32. 32. EigenSpokes <ul><li>Eigenvectors of adjacency matrix </li></ul><ul><ul><li>equivalent to singular vectors (symmetric, undirected graph) </li></ul></ul>C. Faloutsos (CMU) Hadoop Summit '10 N N details
    33. 33. EigenSpokes <ul><li>EE plot: </li></ul><ul><li>Scatter plot of scores of u1 vs u2 </li></ul><ul><li>One would expect </li></ul><ul><ul><li>Many points @ origin </li></ul></ul><ul><ul><li>A few scattered ~randomly </li></ul></ul>C. Faloutsos (CMU) u1 u2 Hadoop Summit '10 1 st Principal component 2 nd Principal component
    34. 34. EigenSpokes <ul><li>EE plot: </li></ul><ul><li>Scatter plot of scores of u1 vs u2 </li></ul><ul><li>One would expect </li></ul><ul><ul><li>Many points @ origin </li></ul></ul><ul><ul><li>A few scattered ~randomly </li></ul></ul>C. Faloutsos (CMU) Hadoop Summit '10 u1 u2 90 o
    35. 35. EigenSpokes - pervasiveness <ul><li>Present in mobile social graph </li></ul><ul><ul><li>across time and space </li></ul></ul><ul><li>Patent citation graph </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    36. 36. EigenSpokes - explanation <ul><li>Near-cliques, or near-bipartite-cores, loosely connected </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    37. 37. EigenSpokes - explanation <ul><li>Near-cliques , or near-bipartite-cores, loosely connected </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    38. 38. EigenSpokes - explanation <ul><li>Near-cliques, or near-bipartite-cores , loosely connected </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    39. 39. EigenSpokes - explanation <ul><li>Near-cliques, or near-bipartite-cores, loosely connected </li></ul><ul><li>So what? </li></ul><ul><ul><li>Extract nodes with high s cores </li></ul></ul><ul><ul><li>high connectivity </li></ul></ul><ul><ul><li>Good “communities” </li></ul></ul>spy plot of top 20 nodes C. Faloutsos (CMU) Hadoop Summit '10
    40. 40. Bipartite Communities! magnified bipartite community patents from same inventor(s) cut-and-paste bibliography! C. Faloutsos (CMU) Hadoop Summit '10
    41. 41. Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><ul><li>Static graphs </li></ul></ul><ul><ul><ul><li>degree, diameter, eigen, </li></ul></ul></ul><ul><ul><ul><li>triangles </li></ul></ul></ul><ul><ul><ul><li>cliques </li></ul></ul></ul><ul><ul><li>Weighted graphs </li></ul></ul><ul><ul><li>Time evolving graphs </li></ul></ul><ul><li>Problem#2: Tools </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    42. 42. Observations on weighted graphs? <ul><li>A: yes - even more ‘laws’! </li></ul>C. Faloutsos (CMU) M. McGlohon, L. Akoglu, and C. Faloutsos Weighted Graphs and Disconnected Components: Patterns and a Generator. SIG-KDD 2008 Hadoop Summit '10
    43. 43. Observation W.1: Fortification <ul><ul><li>Q: How do the weights </li></ul></ul><ul><ul><li>of nodes relate to degree? </li></ul></ul>C. Faloutsos (CMU) Hadoop Summit '10
    44. 44. Observation W.1: Fortification C. Faloutsos (CMU) More donors, more $ ? $10 $5 Hadoop Summit '10 ‘ Reagan’ ‘ Clinton’ $7
    45. 45. Observation W.1: fortification: Snapshot Power Law <ul><li>Weight: super-linear on in-degree </li></ul><ul><li>exponent ‘iw’: 1.01 < iw < 1.26 </li></ul>Edges (# donors) In-weights ($) C. Faloutsos (CMU) Orgs-Candidates e.g. John Kerry, $10M received, from 1K donors More donors, even more $ $10 $5 Hadoop Summit '10
    46. 46. Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><ul><li>Static graphs </li></ul></ul><ul><ul><li>Weighted graphs </li></ul></ul><ul><ul><li>Time evolving graphs </li></ul></ul><ul><li>Problem#2: Tools </li></ul><ul><li>… </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    47. 47. Problem: Time evolution <ul><li>with Jure Leskovec (CMU -> Stanford) </li></ul><ul><li>and Jon Kleinberg (Cornell – sabb. @ CMU) </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    48. 48. T.1 Evolution of the Diameter <ul><li>Prior work on Power Law graphs hints at slowly growing diameter : </li></ul><ul><ul><li>diameter ~ O(log N) </li></ul></ul><ul><ul><li>diameter ~ O(log log N) </li></ul></ul><ul><li>What is happening in real data? </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    49. 49. T.1 Evolution of the Diameter <ul><li>Prior work on Power Law graphs hints at slowly growing diameter : </li></ul><ul><ul><li>diameter ~ O(log N) </li></ul></ul><ul><ul><li>diameter ~ O(log log N) </li></ul></ul><ul><li>What is happening in real data? </li></ul><ul><li>Diameter shrinks over time </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    50. 50. T.1 Diameter – “Patents” <ul><li>Patent citation network </li></ul><ul><li>25 years of data </li></ul><ul><li>@1999 </li></ul><ul><ul><li>2.9 M nodes </li></ul></ul><ul><ul><li>16.5 M edges </li></ul></ul>C. Faloutsos (CMU) time [years] diameter Hadoop Summit '10
    51. 51. T.2 Temporal Evolution of the Graphs <ul><li>N(t) … nodes at time t </li></ul><ul><li>E(t) … edges at time t </li></ul><ul><li>Suppose that </li></ul><ul><ul><li>N(t+1) = 2 * N(t) </li></ul></ul><ul><li>Q: what is your guess for </li></ul><ul><ul><li>E(t+1) =? 2 * E(t) </li></ul></ul>C. Faloutsos (CMU) Hadoop Summit '10
    52. 52. T.2 Temporal Evolution of the Graphs <ul><li>N(t) … nodes at time t </li></ul><ul><li>E(t) … edges at time t </li></ul><ul><li>Suppose that </li></ul><ul><ul><li>N(t+1) = 2 * N(t) </li></ul></ul><ul><li>Q: what is your guess for </li></ul><ul><ul><li>E(t+1) =? 2 * E(t) </li></ul></ul><ul><li>A: over-doubled! </li></ul><ul><ul><li>But obeying the ``Densification Power Law’’ </li></ul></ul>C. Faloutsos (CMU) Hadoop Summit '10
    53. 53. T.2 Densification – Patent Citations <ul><li>Citations among patents granted </li></ul><ul><li>@1999 </li></ul><ul><ul><li>2.9 M nodes </li></ul></ul><ul><ul><li>16.5 M edges </li></ul></ul><ul><li>Each year is a datapoint </li></ul>C. Faloutsos (CMU) N(t) E(t) 1.66 Hadoop Summit '10
    54. 54. Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><ul><li>Static graphs </li></ul></ul><ul><ul><li>Weighted graphs </li></ul></ul><ul><ul><li>Time evolving graphs </li></ul></ul><ul><li>Problem#2: Tools </li></ul><ul><li>… </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    55. 55. More on Time-evolving graphs C. Faloutsos (CMU) M. McGlohon, L. Akoglu, and C. Faloutsos Weighted Graphs and Disconnected Components: Patterns and a Generator. SIG-KDD 2008 Hadoop Summit '10
    56. 56. Observation T.3: NLCC behavior <ul><ul><li>Q: How do NLCC’s emerge and join with the GCC? </li></ul></ul><ul><ul><li>(``NLCC’’ = non-largest conn. components) </li></ul></ul><ul><ul><li>Do they continue to grow in size? </li></ul></ul><ul><ul><li>or do they shrink? </li></ul></ul><ul><ul><li>or stabilize? </li></ul></ul>C. Faloutsos (CMU) Hadoop Summit '10
    57. 57. Observation T.3: NLCC behavior <ul><li>After the gelling point, the GCC takes off, but NLCC’s remain ~constant (actually, oscillate). </li></ul>C. Faloutsos (CMU) IMDB CC size Time-stamp Hadoop Summit '10
    58. 58. Timing for Blogs <ul><li>with Mary McGlohon (CMU) </li></ul><ul><li>Jure Leskovec (CMU->Stanford) </li></ul><ul><li>Natalie Glance (now at Google) </li></ul><ul><li>Mat Hurst (now at MSR) </li></ul><ul><li>[SDM’07] </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    59. 59. T.4 : popularity over time C. Faloutsos (CMU) Post popularity drops-off – exponentially? lag: days after post # in links 1 2 3 @t @t + lag Hadoop Summit '10
    60. 60. T.4 : popularity over time C. Faloutsos (CMU) Post popularity drops-off – exponentially? POWER LAW! Exponent? # in links ( log ) 1 2 3 days after post ( log ) Hadoop Summit '10
    61. 61. T.4 : popularity over time C. Faloutsos (CMU) <ul><li>Post popularity drops-off – exponentially? </li></ul><ul><li>POWER LAW! </li></ul><ul><li>Exponent? -1.6 </li></ul><ul><li>close to -1.5: Barabasi’s stack model </li></ul><ul><li>and like the zero-crossings of a random walk </li></ul># in links ( log ) 1 2 3 -1.6 days after post ( log ) Hadoop Summit '10
    62. 62. Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Tools </li></ul><ul><ul><li>CenterPiece Subgraphs; G-Ray </li></ul></ul><ul><ul><li>OddBall (anomaly detection) </li></ul></ul><ul><ul><li>PEGASUS </li></ul></ul><ul><li>Problem#3: Scalability </li></ul><ul><li>Conclusions </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    63. 63. CenterPiece Subgraphs <ul><li>Hanghang TONG et al, KDD’06 </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    64. 64. Ce nter- P iece S ubgraph Discovery [Tong+ KDD 06] Original Graph Q: Who is the most central node wrt the black nodes? (e.g., master-mind criminal, common advisor/collaborator, etc) Input C. Faloutsos (CMU) Hadoop Summit '10 B A C
    65. 65. Ce nter- P iece S ubgraph Discovery [Tong+ KDD 06] Q: How to find hub for the query nodes? Input: original graph Output: CePS CePS Node C. Faloutsos (CMU) A: Combine proximity scores (RWR) Hadoop Summit '10 B A C B A C
    66. 66. CePS : Example (AND Query) ? C. Faloutsos (CMU) Hadoop Summit '10 <ul><li>DBLP co-authorship network: </li></ul><ul><ul><li>400,000 authors, 2,000,000 edges </li></ul></ul><ul><li>Code at: http://www.cs.cmu.edu/~htong/soft.htm </li></ul>
    67. 67. CePS : Example (AND Query) C. Faloutsos (CMU) <ul><li>DBLP co-authorship network: </li></ul><ul><ul><li>400,000 authors, 2,000,000 edges </li></ul></ul><ul><li>Code at: http://www.cs.cmu.edu/~htong/soft.htm </li></ul>Hadoop Summit '10
    68. 68. Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Tools </li></ul><ul><ul><li>CenterPiece Subgraphs; G-Ray </li></ul></ul><ul><ul><li>OddBall (anomaly detection) </li></ul></ul><ul><ul><li>PEGASUS </li></ul></ul><ul><li>Problem#3: Scalability </li></ul><ul><li>Conclusions </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    69. 69. G raph X -Ray: Fast Best-Effort Pattern Matching in Large Attributed Graphs Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad KDD’07
    70. 70. Output Input Attributed Data Graph Query Graph Matching Subgraph Hadoop Summit '10 C. Faloutsos (CMU)
    71. 71. Effectiveness: star-query Query Result Hadoop Summit '10 C. Faloutsos (CMU)
    72. 72. Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Tools </li></ul><ul><ul><li>CenterPiece Subgraphs </li></ul></ul><ul><ul><li>OddBall (anomaly detection) </li></ul></ul><ul><ul><li>Problem#3: Scalability - PEGASUS </li></ul></ul><ul><li>Conclusions </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    73. 73. OddBall: Spotting A n o m a l i e s in Weighted Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School of Computer Science To appear in PAKDD 2010, Hyderabad, India
    74. 74. Main idea <ul><li>For each node, </li></ul><ul><li>extract ‘ego-net’ (=1-step-away neighbors) </li></ul><ul><li>Extract features (#edges, total weight, etc etc) </li></ul><ul><li>Compare with the rest of the population </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    75. 75. What is an egonet? ego egonet C. Faloutsos (CMU) Hadoop Summit '10
    76. 76. Selected Features <ul><li>N i : number of neighbors (degree) of ego i </li></ul><ul><li>E i : number of edges in egonet i </li></ul><ul><li>W i : total weight of egonet i </li></ul><ul><li>λ w,i : principal eigenvalue of the weighted adjacency matrix of egonet I </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    77. 77. Near-Clique/Star Hadoop Summit '10 C. Faloutsos (CMU)
    78. 78. Near-Clique/Star C. Faloutsos (CMU) Hadoop Summit '10
    79. 79. Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Tools </li></ul><ul><ul><li>CenterPiece Subgraphs </li></ul></ul><ul><ul><li>OddBall (anomaly detection) </li></ul></ul><ul><ul><li>Problem#3: Scalability -PEGASUS </li></ul></ul><ul><li>Conclusions </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    80. 80. Outline – Algorithms & results C. Faloutsos (CMU) Hadoop Summit '10 Centralized Hadoop/PEGASUS Degree Distr. old old Pagerank old old Diameter/ANF old DONE Conn. Comp old DONE Triangles DONE Visualization STARTED
    81. 81. HADI for diameter estimation <ul><li>Radius Plots for Mining Tera-byte Scale Graphs U Kang , Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 </li></ul><ul><li>Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) </li></ul><ul><li>Our HADI: linear on E (~10B) </li></ul><ul><ul><li>Near-linear scalability wrt # machines </li></ul></ul><ul><ul><li>Several optimizations -> 5x faster </li></ul></ul>C. Faloutsos (CMU) Hadoop Summit '10
    82. 82. <ul><li>YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) </li></ul><ul><li>Largest publicly available graph ever studied. </li></ul>???? ?? 19+? [Barabasi+] (‘99, O(10 6 ) nodes) C. Faloutsos (CMU) Radius Count Hadoop Summit '10
    83. 83. <ul><li>YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) </li></ul><ul><li>Largest publicly available graph ever studied. </li></ul>???? C. Faloutsos (CMU) Radius Count Hadoop Summit '10 14 (dir.) ~7 (undir.) 19+? [Barabasi+] (‘99, O(10 6 ) nodes)
    84. 84. <ul><li>YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) </li></ul><ul><li>effective diameter: surprisingly small. </li></ul>C. Faloutsos (CMU) Hadoop Summit '10 Shape?
    85. 85. <ul><li>YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) </li></ul><ul><li>effective diameter: surprisingly small. </li></ul><ul><li>Multi-modality: probably mixture of cores . </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    86. 86. C. Faloutsos (CMU) <ul><li>YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) </li></ul><ul><li>effective diameter: surprisingly small. </li></ul><ul><li>Multi-modality: probably mixture of cores . </li></ul>Hadoop Summit '10
    87. 87. Radius Plot of GCC of YahooWeb. C. Faloutsos (CMU) Hadoop Summit '10
    88. 88. Running time - Kronecker and Erdos-Renyi Graphs with billions edges. details
    89. 89. Outline – Algorithms & results C. Faloutsos (CMU) Hadoop Summit '10 Centralized Hadoop/PEGASUS Degree Distr. old old Pagerank old old Diameter/ANF old DONE Conn. Comp old DONE Triangles DONE Visualization STARTED
    90. 90. Generalized Iterated Matrix Vector Multiplication (GIMV) C. Faloutsos (CMU) PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations . U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. ( ICDM ) 2009, Miami, Florida, USA. Best Application Paper (runner-up) . Hadoop Summit '10
    91. 91. Generalized Iterated Matrix Vector Multiplication (GIMV) C. Faloutsos (CMU) <ul><li>PageRank </li></ul><ul><li>proximity (RWR) </li></ul><ul><li>Diameter </li></ul><ul><li>Connected components </li></ul><ul><li>(eigenvectors, </li></ul><ul><li>Belief Prop. </li></ul><ul><li>… ) </li></ul>Matrix – vector Multiplication (iterated) Hadoop Summit '10 details
    92. 92. Example: GIM-V At Work <ul><li>Connected Components </li></ul>Size Count C. Faloutsos (CMU) Hadoop Summit '10
    93. 93. Example: GIM-V At Work <ul><li>Connected Components </li></ul>Size Count C. Faloutsos (CMU) Hadoop Summit '10 ~0.7B singleton nodes
    94. 94. Example: GIM-V At Work <ul><li>Connected Components </li></ul>Size Count C. Faloutsos (CMU) Hadoop Summit '10
    95. 95. Example: GIM-V At Work <ul><li>Connected Components </li></ul>Size Count 300-size cmpt X 500. Why? 1100-size cmpt X 65. Why? C. Faloutsos (CMU) Hadoop Summit '10
    96. 96. Example: GIM-V At Work <ul><li>Connected Components </li></ul>Size Count suspicious financial-advice sites (not existing now) C. Faloutsos (CMU) Hadoop Summit '10
    97. 97. GIM-V At Work <ul><li>Connected Components over Time </li></ul><ul><li>LinkedIn: 7.5M nodes and 58M edges </li></ul>Stable tail slope after the gelling point C. Faloutsos (CMU) Hadoop Summit '10
    98. 98. Outline – Algorithms & results C. Faloutsos (CMU) Hadoop Summit '10 Centralized Hadoop/PEGASUS Degree Distr. old old Pagerank old old Diameter/ANF old DONE Conn. Comp old DONE Triangles DONE Visualization STARTED
    99. 99. Triangles : Computations [Tsourakakis ICDM 2008] C. Faloutsos (CMU) But: triangles are expensive to compute (3-way join; several approx. algos) Q: Can we do that quickly? A: Yes! #triangles = 1/6 Sum (  i 3 ) (and, because of skewness, we only need the top few eigenvalues! Mentioned already Hadoop Summit '10
    100. 100. Triangle Law: #1 [Tsourakakis ICDM 2008] C. Faloutsos (CMU) ASN HEP-TH Epinions X-axis: # of Triangles a node participates in Y-axis: count of such nodes Mentioned already Hadoop Summit '10
    101. 101. Outline – Algorithms & results C. Faloutsos (CMU) Hadoop Summit '10 Centralized Hadoop/PEGASUS Degree Distr. old old Pagerank old old Diameter/ANF old DONE Conn. Comp old DONE Triangles DONE Visualization STARTED
    102. 102. Visualization: ShiftR <ul><li>Supporting Ad Hoc Sensemaking: Integrating Cognitive, HCI, and Data Mining Approaches Aniket Kittur, Duen Horng (‘Polo’) Chau , Christos Faloutsos, Jason I. Hong Sensemaking Workshop at CHI 2009, April 4-5. Boston, MA, USA. </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    103. 103. C. Faloutsos (CMU) Hadoop Summit '10
    104. 104. Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Tools </li></ul><ul><li>Problem#3: Scalability </li></ul><ul><li>(additional topics, skipped) </li></ul><ul><li>Conclusions </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    105. 105. Other topics - part#1 - tools <ul><li>Community detection – how many? </li></ul><ul><ul><li>Cross-Associations [Chakrabarti +, KDD 2004] </li></ul></ul><ul><li>Time-evolving graphs </li></ul><ul><ul><li>Tensors [Sun+, KDD’06], </li></ul></ul><ul><ul><li>[Kolda+ ICDM’05] </li></ul></ul><ul><ul><li>GraphScope [Sun+, KDD’07] </li></ul></ul><ul><li>Graph compression </li></ul><ul><ul><li>CUR decomposition [Sun+ SDM’07] </li></ul></ul>C. Faloutsos (CMU) Hadoop Summit '10
    106. 106. Tensors <ul><li>Adjacency matrices, stacked (over time, and/or edge-type – ‘composite networks’) </li></ul>Hadoop Summit '10 C. Faloutsos (CMU) keyword 1990 Author
    107. 107. Tensors <ul><li>Adjacency matrices, stacked (over time, and/or edge-type – ‘composite networks’) </li></ul>Hadoop Summit '10 C. Faloutsos (CMU) keyword 1991 1992 1990 Author
    108. 108. Tensors <ul><li>Adjacency matrices, stacked (over time, and/or edge-type – ‘composite networks’) </li></ul>~ + PARAFAC tensor decomposition (generalization of SVD) Hadoop Summit '10 C. Faloutsos (CMU) keyword 1991 1992 1990 Author
    109. 109. Other topics – part#2 - generators <ul><li>Kronecker [PKDD’05]; </li></ul><ul><li>Random Typing [Akoglu+, PKDD’09] </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    110. 110. Kronecker Product – a Graph <ul><li>One of most realistic generators, with provable properties </li></ul>Hadoop Summit '10 C. Faloutsos (CMU)
    111. 111. Other topics - part#3 – virus propagation <ul><li>Epidemic threshold for SIS: depends only on first eigenvalue of adjacency matrix </li></ul><ul><li>[Chakrabarti+, TISSEC’07] </li></ul><ul><li>Immunization policies [Tong+, under review] </li></ul><ul><li>Drinking water sensor placement [KDD’07] </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    112. 112. More info <ul><li>Tutorial on graph mining: KDD’09 </li></ul><ul><li>(w/ Gary Miller and C. Tsourakakis) </li></ul><ul><li>www.cs.cmu.edu/~christos/TALKS/09-KDD-tutorial/ </li></ul><ul><li>Tutorial on tensors: SIGMOD’07 </li></ul><ul><li>(w/ T. Kolda and J. Sun): </li></ul><ul><li>www.cs.cmu.edu/~christos/TALKS/SIGMOD-07-tutorial/ </li></ul>Hadoop Summit '10 C. Faloutsos (CMU)
    113. 113. Outline <ul><li>Introduction – Motivation </li></ul><ul><li>Problem#1: Patterns in graphs </li></ul><ul><li>Problem#2: Tools </li></ul><ul><li>Problem#3: Scalability </li></ul><ul><li>(additional topics, skipped) </li></ul><ul><li>Conclusions </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    114. 114. OVERALL CONCLUSIONS – low level: <ul><li>Several new patterns (fortification, triangle-laws, conn. components, etc) </li></ul><ul><li>New tools : </li></ul><ul><ul><li>CenterPiece Subgraphs, G-Ray, anomaly detection (OddBall), EigenSpokes </li></ul></ul><ul><li>Scalability : PEGASUS / hadoop </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    115. 115. OVERALL CONCLUSIONS – high level <ul><li>Large datasets may reveal patterns/outliers that would be invisible otherwise </li></ul><ul><li>Terrific opportunities </li></ul><ul><ul><li>Large datasets, easily(*) available PLUS </li></ul></ul><ul><ul><li>s/w and h/w developments </li></ul></ul><ul><li>Promising collaborations between DB/Sys, AI/Stat, sociology, marketing, epidemiology, ++ </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    116. 116. References <ul><li>Leman Akoglu, Christos Faloutsos: RTG: A Recursive Realistic Graph Generator Using Random Typing . ECML/PKDD (1) 2009: 13-28 </li></ul><ul><li>Deepayan Chakrabarti, Christos Faloutsos: Graph mining: Laws, generators, and algorithms . ACM Comput. Surv. 38(1): (2006) </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    117. 117. References <ul><li>Deepayan Chakrabarti, Yang Wang, Chenxi Wang, Jure Leskovec, Christos Faloutsos: Epidemic thresholds in real networks. ACM Trans. Inf. Syst. Secur. 10(4): (2008) </li></ul><ul><li>Deepayan Chakrabarti, Jure Leskovec, Christos Faloutsos, Samuel Madden, Carlos Guestrin, Michalis Faloutsos: Information Survival Threshold in Sensor and P2P Networks . INFOCOM 2007: 1316-1324 </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    118. 118. References <ul><li>Christos Faloutsos, Tamara G. Kolda, Jimeng Sun: Mining large graphs and streams using matrix and tensor tools . Tutorial, SIGMOD Conference 2007: 1174 </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    119. 119. References <ul><li>T. G. Kolda and J. Sun. Scalable Tensor Decompositions for Multi-aspect Data Mining . In: ICDM 2008, pp. 363-372, December 2008. </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    120. 120. References <ul><li>Jure Leskovec, Jon Kleinberg and Christos Faloutsos Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations , KDD 2005 (Best Research paper award). </li></ul><ul><li>Jure Leskovec, Deepayan Chakrabarti, Jon M. Kleinberg, Christos Faloutsos: Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication . PKDD 2005: 133-145 </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    121. 121. References <ul><li>Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos. Less is More: Compact Matrix Decomposition for Large Sparse Graphs , SDM, Minneapolis, Minnesota, Apr 2007. </li></ul><ul><li>Jimeng Sun, Spiros Papadimitriou, Philip S. Yu, and Christos Faloutsos, GraphScope: Parameter-free Mining of Large Time-evolving Graphs ACM SIGKDD Conference, San Jose, CA, August 2007 </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    122. 122. References <ul><li>Jimeng Sun, Dacheng Tao, Christos Faloutsos: Beyond streams and graphs: dynamic tensor analysis . KDD 2006: 374-383 </li></ul>Hadoop Summit '10 C. Faloutsos (CMU)
    123. 123. References <ul><li>Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan, Fast Random Walk with Restart and Its Applications , ICDM 2006, Hong Kong. </li></ul><ul><li>Hanghang Tong, Christos Faloutsos, Center-Piece Subgraphs: Problem Definition and Fast Solutions , KDD 2006, Philadelphia, PA </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    124. 124. References <ul><li>Hanghang Tong, Christos Faloutsos, Brian Gallagher, Tina Eliassi-Rad: Fast best-effort pattern matching in large attributed graphs. KDD 2007: 737-746 </li></ul>C. Faloutsos (CMU) Hadoop Summit '10
    125. 125. Joint papers with LLNL <ul><li>Keith Henderson, Tina Eliassi-Rad, Spiros Papadimitriou, Christos Faloutsos: HCDF: A Hybrid Community Discovery Framework. SDM 2010:754-765 </li></ul><ul><li>Hanghang Tong, Yasushi Sakurai, Tina Eliassi-Rad, Christos Faloutsos: Fast mining of complex time-stamped events. CIKM 2008:759-768 </li></ul><ul><li>Jimeng Sun, Charalampos E. Tsourakakis, Evan Hoke, Christos Faloutsos, Tina Eliassi-Rad: Two heads better than one: pattern discovery in time-evolving multi-aspect data. Data Min. Knowl. Discov. (DATAMINE) 17(1):111-128 (2008) </li></ul>Hadoop Summit '10 C. Faloutsos (CMU)
    126. 126. Joint papers with LLNL <ul><li>Jimeng Sun, Charalampos E. Tsourakakis, Evan Hoke, Christos Faloutsos, Tina Eliassi-Rad: Two Heads Better Than One: Pattern Discovery in Time-Evolving Multi-aspect Data. ECML/PKDD 2008:22 </li></ul><ul><li>Duen Horng Chau, Christos Faloutsos, Hanghang Tong, Jason I. Hong, Brian Gallagher, Tina Eliassi-Rad: GRAPHITE: A Visual Query System for Large Graphs. ICDM Workshops 2008:963-966 </li></ul>Hadoop Summit '10 C. Faloutsos (CMU)
    127. 127. Joint papers with LLNL <ul><li>Brian Gallagher, Hanghang Tong, Tina Eliassi-Rad, Christos Faloutsos: Using ghost edges for classification in sparsely labeled networks. KDD 2008:256-264 </li></ul><ul><li>Hanghang Tong, Christos Faloutsos, Brian Gallagher, Tina Eliassi-Rad: Fast best-effort pattern matching in large attributed graphs. KDD 2007:737-746 </li></ul>Hadoop Summit '10 C. Faloutsos (CMU)
    128. 128. Project info <ul><li>www.cs.cmu.edu/~pegasus </li></ul>C. Faloutsos (CMU) Akoglu, Leman Chau, Polo Kang, U McGlohon, Mary Tsourakakis, Babis Tong, Hanghang Prakash, Aditya Hadoop Summit '10 Thanks to: Yahoo (M45 + gifts + data) NSF, LLNL, CTA-INARC, IBM, SPRINT, INTEL, HP

    ×