SlideShare a Scribd company logo
1 of 29
Graph Data Mining at Scale
Nima Sarshar, Ph.D.
nima.sarshar@gmail.com
My Goals for this Talk
 You leave with your inner computer scientist tantalized:
 There is more to writing efficient Map-Reduce algorithms
than counting words and merging logs
 You get a general sense of the state of the research
 I convince you of the need for a real graph processing
package for Hadoop
 You know a bit about our work at Intuit
Plan
 Jump right to it with an example (enumerating triangles)
 Define the performance metrics (what are we optimizing
for?)
 Give a classification of known “recipes”
 The triangle example with with a new trick
 Personalized PageRank, connected components
 A list of other algorithms
3
Finding Triangles with Map-Reduce
1 2
3 4
1 3
2 3
2 4
3 4
3
4
4
3
22
2
4
3
1
1
3
5 Potential Triangles to Consider
Another round of Map Reduce jobs
will check for the existence of the
“closing” edge
Problems with this Approach
1. Each triangle will be detected 3 times – once under
each of its 3 vertices
2. Too many “potential” triangles are created in the first
reduce step.
 For a node with degree d:
 Total # of records:
5
d
2
æ
è
ç
ö
ø
÷ ~ O(d2
)
d2
v
vÎV
å = V pkk2
= N k2
k
å
Modified Algorithm [Cohen ‘08]
1 2
3 4
1 3
2 3
2 4
3 4
3
4
2
4
3
1
3
For each triangle exactly one potential
triangle is created (under the lowest value
node)
The quadratic problem still persists
 This is neat. At least we are not triple counting
 But the quadratic problem still exists. The number of
records is still O(N<k2>)
 We want to avoid binning edges under high degree
nodes
 The ordering of nodes is arbitrary! Let the degree of a
node define its order.
7
Bin an edge under it’s
LOW DEGREE node
Break ties arbitrarily, but consistently
3 2
1 4
5
1 4
5 3
2
The performance
 Worst case: records vs.
 The same as the best serial algorithm [Suri ‘11]
 The gain for “real” graphs is fairly substantial. If a graph is
reasonably random, it cuts down to: vs.
 For a heavy-tailed social graph (like our Commercial
Graph), this can be fairly huge
8
Q M3/2
( ) Q M2
( )
N k
2
N k2
Enumerating Rectangles
 Triangles will tell you the friends you have in common with
another friend
 “People you May Know”: Find another node, not
connected to you, who has many friends in common with
you. That node is a good candidate for “friendship”.
 Basis of User Based or Content Based collaborative filtering
 If the graph is bi-partite
9
Generalization to Rectangles
10
There are 4 classes for a rectangle:
requires a bit more work
2
3
4
1
3
2
4
1
2
4
3
1
A
B C
Ordering triangle
nodes has a unique
equivalency class
Performance Metrics
 Computation:
 Total computation in all mappers and reducers
 Communication:
 How many bits are shuffled from the mapper to the reducer
 Number of map-reduce steps:
 You can work it into the above
 The overhead of running jobs
11
“Recipes” for Graph MR Algorithms
Roughly two classes of algorithms:
1. Partition-Compute then Merge
 Create smaller sub-graphs that fit into a single memory
 Do computation on the small graphs
 Construct the final answer from the answers to the small
sub-problems
2. Compute-in-Parallel then Merge
12
Partition-Compute-Merge
13
Finding Triangles By Partitioning
[Suri ‘11]
1. Partition the nodes into b sets:
2. For every 3 sets
create a reducer.
3. Send an edge to iff both its ends are in
4. Detect triangles using a serial algorithm within each
reducer
14
V =V1 ÈV2 È...ÈVb Vi ÇVj = F, i ¹ j
Vi, j,k =Vi ÈVj ÈVk i < j < k
Vi, j,k Vi, j,k
b=4, V1={1}, V2={2}, V3={3}, V4={4},
1 2
3 4
1 3
2 3
2 4
3 4
V1,2,3 V1,3,4 V2,3,4
3 4
2
3 43
1 2
1
Analysis
 Every triangle is detected. All 3 vertices are guaranteed
to be in at least one partition
 Average # edges in each reducer is
 Use an optimal serial triangle finder at each reducer. The
total amount of work at all reducers is:
 # of edges sent from the mappers to reducers
(communication cost) is
16
O
M
b2
æ
è
ç
ö
ø
÷
M
b2
æ
è
ç
ö
ø
÷
3/2
´b3
= O M3/2
( )
O bM( )= O M3/2
( ) for b = M
One Problem
 Each triangle may be detected multiple times. If all three
vertices are mapped to the same partition, it will be
detected times
 This can be fixed with a similar ordering-of-nodes trick [Afrati
’12]
 Can be generalized to detect other small graph
structures efficiently [Afrati ‘12]
17
b-2
2
æ
è
ç
ö
ø
÷ ~ O b2
( )
Minimum Weights Spanning Tree
1. Partition the nodes into b sets
2. For every pair of sets create a reducer
3. Send all edges that have both their ends in one pair to
the corresponding reducer
4. Compute the minimum spanning tree for the graph in
each reducer. Remove other edges to sparsify the
graph
5. Compute the MST for the sparsified graph
18
Compute-in-parallel and merge
19
Personalized PageRank
 Like the global PageRank:
 But the random walker that comes back to where it started
with probability d
 For every v you will have a personalized page rank
vector of length N.
 We usually keep only a limited number of top personalized
PageRanks for each node.
 It finds the influential nodes in the proximity of a given
node.
20
Monte Carlo Approximation
Simulate many random walks from every single node. For
each walk:
1. A walk starting from node v is identified by v
 Keep track of <v,Uv,t> where Uv,t is the current end point at
step t for the walk starting at node v
2. In each Map-Reduce step advance the walk by 1 step
 Pick a random neighbor of Uv,t
3. Count the frequency of visits to each node
21
One can do better [Das Sarma ‘08]
This takes T steps for a walk of length T
 We can cut it down to T1/2 by a simple “stitching” idea
1. Do T/J random walks from every node for some J
2. To for a walk of length T, pick one of the T/J segments at random
and jump to the end of the segment
3. Pick another random segment, etc
4. If you arrive at a node twice, do not use the same segment
(that’s why you need T/J segments)
Total iterations: J+T/J  minimized when J=T1/2  O(T1/2)
22
Exponential speed up [Bahmani ‘11]
 The stitching was done somewhat serially (at each step,
one segment was stitched to another)
 Idea: Stich recursively, which will result in exponentially
expanding the walk/segment ratio
 Takes a little more tricks to make it work, but you can
bring it down to O(log T)
23
Labeling Connected Components
 Assign the same ID to all nodes inside the same
component
24
1 2
3
4
5
6
How do we do it on one machine?
1. i=1
2. Pick a random node you have not
picked before, assign it id=i and put
it in a stack
3. Pop a node from the stack, pull all
it’s neighbors we have not seen
before into the stack. Assign them
id=i
4. If stack is not empty go to 3,
otherwise i  i+1 and go to 2
Time and memory complexity O(M).
25
1 2
3 4
5
6
In Map-Reduce: More Parallelizim
 Instead of growing a frontier zone from a single seed, start
growing it from all nodes. When two zones meet, merge them
26
1 432
Edge File
<v1,v2>
<v2,v3>
<v3,v4>
Zone File
<v1,z1>
<v2,z2>
<v3,z3>
<v4,z4>
Game Plan
27
<v1,v2>
<v1,z1>
<[v1,v2],z1>
<v2,v1>
<v2,v3>
<v2,z2>
<[v1,v2],z2>
<[v2,v3],z2>
<v3,v2>
<v3,v4>
<v3,z3>
<[v2,v3],z3>
<[v3,v4],z3>
<v4,v3>
<v4,z4>
<[v3,v4],z4>
<[v1,v2],z1>
<[v1,v2],z2>
<[v2,v3],z2>
<[v2,v3],z3>
<[v3,v4],z3>
<[v3,v4],z4>
<z2,z1>
<z3,z2>
<z4,z3>
<z2,v2>
<z2,z1>
<z3,v3>
<z3,z2>
<z4,v4>
<z4,z3>
<z2,v2>
<z2,z1>
New Zone File
<v1,z1>
<v2,z1>
<v3,z2>
<v4,z3>
Bin Zone
and Edge
by Node
Bin edge to
zone map
Collect over
edges
A zone to
zone map
Reconcile
zones
Reassign
zones to
nodes
1 432
Analysis
 Communication: O(M+N)
 Number of rounds: O(d) where d is the diameter of the graph.
Most real graphs have small diameters.
 Random graph: d=O(log N)
 This works worst for a “path-graph”
 An algorithm with O(M+N) communication and O(log n) round
exists for all graphs [Rastogi ’12]
 Uses an idea similar to MinHash
28
References
 Cohen, Jonathan. "Graph twiddling in a MapReduce world."
Computing in Science & Engineering 11.4 (2009): 29-41.
 Suri, Siddharth, and Sergei Vassilvitskii. "Counting triangles and the
curse of the last reducer." Proceedings of the 20th international
conference on World wide web. ACM, 2011.
 Bahmani Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast
personalized pagerank on mapreduce." Proceedings of the 37th
SIGMOD international conference on Management of data. 2011.
 A. Das Sarma, S. Gollapudi, and R. Panigrahy. Estimating
PageRank on graph streams. In PODS, pages 69–78, 2008.
 Foto N. Afrati, Dimitris Fotakis, Jeffrey D. Ullman, Enumerating
Subgraph Instances Using Map-Reduce.
http://arxiv.org/abs/1208.0615 2012
 Lattanzi, Silvio, et al. "Filtering: a method for solving graph
problems in mapreduce.” 2011.
29

More Related Content

What's hot

Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treeoneous
 
Dijkstra's Algorithm
Dijkstra's AlgorithmDijkstra's Algorithm
Dijkstra's Algorithmguest862df4e
 
Biconnected components (13024116056)
Biconnected components (13024116056)Biconnected components (13024116056)
Biconnected components (13024116056)Akshay soni
 
Sudoku
SudokuSudoku
Sudokub p
 
Transformations advanced
Transformations advancedTransformations advanced
Transformations advancedAVINASH JURIANI
 
Data and computer communication exam i
Data and computer communication exam iData and computer communication exam i
Data and computer communication exam iAndrew Ibrahim
 
Cyclic Redundancy Check
Cyclic Redundancy CheckCyclic Redundancy Check
Cyclic Redundancy CheckRajan Shah
 
Digital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing AlgorithmDigital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing AlgorithmKasun Ranga Wijeweera
 
Shaderx5 2.6normalmappingwithoutprecomputedtangents 130318 (1)
Shaderx5 2.6normalmappingwithoutprecomputedtangents 130318 (1)Shaderx5 2.6normalmappingwithoutprecomputedtangents 130318 (1)
Shaderx5 2.6normalmappingwithoutprecomputedtangents 130318 (1)Kyuseok Hwang(allosha)
 
cryptography Application of linear algebra
cryptography Application of linear algebra cryptography Application of linear algebra
cryptography Application of linear algebra Sami Ullah
 

What's hot (13)

Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning tree
 
Dijkstra's Algorithm
Dijkstra's AlgorithmDijkstra's Algorithm
Dijkstra's Algorithm
 
Biconnected components (13024116056)
Biconnected components (13024116056)Biconnected components (13024116056)
Biconnected components (13024116056)
 
Sudoku
SudokuSudoku
Sudoku
 
Graph Coloring using Peer-to-Peer Networks
Graph Coloring using Peer-to-Peer NetworksGraph Coloring using Peer-to-Peer Networks
Graph Coloring using Peer-to-Peer Networks
 
Transformations advanced
Transformations advancedTransformations advanced
Transformations advanced
 
Data and computer communication exam i
Data and computer communication exam iData and computer communication exam i
Data and computer communication exam i
 
Optimisation random graph presentation
Optimisation random graph presentationOptimisation random graph presentation
Optimisation random graph presentation
 
tic-tac-toe: Game playing
 tic-tac-toe: Game playing tic-tac-toe: Game playing
tic-tac-toe: Game playing
 
Cyclic Redundancy Check
Cyclic Redundancy CheckCyclic Redundancy Check
Cyclic Redundancy Check
 
Digital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing AlgorithmDigital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing Algorithm
 
Shaderx5 2.6normalmappingwithoutprecomputedtangents 130318 (1)
Shaderx5 2.6normalmappingwithoutprecomputedtangents 130318 (1)Shaderx5 2.6normalmappingwithoutprecomputedtangents 130318 (1)
Shaderx5 2.6normalmappingwithoutprecomputedtangents 130318 (1)
 
cryptography Application of linear algebra
cryptography Application of linear algebra cryptography Application of linear algebra
cryptography Application of linear algebra
 

Similar to Graph Data Mining Algorithms at Scale

Chap10 slides
Chap10 slidesChap10 slides
Chap10 slidesHJ DS
 
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjteUnit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjtepournima055
 
A study on_contrast_and_comparison_between_bellman-ford_algorithm_and_dijkstr...
A study on_contrast_and_comparison_between_bellman-ford_algorithm_and_dijkstr...A study on_contrast_and_comparison_between_bellman-ford_algorithm_and_dijkstr...
A study on_contrast_and_comparison_between_bellman-ford_algorithm_and_dijkstr...Khoa Mac Tu
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesData-Centric_Alliance
 
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENTTRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENTcscpconf
 
Algorithm chapter 11
Algorithm chapter 11Algorithm chapter 11
Algorithm chapter 11chidabdu
 
MLSD18. Unsupervised Learning
MLSD18. Unsupervised LearningMLSD18. Unsupervised Learning
MLSD18. Unsupervised LearningBigML, Inc
 
Graph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningGraph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningtuxette
 
designanalysisalgorithm_unit-v-part2.pptx
designanalysisalgorithm_unit-v-part2.pptxdesignanalysisalgorithm_unit-v-part2.pptx
designanalysisalgorithm_unit-v-part2.pptxarifimad15
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptxKokilaK25
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Saad Liaqat
 

Similar to Graph Data Mining Algorithms at Scale (20)

algorithm Unit 3
algorithm Unit 3algorithm Unit 3
algorithm Unit 3
 
Pathfinding in games
Pathfinding in gamesPathfinding in games
Pathfinding in games
 
Counting trees.pptx
Counting trees.pptxCounting trees.pptx
Counting trees.pptx
 
Chap10 slides
Chap10 slidesChap10 slides
Chap10 slides
 
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjteUnit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
 
Unit 3 daa
Unit 3 daaUnit 3 daa
Unit 3 daa
 
Data structure and algorithm
Data structure and algorithmData structure and algorithm
Data structure and algorithm
 
A study on_contrast_and_comparison_between_bellman-ford_algorithm_and_dijkstr...
A study on_contrast_and_comparison_between_bellman-ford_algorithm_and_dijkstr...A study on_contrast_and_comparison_between_bellman-ford_algorithm_and_dijkstr...
A study on_contrast_and_comparison_between_bellman-ford_algorithm_and_dijkstr...
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblages
 
A greedy algorithms
A greedy algorithmsA greedy algorithms
A greedy algorithms
 
36 greedy
36 greedy36 greedy
36 greedy
 
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENTTRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
 
Algorithm chapter 11
Algorithm chapter 11Algorithm chapter 11
Algorithm chapter 11
 
MLSD18. Unsupervised Learning
MLSD18. Unsupervised LearningMLSD18. Unsupervised Learning
MLSD18. Unsupervised Learning
 
Graph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningGraph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph mining
 
designanalysisalgorithm_unit-v-part2.pptx
designanalysisalgorithm_unit-v-part2.pptxdesignanalysisalgorithm_unit-v-part2.pptx
designanalysisalgorithm_unit-v-part2.pptx
 
12_Graph.pptx
12_Graph.pptx12_Graph.pptx
12_Graph.pptx
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptx
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01
 
Assignment model
Assignment modelAssignment model
Assignment model
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Recently uploaded (20)

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

Graph Data Mining Algorithms at Scale

  • 1. Graph Data Mining at Scale Nima Sarshar, Ph.D. nima.sarshar@gmail.com
  • 2. My Goals for this Talk  You leave with your inner computer scientist tantalized:  There is more to writing efficient Map-Reduce algorithms than counting words and merging logs  You get a general sense of the state of the research  I convince you of the need for a real graph processing package for Hadoop  You know a bit about our work at Intuit
  • 3. Plan  Jump right to it with an example (enumerating triangles)  Define the performance metrics (what are we optimizing for?)  Give a classification of known “recipes”  The triangle example with with a new trick  Personalized PageRank, connected components  A list of other algorithms 3
  • 4. Finding Triangles with Map-Reduce 1 2 3 4 1 3 2 3 2 4 3 4 3 4 4 3 22 2 4 3 1 1 3 5 Potential Triangles to Consider Another round of Map Reduce jobs will check for the existence of the “closing” edge
  • 5. Problems with this Approach 1. Each triangle will be detected 3 times – once under each of its 3 vertices 2. Too many “potential” triangles are created in the first reduce step.  For a node with degree d:  Total # of records: 5 d 2 æ è ç ö ø ÷ ~ O(d2 ) d2 v vÎV å = V pkk2 = N k2 k å
  • 6. Modified Algorithm [Cohen ‘08] 1 2 3 4 1 3 2 3 2 4 3 4 3 4 2 4 3 1 3 For each triangle exactly one potential triangle is created (under the lowest value node)
  • 7. The quadratic problem still persists  This is neat. At least we are not triple counting  But the quadratic problem still exists. The number of records is still O(N<k2>)  We want to avoid binning edges under high degree nodes  The ordering of nodes is arbitrary! Let the degree of a node define its order. 7 Bin an edge under it’s LOW DEGREE node Break ties arbitrarily, but consistently 3 2 1 4 5 1 4 5 3 2
  • 8. The performance  Worst case: records vs.  The same as the best serial algorithm [Suri ‘11]  The gain for “real” graphs is fairly substantial. If a graph is reasonably random, it cuts down to: vs.  For a heavy-tailed social graph (like our Commercial Graph), this can be fairly huge 8 Q M3/2 ( ) Q M2 ( ) N k 2 N k2
  • 9. Enumerating Rectangles  Triangles will tell you the friends you have in common with another friend  “People you May Know”: Find another node, not connected to you, who has many friends in common with you. That node is a good candidate for “friendship”.  Basis of User Based or Content Based collaborative filtering  If the graph is bi-partite 9
  • 10. Generalization to Rectangles 10 There are 4 classes for a rectangle: requires a bit more work 2 3 4 1 3 2 4 1 2 4 3 1 A B C Ordering triangle nodes has a unique equivalency class
  • 11. Performance Metrics  Computation:  Total computation in all mappers and reducers  Communication:  How many bits are shuffled from the mapper to the reducer  Number of map-reduce steps:  You can work it into the above  The overhead of running jobs 11
  • 12. “Recipes” for Graph MR Algorithms Roughly two classes of algorithms: 1. Partition-Compute then Merge  Create smaller sub-graphs that fit into a single memory  Do computation on the small graphs  Construct the final answer from the answers to the small sub-problems 2. Compute-in-Parallel then Merge 12
  • 14. Finding Triangles By Partitioning [Suri ‘11] 1. Partition the nodes into b sets: 2. For every 3 sets create a reducer. 3. Send an edge to iff both its ends are in 4. Detect triangles using a serial algorithm within each reducer 14 V =V1 ÈV2 È...ÈVb Vi ÇVj = F, i ¹ j Vi, j,k =Vi ÈVj ÈVk i < j < k Vi, j,k Vi, j,k
  • 15. b=4, V1={1}, V2={2}, V3={3}, V4={4}, 1 2 3 4 1 3 2 3 2 4 3 4 V1,2,3 V1,3,4 V2,3,4 3 4 2 3 43 1 2 1
  • 16. Analysis  Every triangle is detected. All 3 vertices are guaranteed to be in at least one partition  Average # edges in each reducer is  Use an optimal serial triangle finder at each reducer. The total amount of work at all reducers is:  # of edges sent from the mappers to reducers (communication cost) is 16 O M b2 æ è ç ö ø ÷ M b2 æ è ç ö ø ÷ 3/2 ´b3 = O M3/2 ( ) O bM( )= O M3/2 ( ) for b = M
  • 17. One Problem  Each triangle may be detected multiple times. If all three vertices are mapped to the same partition, it will be detected times  This can be fixed with a similar ordering-of-nodes trick [Afrati ’12]  Can be generalized to detect other small graph structures efficiently [Afrati ‘12] 17 b-2 2 æ è ç ö ø ÷ ~ O b2 ( )
  • 18. Minimum Weights Spanning Tree 1. Partition the nodes into b sets 2. For every pair of sets create a reducer 3. Send all edges that have both their ends in one pair to the corresponding reducer 4. Compute the minimum spanning tree for the graph in each reducer. Remove other edges to sparsify the graph 5. Compute the MST for the sparsified graph 18
  • 20. Personalized PageRank  Like the global PageRank:  But the random walker that comes back to where it started with probability d  For every v you will have a personalized page rank vector of length N.  We usually keep only a limited number of top personalized PageRanks for each node.  It finds the influential nodes in the proximity of a given node. 20
  • 21. Monte Carlo Approximation Simulate many random walks from every single node. For each walk: 1. A walk starting from node v is identified by v  Keep track of <v,Uv,t> where Uv,t is the current end point at step t for the walk starting at node v 2. In each Map-Reduce step advance the walk by 1 step  Pick a random neighbor of Uv,t 3. Count the frequency of visits to each node 21
  • 22. One can do better [Das Sarma ‘08] This takes T steps for a walk of length T  We can cut it down to T1/2 by a simple “stitching” idea 1. Do T/J random walks from every node for some J 2. To for a walk of length T, pick one of the T/J segments at random and jump to the end of the segment 3. Pick another random segment, etc 4. If you arrive at a node twice, do not use the same segment (that’s why you need T/J segments) Total iterations: J+T/J  minimized when J=T1/2  O(T1/2) 22
  • 23. Exponential speed up [Bahmani ‘11]  The stitching was done somewhat serially (at each step, one segment was stitched to another)  Idea: Stich recursively, which will result in exponentially expanding the walk/segment ratio  Takes a little more tricks to make it work, but you can bring it down to O(log T) 23
  • 24. Labeling Connected Components  Assign the same ID to all nodes inside the same component 24 1 2 3 4 5 6
  • 25. How do we do it on one machine? 1. i=1 2. Pick a random node you have not picked before, assign it id=i and put it in a stack 3. Pop a node from the stack, pull all it’s neighbors we have not seen before into the stack. Assign them id=i 4. If stack is not empty go to 3, otherwise i  i+1 and go to 2 Time and memory complexity O(M). 25 1 2 3 4 5 6
  • 26. In Map-Reduce: More Parallelizim  Instead of growing a frontier zone from a single seed, start growing it from all nodes. When two zones meet, merge them 26 1 432 Edge File <v1,v2> <v2,v3> <v3,v4> Zone File <v1,z1> <v2,z2> <v3,z3> <v4,z4>
  • 28. Analysis  Communication: O(M+N)  Number of rounds: O(d) where d is the diameter of the graph. Most real graphs have small diameters.  Random graph: d=O(log N)  This works worst for a “path-graph”  An algorithm with O(M+N) communication and O(log n) round exists for all graphs [Rastogi ’12]  Uses an idea similar to MinHash 28
  • 29. References  Cohen, Jonathan. "Graph twiddling in a MapReduce world." Computing in Science & Engineering 11.4 (2009): 29-41.  Suri, Siddharth, and Sergei Vassilvitskii. "Counting triangles and the curse of the last reducer." Proceedings of the 20th international conference on World wide web. ACM, 2011.  Bahmani Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." Proceedings of the 37th SIGMOD international conference on Management of data. 2011.  A. Das Sarma, S. Gollapudi, and R. Panigrahy. Estimating PageRank on graph streams. In PODS, pages 69–78, 2008.  Foto N. Afrati, Dimitris Fotakis, Jeffrey D. Ullman, Enumerating Subgraph Instances Using Map-Reduce. http://arxiv.org/abs/1208.0615 2012  Lattanzi, Silvio, et al. "Filtering: a method for solving graph problems in mapreduce.” 2011. 29