3. Motivation
• The reason BigData is here
– To make processing data easier which is impossible or
overwhelming to process with our existing problem.
• Some Graph Database might be too big for a
single machine
– Easier for a distributed system by sharing load
• Graph Database may itself be scattered around
the globe
– Google search records.
4. Distributed Graph Mining
• Partition based
• Divide the problem into independent sub-problems
– Each node of the system can process it
independently
– Parallel processing
– Speedup computation
– Enhance scalability of solutions
6. Map Reduce
• A programming model for distributed
platforms.
• Proposed by Google
• Abundant open source implementations
– Hadoop
• Divides the problem in to sub-problems to be
processed in nodes
– Mapping
• Combining the processing results
– Reduce
7. Map Reduce Example
• Problem: Find frequency of a word in documents available on
a system.
…wor
d….
word
…
…
…wor
d….
…
…
…wor
d….
word
…
…
<word, count>
Map
Distributed
System
<word, 2> <word, 1> <word, 2>
<word, 2 + 1 + 2 = 5> Reduce
8. Graph Mining using Map Reduce
• Problem: Find frequent sub-graphs of a graph database in a
MapReduce programming model (Local Support 2)
Graph Dataset
Map
Distributed System
Run gSpan Run gSpan
3
2
5 Reduce
9. Data Partitioning
• Performance and load balancing will be
depending on Mapping portion
– Termed “Partitioning”
– Which portion of the graph dataset will go to which
– Loss of Data and Load Balancing directly dependent
on partitioning.
• Two approach
– MRGP (Map Reduced Partitioning)
– DGP (Density Based Partitioning)
12. DGP cont..
• Lets say bucket count for this demo is 2
• Next we equally distribute the sorted list to two buckets.
Bucket 1 Bucket 2
G1
G G2 5
G6 G7
G4
G3
G G8 10
G11 G12
G9
Make 4 PaDrivtiidtei oeancsh iBnu ctkoett ainl 4 Non Empty Sub-Bucket
13. DGP Cont..
• Now take one partition from each and form
final partitions
G1
G G2 5
G6 G7
G4
G3
G G8 10
G11 G12
G9
G1, G2, G3,
G8
G4, G5, G9,
G10
G6, G11 G7, G12
14. Support Count
• There are two types of support counts to be
considered in distributed graph mining
– Global Support Count
– Local Support Count
• Global Support is the same as in normal graph
mining
• When each mapper is running individual job it
considers local support count.
15. Local Support Count
• Each individual node has only partial graph
data set.
• Support Count need to be adjusted relative to
the original dataset.
• This adjusted support count is Local Support
Count.
• Local Support Count = Tolerance Rate * Global
Support [Tolerance rate is between 1 and 0]
16. Loss of Data
• Some frequent sub-graph are lost
• The loss can be mitigated by choosing an
optimal tolerance rate.
– Theoretically tolerance rate = 1 means there will
be no loss of data.
– But usually higher run time.