Distributed graph mining

Distributed Graph Mining
Presented By
Sayeed Mahmud

Motivation
• The reason BigData is here
– To make processing data easier which is impossible or
overwhelming to process with our existing problem.
• Some Graph Database might be too big for a
single machine
– Easier for a distributed system by sharing load
• Graph Database may itself be scattered around
the globe
– Google search records.

Distributed Graph Mining
• Partition based
• Divide the problem into independent sub-problems
– Each node of the system can process it
independently
– Parallel processing
– Speedup computation
– Enhance scalability of solutions

Techniques
• MRPF
• MapReduce
– We are mainly interested in this

Map Reduce
• A programming model for distributed
platforms.
• Proposed by Google
• Abundant open source implementations
– Hadoop
• Divides the problem in to sub-problems to be
processed in nodes
– Mapping
• Combining the processing results
– Reduce

Map Reduce Example
• Problem: Find frequency of a word in documents available on
a system.
…wor
d….
word
…
…
…wor
d….
…
…
…wor
d….
word
…
…
<word, count>
Map
Distributed
System
<word, 2> <word, 1> <word, 2>
<word, 2 + 1 + 2 = 5> Reduce

Graph Mining using Map Reduce
• Problem: Find frequent sub-graphs of a graph database in a
MapReduce programming model (Local Support 2)
Graph Dataset
Map
Distributed System
Run gSpan Run gSpan
3
2
5 Reduce

Data Partitioning
• Performance and load balancing will be
depending on Mapping portion
– Termed “Partitioning”
– Which portion of the graph dataset will go to which
– Loss of Data and Load Balancing directly dependent
on partitioning.
• Two approach
– MRGP (Map Reduced Partitioning)
– DGP (Density Based Partitioning)

MRGP
• Followed in common Map Reduce problems.
• Assigned sequentially
• Simple
Graph Size (KB) Density
G1 1 0.25
G2 2 0.5
G3 2 0.6
G4 1 0.25
G5 2 0.5
G6 2 0.5
G7 2 0.5
G8 2 0.6
G9 2 0.6
G10 2 0.7
G11 3 0.7
G12 3 0.8
4 Partition 6KB Each
G1, G2, G3, G4
G5, G6, G7
G8, G9, G10
G11, G12

DGP
• Goes for a balanced distribution
• Uses intermediary Bucket
• First graphs are sorted according to densities.
Graph Size (KB) Density
G1 1 0.25
G2 2 0.5
G3 2 0.6
G4 1 0.25
G5 2 0.5
G6 2 0.5
G7 2 0.5
G8 2 0.6
G9 2 0.6
G10 2 0.7
G11 3 0.7
G12 3 0.8
G1 (0.25)
G4 (0.25)
G2 (0.5)
G5 (0.5)
G6 (0.5)
G7 (0.5)
G3 (0.6)
G8 (0.6)
G9 (0.6)
G10 (0.7)
G11 (0.7)
G12 (0.8)

DGP cont..
• Lets say bucket count for this demo is 2
• Next we equally distribute the sorted list to two buckets.
Bucket 1 Bucket 2
G1
G G2 5
G6 G7
G4
G3
G G8 10
G11 G12
G9
Make 4 PaDrivtiidtei oeancsh iBnu ctkoett ainl 4 Non Empty Sub-Bucket

DGP Cont..
• Now take one partition from each and form
final partitions
G1
G G2 5
G6 G7
G4
G3
G G8 10
G11 G12
G9
G1, G2, G3,
G8
G4, G5, G9,
G10
G6, G11 G7, G12

Support Count
• There are two types of support counts to be
considered in distributed graph mining
– Global Support Count
– Local Support Count
• Global Support is the same as in normal graph
mining
• When each mapper is running individual job it
considers local support count.

Local Support Count
• Each individual node has only partial graph
data set.
• Support Count need to be adjusted relative to
the original dataset.
• This adjusted support count is Local Support
Count.
• Local Support Count = Tolerance Rate * Global
Support [Tolerance rate is between 1 and 0]

Loss of Data
• Some frequent sub-graph are lost
• The loss can be mitigated by choosing an
optimal tolerance rate.
– Theoretically tolerance rate = 1 means there will
be no loss of data.
– But usually higher run time.

Experiment Environment
• Language : Perl
• MapReduce Framework : Hadoop (0.20.1)
• Cluster Size : 5
• Node Specification:
– Processor AMD Opteron Quad Core 2.4 GHz
– 4GB Main memory

Data Sets
• Synthetic (Size Ranging from 18MB to 69GB)
• Real
– Chemical Compound Dataset from National
Cancer Institute.

Loss Rate for gSpan Support 30%

Loss Rate for Gaston and FSG Support
30%

Distributed graph mining

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Distributed graph mining

Similar to Distributed graph mining (20)

Recently uploaded

Recently uploaded (20)

Distributed graph mining