SlideShare a Scribd company logo
1 of 22
Distributed Graph Mining 
Presented By 
Sayeed Mahmud
Motivation
Motivation 
• The reason BigData is here 
– To make processing data easier which is impossible or 
overwhelming to process with our existing problem. 
• Some Graph Database might be too big for a 
single machine 
– Easier for a distributed system by sharing load 
• Graph Database may itself be scattered around 
the globe 
– Google search records.
Distributed Graph Mining 
• Partition based 
• Divide the problem into independent sub-problems 
– Each node of the system can process it 
independently 
– Parallel processing 
– Speedup computation 
– Enhance scalability of solutions
Techniques 
• MRPF 
• MapReduce 
– We are mainly interested in this
Map Reduce 
• A programming model for distributed 
platforms. 
• Proposed by Google 
• Abundant open source implementations 
– Hadoop 
• Divides the problem in to sub-problems to be 
processed in nodes 
– Mapping 
• Combining the processing results 
– Reduce
Map Reduce Example 
• Problem: Find frequency of a word in documents available on 
a system. 
…wor 
d…. 
word 
… 
… 
…wor 
d…. 
… 
… 
…wor 
d…. 
word 
… 
… 
<word, count> 
Map 
Distributed 
System 
<word, 2> <word, 1> <word, 2> 
<word, 2 + 1 + 2 = 5> Reduce
Graph Mining using Map Reduce 
• Problem: Find frequent sub-graphs of a graph database in a 
MapReduce programming model (Local Support 2) 
Graph Dataset 
Map 
Distributed System 
Run gSpan Run gSpan 
3 
2 
5 Reduce
Data Partitioning 
• Performance and load balancing will be 
depending on Mapping portion 
– Termed “Partitioning” 
– Which portion of the graph dataset will go to which 
– Loss of Data and Load Balancing directly dependent 
on partitioning. 
• Two approach 
– MRGP (Map Reduced Partitioning) 
– DGP (Density Based Partitioning)
MRGP 
• Followed in common Map Reduce problems. 
• Assigned sequentially 
• Simple 
Graph Size (KB) Density 
G1 1 0.25 
G2 2 0.5 
G3 2 0.6 
G4 1 0.25 
G5 2 0.5 
G6 2 0.5 
G7 2 0.5 
G8 2 0.6 
G9 2 0.6 
G10 2 0.7 
G11 3 0.7 
G12 3 0.8 
4 Partition 6KB Each 
G1, G2, G3, G4 
G5, G6, G7 
G8, G9, G10 
G11, G12
DGP 
• Goes for a balanced distribution 
• Uses intermediary Bucket 
• First graphs are sorted according to densities. 
Graph Size (KB) Density 
G1 1 0.25 
G2 2 0.5 
G3 2 0.6 
G4 1 0.25 
G5 2 0.5 
G6 2 0.5 
G7 2 0.5 
G8 2 0.6 
G9 2 0.6 
G10 2 0.7 
G11 3 0.7 
G12 3 0.8 
G1 (0.25) 
G4 (0.25) 
G2 (0.5) 
G5 (0.5) 
G6 (0.5) 
G7 (0.5) 
G3 (0.6) 
G8 (0.6) 
G9 (0.6) 
G10 (0.7) 
G11 (0.7) 
G12 (0.8)
DGP cont.. 
• Lets say bucket count for this demo is 2 
• Next we equally distribute the sorted list to two buckets. 
Bucket 1 Bucket 2 
G1 
G G2 5 
G6 G7 
G4 
G3 
G G8 10 
G11 G12 
G9 
Make 4 PaDrivtiidtei oeancsh iBnu ctkoett ainl 4 Non Empty Sub-Bucket
DGP Cont.. 
• Now take one partition from each and form 
final partitions 
G1 
G G2 5 
G6 G7 
G4 
G3 
G G8 10 
G11 G12 
G9 
G1, G2, G3, 
G8 
G4, G5, G9, 
G10 
G6, G11 G7, G12
Support Count 
• There are two types of support counts to be 
considered in distributed graph mining 
– Global Support Count 
– Local Support Count 
• Global Support is the same as in normal graph 
mining 
• When each mapper is running individual job it 
considers local support count.
Local Support Count 
• Each individual node has only partial graph 
data set. 
• Support Count need to be adjusted relative to 
the original dataset. 
• This adjusted support count is Local Support 
Count. 
• Local Support Count = Tolerance Rate * Global 
Support [Tolerance rate is between 1 and 0]
Loss of Data 
• Some frequent sub-graph are lost 
• The loss can be mitigated by choosing an 
optimal tolerance rate. 
– Theoretically tolerance rate = 1 means there will 
be no loss of data. 
– But usually higher run time.
Experiment Environment 
• Language : Perl 
• MapReduce Framework : Hadoop (0.20.1) 
• Cluster Size : 5 
• Node Specification: 
– Processor AMD Opteron Quad Core 2.4 GHz 
– 4GB Main memory
Data Sets 
• Synthetic (Size Ranging from 18MB to 69GB) 
• Real 
– Chemical Compound Dataset from National 
Cancer Institute.
Loss Rate for gSpan Support 30%
Loss Rate for Gaston and FSG Support 
30%
Runtime
Thank You

More Related Content

What's hot

Shuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop TerasortShuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop Terasortpramodbiligiri
 
Large Scale Geo Processing on Hadoop
Large Scale Geo Processing on HadoopLarge Scale Geo Processing on Hadoop
Large Scale Geo Processing on HadoopChristoph Körner
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?TerrierTeam
 

What's hot (7)

Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Shuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop TerasortShuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop Terasort
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Large Scale Geo Processing on Hadoop
Large Scale Geo Processing on HadoopLarge Scale Geo Processing on Hadoop
Large Scale Geo Processing on Hadoop
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
T180304125129
T180304125129T180304125129
T180304125129
 

Similar to Distributed graph mining

Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentationVu Thi Trang
 
Map reduce学习报告
Map reduce学习报告Map reduce学习报告
Map reduce学习报告Anty Rao
 
Big Data presentation at GITPRO 2013
Big Data presentation at GITPRO 2013Big Data presentation at GITPRO 2013
Big Data presentation at GITPRO 2013Sameer Wadkar
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersCleverence Kombe
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuningAnil Reddy
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing FrameworksAntonios Katsarakis
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
PEN: Design and Evaluation of Partial-Erase for 3D NAND-Based High Density S...
PEN: Design and Evaluation of Partial-Erase  for 3D NAND-Based High Density S...PEN: Design and Evaluation of Partial-Erase  for 3D NAND-Based High Density S...
PEN: Design and Evaluation of Partial-Erase for 3D NAND-Based High Density S...Po-Chuan Chen
 

Similar to Distributed graph mining (20)

ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
IOE MODULE 6.pptx
IOE MODULE 6.pptxIOE MODULE 6.pptx
IOE MODULE 6.pptx
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Map reduce学习报告
Map reduce学习报告Map reduce学习报告
Map reduce学习报告
 
Big Data presentation at GITPRO 2013
Big Data presentation at GITPRO 2013Big Data presentation at GITPRO 2013
Big Data presentation at GITPRO 2013
 
try
trytry
try
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
PEN: Design and Evaluation of Partial-Erase for 3D NAND-Based High Density S...
PEN: Design and Evaluation of Partial-Erase  for 3D NAND-Based High Density S...PEN: Design and Evaluation of Partial-Erase  for 3D NAND-Based High Density S...
PEN: Design and Evaluation of Partial-Erase for 3D NAND-Based High Density S...
 

Recently uploaded

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Distributed graph mining

  • 1. Distributed Graph Mining Presented By Sayeed Mahmud
  • 3. Motivation • The reason BigData is here – To make processing data easier which is impossible or overwhelming to process with our existing problem. • Some Graph Database might be too big for a single machine – Easier for a distributed system by sharing load • Graph Database may itself be scattered around the globe – Google search records.
  • 4. Distributed Graph Mining • Partition based • Divide the problem into independent sub-problems – Each node of the system can process it independently – Parallel processing – Speedup computation – Enhance scalability of solutions
  • 5. Techniques • MRPF • MapReduce – We are mainly interested in this
  • 6. Map Reduce • A programming model for distributed platforms. • Proposed by Google • Abundant open source implementations – Hadoop • Divides the problem in to sub-problems to be processed in nodes – Mapping • Combining the processing results – Reduce
  • 7. Map Reduce Example • Problem: Find frequency of a word in documents available on a system. …wor d…. word … … …wor d…. … … …wor d…. word … … <word, count> Map Distributed System <word, 2> <word, 1> <word, 2> <word, 2 + 1 + 2 = 5> Reduce
  • 8. Graph Mining using Map Reduce • Problem: Find frequent sub-graphs of a graph database in a MapReduce programming model (Local Support 2) Graph Dataset Map Distributed System Run gSpan Run gSpan 3 2 5 Reduce
  • 9. Data Partitioning • Performance and load balancing will be depending on Mapping portion – Termed “Partitioning” – Which portion of the graph dataset will go to which – Loss of Data and Load Balancing directly dependent on partitioning. • Two approach – MRGP (Map Reduced Partitioning) – DGP (Density Based Partitioning)
  • 10. MRGP • Followed in common Map Reduce problems. • Assigned sequentially • Simple Graph Size (KB) Density G1 1 0.25 G2 2 0.5 G3 2 0.6 G4 1 0.25 G5 2 0.5 G6 2 0.5 G7 2 0.5 G8 2 0.6 G9 2 0.6 G10 2 0.7 G11 3 0.7 G12 3 0.8 4 Partition 6KB Each G1, G2, G3, G4 G5, G6, G7 G8, G9, G10 G11, G12
  • 11. DGP • Goes for a balanced distribution • Uses intermediary Bucket • First graphs are sorted according to densities. Graph Size (KB) Density G1 1 0.25 G2 2 0.5 G3 2 0.6 G4 1 0.25 G5 2 0.5 G6 2 0.5 G7 2 0.5 G8 2 0.6 G9 2 0.6 G10 2 0.7 G11 3 0.7 G12 3 0.8 G1 (0.25) G4 (0.25) G2 (0.5) G5 (0.5) G6 (0.5) G7 (0.5) G3 (0.6) G8 (0.6) G9 (0.6) G10 (0.7) G11 (0.7) G12 (0.8)
  • 12. DGP cont.. • Lets say bucket count for this demo is 2 • Next we equally distribute the sorted list to two buckets. Bucket 1 Bucket 2 G1 G G2 5 G6 G7 G4 G3 G G8 10 G11 G12 G9 Make 4 PaDrivtiidtei oeancsh iBnu ctkoett ainl 4 Non Empty Sub-Bucket
  • 13. DGP Cont.. • Now take one partition from each and form final partitions G1 G G2 5 G6 G7 G4 G3 G G8 10 G11 G12 G9 G1, G2, G3, G8 G4, G5, G9, G10 G6, G11 G7, G12
  • 14. Support Count • There are two types of support counts to be considered in distributed graph mining – Global Support Count – Local Support Count • Global Support is the same as in normal graph mining • When each mapper is running individual job it considers local support count.
  • 15. Local Support Count • Each individual node has only partial graph data set. • Support Count need to be adjusted relative to the original dataset. • This adjusted support count is Local Support Count. • Local Support Count = Tolerance Rate * Global Support [Tolerance rate is between 1 and 0]
  • 16. Loss of Data • Some frequent sub-graph are lost • The loss can be mitigated by choosing an optimal tolerance rate. – Theoretically tolerance rate = 1 means there will be no loss of data. – But usually higher run time.
  • 17. Experiment Environment • Language : Perl • MapReduce Framework : Hadoop (0.20.1) • Cluster Size : 5 • Node Specification: – Processor AMD Opteron Quad Core 2.4 GHz – 4GB Main memory
  • 18. Data Sets • Synthetic (Size Ranging from 18MB to 69GB) • Real – Chemical Compound Dataset from National Cancer Institute.
  • 19. Loss Rate for gSpan Support 30%
  • 20. Loss Rate for Gaston and FSG Support 30%