SlideShare a Scribd company logo
1 of 39
Mining Quasi-Bicliques with Giraph
2013.07.02
Hsiao-Fei Liu
Sr. Engineer, CoreTech, Trend Micro Inc.
Chung-Tsai Su
Sr. Manager, CoreTech, Trend Micro Inc.
An-Chiang Chu
PostDoc, CSIE, National Taiwan University
Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
• Distributed graph-processing system
• Originating from Google’s paper “Pregel” in 2010
• Efficient iterative processing of large sparse graphs
• A variation of Bulk Synchronous Parallel (BSP) Model
• Prominent user
• Facebook
• Who are contributing Giraph?
• Facebook, Yahoo!, Twitter, Linkedin and TrendMicro
What’s Apach Giraph
BSP VARIANT (1/3)
• Input: a directed graph where each vertex contains
1. state (active or inactive)
2. a value of a user-defined type
3. a set of out-edges, and each out-edge can also have an
associated value
4. a user program compute(.), which is the same for all vertices
and is allowed to execute the following tasks
• read/write its own vertex and edge values
• do local computation
• send messages
• mutate topology
• vote to halt, i.e., change the state to inactive
BSP VARIANT (2/3)
• Execution: a sequence of supersteps.
• In each superstep, each vertex runs its compute(.) function in
parallel with received messages as input
• messages sent to a vertex are to be processed in the next
superstep
• topology mutations will become effective in the next superstep with
conflicts resolution rules are as following:
1. remove > add
2. apply user-defined handler for multiple requests to add the same
vertex/edge with different initial values
• Barrier synchronization
• A mechanism for ensuring all computations are done and all
messages are delivered before starting the next superstep
BSP VARIANT (3/3)
• Termination criteria: all vertices become inactive and
no messages are en route.
Active Inactive
messages received
vote to halt
Giraph Architecture Overview
master
worker_0
partition rule
part_i = {v | hash(v) % N = i }
--------------------------------------
partition assignment
part_0 -> worker_0
part_1 -> worker_0
part_2 -> worker_1
part_2 -> worker_1
…
1. create
part_0
part_1
start superstep
2. copy
response
HDFS
3. load data
split_0 …file: split_1
Copyright 2009 - Trend Micro Inc.
How Giraph Works -- Initialization
1. User decides the partition rule for vertices
• default partition rule part_i = { v | hash(v) mod N = i}, where N is the
number of partitions
2. Master computes partition-to-worker assignment and sends it to all
workers
3. Master instructs each worker to load a split of the graph data from
HDFS
• if a worker happens to load the data of a vertex belonging to itself, then
keep it
• else, send messages to the owner of the vertex to create the vertex in
the beginning of the first superstep
4. After loading the split and delivering the messages, a worker
responds to the master.
5. Master starts the first superstep after all workers respond
Copyright 2009 - Trend Micro Inc.
How Giraph Works -- Superstep
1. Master instruct workers to start superstep
2. Each worker executes compute(.) for all of its vertices
• One thread per partition
3. Each worker responds to master after done with all of
computations and message deliveries with
• Number of active vertices under it and
• Number of message deliveries
Copyright 2009 - Trend Micro Inc.
How Giraph Works -- Synchronization
1. Master waits until all workers respond
2. If all vertices become inactive and no message is delivered then
stop
3. Else, start the next superstep.
Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
Giraph vs chained MapReduce
• Pros
• No need to load/shuffle/store the entire graph in each iteration
• Vertex-centric programming model is an more intuitive way to
think of graphs
• Cons
• Requires the whole input graph to be loaded into memory
• Memory has to be larger than the input
• Messages are stored in memory
• Control communication costs to avoid out-of-memory errors.
Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
• A bilclique in a bipartite graph is a set of nodes sharing
the same neighbors
• Informally, a quasi-biclique in a bipartite graph
is a set of nodes sharing similar neighbors
• E.g.
Quasi-Biclique Mining (1/4)
66.135.205.141
66.135.213.211
66.135.213.215
66.211.160.11
66.135.202.89
66.211.180.27
shop.ebay.de
video.ebay.au
fahrzeugteile.ebay.ca
domain IP
• E.g. C&C detection
• Given is a bipartite website-client graph and a website which is
reported to be a command and control (C&C) server.
• Hackers used to setup multiple C&C servers for high availability
and these C&C servers usually share the same bots.
• Thus finding websites sharing similar clients with the reported
C&C server can help to identify remaining C&C servers.
Quasi-Biclique Mining (2/4)
Quasi-Biclique Mining (3/4)
• Given a bipartite graph and a threshold µ, the quasi-biclique for a
node v is the set of nodes connecting to at least µ of v’s neighbors
• E.g. Let µ = 2/3. quasi-biclique(D2) = {D1, D2}.
D1
D2
D4
IP1
IP2
IP4
IP5
IP3
D3
G
Quasi-Biclique Mining (4/4)
• Given is a bipartite graph G=(X, Y, E) and a threshold 0<µ ≤ 1.
• Suppose that the objects we are interested in are represented by
nodes in X, and their associated features are represented by nodes
in Y.
• The problem is to find quasi-bicliques for all vertices in X.
Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
MapReduce Algorithm: Mapper
Map(keyIn, valueIn):
/***
keyIn: a vertex y on the right side
valueIn: neighbors of y
***/
for x in valueIn:
keyOut := x
valueOut := valueIn
output (keyOut, valueOut)
x1
y1x2
x3
Map(y1, [x1, x2, x3])
(x1, [x1, x2, x3])
(x2, [x1, x2, x3])
(x3, [x1, x2, x3])
E.g.
y2
x4
Map(y2, [x3, x4])
(x3, [x3, x4]))
(x4, [x3, x4]))
MapReduce Algorithm: Reducer
Reduce(keyIn, valueIn):
/***
keyIn: a vertex x on the left side
valueIn:
[ neighbors(y)) | y∈neighbors(keyIn)]
***/
for neighbors(y) in valueIn:
for x’ in neighbors(y):
COUNTER[x’] += 1
if COUNTER[x’] >= µ*|valueIn|:
add x’ to Q_BICLIQUE[keyIn]
x1 y1
x2
x3
Map(y1, [x1, x2])
(x1, [x1, x2])
(x2, [x1, x2])
E.g. µ = 2/3
y2
Map(y2, [x1, x2])
(x1, [x1, x2])
(x2, [x1, x2])
y3
Map(y3, [x1, x3])
(x1, [x1, x3])
(x3, [x1, x3])
Reduce(x1, [[x1, x2], [x1, x2], [x1, x3]])
Reduce(x2, [[x1, x2], [x1,x2]])
Reduce(x3, [[x1, x3]])
Q_BICLIQUE[x1] = {x1, x2}
Q_BICLIQUE[x2] = {x1, x2}
Q_BICLIQUE[x3] = {x1, x3}
MapReduce Algorithm: Bottleneck
• Experimental results on one hour of web browsing logs
• Input graph size = 180 MB
• Map outputs are too large to shuffle efficiently!!
• Same information is copied and sent multiple times
Map output bytes 36 GB
Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
1. Partition the graph into small groups composed of highly
correlated nodes in advance
• Improve data locality
• Reduce unnecessary communication cost and disk I/O
2. Utilize Giraph for efficient graph partitioning
Idea for improvement
Giraph Algorithm Overview
• Three phases:
1. Partitioning (Giraph):
• An iterative algorithm dividing the graph into smaller partitions
• The partitioning algorithm is designed to produce good enough partitions
without incurring too many communication efforts
2. Augmenting (MapReduce):
• Extend each partition with its adjacent inter-partition edges
3. Computing (MapReduce):
• Compute quasi-bicliques of augmented partitions in parallel
• Iteration1: each vertex on the left sets its group ID to hash of
its neighbors
Partitioning
value=H1=hash(IP1, IP3, IP4)
value=H2=hash(IP1, IP2,
IP4)
value=H3=hash(IP2, IP3, IP4, IP7)
value=H4=hash(IP5, IP6, IP7)
value=H5=hash(IP6, IP7)
D1
D2
D4
IP1
IP2
IP3
IP5
D5
IP6
IP4
IP7
D3
value=NULL
value=NULL
value=NULL
value=NULL
value=NULL
value=NULL
value=NULL
• Iteration2: Each vertex on the right side sets its group ID to
the group ID of its highest-degree neighbor.
Partitioning
value=H1
value=H2
value=H3
value=H4
value=H5
D1
D2
D4
IP1
IP2
IP3
IP5
D5
IP6
IP4
IP7
D3
value=H1
value=H3
value=H3
value=H3
value=H3
value=H4
value=H4
• Iteration3: Each vertex on the left side changes its group ID to
the majority group ID among its neighbors, if any.
Partitioning
value=H3 (changed)
value=H3 (changed)
value=H3
value=H4
value=H4 (changed)
D1
D2
D4
IP1
IP2
IP3
IP5
D5
IP6
IP4
IP7
D3
value=H1
value=H3
value=H3
value=H3
value=H3
value=H4
value=H4
• Iteration4: Each vertex on the right side changes its group ID
to the majority group ID among its neighbors, if any.
Partitioning
value=H3
value=H3
value=H3
value=H4
value=H4
D1
D2
D4
IP1
IP2
IP3
IP5
D5
IP6
IP4
IP7
D3
value=H3
(changed)
value=H3
value=H3
value=H3
value=H3
value=H4
value=H4
• Iteration5: All vertices stop to change group IDs so the
convergence criteria are met.
value=H4
D1
D2
D3
value=H4
value=H3
Partitioning
D4
IP1
IP2
IP3
IP5
D5
IP6
IP4
IP7
value=H3
value=H3
value=H3
value=H3
value=H3
value=H3
value=H3
value=H4
value=H4
Partition1
Partition2
Augmenting
D2
D3
D5
IP2
IP3
IP4
IP5
Partition
Augmented partition
IP6
D4
D1
IP1
Computing
Reducer1
augmented
partition1
augmented
partition2
Reducer2
augmented
partition3
augmented
partition4
Reducer3
augmented
partition5
augmented
partition6
Reducer4
augmented
partition7
augmented
partition8
• Compute quasi-bicliques for augmented partitions
• Assign each augmented partition to a reducer
• Each reducer runs a sequential algorithm to compute quasi-bicliques
for augmented partitions assigned to it.
Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
Performance testing
• Setting
• Input graph is constructed by parsing one-hour of web browsing logs
• 3.4 M vertices (1.3 M domains + 2.1 M IPs) and 2.5 M edges
• 60 MB in size
• Giraph: 15 workers (or mappers)
• MapReduce: 15 mappers and 15 reducers
• Result
• Our approach is able to reduce CPU time by 80% and I/O load by 95%,
compared with the MapReduce algorithm
• The communication cost incurred by graph partitioning is only 720MB,
and it takes only 1 minute to finish the partitioning
Outline
• Preliminary
• Introduction to Giraph
• Giraph vs. chained MapReduce
• Problem
• Algorithm
• MapReduce Algorithm
• Giraph Algorithm
• Experiment results
• Conclusions
Lessons learned
• Giraph is great for implementing iterative algorithms for it will not
bring unnecessary I/O between iterations
• Usecases: Belief propagation, Page Ranking, Random Walk,
Connected Componets, Shortest Paths, etc.
• Giraph requires the whole input graph to be loaded into
memory
• Proper graph partitioning in advance can significantly improve the
performance of following graph mining tasks
• A general graph partitioning algorithm is hard to design for we
usually don’t know which nodes should belong to the same group
Future work
• Incremental Graph Mining
• Observe the communication patterns during past incremental mining
tasks
• Partition the graph such that nodes which communicate often with
each other are in the same group
• Following incremental mining tasks will have lower communication
costs
• Consider the situation where the incremental algorithms are hard to design
so the easiest way is to periodically re-compute the result from scratch.
• Pregel: A System for Large-Scale Graph Processing
http://kowshik.github.com/JPregel/pregel_paper.pdf
• Apache Giraph
http://giraph.apache.org/
• GraphLab: A Distributed Framework for Machine Learning in the Cloud
http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf
• Kineograph: Taking the Pulse of a Fast-Changing and Connected World
http://research.microsoft.com/apps/pubs/default.aspx?id=163832
References
Thank You!

More Related Content

What's hot

Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel AlgorithmsHeman Pathak
 
Chapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming LanguagesChapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming LanguagesHeman Pathak
 
Comparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageComparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageIRJET Journal
 
Elementary Parallel Algorithms
Elementary Parallel AlgorithmsElementary Parallel Algorithms
Elementary Parallel AlgorithmsHeman Pathak
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Spark Summit
 
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksVauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksMumtaz Hannah Vauhkonen
 
Cross-Validation and Big Data Partitioning Via Experimental Design
Cross-Validation and Big Data Partitioning Via Experimental DesignCross-Validation and Big Data Partitioning Via Experimental Design
Cross-Validation and Big Data Partitioning Via Experimental Designdans_salford
 
Matlab for Electrical Engineers
Matlab for Electrical EngineersMatlab for Electrical Engineers
Matlab for Electrical EngineersManish Joshi
 
Integrative Parallel Programming in HPC
Integrative Parallel Programming in HPCIntegrative Parallel Programming in HPC
Integrative Parallel Programming in HPCVictor Eijkhout
 
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...Kalman Graffi
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitionerSubhas Kumar Ghosh
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
PRAM algorithms from deepika
PRAM algorithms from deepikaPRAM algorithms from deepika
PRAM algorithms from deepikaguest1f4fb3
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahoutbigdatasyd
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSangmin Woo
 

What's hot (20)

Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Chapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming LanguagesChapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming Languages
 
Comparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageComparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from Image
 
Elementary Parallel Algorithms
Elementary Parallel AlgorithmsElementary Parallel Algorithms
Elementary Parallel Algorithms
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksVauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
 
Cross-Validation and Big Data Partitioning Via Experimental Design
Cross-Validation and Big Data Partitioning Via Experimental DesignCross-Validation and Big Data Partitioning Via Experimental Design
Cross-Validation and Big Data Partitioning Via Experimental Design
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
Matlab for Electrical Engineers
Matlab for Electrical EngineersMatlab for Electrical Engineers
Matlab for Electrical Engineers
 
Integrative Parallel Programming in HPC
Integrative Parallel Programming in HPCIntegrative Parallel Programming in HPC
Integrative Parallel Programming in HPC
 
COMPILER DESIGN Run-Time Environments
COMPILER DESIGN Run-Time EnvironmentsCOMPILER DESIGN Run-Time Environments
COMPILER DESIGN Run-Time Environments
 
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
 
Chap6 slides
Chap6 slidesChap6 slides
Chap6 slides
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
PRAM algorithms from deepika
PRAM algorithms from deepikaPRAM algorithms from deepika
PRAM algorithms from deepika
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahout
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 

Similar to Mining quasi bicliques using giraph

2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache GiraphAvery Ching
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwordsNitay Joffe
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
A framework for low communication approaches for large scale 3D convolution
A framework for low communication approaches for large scale 3D convolutionA framework for low communication approaches for large scale 3D convolution
A framework for low communication approaches for large scale 3D convolutionCarlos Reaño González
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users GroupNitay Joffe
 
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxBharathiLakshmiAAssi
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processingjins0618
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Ontico
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processinghuguk
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processingjins0618
 
A GPU Implementation of Generalized Graph Processing Algorithm GIM-V
A GPU Implementation of Generalized Graph Processing Algorithm GIM-VA GPU Implementation of Generalized Graph Processing Algorithm GIM-V
A GPU Implementation of Generalized Graph Processing Algorithm GIM-VKoichi Shirahata
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 

Similar to Mining quasi bicliques using giraph (20)

Pregel
PregelPregel
Pregel
 
Pregel and giraph
Pregel and giraphPregel and giraph
Pregel and giraph
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords
 
Graph processing
Graph processingGraph processing
Graph processing
 
A framework for low communication approaches for large scale 3D convolution
A framework for low communication approaches for large scale 3D convolutionA framework for low communication approaches for large scale 3D convolution
A framework for low communication approaches for large scale 3D convolution
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
PREDIcT
PREDIcTPREDIcT
PREDIcT
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
 
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsx
 
matrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsxmatrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsx
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
A GPU Implementation of Generalized Graph Processing Algorithm GIM-V
A GPU Implementation of Generalized Graph Processing Algorithm GIM-VA GPU Implementation of Generalized Graph Processing Algorithm GIM-V
A GPU Implementation of Generalized Graph Processing Algorithm GIM-V
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 

Recently uploaded

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Recently uploaded (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 

Mining quasi bicliques using giraph

  • 1. Mining Quasi-Bicliques with Giraph 2013.07.02 Hsiao-Fei Liu Sr. Engineer, CoreTech, Trend Micro Inc. Chung-Tsai Su Sr. Manager, CoreTech, Trend Micro Inc. An-Chiang Chu PostDoc, CSIE, National Taiwan University
  • 2. Outline • Preliminary • Introduction to Giraph • Giraph vs. chained MapReduce • Problem • Algorithm • MapReduce Algorithm • Giraph Algorithm • Experiment results • Conclusions
  • 3. Outline • Preliminary • Introduction to Giraph • Giraph vs. chained MapReduce • Problem • Algorithm • MapReduce Algorithm • Giraph Algorithm • Experiment results • Conclusions
  • 4. • Distributed graph-processing system • Originating from Google’s paper “Pregel” in 2010 • Efficient iterative processing of large sparse graphs • A variation of Bulk Synchronous Parallel (BSP) Model • Prominent user • Facebook • Who are contributing Giraph? • Facebook, Yahoo!, Twitter, Linkedin and TrendMicro What’s Apach Giraph
  • 5. BSP VARIANT (1/3) • Input: a directed graph where each vertex contains 1. state (active or inactive) 2. a value of a user-defined type 3. a set of out-edges, and each out-edge can also have an associated value 4. a user program compute(.), which is the same for all vertices and is allowed to execute the following tasks • read/write its own vertex and edge values • do local computation • send messages • mutate topology • vote to halt, i.e., change the state to inactive
  • 6. BSP VARIANT (2/3) • Execution: a sequence of supersteps. • In each superstep, each vertex runs its compute(.) function in parallel with received messages as input • messages sent to a vertex are to be processed in the next superstep • topology mutations will become effective in the next superstep with conflicts resolution rules are as following: 1. remove > add 2. apply user-defined handler for multiple requests to add the same vertex/edge with different initial values • Barrier synchronization • A mechanism for ensuring all computations are done and all messages are delivered before starting the next superstep
  • 7. BSP VARIANT (3/3) • Termination criteria: all vertices become inactive and no messages are en route. Active Inactive messages received vote to halt
  • 8. Giraph Architecture Overview master worker_0 partition rule part_i = {v | hash(v) % N = i } -------------------------------------- partition assignment part_0 -> worker_0 part_1 -> worker_0 part_2 -> worker_1 part_2 -> worker_1 … 1. create part_0 part_1 start superstep 2. copy response HDFS 3. load data split_0 …file: split_1
  • 9. Copyright 2009 - Trend Micro Inc. How Giraph Works -- Initialization 1. User decides the partition rule for vertices • default partition rule part_i = { v | hash(v) mod N = i}, where N is the number of partitions 2. Master computes partition-to-worker assignment and sends it to all workers 3. Master instructs each worker to load a split of the graph data from HDFS • if a worker happens to load the data of a vertex belonging to itself, then keep it • else, send messages to the owner of the vertex to create the vertex in the beginning of the first superstep 4. After loading the split and delivering the messages, a worker responds to the master. 5. Master starts the first superstep after all workers respond
  • 10. Copyright 2009 - Trend Micro Inc. How Giraph Works -- Superstep 1. Master instruct workers to start superstep 2. Each worker executes compute(.) for all of its vertices • One thread per partition 3. Each worker responds to master after done with all of computations and message deliveries with • Number of active vertices under it and • Number of message deliveries
  • 11. Copyright 2009 - Trend Micro Inc. How Giraph Works -- Synchronization 1. Master waits until all workers respond 2. If all vertices become inactive and no message is delivered then stop 3. Else, start the next superstep.
  • 12. Outline • Preliminary • Introduction to Giraph • Giraph vs. chained MapReduce • Problem • Algorithm • MapReduce Algorithm • Giraph Algorithm • Experiment results • Conclusions
  • 13. Giraph vs chained MapReduce • Pros • No need to load/shuffle/store the entire graph in each iteration • Vertex-centric programming model is an more intuitive way to think of graphs • Cons • Requires the whole input graph to be loaded into memory • Memory has to be larger than the input • Messages are stored in memory • Control communication costs to avoid out-of-memory errors.
  • 14. Outline • Preliminary • Introduction to Giraph • Giraph vs. chained MapReduce • Problem • Algorithm • MapReduce Algorithm • Giraph Algorithm • Experiment results • Conclusions
  • 15. • A bilclique in a bipartite graph is a set of nodes sharing the same neighbors • Informally, a quasi-biclique in a bipartite graph is a set of nodes sharing similar neighbors • E.g. Quasi-Biclique Mining (1/4) 66.135.205.141 66.135.213.211 66.135.213.215 66.211.160.11 66.135.202.89 66.211.180.27 shop.ebay.de video.ebay.au fahrzeugteile.ebay.ca domain IP
  • 16. • E.g. C&C detection • Given is a bipartite website-client graph and a website which is reported to be a command and control (C&C) server. • Hackers used to setup multiple C&C servers for high availability and these C&C servers usually share the same bots. • Thus finding websites sharing similar clients with the reported C&C server can help to identify remaining C&C servers. Quasi-Biclique Mining (2/4)
  • 17. Quasi-Biclique Mining (3/4) • Given a bipartite graph and a threshold µ, the quasi-biclique for a node v is the set of nodes connecting to at least µ of v’s neighbors • E.g. Let µ = 2/3. quasi-biclique(D2) = {D1, D2}. D1 D2 D4 IP1 IP2 IP4 IP5 IP3 D3 G
  • 18. Quasi-Biclique Mining (4/4) • Given is a bipartite graph G=(X, Y, E) and a threshold 0<µ ≤ 1. • Suppose that the objects we are interested in are represented by nodes in X, and their associated features are represented by nodes in Y. • The problem is to find quasi-bicliques for all vertices in X.
  • 19. Outline • Preliminary • Introduction to Giraph • Giraph vs. chained MapReduce • Problem • Algorithm • MapReduce Algorithm • Giraph Algorithm • Experiment results • Conclusions
  • 20. MapReduce Algorithm: Mapper Map(keyIn, valueIn): /*** keyIn: a vertex y on the right side valueIn: neighbors of y ***/ for x in valueIn: keyOut := x valueOut := valueIn output (keyOut, valueOut) x1 y1x2 x3 Map(y1, [x1, x2, x3]) (x1, [x1, x2, x3]) (x2, [x1, x2, x3]) (x3, [x1, x2, x3]) E.g. y2 x4 Map(y2, [x3, x4]) (x3, [x3, x4])) (x4, [x3, x4]))
  • 21. MapReduce Algorithm: Reducer Reduce(keyIn, valueIn): /*** keyIn: a vertex x on the left side valueIn: [ neighbors(y)) | y∈neighbors(keyIn)] ***/ for neighbors(y) in valueIn: for x’ in neighbors(y): COUNTER[x’] += 1 if COUNTER[x’] >= µ*|valueIn|: add x’ to Q_BICLIQUE[keyIn] x1 y1 x2 x3 Map(y1, [x1, x2]) (x1, [x1, x2]) (x2, [x1, x2]) E.g. µ = 2/3 y2 Map(y2, [x1, x2]) (x1, [x1, x2]) (x2, [x1, x2]) y3 Map(y3, [x1, x3]) (x1, [x1, x3]) (x3, [x1, x3]) Reduce(x1, [[x1, x2], [x1, x2], [x1, x3]]) Reduce(x2, [[x1, x2], [x1,x2]]) Reduce(x3, [[x1, x3]]) Q_BICLIQUE[x1] = {x1, x2} Q_BICLIQUE[x2] = {x1, x2} Q_BICLIQUE[x3] = {x1, x3}
  • 22. MapReduce Algorithm: Bottleneck • Experimental results on one hour of web browsing logs • Input graph size = 180 MB • Map outputs are too large to shuffle efficiently!! • Same information is copied and sent multiple times Map output bytes 36 GB
  • 23. Outline • Preliminary • Introduction to Giraph • Giraph vs. chained MapReduce • Problem • Algorithm • MapReduce Algorithm • Giraph Algorithm • Experiment results • Conclusions
  • 24. 1. Partition the graph into small groups composed of highly correlated nodes in advance • Improve data locality • Reduce unnecessary communication cost and disk I/O 2. Utilize Giraph for efficient graph partitioning Idea for improvement
  • 25. Giraph Algorithm Overview • Three phases: 1. Partitioning (Giraph): • An iterative algorithm dividing the graph into smaller partitions • The partitioning algorithm is designed to produce good enough partitions without incurring too many communication efforts 2. Augmenting (MapReduce): • Extend each partition with its adjacent inter-partition edges 3. Computing (MapReduce): • Compute quasi-bicliques of augmented partitions in parallel
  • 26. • Iteration1: each vertex on the left sets its group ID to hash of its neighbors Partitioning value=H1=hash(IP1, IP3, IP4) value=H2=hash(IP1, IP2, IP4) value=H3=hash(IP2, IP3, IP4, IP7) value=H4=hash(IP5, IP6, IP7) value=H5=hash(IP6, IP7) D1 D2 D4 IP1 IP2 IP3 IP5 D5 IP6 IP4 IP7 D3 value=NULL value=NULL value=NULL value=NULL value=NULL value=NULL value=NULL
  • 27. • Iteration2: Each vertex on the right side sets its group ID to the group ID of its highest-degree neighbor. Partitioning value=H1 value=H2 value=H3 value=H4 value=H5 D1 D2 D4 IP1 IP2 IP3 IP5 D5 IP6 IP4 IP7 D3 value=H1 value=H3 value=H3 value=H3 value=H3 value=H4 value=H4
  • 28. • Iteration3: Each vertex on the left side changes its group ID to the majority group ID among its neighbors, if any. Partitioning value=H3 (changed) value=H3 (changed) value=H3 value=H4 value=H4 (changed) D1 D2 D4 IP1 IP2 IP3 IP5 D5 IP6 IP4 IP7 D3 value=H1 value=H3 value=H3 value=H3 value=H3 value=H4 value=H4
  • 29. • Iteration4: Each vertex on the right side changes its group ID to the majority group ID among its neighbors, if any. Partitioning value=H3 value=H3 value=H3 value=H4 value=H4 D1 D2 D4 IP1 IP2 IP3 IP5 D5 IP6 IP4 IP7 D3 value=H3 (changed) value=H3 value=H3 value=H3 value=H3 value=H4 value=H4
  • 30. • Iteration5: All vertices stop to change group IDs so the convergence criteria are met. value=H4 D1 D2 D3 value=H4 value=H3 Partitioning D4 IP1 IP2 IP3 IP5 D5 IP6 IP4 IP7 value=H3 value=H3 value=H3 value=H3 value=H3 value=H3 value=H3 value=H4 value=H4 Partition1 Partition2
  • 32. Computing Reducer1 augmented partition1 augmented partition2 Reducer2 augmented partition3 augmented partition4 Reducer3 augmented partition5 augmented partition6 Reducer4 augmented partition7 augmented partition8 • Compute quasi-bicliques for augmented partitions • Assign each augmented partition to a reducer • Each reducer runs a sequential algorithm to compute quasi-bicliques for augmented partitions assigned to it.
  • 33. Outline • Preliminary • Introduction to Giraph • Giraph vs. chained MapReduce • Problem • Algorithm • MapReduce Algorithm • Giraph Algorithm • Experiment results • Conclusions
  • 34. Performance testing • Setting • Input graph is constructed by parsing one-hour of web browsing logs • 3.4 M vertices (1.3 M domains + 2.1 M IPs) and 2.5 M edges • 60 MB in size • Giraph: 15 workers (or mappers) • MapReduce: 15 mappers and 15 reducers • Result • Our approach is able to reduce CPU time by 80% and I/O load by 95%, compared with the MapReduce algorithm • The communication cost incurred by graph partitioning is only 720MB, and it takes only 1 minute to finish the partitioning
  • 35. Outline • Preliminary • Introduction to Giraph • Giraph vs. chained MapReduce • Problem • Algorithm • MapReduce Algorithm • Giraph Algorithm • Experiment results • Conclusions
  • 36. Lessons learned • Giraph is great for implementing iterative algorithms for it will not bring unnecessary I/O between iterations • Usecases: Belief propagation, Page Ranking, Random Walk, Connected Componets, Shortest Paths, etc. • Giraph requires the whole input graph to be loaded into memory • Proper graph partitioning in advance can significantly improve the performance of following graph mining tasks • A general graph partitioning algorithm is hard to design for we usually don’t know which nodes should belong to the same group
  • 37. Future work • Incremental Graph Mining • Observe the communication patterns during past incremental mining tasks • Partition the graph such that nodes which communicate often with each other are in the same group • Following incremental mining tasks will have lower communication costs • Consider the situation where the incremental algorithms are hard to design so the easiest way is to periodically re-compute the result from scratch.
  • 38. • Pregel: A System for Large-Scale Graph Processing http://kowshik.github.com/JPregel/pregel_paper.pdf • Apache Giraph http://giraph.apache.org/ • GraphLab: A Distributed Framework for Machine Learning in the Cloud http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf • Kineograph: Taking the Pulse of a Fast-Changing and Connected World http://research.microsoft.com/apps/pubs/default.aspx?id=163832 References

Editor's Notes

  1. Good morning ladies and gentlemen. Thank you for attending my presentation. My chineses name is 劉效飛 or you can call me Ken. Today I will share with you my first research project at Trend Micro. The topic is mining quasi-bicliques with Giraph. This is joint work with my mentor Dr. Chung-Tsai Su and my friend Dr. An-Chiang Chu. The result is included in the industry track of IEEE BigData Congress 2013 I know My English speaking is not very fluent, so please interrupt me if you don’t understand what I am saying.
  2. Let us get back to the main point.My presentation is in 5 parts. First I’ll introduce you the programming model of Giraph and discuss the pros and cons between Giraph and chained MapReduce. Next, I’ll give you a formal problem definition and algorithms for mining quasi-bicliques. We shall study the bottleneck of the existing mapreduce algorithm and propose a new algorithm based on Giraph.The performance improvement is proved through experiments on real data. Finally, I will conclude with a summary and future work.
  3. Let’s begin with Giraph.
  4. Apache Giraph is a distributed graph processing system following the desing of Google’s paper pregel. It’s designed to enable efficient iterative processing of large sparse graphs, and it adopts a variation of the BSP model. The most prominent user is Facebook, which recently announced it has used Giraph in some of its production applications. In addition to Facebook, Yahoo!, Twitter, Linkedin and most importanly and vsome of my colleagues in US are important contributors to Giraph. That’s why I have to use Giraph, not other frameworks like GraphLab.
  5. Let’s move on to the programming model. The input to Giraph is a directed graph where each vertex contains four members. The first is the state of the vertex, either active or inactive. The second is a modifiable value of a user-defined type The third is a set of out-edges. Each out-edge can also have an associated value. The fourth is a user-defined program, which is allowed to do local computation, read/write its own values, send messages, mutate topology and vote to halt. What it cannot do is to modify the values of oher vertices and their edges.
  6. The execution is composed of a sequence of supersteps. In each superstep each active vertex will run the user program in parallel with received messages as input. A barrier synchronization mechanism is implemented to make sure all computations and message deliveries are done before starting the next superstep.Note that the messages sent to a vertex in the present superstep are to be received in the next superstep. Similarly, the topology mutations are not effective until the beginning of the next superstep.
  7. So when will a Giraph program stop execution? Let’s first look at the state digram of a single vertex. A vertex becomes inactive after voting to halt and is re-activated after receiving new messages. The Giraph program terminates when all vertices become inactive and no messages is on route.
  8. This diagram shows the basic architecture of Giraph. The input graph is partitioned by a user defined partition rule, default is according to the hash values of vertices.A distinguished master node computes the partition assignment and copy the partition rule and the partition assignment to all workers.Each worker then load a split of the input file from HDFS. If a vertex specified in the split does not belong to the worker, the worker will send messages to the owner of the vertex to, tell the owner to create the vertex in the beginning of the first superstep.
  9. Here is a more detailed description of the initialization step. After loading the split and delivering the messages, each worker must respond to the master. The master starts the first superstep after receiving responses from all workers.
  10. In the execution of a superstep, each worker will create a thread per partition to run the user program for its vertices. After finishing all computations and message deliveries, each worker has to respond to the master with the number of remaining active vertices under it and the number of message deliveries.
  11. Barrier synchronization is achieved by letting the master to wait until all workers respond. At the end of each supestep, by the responded information, the master can check if any vertex is still active and if any messages is en route to determine to stop execution or not.
  12. We have finished the introduction to Giraph. Now we are ready to compare Giraph with chained MapReduce.
  13. The main benefit of Giraph is to avoid unnecessary disk I/O and network traffic incurred byloading, shuffling and storing the entire graph in each iteration. Also, the vertex-centric programming model is more intuitive to think of graphs. However, Giraph requires very much memory. In MapReduce, each map or reduce task can be executed independently, but Giraph requires the whole input graph to be loaded into memory to start execution. So the memory requirement is very high. Especially when you have multiple users wanting to run Giraph applications at the same time. Even worse, the current implementation of Giraph stores all messages in memory. So the algorithm has to carefully control the communication costs to reduce memory consumption. Otherwise your job will fail due to out of memory errors. But it’s mainly an implementation issue, not an architecture flaw.
  14. Now lets introduce the problem.
  15. Givent a bipartie graph, a bilclique is a set of nodes with the same neighbors. However, real data are usually full of missing edges or non-exising edges caused by errors. So to make the definition more suitable to real data, we have to relax the requirement to define quasi-bicliques. Informally, a quasi-bliclique in a bipartite graph is a set of vertices sharing similar neighbors.In the example, we have the three domains resolve to a set of similar IPs and form a quasi-biclique. In practice, quasi-bicliques are useful to help identify nodes serving the similar purpose.
  16. One application of quais-biclique mining is to identify C&amp;C servers. Consider a bipartite graph consisting of websites clients, where one of the websites is known to be a C&amp;C server. Hackers used to setup multiple C&amp;C servers for HA purpose and these C&amp;C servers will share the similar bots.So it’s possible for us to identify remaining C&amp;C servers by finding out websites which share similar clients with the known C&amp;C server. It’s one of our target application at TrendMicro.
  17. Ok. Let’s move on to the formal definition. Given a bipartite graph and a threshold mu, the qusi-biclique for a node v is the set of nodes connecting to at least mu of v’s neighbors. For example, let mu equal to 2 third. In the graph, the quasi-biclique for D2 consists of D1 and D2.D1 is included because it connects to 2 third of D2 ‘s neighbors. D3 is not included because it only connects to one third of D2’s neighbors.
  18. Suppose that the objects we are interested in finding quasi-bicliques are represented by nodes on the left side, and their associated features are represented by nodes on the right side. The problem is to find the quasi-bicliques for all vertices on the left side.
  19. Next I’d like to show you the mapreduce algorithm for solving this problem.
  20. First look at the Map function. The input to the map function is a key-value pair. The input key is a vertex y on the right side and the value is its neighbors on the left side Each map output is also a key-value pair where the key is a neighbor of y and the value is is y’s adjacency list
  21. After aggregating the map output of the same key, The input to the reduce function will be a key-value pair, where the key is a vertex x on the left side, And the value is the adjacency lists of its neighbors.We then compute quasi-biclique for x by finding the nodes which appear in enough number of adjacency lists of neighbors of x.
  22. Before Giraph is indroduced toTrendMicro, we used to apply the mapresuce algorithm to mining quasi-bicliques.However, the performance is not very satisfying. So we conduct some experiments to find the root cause. The input graph is a domain-IP graph constructed by using customers’ web browsing logs. The logs contain the domains and corresponding IPs accessed by our customers. The input graph is only of size 180 Mega bytes, but the map outputs expand to 36 GB. 200 times larger than the input. To shuffle the map outputs become the performance bottleneck.The huge amount of map outputs was due to that the mapreduce algorithm does not utilize the locality of the graph. As we can see in the above example. Each vertex has to copy and send its adjacency list multiple times to all of its neighbors. It makes map outputs explode.
  23. The observation gives us an idea to improve the typical mapreduce algorithm.
  24. Our basic idea is to partition the graph into smaller groups in advance to improve data locality. Each group is composed of highly-correlated nodes and can be processed by a single reducer. In this way the nodes in the same group can share one copy of the same information to reduce unnecessary communication and disk I/O. However, Graph partitioning usually require multiple iterations of processing. Chained MapReduce is not very efficient for this kind of tasks, so we have utilized Giraph for graph partitioning.
  25. Here is an overview of the algorithm.First, we use an iterative Giraph algorithm to divide the graph into smaller partitions. The algorithm is designed to be very lightweight to not become another bottleneck, but still produce good enough partitions. Second, we augment each partition with its adjacent inter-partition edges, so each partition is self-contained for mining quasi-bicliques. It ensures the solution produced by our improved algorithm is exactly the same as the one produced by the typical mapreduce algorithm. We do not sacrifice solution quality for performance. Finally, we compute quasi-bicliques for all augmented partitions in parallel without dependency.
  26. I’d rather not go into the details but show you a running example to got some feelings for the partioning algorithm.In the first iteration of the partitioning algorithm, each vertex on the leftside would set its goup ID to the hash value of its adjacency list. Because we want the vertices with the same neighbors are in the same group.
  27. In the second iteration, each vertex on the right side would set its group ID to the group ID of its highest-degree neighbor. The intention for this step is to increase the probability that correlated vertices on the right side would have the same group ID. It is based on the assumption that the graph is composed of groups with structures similar to bicliques so that the highest-degree vertex in a group would have connections to most of its group peers on the other side. If this assumption does not hold, our partitioning algorithm may not produce good result and take a long time to converge. So I have to emphasize that this partitioning algorithm is not designed for general bipartite graphs. It works only for graphs composed of loosely coupled groups and each group has structure similar to a biclique.
  28. From the third iteration, we apply the majority rule to adjust the group IDs.The majority rule says that a vertex change its group ID if only if more than half of its neighbors have a common group ID different than its current group ID.
  29. The process continues until all vertices stop to change group IDs.
  30. We also proved convergence of our partitioning algorithm. After the algorithm converges, vertices on the left with the same group ID and their neighbors together define a partition. Some partitons may have common right-hand side vertices
  31. After partitioning, there would be some missing information due to inter-partition edges. The goal of augmentation is to extend each partition with the information of its adjacent inter-partition edges. So in the example the edges D1 IP2 and D5 IP4 are added to the partition to form an augmented partition.
  32. Finally we run a mapreduce job to assign each augmented partition to a reducer. And then each reducer runs a sequential algorithm to compute quasi-bicliques for augmented partitions assigned to it.
  33. Ok next I’ll show you the performance testing result.
  34. The testing data is a domain-IP graph constructed by parsing our customers’ web browsing logs, which consists of 3.4 M vertices and 2.5 M edges.The result shows that our proposed algorithm can reduce CPU time by 80% and I/O load by 95%., compared to the mapreduce algorithm. And the communication cost introduced by the graph partitioning algorithm is quite few, It takes only 1 minute to finish the graph partitioning.Actually we believe the communication cost can be significant reduced by some simple tricks, but for now it is quick enough so we didn’t do any further optimization.
  35. Let’s summarize the lesson we learned in this project. First, Giraph is great for implementing iterative algorithms for it will not bring unnecessary I/O between iterations. However, Giraph has very high memory requirement for it has to load the whole input graph into memory. We also learned taht proper graph partitioning in advance may significantly improve the performance of following graph mining tasks. However, a general graph partitioning algorithm is hard to design for we usuall don’t know which nodes should belong to the same group. It’s usually data and application dependent.
  36. The potential future work includes incremental Graph Mining. In incremental mining, we can first observe he communication pattern of the graph mining algorithm during the past, and partition the graph so that nodes communicating often are in the same group. So the following incremental mining tasks will have lower communication costs. Particularly when the incremental algorithms are hard to design so we have to periodically re-compute the result from scratch.
  37. Here are some references.
  38. Thank you for your listening.