Large scale graph processing

Large Scale Graph Processing
Deepankar Patra
IIT Madras

Goal
Running graph algorithms(e.g. Shortest
path, connected components, finding
diameter etc) on huge graphs(Terabyte
or more Sized)

Example Graph Algorithm
● Shortest Path Algorithm
Source Vertex Destination Vertex

Why?
Lot of machine learning algorithms
require graph computations and in the
real world the input for these are
huge, which cannot fit in one machine.

Real World?
Big Graphs:
● Social Networks
● Biological Networks
● Mobile Call Networks
● Citation Networks
● World Wide Web
● Geographic Pathways
● Customer merchant graphs(Amazon,
Ebay)

Facebook Friends Graph
Src: http://wisonets.files.wordpress.com/2012/09/facebook-mutual-friends2.png

Machine Learning
Algorithms?
● Recommendation
● PageRank
● Web search
● Cyber security
● Fraud detection
● Clustering
● Shortest Path Calculation

Graph Algorithms Typically Involve
● Performing computations at each
node based on node features, edge
features, and local link structure.
● Propagating computations:
“traversing” the graph

Example
Src: http://www.slideshare.net/WeiruDai

Why not MapReduce?
● Represent graphs as adjacency lists
● Perform local computations in mapper
● Pass along partial results via
outlinks, keyed by destination node
● Perform aggregation in reducer on
inlinks to a node
● Iterate until convergence: controlled
by external “driver”
● Don’t forget to pass the graph
structure between iterations

Why not Spark?
● Spark provides GraphX library for
graph & machine learning algorithms.
● But still it is not designed
specifically for graph algorithms.
● So, no optimization will be available
which are applicable for graphs only.

PREGEL, Google, 2010
● Basic idea: “think like a vertex”
● Based on Bulk Synchronous
Parallel(BSP) Model
● Provides scalability
● Provides fault tolerance
● Provides flexibility to express
arbitrary graph algorithms

How does it work?
● Master/Worker architecture
● Each worker is assigned a subset of
a directed graph’s vertices
● Vertex-centric model. Each vertex
has:
● An arbitrary “value” that can be
get/set.
● List of messages sent to it
● List of outgoing edges (edges have
a value too)
● A binary state (active/inactive)

Graph Parititioning
Worker 1
Worker 3
Worker 2

Pregel execution model
Master initiates synchronous iterations (called a
“superstep”), where at every superstep:
● Workers asynchronously execute a user function on all
of its vertices
● Vertices can receive messages sent to it in the last
superstep
● Vertices can modify their value, modify values of
edges, change the topology of the graph (add/remove
vertices or edges)
● Vertices can send messages to other vertices to be
received in the next superstep
● Vertices can “vote to halt”
● Execution stops when all vertices have voted to halt
and no vertices have messages.
● Vote to halt trumped by non-empty message queue

Page Rank
PageRank is a link analysis
algorithm that is used to determine
the importance of a documentbased on
the number of references to it and
the importance of the source
documents themselves.

Page Rank
A = A given page
T1 .... Tn = Pages that point to page
A (citations)
d = Damping factor between 0 and 1
(usually kept as
0.85)
C(T) = number of links going out of T
PR(A) = the PageRank of page A

Page Rank
Class PageRankVertex
: public Vertex<double, void, double> {
public:
virtual void Compute(MessageIterator* msgs) {
if (superstep() >= 1) {
double sum = 0;
for (; !msgs->done(); msgs->Next())
sum += msgs->Value();
*MutableValue() = 0.15 + 0.85 * sum;
}
if (supersteps() < 30) {
const int64 n = GetOutEdgeIterator().size();
SendMessageToAllNeighbors(GetValue() / n);
}
else {
VoteToHalt();
}}};

Open Source
PREGEL was a research paper, Google didn't
expose any open source implementation.
As a result lots of open source
implementations came up and they keep on
improving the basic Pregel model. Most
notable two are:
a) Apache Giraph, started, maintained and
used mainly by facebook
b) CMU's GraphLab(now it is a company by
itself)

One Example: GraphLab
● GraphLab is currently is the best one
● GraphLab modified the partitioning
strategy to reduce network overhead
message transfer among workers
● GraphLab has a rich library of
machine learning algorithms and its
growing

Reference
● Pregel: A System for Large-Scale Graph
Processing
● PowerGraph: Distributed Graph-Parallel
Computation on Natural Graphs
● GraphX: A Resilient Distributed Graph
System on Spark
● giraph.apache.org
● graphlab.org

Large scale graph processing

More Related Content

What's hot

Viewers also liked

Similar to Large scale graph processing

Large scale graph processing