Apache Giraph: Large-scale graph processing done better

Apache Giraph
Large-scale graph processing done better
Data Mining Class
Sapienza, University of Rome
A. Y. 2016 - 2017

Basic concepts Let’s start Get our hands dirty
Hi!
Simone Santacroce
santacroce.1542338@studenti.uniroma1.it
https://it.linkedin.com/in/simone-santacroce-272739134
Manuel Coppotelli
coppotelli.1540732@studenti.uniroma1.it
https://it.linkedin.com/in/manuelcoppotelli
George Adrian Munteanu
munteanu.1540833@studenti.uniroma1.it
https://it.linkedin.com/in/george-adrian-munteanu-707744134
Lorenzo Marconi
marconi.1494505@studenti.uniroma1.it
https://www.linkedin.com/in/lorenzo-marconi-1a2580105
Antonio La Torre
alatorre182@hotmail.it
https://www.linkedin.com/in/antonio-la-torre-768738134
Lucio Burlini
burlini.1705432@studenti.uniroma1.it
https://www.linkedin.com/in/lucio-burlini-827739134
Apache Giraph

Agenda
1 Basic concepts
• Graphs in the real world
• Challenges on graphs
• MapReduce
• Giraph
2 Let’s start
• Out-Degree & In-Degree
3 Get our hands dirty
• Simple PageRank
Apache Giraph

Graphs 101
• Graph: representation of a set
of objects G =< V , E >
• Captures pairwise relationships
between objects
• Can have directions, weights,
. . .
Apache Giraph

A computer network
Apache Giraph

A road map
Apache Giraph

The web
Apache Giraph

Social networks
• Both physical and Internet mediated
• Users are vertices
• Any kind of interaction generates edges
Apache Giraph

Graph are huge!
∼ 50B pages
∼ 1.1B users
∼ 570M users
∼ 530M users
Apache Giraph

Graph are nasty
• Graph needs processing
Apache Giraph

Graph are nasty
• Each vertex depends on its neighbors, recursively
Apache Giraph

Graph are nasty
• Recursive problems are nicely solved iteratively
Apache Giraph

Graph are nasty
• Recursive problems are nicely solved iteratively
So what?
Apache Giraph

Why not MapReduce?1
MapReduce is the current standard to manage big sets of data for
intensive computing.
Repeat N times . . .
1
https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
Apache Giraph

MapReduce Drawbacks
• Each job is executed N times
• Job bootstrap
• Mappers send values and structure
• Extensive IO at input, shuﬄe & sort, output
Disk I/O and Job scheduling quickly dominate the algorithm
Apache Giraph

Google’s Pregel2
• Especially developed for large scale graph processing
2
https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph

Google’s Pregel2
• Intuitive API that let’s you “think like a vertex”
2
Apache Giraph

Google’s Pregel2
• Bulk Synchronous Parallel (BSP) as execution model
2
Apache Giraph

Google’s Pregel2
• Bulk Synchronous Parallel (BSP) as execution model
• Fault tolerance by checkpointing
2
Apache Giraph

Giraph
Apache Giraph

The Story
Apache Giraph

Think like a vertex
• Each vertex has an id, a value, a list of adjacent neighbors and
corresponding edge values
• Vertices implement algorithms by sending messages
• Messages are delivered at the start of each superstep
Apache Giraph

Bulk Synchronous Parallel (BSP)
• Master-Slave architecture
• Batch oriented processing
• Computation happens in-memory
Apache Giraph

Advantages
• No locks: message-based communication
• No semaphores: global synchronization
• Iteration isolation: massively parallelizable
Apache Giraph

Architecture
Single Map-only Job
Apache Giraph

Jobs Schema
Apache Giraph

Other things
Aggregators
• Mechanism for global communication and global computation
• Global value calculated in superstep t available in t + 1
• Pre-deﬁned (e.g. sum, max, min) or user-deﬁnable functions3
3
The function has to be both commutative and associative
Apache Giraph

Other things
Aggregators
Combiners
• User-deﬁned function3 for messages before being sent or delivered
• Similar to Hadoop ones
• Saves on network or memory
3
Apache Giraph

Other things
Aggregators
Combiners
• User-deﬁned function3 for messages before being sent or delivered
• Similar to Hadoop ones
• Saves on network or memory
Checkpointing
• Store work to disk at user-deﬁned intervals (isn’t always evil)
• Restart on failure
3
Apache Giraph

LongLongNullTextInputFormat
org.apache.giraph.io.formats.LongLongNullTextInputFormat
If there is ad edge from Node 1 to Node 2 then
Node 2 appears in the neighbor list of Node 1
<NODE1 ID> <SPACE> <NEIGHBOR1 ID> <SPACE> <NEIGHBOR2 ID> ...
<NODE2 ID> <SPACE> <NEIGHBOR1 ID> <SPACE> <NEIGHBOR2 ID> ...
...
Apache Giraph

IdWithValueTextOutputFormat
org.apache.giraph.io.formats.IdWithValueTextOutputFormat
For each node print the Node ID and the Node Value
<NODE1 ID> <TAB> <NODE1 VALUE>
<NODE2 ID> <TAB> <NODE2 VALUE>
...
Apache Giraph

Demo
Demo code
https://github.com/manuelcoppotelli/giraph-demo
Apache Giraph

Google’s PageRank4
• The success factor of Google’s search engine
4
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph

• A graph algorithm computing the “importance” of webpages
4
Apache Giraph

◦ Important pages have a lot of links from other important pages
4
Apache Giraph

◦ Look at the structure of the underlying network
4
Apache Giraph

◦ Look at the structure of the underlying network
• Ability to conduct web scale graph processing
4
Apache Giraph

Simple PageRank
• Recursive deﬁnition
PageRanki+1(v) =
1 − d
N
+ d ·
u→v
PageRanki (u)
O(u)
Apache Giraph

Simple PageRank
• Recursive deﬁnition
PageRanki+1(v) =
1 − d
N
+ d ·
u→v
PageRanki (u)
O(u)
• Where:
◦ d: damping factor; which percentage of the PageRank must be
transferred to the neighbors. Usually 0.85
◦ N: total number of pages
◦ O: out-degree; total number of link within a page
Apache Giraph

Simple PageRank Example
1.0
1.0
1.0
Apache Giraph

1.0
1.0
1.0
0.5
0.5
1
1
Apache Giraph

1 · 0.85 + 0.15/3
0.5 · 0.85 + 0.15/3
1.5 · 0.85 + 0.15/3
0.5
0.5
1
1
Apache Giraph

0.43
0.21
0.64
Apache Giraph

JsonLongDoubleFloatDoubleVertexInputFormat
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
Express both nodes and edges information using JSON arrays
[<vertex id>, <vertex value>,
[
[<dest vertex id>, <edge value>],
...
]
]
Notice
Fore more in/out formats visit https://github.com/apache/giraph/tree/
trunk/giraph-core/src/main/java/org/apache/giraph/io/formats
Apache Giraph

Q? & A!
Apache Giraph

Thank you for your attention
Contact us for any questions or problem
Demo code
https://github.com/manuelcoppotelli/giraph-demo
Homework
https://github.com/manuelcoppotelli/giraph-homework
Apache Giraph

Apache Giraph: Large-scale graph processing done better

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache Giraph: Large-scale graph processing done better

Similar to Apache Giraph: Large-scale graph processing done better (20)

Recently uploaded

Recently uploaded (20)

Apache Giraph: Large-scale graph processing done better