Hadoop and Graph Data Management: Challenges and Opportunities

Hadoop and Graph Data Management:
Challenges and Opportunities

Daniel Abadi
Assistant Professor at Yale University
Chief Scientist at Hadapt
Tuesday, November 8th

(Collaboration with Jiewen Huang)

Two Major Trends
 Hadoop
 Becoming de facto standard for large scale data
processing
 Becoming more than just MapReduce
 Ecosystem growing rapidly --- lot’s of great tools
around it
 Graph data is proliferating; huge value if analyzed
and processed
 Social graphs
 Linked data
 Telecom
 Advertising
 Defense

Use Hadoop to Process Graph
Data?
 Of course!
 BUT:
 Little automatic optimization (non-declarative
interface)
 It’s possible to do graphs with Hadoop:
 Really badly
 Suboptimally
 Really well
 Case in point: subgraph pattern-matching
benchmark on three different Hadoop-centered
solutions:
 ~1000s,
 ~100s,
 < ~10s

Example Linked Data
 Entities (or “resources”)
are nodes in the graph
 Relationships between
entities are labeled
directed edges in the
graph
 (Resources typically have
unique URI identifies to
allow for global
references --- not shown
in example to the right)
 Any resource can
connect to any other
resource

Graph Representation
 Linked data graph
can be parsed
into a series of
vertex-entity-
vertex triples
 First entity
referred to as the
subject; second
as the object;
edge connecting
them as the
predicate
 We will call these
“RDF triples”

Querying Linked Data
 Linked Data is typically queried in SPARQL
 Processing a SPARQL query is basically
equivalent to the subgraph pattern matching
problem

Scalable SPARQL Processing
 Single-node options are abundant, e.g.,
 Sesame
 Jena
 RDF-3X
 3store
 Fewer options that can scale across multiple
machines
 This is where Hadoop comes in!
 One cool solution: SHARD (presented by Kurt Rohloff at
HadoopWorld 2010)
 Uses HDFS to store graph, MapReduce jobs for
subgraph pattern matching
 Much nicer than a naïve Hadoop solution

Another Example: Twitter Data
@joe_hellerst
ein

follow follow
s s
follow
s
@daniel_ab @mikeolso
adi n
follow
s retweete
retweete
d retweeted
retweete retweete retweete d
d d d

@hadoop_is_my_life @super_hadooper @hadoop_is_the answer

Example Query over Twitter
Graph

Who has retweeted both @daniel_abadi and
@mikeolson?

@daniel_aba
@mikeolson
di

retweeted retweeted

???

Issues With Hadoop Defaults
 Hadoop does not default to pipelined algorithms
between MapReduce jobs
 SHARD algorithm matches each clause of SPARQL
subgraph in a separate MapReduce job
 Need full barrier synchronization between jobs which is
unnecessary
 Hadoop does not default to pipelined algorithms
between Map and Reduce
 Each job performs a join by vertex, where the object
of one triple is joined with the subject of another
triple
 Joins work much faster if you can pipeline between
Map and Reduce (e.g. pipelined hash join)

Issues with Hadoop Defaults
(cont.)
 Hadoop hash partitions data across nodes
 Data in graph is randomly partitioned across nodes
 Data close in the graph could be on a physically
distant node
 Therefore each join requires a complete
redistribution of the entire data set across the
network (and there are many joins)
 Hadoop defaults to replicating all data 3 times
with no preferences for what is replicated or
where it is replicated to
 Hadoop defaults to using the Hadoop Distributed
File System (HDFS) for storage, which is not
optimized for graph data

All is not lost! Don’t throw away
Hadoop!
 All we have to do is change the defaults and add
to it a little

Partitioning
 Graphs can be represented as vertex1-edge-
vertex2 triples
 Hash partitioning by vertex1 is straightforward
 Great for queries like:

Query: Find the names of the strikers that play for FC Barcelona.

SELECT ?name
WHERE { ?player type footballer .
?player name ?name .
?player position striker .
?player playsFor FC_Barcelona . }

The problem with hash partitioning
…

Find football players playing for clubs in a populous region where he was born.

Graph Partitioning
 Data close together in the graph should be
physically close together (preferably on the same
machine)
 Subgraph pattern matching can be done without
require huge amounts of communication via joins
across the cluster

Graph Partitioning

Machine 1 Machine 2 Machine 3

Edge/Triple Placement
● Minimizing data exchange
● Allowing data overlap
● N-hop guarantee
● The extent of data overlap
● If a vertex is assigned to a machine, any
vertex that is within n-hop of this vertex is
also stored in this machine

0 Hop of Machine 1


1 Hop of Machine 1


2 Hop of Machine 1


0 Hop of Machine 3


1 Hop of Machine 3


2 Hop of Machine 3


High Degree Vertexes
● Problem: High-degree vertexes make the
graph well-connected and difficult to
partition
● Solution: Ignore them during graph
partitioning

● Problem: High-degree vertexes cause
data explosion with a n-hop guarantee
● Solution: Selectively weaken the n-hop
guarantee

Query Processing
● Query execution is more efficient if
pushed to optimized storage (RDF-
stores)
● Minimizing the number of Hadoop jobs
● The larger the hop guarantee, the more
work is done in RDF-stores

To Exchange, or not to Exchange?
● Given a query and n-hop guarantee, is
data exchange (Hadoop job) between
nodes needed?
● Choose the “center” of the query graph
● Calculate the distance from the “center” to
the furthest edge
● If distance > n, data exchange is needed;
not needed otherwise

Data Exchange

Find football players playing for clubs in a populous region where he was born.

Experimental Setup
● 20-machine cluster
● Leigh University Benchmark (LUBM):
270 million triples
● Things to compare:
● Single-node RDF-3X
● SHARD
● Graph partitioning (the proposed system)
● Hash partitioning on subjects

Speedup
● Better than linear speedup

Analysis
 Difference between fastest implementation and
slowest implementation was a factor of 1340!
 Using Hadoop does not mean that performance is
fixed
 More improvements are possible
 Experiments used MapReduce whenever data
communication was necessary
 NextGen Hadoop allows other programming paradigms
besides MR
 MPI is a good candidate

 Still need to fix the data pipelining problem
 Factor of 1340 possible via focusing on storage --
- similar in theme to Hadapt

How this fits with Hadapt

 Full SQL interface, Map
Reduce, and JDBC Connector
Flexible Query Interface
(Full SQL Support, MR, JDBC)
 10x-50x faster than Hadoop and
Hive
 Queries go from hours to minutes,
Split Query and minutes to seconds
Hadoop Execution
(Patent Pending)
 Analytics across structured and
unstructured data in one
platform
Hadapt Storage Engine
(Relational + HDFS)
 3.5 Patents Pending

 $9.5 Series A financing, lead by
Norwest Venture Partners and
Bessemer Venture Partners

Optimized Storage Matters
 HDFS appropriate for unstructured data
 Relational storage appropriate for relational data
 Graph storage appropriate for graph data
 Hadapt allows for pluggable storage inside
Hadoop (amongst other things)

 Bottom line: Hadoop can be used for scalable
graph processing, but it might need some
Hadapting ;)

Hadoop and Graph Data Management: Challenges and Opportunities

More Related Content

What's hot

Viewers also liked

Similar to Hadoop and Graph Data Management: Challenges and Opportunities

Recently uploaded

Hadoop and Graph Data Management: Challenges and Opportunities