As Hadoop rapidly becomes the universal standard for scalable data analysis and processing, it becomes increasingly important to understand its strengths and weaknesses for particular application scenarios in order to avoid inefficiency pitfalls. For example, Hadoop has great potential to perform scalable graph analysis, but recent research from Yale University has shown that a conventional approach to graph analysis is 1300 times less efficient than a more advanced approach. This session will give an overview of the advanced approach and then discuss further changes that are needed in the core Hadoop framework to take performance to the next level.
Breaking the Kubernetes Kill Chain: Host Path Mount
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportunities - Daniel Abadi - Yale University & Hadapt
1. Hadoop and Graph Data Management:
Challenges and Opportunities
Daniel Abadi
Assistant Professor at Yale University
Chief Scientist at Hadapt
Tuesday, November 8th
(Collaboration with Jiewen Huang)
2. Two Major Trends
Hadoop
Becoming de facto standard for large scale data
processing
Becoming more than just MapReduce
Ecosystem growing rapidly --- lot’s of great tools
around it
Graph data is proliferating; huge value if analyzed
and processed
Social graphs
Linked data
Telecom
Advertising
Defense
3. Use Hadoop to Process Graph
Data?
Of course!
BUT:
Little automatic optimization (non-declarative
interface)
It’s possible to do graphs with Hadoop:
Really badly
Suboptimally
Really well
Case in point: subgraph pattern-matching
benchmark on three different Hadoop-centered
solutions:
~1000s,
~100s,
< ~10s
5. Example Linked Data
Entities (or “resources”)
are nodes in the graph
Relationships between
entities are labeled
directed edges in the
graph
(Resources typically have
unique URI identifies to
allow for global
references --- not shown
in example to the right)
Any resource can
connect to any other
resource
6. Graph Representation
Linked data graph
can be parsed
into a series of
vertex-entity-
vertex triples
First entity
referred to as the
subject; second
as the object;
edge connecting
them as the
predicate
We will call these
“RDF triples”
7. Querying Linked Data
Linked Data is typically queried in SPARQL
Processing a SPARQL query is basically
equivalent to the subgraph pattern matching
problem
8. Scalable SPARQL Processing
Single-node options are abundant, e.g.,
Sesame
Jena
RDF-3X
3store
Fewer options that can scale across multiple
machines
This is where Hadoop comes in!
One cool solution: SHARD (presented by Kurt Rohloff at
HadoopWorld 2010)
Uses HDFS to store graph, MapReduce jobs for
subgraph pattern matching
Much nicer than a naïve Hadoop solution
9. Another Example: Twitter Data
@joe_hellerst
ein
follow follow
s s
follow
s
@daniel_ab @mikeolso
adi n
follow
s retweete
retweete
d retweeted
retweete retweete retweete d
d d d
@hadoop_is_my_life @super_hadooper @hadoop_is_the answer
10. Example Query over Twitter
Graph
Who has retweeted both @daniel_abadi and
@mikeolson?
@daniel_aba
@mikeolson
di
retweeted retweeted
???
13. Issues With Hadoop Defaults
Hadoop does not default to pipelined algorithms
between MapReduce jobs
SHARD algorithm matches each clause of SPARQL
subgraph in a separate MapReduce job
Need full barrier synchronization between jobs which is
unnecessary
Hadoop does not default to pipelined algorithms
between Map and Reduce
Each job performs a join by vertex, where the object
of one triple is joined with the subject of another
triple
Joins work much faster if you can pipeline between
Map and Reduce (e.g. pipelined hash join)
14. Issues with Hadoop Defaults
(cont.)
Hadoop hash partitions data across nodes
Data in graph is randomly partitioned across nodes
Data close in the graph could be on a physically
distant node
Therefore each join requires a complete
redistribution of the entire data set across the
network (and there are many joins)
Hadoop defaults to replicating all data 3 times
with no preferences for what is replicated or
where it is replicated to
Hadoop defaults to using the Hadoop Distributed
File System (HDFS) for storage, which is not
optimized for graph data
15. All is not lost! Don’t throw away
Hadoop!
All we have to do is change the defaults and add
to it a little
17. Partitioning
Graphs can be represented as vertex1-edge-
vertex2 triples
Hash partitioning by vertex1 is straightforward
Great for queries like:
Query: Find the names of the strikers that play for FC Barcelona.
SELECT ?name
WHERE { ?player type footballer .
?player name ?name .
?player position striker .
?player playsFor FC_Barcelona . }
18. The problem with hash partitioning
…
Find football players playing for clubs in a populous region where he was born.
19. Graph Partitioning
Data close together in the graph should be
physically close together (preferably on the same
machine)
Subgraph pattern matching can be done without
require huge amounts of communication via joins
across the cluster
22. Edge/Triple Placement
● Minimizing data exchange
● Allowing data overlap
● N-hop guarantee
● The extent of data overlap
● If a vertex is assigned to a machine, any
vertex that is within n-hop of this vertex is
also stored in this machine
23. 0 Hop of Machine 1
Machine 1 Machine 2 Machine 3
24. 1 Hop of Machine 1
Machine 1 Machine 2 Machine 3
25. 2 Hop of Machine 1
Machine 1 Machine 2 Machine 3
26. 0 Hop of Machine 3
Machine 1 Machine 2 Machine 3
27. 1 Hop of Machine 3
Machine 1 Machine 2 Machine 3
28. 2 Hop of Machine 3
Machine 1 Machine 2 Machine 3
29. High Degree Vertexes
● Problem: High-degree vertexes make the
graph well-connected and difficult to
partition
● Solution: Ignore them during graph
partitioning
● Problem: High-degree vertexes cause
data explosion with a n-hop guarantee
● Solution: Selectively weaken the n-hop
guarantee
30. Query Processing
● Query execution is more efficient if
pushed to optimized storage (RDF-
stores)
● Minimizing the number of Hadoop jobs
● The larger the hop guarantee, the more
work is done in RDF-stores
31. To Exchange, or not to Exchange?
● Given a query and n-hop guarantee, is
data exchange (Hadoop job) between
nodes needed?
● Choose the “center” of the query graph
● Calculate the distance from the “center” to
the furthest edge
● If distance > n, data exchange is needed;
not needed otherwise
36. Analysis
Difference between fastest implementation and
slowest implementation was a factor of 1340!
Using Hadoop does not mean that performance is
fixed
More improvements are possible
Experiments used MapReduce whenever data
communication was necessary
NextGen Hadoop allows other programming paradigms
besides MR
MPI is a good candidate
Still need to fix the data pipelining problem
Factor of 1340 possible via focusing on storage --
- similar in theme to Hadapt
37. How this fits with Hadapt
Full SQL interface, Map
Reduce, and JDBC Connector
Flexible Query Interface
(Full SQL Support, MR, JDBC)
10x-50x faster than Hadoop and
Hive
Queries go from hours to minutes,
Split Query and minutes to seconds
Hadoop Execution
(Patent Pending)
Analytics across structured and
unstructured data in one
platform
Hadapt Storage Engine
(Relational + HDFS)
3.5 Patents Pending
$9.5 Series A financing, lead by
Norwest Venture Partners and
Bessemer Venture Partners
38. Optimized Storage Matters
HDFS appropriate for unstructured data
Relational storage appropriate for relational data
Graph storage appropriate for graph data
Hadapt allows for pluggable storage inside
Hadoop (amongst other things)
Bottom line: Hadoop can be used for scalable
graph processing, but it might need some
Hadapting ;)