Hadoop and Graph Data Management: Challenges and Opportunities


Published on

HadoopWorld 2011 Presentation by Daniel Abadi

As Hadoop rapidly becomes the universal standard for scalable data analysis and processing, it is increasingly important to understand its strengths and weaknesses for particular application scenarios in order to avoid inefficiency pitfalls. For example, Hadoop has great potential to perform scalable graph analysis if it is used correctly. Recent benchmarking has shown that simple implementations can be 1300 times less efficient than a more optimal Hadoop-centered implementation. In this talk, Daniel Abadi gives an overview of a recent research project at Yale University that investigates how to perform sub-graph pattern matching within a Hadoop-centered system that is three orders of magnitude faster than a more simple approach. This talk highlights how the cleaning, transforming, and parallel processing strengths of Hadoop are combined with storage optimized for graph data analysis. It then discusses further changes that are needed in the core Hadoop framework to take performance to the next level.

Published in: Technology, Education

Hadoop and Graph Data Management: Challenges and Opportunities

  1. 1. Hadoop and Graph Data Management: Challenges and Opportunities Daniel Abadi Assistant Professor at Yale University Chief Scientist at Hadapt Tuesday, November 8th (Collaboration with Jiewen Huang)
  2. 2. Two Major Trends Hadoop  Becoming de facto standard for large scale data processing  Becoming more than just MapReduce  Ecosystem growing rapidly --- lot’s of great tools around it Graph data is proliferating; huge value if analyzed and processed  Social graphs  Linked data  Telecom  Advertising  Defense
  3. 3. Use Hadoop to Process GraphData? Of course! BUT:  Little automatic optimization (non-declarative interface)  It’s possible to do graphs with Hadoop:  Really badly  Suboptimally  Really well  Case in point: subgraph pattern-matching benchmark on three different Hadoop-centered solutions:  ~1000s,  ~100s,  < ~10s
  4. 4. Case study: Linked Data
  5. 5. Example Linked Data Entities (or “resources”) are nodes in the graph Relationships between entities are labeled directed edges in the graph (Resources typically have unique URI identifies to allow for global references --- not shown in example to the right) Any resource can connect to any other resource
  6. 6. Graph Representation Linked data graph can be parsed into a series of vertex-entity- vertex triples First entity referred to as the subject; second as the object; edge connecting them as the predicate We will call these “RDF triples”
  7. 7. Querying Linked Data Linked Data is typically queried in SPARQL Processing a SPARQL query is basically equivalent to the subgraph pattern matching problem
  8. 8. Scalable SPARQL Processing Single-node options are abundant, e.g.,  Sesame  Jena  RDF-3X  3store Fewer options that can scale across multiple machines  This is where Hadoop comes in!  One cool solution: SHARD (presented by Kurt Rohloff at HadoopWorld 2010)  Uses HDFS to store graph, MapReduce jobs for subgraph pattern matching  Much nicer than a naïve Hadoop solution
  9. 9. Another Example: Twitter Data @joe_hellerst ein follow follow s s follow s @daniel_ab @mikeolso adi n follow s retweeteretweeted retweeted retweete retweete retweete d d d d @hadoop_is_my_life @super_hadooper @hadoop_is_the answer
  10. 10. Example Query over Twitter GraphWho has retweeted both @daniel_abadi and@mikeolson? @daniel_aba @mikeolson di retweeted retweeted ???
  11. 11. Issues With Hadoop Defaults Hadoop does not default to pipelined algorithms between MapReduce jobs  SHARD algorithm matches each clause of SPARQL subgraph in a separate MapReduce job  Need full barrier synchronization between jobs which is unnecessary Hadoop does not default to pipelined algorithms between Map and Reduce  Each job performs a join by vertex, where the object of one triple is joined with the subject of another triple  Joins work much faster if you can pipeline between Map and Reduce (e.g. pipelined hash join)
  12. 12. Issues with Hadoop Defaults(cont.) Hadoop hash partitions data across nodes  Data in graph is randomly partitioned across nodes  Data close in the graph could be on a physically distant node  Therefore each join requires a complete redistribution of the entire data set across the network (and there are many joins) Hadoop defaults to replicating all data 3 times with no preferences for what is replicated or where it is replicated to Hadoop defaults to using the Hadoop Distributed File System (HDFS) for storage, which is not optimized for graph data
  13. 13. All is not lost! Don’t throw awayHadoop! All we have to do is change the defaults and add to it a little
  14. 14. System Architecture
  15. 15. Partitioning Graphs can be represented as vertex1-edge- vertex2 triples Hash partitioning by vertex1 is straightforward Great for queries like:Query: Find the names of the strikers that play for FC Barcelona.SELECT ?nameWHERE { ?player type footballer . ?player name ?name . ?player position striker . ?player playsFor FC_Barcelona . }
  16. 16. The problem with hash partitioning …Find football players playing for clubs in a populous region where he was born.
  17. 17. Graph Partitioning Data close together in the graph should be physically close together (preferably on the same machine) Subgraph pattern matching can be done without require huge amounts of communication via joins across the cluster
  18. 18. Graph Partitioning
  19. 19. Graph Partitioning Machine 1 Machine 2 Machine 3
  20. 20. Edge/Triple Placement● Minimizing data exchange ● Allowing data overlap● N-hop guarantee ● The extent of data overlap ● If a vertex is assigned to a machine, any vertex that is within n-hop of this vertex is also stored in this machine
  21. 21. 0 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  22. 22. 1 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  23. 23. 2 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  24. 24. 0 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  25. 25. 1 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  26. 26. 2 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  27. 27. High Degree Vertexes● Problem: High-degree vertexes make the graph well-connected and difficult to partition● Solution: Ignore them during graph partitioning● Problem: High-degree vertexes cause data explosion with a n-hop guarantee● Solution: Selectively weaken the n-hop guarantee
  28. 28. Query Processing ● Query execution is more efficient if pushed to optimized storage (RDF- stores) ● Minimizing the number of Hadoop jobs ● The larger the hop guarantee, the more work is done in RDF-stores
  29. 29. To Exchange, or not to Exchange? ● Given a query and n-hop guarantee, is data exchange (Hadoop job) between nodes needed? ● Choose the “center” of the query graph ● Calculate the distance from the “center” to the furthest edge ● If distance > n, data exchange is needed; not needed otherwise
  30. 30. Data ExchangeFind football players playing for clubs in a populous region where he was born.
  31. 31. Experimental Setup● 20-machine cluster● Leigh University Benchmark (LUBM): 270 million triples● Things to compare: ● Single-node RDF-3X ● SHARD ● Graph partitioning (the proposed system) ● Hash partitioning on subjects
  32. 32. Performance Comparison
  33. 33. Speedup● Better than linear speedup
  34. 34. Analysis Difference between fastest implementation and slowest implementation was a factor of 1340!  Using Hadoop does not mean that performance is fixed More improvements are possible  Experiments used MapReduce whenever data communication was necessary  NextGen Hadoop allows other programming paradigms besides MR  MPI is a good candidate  Still need to fix the data pipelining problem Factor of 1340 possible via focusing on storage -- - similar in theme to Hadapt
  35. 35. How this fits with Hadapt  Full SQL interface, Map Reduce, and JDBC Connector Flexible Query Interface (Full SQL Support, MR, JDBC)  10x-50x faster than Hadoop and Hive  Queries go from hours to minutes, Split Query and minutes to seconds Hadoop Execution (Patent Pending)  Analytics across structured and unstructured data in one platform Hadapt Storage Engine (Relational + HDFS)  3.5 Patents Pending  $9.5 Series A financing, lead by Norwest Venture Partners and Bessemer Venture Partners
  36. 36. Optimized Storage Matters HDFS appropriate for unstructured data Relational storage appropriate for relational data Graph storage appropriate for graph data Hadapt allows for pluggable storage inside Hadoop (amongst other things) Bottom line: Hadoop can be used for scalable graph processing, but it might need some Hadapting ;)