Hadoop and Graph Data Management:
   Challenges and Opportunities

                 Daniel Abadi
     Assistant Professor at Yale University
           Chief Scientist at Hadapt
            Tuesday, November 8th

      (Collaboration with Jiewen Huang)
Two Major Trends
 Hadoop
   Becoming de facto standard for large scale data
    processing
   Becoming more than just MapReduce
   Ecosystem growing rapidly --- lot’s of great tools
    around it
 Graph data is proliferating; huge value if analyzed
 and processed
     Social graphs
     Linked data
     Telecom
     Advertising
     Defense
Use Hadoop to Process Graph
Data?
 Of course!
 BUT:
   Little automatic optimization (non-declarative
    interface)
   It’s possible to do graphs with Hadoop:
     Really badly
     Suboptimally
     Really well
   Case in point: subgraph pattern-matching
   benchmark on three different Hadoop-centered
   solutions:
     ~1000s,
     ~100s,
     < ~10s
Case study: Linked Data
Example Linked Data
 Entities (or “resources”)
  are nodes in the graph
 Relationships between
  entities are labeled
  directed edges in the
  graph
 (Resources typically have
  unique URI identifies to
  allow for global
  references --- not shown
  in example to the right)
 Any resource can
  connect to any other
  resource
Graph Representation
 Linked data graph
  can be parsed
  into a series of
  vertex-entity-
  vertex triples
 First entity
  referred to as the
  subject; second
  as the object;
  edge connecting
  them as the
  predicate
 We will call these
  “RDF triples”
Querying Linked Data
 Linked Data is typically queried in SPARQL
 Processing a SPARQL query is basically
 equivalent to the subgraph pattern matching
 problem
Scalable SPARQL Processing
 Single-node options are abundant, e.g.,
     Sesame
     Jena
     RDF-3X
     3store
 Fewer options that can scale across multiple
 machines
   This is where Hadoop comes in!
     One cool solution: SHARD (presented by Kurt Rohloff at
      HadoopWorld 2010)
       Uses HDFS to store graph, MapReduce jobs for
        subgraph pattern matching
       Much nicer than a naïve Hadoop solution
Another Example: Twitter Data
                               @joe_hellerst
                                   ein

                      follow                      follow
                      s                           s
                                follow
                                s
           @daniel_ab                               @mikeolso
              adi                                      n
                                 follow
                                 s                                  retweete
retweete
d                                                           retweeted
           retweete            retweete          retweete   d
           d                   d                 d

     @hadoop_is_my_life        @super_hadooper        @hadoop_is_the answer
Example Query over Twitter
 Graph

Who has retweeted both @daniel_abadi and
@mikeolson?

      @daniel_aba
                               @mikeolson
          di



          retweeted          retweeted


                      ???
Issues With Hadoop Defaults
 Hadoop does not default to pipelined algorithms
 between MapReduce jobs
   SHARD algorithm matches each clause of SPARQL
   subgraph in a separate MapReduce job
     Need full barrier synchronization between jobs which is
      unnecessary
 Hadoop does not default to pipelined algorithms
 between Map and Reduce
   Each job performs a join by vertex, where the object
    of one triple is joined with the subject of another
    triple
   Joins work much faster if you can pipeline between
    Map and Reduce (e.g. pipelined hash join)
Issues with Hadoop Defaults
(cont.)
 Hadoop hash partitions data across nodes
   Data in graph is randomly partitioned across nodes
   Data close in the graph could be on a physically
    distant node
   Therefore each join requires a complete
    redistribution of the entire data set across the
    network (and there are many joins)
 Hadoop defaults to replicating all data 3 times
  with no preferences for what is replicated or
  where it is replicated to
 Hadoop defaults to using the Hadoop Distributed
  File System (HDFS) for storage, which is not
  optimized for graph data
All is not lost! Don’t throw away
Hadoop!
 All we have to do is change the defaults and add
 to it a little
System Architecture
Partitioning
 Graphs can be represented as vertex1-edge-
  vertex2 triples
 Hash partitioning by vertex1 is straightforward
 Great for queries like:

Query: Find the names of the strikers that play for FC Barcelona.

SELECT ?name
WHERE { ?player type        footballer      .
          ?player name      ?name           .
          ?player position striker          .
          ?player playsFor FC_Barcelona . }
The problem with hash partitioning
 …

Find football players playing for clubs in a populous region where he was born.
Graph Partitioning
 Data close together in the graph should be
  physically close together (preferably on the same
  machine)
 Subgraph pattern matching can be done without
  require huge amounts of communication via joins
  across the cluster
Graph Partitioning
Graph Partitioning




   Machine 1   Machine 2   Machine 3
Edge/Triple Placement
●   Minimizing data exchange
    ●   Allowing data overlap
●   N-hop guarantee
    ●   The extent of data overlap
    ●   If a vertex is assigned to a machine, any
        vertex that is within n-hop of this vertex is
        also stored in this machine
0 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
1 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
2 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
0 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
1 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
2 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
High Degree Vertexes
●   Problem: High-degree vertexes make the
    graph well-connected and difficult to
    partition
●   Solution: Ignore them during graph
    partitioning

●   Problem: High-degree vertexes cause
    data explosion with a n-hop guarantee
●   Solution: Selectively weaken the n-hop
    guarantee
Query Processing
  ●   Query execution is more efficient if
      pushed to optimized storage (RDF-
      stores)
      ●   Minimizing the number of Hadoop jobs
      ●   The larger the hop guarantee, the more
          work is done in RDF-stores
To Exchange, or not to Exchange?
  ●   Given a query and n-hop guarantee, is
      data exchange (Hadoop job) between
      nodes needed?
      ●   Choose the “center” of the query graph
      ●   Calculate the distance from the “center” to
          the furthest edge
      ●   If distance > n, data exchange is needed;
          not needed otherwise
Data Exchange

Find football players playing for clubs in a populous region where he was born.
Experimental Setup
●   20-machine cluster
●   Leigh University Benchmark (LUBM):
    270 million triples
●   Things to compare:
    ●   Single-node RDF-3X
    ●   SHARD
    ●   Graph partitioning (the proposed system)
    ●   Hash partitioning on subjects
Performance Comparison
Speedup
●   Better than linear speedup
Analysis
 Difference between fastest implementation and
 slowest implementation was a factor of 1340!
   Using Hadoop does not mean that performance is
   fixed
 More improvements are possible
   Experiments used MapReduce whenever data
   communication was necessary
     NextGen Hadoop allows other programming paradigms
     besides MR
      MPI is a good candidate

   Still need to fix the data pipelining problem
 Factor of 1340 possible via focusing on storage --
 - similar in theme to Hadapt
How this fits with Hadapt

                                        Full SQL interface, Map
                                         Reduce, and JDBC Connector
  Flexible Query Interface
    (Full SQL Support, MR, JDBC)
                                        10x-50x faster than Hadoop and
                                         Hive
                                          Queries go from hours to minutes,
                   Split Query             and minutes to seconds
   Hadoop          Execution
                    (Patent Pending)
                                        Analytics across structured and
                                         unstructured data in one
                                         platform
   Hadapt Storage Engine
        (Relational + HDFS)
                                        3.5 Patents Pending

                                        $9.5 Series A financing, lead by
                                         Norwest Venture Partners and
                                         Bessemer Venture Partners
Optimized Storage Matters
 HDFS appropriate for unstructured data
 Relational storage appropriate for relational data
 Graph storage appropriate for graph data
 Hadapt allows for pluggable storage inside
 Hadoop (amongst other things)




 Bottom line: Hadoop can be used for scalable
 graph processing, but it might need some
 Hadapting ;)

Hadoop and Graph Data Management: Challenges and Opportunities

  • 1.
    Hadoop and GraphData Management: Challenges and Opportunities Daniel Abadi Assistant Professor at Yale University Chief Scientist at Hadapt Tuesday, November 8th (Collaboration with Jiewen Huang)
  • 2.
    Two Major Trends Hadoop  Becoming de facto standard for large scale data processing  Becoming more than just MapReduce  Ecosystem growing rapidly --- lot’s of great tools around it  Graph data is proliferating; huge value if analyzed and processed  Social graphs  Linked data  Telecom  Advertising  Defense
  • 3.
    Use Hadoop toProcess Graph Data?  Of course!  BUT:  Little automatic optimization (non-declarative interface)  It’s possible to do graphs with Hadoop:  Really badly  Suboptimally  Really well  Case in point: subgraph pattern-matching benchmark on three different Hadoop-centered solutions:  ~1000s,  ~100s,  < ~10s
  • 4.
  • 5.
    Example Linked Data Entities (or “resources”) are nodes in the graph  Relationships between entities are labeled directed edges in the graph  (Resources typically have unique URI identifies to allow for global references --- not shown in example to the right)  Any resource can connect to any other resource
  • 6.
    Graph Representation  Linkeddata graph can be parsed into a series of vertex-entity- vertex triples  First entity referred to as the subject; second as the object; edge connecting them as the predicate  We will call these “RDF triples”
  • 7.
    Querying Linked Data Linked Data is typically queried in SPARQL  Processing a SPARQL query is basically equivalent to the subgraph pattern matching problem
  • 8.
    Scalable SPARQL Processing Single-node options are abundant, e.g.,  Sesame  Jena  RDF-3X  3store  Fewer options that can scale across multiple machines  This is where Hadoop comes in!  One cool solution: SHARD (presented by Kurt Rohloff at HadoopWorld 2010)  Uses HDFS to store graph, MapReduce jobs for subgraph pattern matching  Much nicer than a naïve Hadoop solution
  • 9.
    Another Example: TwitterData @joe_hellerst ein follow follow s s follow s @daniel_ab @mikeolso adi n follow s retweete retweete d retweeted retweete retweete retweete d d d d @hadoop_is_my_life @super_hadooper @hadoop_is_the answer
  • 10.
    Example Query overTwitter Graph Who has retweeted both @daniel_abadi and @mikeolson? @daniel_aba @mikeolson di retweeted retweeted ???
  • 11.
    Issues With HadoopDefaults  Hadoop does not default to pipelined algorithms between MapReduce jobs  SHARD algorithm matches each clause of SPARQL subgraph in a separate MapReduce job  Need full barrier synchronization between jobs which is unnecessary  Hadoop does not default to pipelined algorithms between Map and Reduce  Each job performs a join by vertex, where the object of one triple is joined with the subject of another triple  Joins work much faster if you can pipeline between Map and Reduce (e.g. pipelined hash join)
  • 12.
    Issues with HadoopDefaults (cont.)  Hadoop hash partitions data across nodes  Data in graph is randomly partitioned across nodes  Data close in the graph could be on a physically distant node  Therefore each join requires a complete redistribution of the entire data set across the network (and there are many joins)  Hadoop defaults to replicating all data 3 times with no preferences for what is replicated or where it is replicated to  Hadoop defaults to using the Hadoop Distributed File System (HDFS) for storage, which is not optimized for graph data
  • 13.
    All is notlost! Don’t throw away Hadoop!  All we have to do is change the defaults and add to it a little
  • 14.
  • 15.
    Partitioning  Graphs canbe represented as vertex1-edge- vertex2 triples  Hash partitioning by vertex1 is straightforward  Great for queries like: Query: Find the names of the strikers that play for FC Barcelona. SELECT ?name WHERE { ?player type footballer . ?player name ?name . ?player position striker . ?player playsFor FC_Barcelona . }
  • 16.
    The problem withhash partitioning … Find football players playing for clubs in a populous region where he was born.
  • 17.
    Graph Partitioning  Dataclose together in the graph should be physically close together (preferably on the same machine)  Subgraph pattern matching can be done without require huge amounts of communication via joins across the cluster
  • 18.
  • 19.
    Graph Partitioning Machine 1 Machine 2 Machine 3
  • 20.
    Edge/Triple Placement ● Minimizing data exchange ● Allowing data overlap ● N-hop guarantee ● The extent of data overlap ● If a vertex is assigned to a machine, any vertex that is within n-hop of this vertex is also stored in this machine
  • 21.
    0 Hop ofMachine 1 Machine 1 Machine 2 Machine 3
  • 22.
    1 Hop ofMachine 1 Machine 1 Machine 2 Machine 3
  • 23.
    2 Hop ofMachine 1 Machine 1 Machine 2 Machine 3
  • 24.
    0 Hop ofMachine 3 Machine 1 Machine 2 Machine 3
  • 25.
    1 Hop ofMachine 3 Machine 1 Machine 2 Machine 3
  • 26.
    2 Hop ofMachine 3 Machine 1 Machine 2 Machine 3
  • 27.
    High Degree Vertexes ● Problem: High-degree vertexes make the graph well-connected and difficult to partition ● Solution: Ignore them during graph partitioning ● Problem: High-degree vertexes cause data explosion with a n-hop guarantee ● Solution: Selectively weaken the n-hop guarantee
  • 28.
    Query Processing ● Query execution is more efficient if pushed to optimized storage (RDF- stores) ● Minimizing the number of Hadoop jobs ● The larger the hop guarantee, the more work is done in RDF-stores
  • 29.
    To Exchange, ornot to Exchange? ● Given a query and n-hop guarantee, is data exchange (Hadoop job) between nodes needed? ● Choose the “center” of the query graph ● Calculate the distance from the “center” to the furthest edge ● If distance > n, data exchange is needed; not needed otherwise
  • 30.
    Data Exchange Find footballplayers playing for clubs in a populous region where he was born.
  • 31.
    Experimental Setup ● 20-machine cluster ● Leigh University Benchmark (LUBM): 270 million triples ● Things to compare: ● Single-node RDF-3X ● SHARD ● Graph partitioning (the proposed system) ● Hash partitioning on subjects
  • 32.
  • 33.
    Speedup ● Better than linear speedup
  • 34.
    Analysis  Difference betweenfastest implementation and slowest implementation was a factor of 1340!  Using Hadoop does not mean that performance is fixed  More improvements are possible  Experiments used MapReduce whenever data communication was necessary  NextGen Hadoop allows other programming paradigms besides MR  MPI is a good candidate  Still need to fix the data pipelining problem  Factor of 1340 possible via focusing on storage -- - similar in theme to Hadapt
  • 35.
    How this fitswith Hadapt  Full SQL interface, Map Reduce, and JDBC Connector Flexible Query Interface (Full SQL Support, MR, JDBC)  10x-50x faster than Hadoop and Hive  Queries go from hours to minutes, Split Query and minutes to seconds Hadoop Execution (Patent Pending)  Analytics across structured and unstructured data in one platform Hadapt Storage Engine (Relational + HDFS)  3.5 Patents Pending  $9.5 Series A financing, lead by Norwest Venture Partners and Bessemer Venture Partners
  • 36.
    Optimized Storage Matters HDFS appropriate for unstructured data  Relational storage appropriate for relational data  Graph storage appropriate for graph data  Hadapt allows for pluggable storage inside Hadoop (amongst other things)  Bottom line: Hadoop can be used for scalable graph processing, but it might need some Hadapting ;)