Storing and Manipulating Graphs
            in HBase


            Dan Lynn
          dan@fullcontact.com
              @danklynn
Keeps Contact Information Current and Complete


  Based in Denver, Colorado




                              CTO & Co-Founder
Turn Partial Contacts
 Into Full Contacts
Refresher: Graph Theory
Refresher: Graph Theory
Refresher: Graph Theory




     rt ex
Ve
Refresher: Graph Theory




                          Edg
                                e
Social Networks
Tweets

@danklynn

              retweeted


                                   “#HBase rocks”
 follows


                          author



            @xorlev
Web Links


http://fullcontact.com/blog/



                               <a href=”...”>TechStars</a>




                               http://techstars.com/
Why should you care?

Vertex Influence
- PageRank

- Social Influence

- Network bottlenecks

Identifying Communities
Storage Options
neo4j
neo4j




Very expressive querying
       (e.g. Gremlin)
neo4j




Transactional
neo4j




Data must fit on a
 single machine

       :-(
FlockDB
FlockDB




Scales horizontally
FlockDB




Very fast
FlockDB




No multi-hop query support

           :-(
RDBMS
(e.g. MySQL, Postgres, et al.)
RDBMS




Transactional
RDBMS




Huge amounts of JOINing

          :-(
HBase




Massively scalable
HBase




Data model well-suited
HBase




Multi-hop querying?
Modeling
Techniques
Adjacency Matrix


1
             3




    2
Adjacency Matrix

    1   2    3

1   0   1    1

2   1   0    1

3   1   1    0
Adjacency Matrix




Can use vectorized libraries
Adjacency Matrix




Requires   O(n2)   memory
                   n = number of vertices
Adjacency Matrix




Hard(er) to distribute
Adjacency List


1
                3




      2
Adjacency List




1           2,3

2           1,3

3           1,2
Adjacency List Design in HBase

e:dan@fullcontact.com
                                p:+13039316251




                   t:danklynn
Adjacency List Design in HBase
      row key               “edges” column family

e:dan@fullcontact.com   p:+13039316251= ...

                        t:danklynn= ...


p:+13039316251          e:dan@fullcontact.com= ...

                        t:danklynn= ...


t:danklynn              e:dan@fullcontact.com= ...

                        p:+13039316251= ...
Adjacency List Design in HBase
      row key               “edges” column family

e:dan@fullcontact.com   p:+13039316251= ...

                        t:danklynn= ...
                                                      at to
                                                W e?h
p:+13039316251          e:dan@fullcontact.com= ...
                                                   st or
                        t:danklynn= ...


t:danklynn              e:dan@fullcontact.com= ...

                        p:+13039316251= ...
Custom Writables
package org.apache.hadoop.io;

public interface Writable   {

    void write(java.io.DataOutput dataOutput);

    void readFields(java.io.DataInput dataInput);
}
                                                    java
Custom Writables
class EdgeValueWritable implements Writable {

    EdgeValue edgeValue

    void write(DataOutput dataOutput) {
        dataOutput.writeDouble edgeValue.weight
    }

    void readFields(DataInput dataInput) {
        Double weight = dataInput.readDouble()
        edgeValue = new EdgeValue(weight)
    }

    // ...
}
                                                  groovy
Don’t get fancy with byte[]
class EdgeValueWritable implements Writable {
   EdgeValue edgeValue

    byte[] toBytes() {
        // use strings if you can help it
    }

    static EdgeValueWritable fromBytes(byte[] bytes) {
        // use strings if you can help it
    }
}
                                                     groovy
Querying by vertex
def get = new Get(vertexKeyBytes)
get.addFamily(edgesFamilyBytes)

Result result = table.get(get);
result.noVersionMap.each {family, data ->

    // construct edge objects as needed
    // data is a Map<byte[],byte[]>
}
Adding edges to a vertex
def put = new Put(vertexKeyBytes)

put.add(
    edgesFamilyBytes,
    destinationVertexBytes,
    edgeValue.toBytes() // your own implementation here
)

// if writing directly
table.put(put)


// if using TableReducer
context.write(NullWritable.get(), put)
Distributed Traversal / Indexing

e:dan@fullcontact.com
                         p:+13039316251




                          t:danklynn
Distributed Traversal / Indexing

e:dan@fullcontact.com
                         p:+13039316251




                          t:danklynn
Distributed Traversal / Indexing

e:dan@fullcontact.com
                                         p:+13039316251


                    Pi v
                           ot v
                                  e rt
                                         ex

                                         t:danklynn
Distributed Traversal / Indexing

 e:dan@fullcontact.com
                          p:+13039316251




Ma pReduce ove r
out bou nd edges
                           t:danklynn
Distributed Traversal / Indexing

  e:dan@fullcontact.com
                           p:+13039316251




Em it vertexes an d edge
dat a gro upe d by
the piv ot               t:danklynn
Distributed Traversal / Indexing

   Re duc e key                p:+13039316251




“Ou t” vertex
                e:dan@fullcontact.com



                                        t:danklynn
“In” vertex
Distributed Traversal / Indexing


e:dan@fullcontact.com       t:danklynn




Re duc er em its higher-order edge
Distributed Traversal / Indexing




Ite rat ion 0
Distributed Traversal / Indexing




Ite rat ion 1
Distributed Traversal / Indexing




Ite rat ion 2
Distributed Traversal / Indexing




                               Reuse edges created
                               during previ ous
                               iterat ions




Ite rat ion 2
Distributed Traversal / Indexing




Ite rat ion 3
Distributed Traversal / Indexing




                               Reuse edges created
                               during previ ous
                               iterat ions




Ite rat ion 3
Distributed Traversal / Indexing


   hop s req uires on ly

                   ite rat ion s
Tips / Gotchas
Do implement your own comparator
public static class Comparator
               extends WritableComparator {


    public int compare(
        byte[] b1, int s1, int l1,
        byte[] b2, int s2, int l2) {

        // .....
    }
}
                                              java
Do implement your own comparator


static {
    WritableComparator.define(VertexKeyWritable,
         new VertexKeyWritable.Comparator())
}



                                                   java
MultiScanTableInputFormat

MultiScanTableInputFormat.setTable(conf,
   "graph");

MultiScanTableInputFormat.addScan(conf,
   new Scan());

job.setInputFormatClass(
   MultiScanTableInputFormat.class);


                                          java
TableMapReduceUtil



TableMapReduceUtil.initTableReducerJob(
    "graph", MyReducer.class, job);

                                      java
Elastic
MapReduce
Elastic MapReduce

HFi les
Elastic MapReduce

HFi les
     Copy to S3



  Seq uen ceFiles
Elastic MapReduce

HFi les
     Copy to S3
                     Elastic MapReduce



  Seq uen ceFiles Seq uen ceFiles
Elastic MapReduce

HFi les
     Copy to S3
                     Elastic MapReduce



  Seq uen ceFiles Seq uen ceFiles
Elastic MapReduce

HFi les
     Copy to S3
                                Elastic MapReduce



  Seq uen ceFiles Seq uen ceFiles
          HFileOutputFormat.configureIncrementalLoad(job, outputTable)



  HFi les
Elastic MapReduce

HFi les
     Copy to S3
                                Elastic MapReduce



  Seq uen ceFiles Seq uen ceFiles
          HFileOutputFormat.configureIncrementalLoad(job, outputTable)



  HFi les                                          HBase
                   $ hadoop jar hbase-VERSION.jar completebulkload
Additional Resources
Google Pregel: BSP-based graph processing system

Apache Giraph: Implementation of Pregel for Hadoop

MultiScanTableInputFormat example

Apache Mahout - Distributed machine learning on Hadoop
Thanks!
dan@fullcontact.com

Storing and manipulating graphs in HBase