Storing and Manipulating Graphs            in HBase            Dan Lynn          dan@fullcontact.com              @danklynn
Keeps Contact Information Current and Complete  Based in Denver, Colorado                              CTO & Co-Founder
Turn Partial Contacts Into Full Contacts
Refresher: Graph Theory
Refresher: Graph Theory
Refresher: Graph Theory     rt exVe
Refresher: Graph Theory                          Edg                                e
Social Networks
Tweets@danklynn              retweeted                                   “#HBase rocks” follows                          a...
Web Linkshttp://fullcontact.com/blog/                               <a href=”...”>TechStars</a>                           ...
Why should you care?Vertex Influence- PageRank- Social Influence- Network bottlenecksIdentifying Communities
Storage Options
neo4j
neo4jVery expressive querying       (e.g. Gremlin)
neo4jTransactional
neo4jData must fit on a single machine       :-(
FlockDB
FlockDBScales horizontally
FlockDBVery fast
FlockDBNo multi-hop query support           :-(
RDBMS(e.g. MySQL, Postgres, et al.)
RDBMSTransactional
RDBMSHuge amounts of JOINing          :-(
HBaseMassively scalable
HBaseData model well-suited
HBaseMulti-hop querying?
ModelingTechniques
Adjacency Matrix1             3    2
Adjacency Matrix    1   2    31   0   1    12   1   0    13   1   1    0
Adjacency MatrixCan use vectorized libraries
Adjacency MatrixRequires   O(n2)   memory                   n = number of vertices
Adjacency MatrixHard(er) to distribute
Adjacency List1                3      2
Adjacency List1           2,32           1,33           1,2
Adjacency List Design in HBasee:dan@fullcontact.com                                p:+13039316251                   t:dank...
Adjacency List Design in HBase      row key               “edges” column familye:dan@fullcontact.com   p:+13039316251= ......
Adjacency List Design in HBase      row key               “edges” column familye:dan@fullcontact.com   p:+13039316251= ......
Custom Writablespackage org.apache.hadoop.io;public interface Writable   {    void write(java.io.DataOutput dataOutput);  ...
Custom Writablesclass EdgeValueWritable implements Writable {    EdgeValue edgeValue    void write(DataOutput dataOutput) ...
Don’t get fancy with byte[]class EdgeValueWritable implements Writable {   EdgeValue edgeValue    byte[] toBytes() {      ...
Querying by vertexdef get = new Get(vertexKeyBytes)get.addFamily(edgesFamilyBytes)Result result = table.get(get);result.no...
Adding edges to a vertexdef put = new Put(vertexKeyBytes)put.add(    edgesFamilyBytes,    destinationVertexBytes,    edgeV...
Distributed Traversal / Indexinge:dan@fullcontact.com                         p:+13039316251                          t:da...
Distributed Traversal / Indexinge:dan@fullcontact.com                         p:+13039316251                          t:da...
Distributed Traversal / Indexinge:dan@fullcontact.com                                         p:+13039316251              ...
Distributed Traversal / Indexing e:dan@fullcontact.com                          p:+13039316251Ma pReduce ove rout bou nd e...
Distributed Traversal / Indexing  e:dan@fullcontact.com                           p:+13039316251Em it vertexes an d edgeda...
Distributed Traversal / Indexing   Re duc e key                p:+13039316251“Ou t” vertex                e:dan@fullcontac...
Distributed Traversal / Indexinge:dan@fullcontact.com       t:danklynnRe duc er em its higher-order edge
Distributed Traversal / IndexingIte rat ion 0
Distributed Traversal / IndexingIte rat ion 1
Distributed Traversal / IndexingIte rat ion 2
Distributed Traversal / Indexing                               Reuse edges created                               during pr...
Distributed Traversal / IndexingIte rat ion 3
Distributed Traversal / Indexing                               Reuse edges created                               during pr...
Distributed Traversal / Indexing   hop s req uires on ly                   ite rat ion s
Tips / Gotchas
Do implement your own comparatorpublic static class Comparator               extends WritableComparator {    public int co...
Do implement your own comparatorstatic {    WritableComparator.define(VertexKeyWritable,         new VertexKeyWritable.Com...
MultiScanTableInputFormatMultiScanTableInputFormat.setTable(conf,   "graph");MultiScanTableInputFormat.addScan(conf,   new...
TableMapReduceUtilTableMapReduceUtil.initTableReducerJob(    "graph", MyReducer.class, job);                              ...
ElasticMapReduce
Elastic MapReduceHFi les
Elastic MapReduceHFi les     Copy to S3  Seq uen ceFiles
Elastic MapReduceHFi les     Copy to S3                     Elastic MapReduce  Seq uen ceFiles Seq uen ceFiles
Elastic MapReduceHFi les     Copy to S3                     Elastic MapReduce  Seq uen ceFiles Seq uen ceFiles
Elastic MapReduceHFi les     Copy to S3                                Elastic MapReduce  Seq uen ceFiles Seq uen ceFiles ...
Elastic MapReduceHFi les     Copy to S3                                Elastic MapReduce  Seq uen ceFiles Seq uen ceFiles ...
Additional ResourcesGoogle Pregel: BSP-based graph processing systemApache Giraph: Implementation of Pregel for HadoopMult...
Thanks!dan@fullcontact.com
HBaseCon 2012 | Storing and Manipulating Graphs in HBase
Upcoming SlideShare
Loading in...5
×

HBaseCon 2012 | Storing and Manipulating Graphs in HBase

4,468

Published on

Google’s original use case for BigTable was the storage and processing of web graph information, represented as sparse matrices. However, many organizations tend to treat HBase as merely a “web scale” RDBMS. This session will cover several use cases for storing graph data in HBase, including social networks and web link graphs, MapReduce processes like cached traversal, PageRank, and clustering and lastly will look at some lower-level modeling details like row key and column qualifier design, using FullContact’s graph processing systems as a real-world use case.

Published in: Technology, Education
3 Comments
29 Likes
Statistics
Notes
No Downloads
Views
Total Views
4,468
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
221
Comments
3
Likes
29
Embeds 0
No embeds

No notes for slide

HBaseCon 2012 | Storing and Manipulating Graphs in HBase

  1. 1. Storing and Manipulating Graphs in HBase Dan Lynn dan@fullcontact.com @danklynn
  2. 2. Keeps Contact Information Current and Complete Based in Denver, Colorado CTO & Co-Founder
  3. 3. Turn Partial Contacts Into Full Contacts
  4. 4. Refresher: Graph Theory
  5. 5. Refresher: Graph Theory
  6. 6. Refresher: Graph Theory rt exVe
  7. 7. Refresher: Graph Theory Edg e
  8. 8. Social Networks
  9. 9. Tweets@danklynn retweeted “#HBase rocks” follows author @xorlev
  10. 10. Web Linkshttp://fullcontact.com/blog/ <a href=”...”>TechStars</a> http://techstars.com/
  11. 11. Why should you care?Vertex Influence- PageRank- Social Influence- Network bottlenecksIdentifying Communities
  12. 12. Storage Options
  13. 13. neo4j
  14. 14. neo4jVery expressive querying (e.g. Gremlin)
  15. 15. neo4jTransactional
  16. 16. neo4jData must fit on a single machine :-(
  17. 17. FlockDB
  18. 18. FlockDBScales horizontally
  19. 19. FlockDBVery fast
  20. 20. FlockDBNo multi-hop query support :-(
  21. 21. RDBMS(e.g. MySQL, Postgres, et al.)
  22. 22. RDBMSTransactional
  23. 23. RDBMSHuge amounts of JOINing :-(
  24. 24. HBaseMassively scalable
  25. 25. HBaseData model well-suited
  26. 26. HBaseMulti-hop querying?
  27. 27. ModelingTechniques
  28. 28. Adjacency Matrix1 3 2
  29. 29. Adjacency Matrix 1 2 31 0 1 12 1 0 13 1 1 0
  30. 30. Adjacency MatrixCan use vectorized libraries
  31. 31. Adjacency MatrixRequires O(n2) memory n = number of vertices
  32. 32. Adjacency MatrixHard(er) to distribute
  33. 33. Adjacency List1 3 2
  34. 34. Adjacency List1 2,32 1,33 1,2
  35. 35. Adjacency List Design in HBasee:dan@fullcontact.com p:+13039316251 t:danklynn
  36. 36. Adjacency List Design in HBase row key “edges” column familye:dan@fullcontact.com p:+13039316251= ... t:danklynn= ...p:+13039316251 e:dan@fullcontact.com= ... t:danklynn= ...t:danklynn e:dan@fullcontact.com= ... p:+13039316251= ...
  37. 37. Adjacency List Design in HBase row key “edges” column familye:dan@fullcontact.com p:+13039316251= ... t:danklynn= ... at to W e?hp:+13039316251 e:dan@fullcontact.com= ... st or t:danklynn= ...t:danklynn e:dan@fullcontact.com= ... p:+13039316251= ...
  38. 38. Custom Writablespackage org.apache.hadoop.io;public interface Writable { void write(java.io.DataOutput dataOutput); void readFields(java.io.DataInput dataInput);} java
  39. 39. Custom Writablesclass EdgeValueWritable implements Writable { EdgeValue edgeValue void write(DataOutput dataOutput) { dataOutput.writeDouble edgeValue.weight } void readFields(DataInput dataInput) { Double weight = dataInput.readDouble() edgeValue = new EdgeValue(weight) } // ...} groovy
  40. 40. Don’t get fancy with byte[]class EdgeValueWritable implements Writable { EdgeValue edgeValue byte[] toBytes() { // use strings if you can help it } static EdgeValueWritable fromBytes(byte[] bytes) { // use strings if you can help it }} groovy
  41. 41. Querying by vertexdef get = new Get(vertexKeyBytes)get.addFamily(edgesFamilyBytes)Result result = table.get(get);result.noVersionMap.each {family, data -> // construct edge objects as needed // data is a Map<byte[],byte[]>}
  42. 42. Adding edges to a vertexdef put = new Put(vertexKeyBytes)put.add( edgesFamilyBytes, destinationVertexBytes, edgeValue.toBytes() // your own implementation here)// if writing directlytable.put(put)// if using TableReducercontext.write(NullWritable.get(), put)
  43. 43. Distributed Traversal / Indexinge:dan@fullcontact.com p:+13039316251 t:danklynn
  44. 44. Distributed Traversal / Indexinge:dan@fullcontact.com p:+13039316251 t:danklynn
  45. 45. Distributed Traversal / Indexinge:dan@fullcontact.com p:+13039316251 Pi v ot v e rt ex t:danklynn
  46. 46. Distributed Traversal / Indexing e:dan@fullcontact.com p:+13039316251Ma pReduce ove rout bou nd edges t:danklynn
  47. 47. Distributed Traversal / Indexing e:dan@fullcontact.com p:+13039316251Em it vertexes an d edgedat a gro upe d bythe piv ot t:danklynn
  48. 48. Distributed Traversal / Indexing Re duc e key p:+13039316251“Ou t” vertex e:dan@fullcontact.com t:danklynn“In” vertex
  49. 49. Distributed Traversal / Indexinge:dan@fullcontact.com t:danklynnRe duc er em its higher-order edge
  50. 50. Distributed Traversal / IndexingIte rat ion 0
  51. 51. Distributed Traversal / IndexingIte rat ion 1
  52. 52. Distributed Traversal / IndexingIte rat ion 2
  53. 53. Distributed Traversal / Indexing Reuse edges created during previ ous iterat ionsIte rat ion 2
  54. 54. Distributed Traversal / IndexingIte rat ion 3
  55. 55. Distributed Traversal / Indexing Reuse edges created during previ ous iterat ionsIte rat ion 3
  56. 56. Distributed Traversal / Indexing hop s req uires on ly ite rat ion s
  57. 57. Tips / Gotchas
  58. 58. Do implement your own comparatorpublic static class Comparator extends WritableComparator { public int compare( byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { // ..... }} java
  59. 59. Do implement your own comparatorstatic { WritableComparator.define(VertexKeyWritable, new VertexKeyWritable.Comparator())} java
  60. 60. MultiScanTableInputFormatMultiScanTableInputFormat.setTable(conf, "graph");MultiScanTableInputFormat.addScan(conf, new Scan());job.setInputFormatClass( MultiScanTableInputFormat.class); java
  61. 61. TableMapReduceUtilTableMapReduceUtil.initTableReducerJob( "graph", MyReducer.class, job); java
  62. 62. ElasticMapReduce
  63. 63. Elastic MapReduceHFi les
  64. 64. Elastic MapReduceHFi les Copy to S3 Seq uen ceFiles
  65. 65. Elastic MapReduceHFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles
  66. 66. Elastic MapReduceHFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles
  67. 67. Elastic MapReduceHFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles HFileOutputFormat.configureIncrementalLoad(job, outputTable) HFi les
  68. 68. Elastic MapReduceHFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles HFileOutputFormat.configureIncrementalLoad(job, outputTable) HFi les HBase $ hadoop jar hbase-VERSION.jar completebulkload
  69. 69. Additional ResourcesGoogle Pregel: BSP-based graph processing systemApache Giraph: Implementation of Pregel for HadoopMultiScanTableInputFormat: (code to appear on GitHub)Apache Mahout - Distributed machine learning on Hadoop
  70. 70. Thanks!dan@fullcontact.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×