Successfully reported this slideshow.

HBaseCon 2012 | Storing and Manipulating Graphs in HBase

29

Share

1 of 71
1 of 71

HBaseCon 2012 | Storing and Manipulating Graphs in HBase

29

Share

Download to read offline

Google’s original use case for BigTable was the storage and processing of web graph information, represented as sparse matrices. However, many organizations tend to treat HBase as merely a “web scale” RDBMS. This session will cover several use cases for storing graph data in HBase, including social networks and web link graphs, MapReduce processes like cached traversal, PageRank, and clustering and lastly will look at some lower-level modeling details like row key and column qualifier design, using FullContact’s graph processing systems as a real-world use case.

Google’s original use case for BigTable was the storage and processing of web graph information, represented as sparse matrices. However, many organizations tend to treat HBase as merely a “web scale” RDBMS. This session will cover several use cases for storing graph data in HBase, including social networks and web link graphs, MapReduce processes like cached traversal, PageRank, and clustering and lastly will look at some lower-level modeling details like row key and column qualifier design, using FullContact’s graph processing systems as a real-world use case.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

HBaseCon 2012 | Storing and Manipulating Graphs in HBase

  1. 1. Storing and Manipulating Graphs in HBase Dan Lynn dan@fullcontact.com @danklynn
  2. 2. Keeps Contact Information Current and Complete Based in Denver, Colorado CTO & Co-Founder
  3. 3. Turn Partial Contacts Into Full Contacts
  4. 4. Refresher: Graph Theory
  5. 5. Refresher: Graph Theory
  6. 6. Refresher: Graph Theory rt ex Ve
  7. 7. Refresher: Graph Theory Edg e
  8. 8. Social Networks
  9. 9. Tweets @danklynn retweeted “#HBase rocks” follows author @xorlev
  10. 10. Web Links http://fullcontact.com/blog/ <a href=”...”>TechStars</a> http://techstars.com/
  11. 11. Why should you care? Vertex Influence - PageRank - Social Influence - Network bottlenecks Identifying Communities
  12. 12. Storage Options
  13. 13. neo4j
  14. 14. neo4j Very expressive querying (e.g. Gremlin)
  15. 15. neo4j Transactional
  16. 16. neo4j Data must fit on a single machine :-(
  17. 17. FlockDB
  18. 18. FlockDB Scales horizontally
  19. 19. FlockDB Very fast
  20. 20. FlockDB No multi-hop query support :-(
  21. 21. RDBMS (e.g. MySQL, Postgres, et al.)
  22. 22. RDBMS Transactional
  23. 23. RDBMS Huge amounts of JOINing :-(
  24. 24. HBase Massively scalable
  25. 25. HBase Data model well-suited
  26. 26. HBase Multi-hop querying?
  27. 27. Modeling Techniques
  28. 28. Adjacency Matrix 1 3 2
  29. 29. Adjacency Matrix 1 2 3 1 0 1 1 2 1 0 1 3 1 1 0
  30. 30. Adjacency Matrix Can use vectorized libraries
  31. 31. Adjacency Matrix Requires O(n2) memory n = number of vertices
  32. 32. Adjacency Matrix Hard(er) to distribute
  33. 33. Adjacency List 1 3 2
  34. 34. Adjacency List 1 2,3 2 1,3 3 1,2
  35. 35. Adjacency List Design in HBase e:dan@fullcontact.com p:+13039316251 t:danklynn
  36. 36. Adjacency List Design in HBase row key “edges” column family e:dan@fullcontact.com p:+13039316251= ... t:danklynn= ... p:+13039316251 e:dan@fullcontact.com= ... t:danklynn= ... t:danklynn e:dan@fullcontact.com= ... p:+13039316251= ...
  37. 37. Adjacency List Design in HBase row key “edges” column family e:dan@fullcontact.com p:+13039316251= ... t:danklynn= ... at to W e?h p:+13039316251 e:dan@fullcontact.com= ... st or t:danklynn= ... t:danklynn e:dan@fullcontact.com= ... p:+13039316251= ...
  38. 38. Custom Writables package org.apache.hadoop.io; public interface Writable { void write(java.io.DataOutput dataOutput); void readFields(java.io.DataInput dataInput); } java
  39. 39. Custom Writables class EdgeValueWritable implements Writable { EdgeValue edgeValue void write(DataOutput dataOutput) { dataOutput.writeDouble edgeValue.weight } void readFields(DataInput dataInput) { Double weight = dataInput.readDouble() edgeValue = new EdgeValue(weight) } // ... } groovy
  40. 40. Don’t get fancy with byte[] class EdgeValueWritable implements Writable { EdgeValue edgeValue byte[] toBytes() { // use strings if you can help it } static EdgeValueWritable fromBytes(byte[] bytes) { // use strings if you can help it } } groovy
  41. 41. Querying by vertex def get = new Get(vertexKeyBytes) get.addFamily(edgesFamilyBytes) Result result = table.get(get); result.noVersionMap.each {family, data -> // construct edge objects as needed // data is a Map<byte[],byte[]> }
  42. 42. Adding edges to a vertex def put = new Put(vertexKeyBytes) put.add( edgesFamilyBytes, destinationVertexBytes, edgeValue.toBytes() // your own implementation here ) // if writing directly table.put(put) // if using TableReducer context.write(NullWritable.get(), put)
  43. 43. Distributed Traversal / Indexing e:dan@fullcontact.com p:+13039316251 t:danklynn
  44. 44. Distributed Traversal / Indexing e:dan@fullcontact.com p:+13039316251 t:danklynn
  45. 45. Distributed Traversal / Indexing e:dan@fullcontact.com p:+13039316251 Pi v ot v e rt ex t:danklynn
  46. 46. Distributed Traversal / Indexing e:dan@fullcontact.com p:+13039316251 Ma pReduce ove r out bou nd edges t:danklynn
  47. 47. Distributed Traversal / Indexing e:dan@fullcontact.com p:+13039316251 Em it vertexes an d edge dat a gro upe d by the piv ot t:danklynn
  48. 48. Distributed Traversal / Indexing Re duc e key p:+13039316251 “Ou t” vertex e:dan@fullcontact.com t:danklynn “In” vertex
  49. 49. Distributed Traversal / Indexing e:dan@fullcontact.com t:danklynn Re duc er em its higher-order edge
  50. 50. Distributed Traversal / Indexing Ite rat ion 0
  51. 51. Distributed Traversal / Indexing Ite rat ion 1
  52. 52. Distributed Traversal / Indexing Ite rat ion 2
  53. 53. Distributed Traversal / Indexing Reuse edges created during previ ous iterat ions Ite rat ion 2
  54. 54. Distributed Traversal / Indexing Ite rat ion 3
  55. 55. Distributed Traversal / Indexing Reuse edges created during previ ous iterat ions Ite rat ion 3
  56. 56. Distributed Traversal / Indexing hop s req uires on ly ite rat ion s
  57. 57. Tips / Gotchas
  58. 58. Do implement your own comparator public static class Comparator extends WritableComparator { public int compare( byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { // ..... } } java
  59. 59. Do implement your own comparator static { WritableComparator.define(VertexKeyWritable, new VertexKeyWritable.Comparator()) } java
  60. 60. MultiScanTableInputFormat MultiScanTableInputFormat.setTable(conf, "graph"); MultiScanTableInputFormat.addScan(conf, new Scan()); job.setInputFormatClass( MultiScanTableInputFormat.class); java
  61. 61. TableMapReduceUtil TableMapReduceUtil.initTableReducerJob( "graph", MyReducer.class, job); java
  62. 62. Elastic MapReduce
  63. 63. Elastic MapReduce HFi les
  64. 64. Elastic MapReduce HFi les Copy to S3 Seq uen ceFiles
  65. 65. Elastic MapReduce HFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles
  66. 66. Elastic MapReduce HFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles
  67. 67. Elastic MapReduce HFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles HFileOutputFormat.configureIncrementalLoad(job, outputTable) HFi les
  68. 68. Elastic MapReduce HFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles HFileOutputFormat.configureIncrementalLoad(job, outputTable) HFi les HBase $ hadoop jar hbase-VERSION.jar completebulkload
  69. 69. Additional Resources Google Pregel: BSP-based graph processing system Apache Giraph: Implementation of Pregel for Hadoop MultiScanTableInputFormat: (code to appear on GitHub) Apache Mahout - Distributed machine learning on Hadoop
  70. 70. Thanks! dan@fullcontact.com

×