Storing and manipulating graphs in HBase

12,411 views

Published on

Published in: Technology, Education
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
12,411
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
Downloads
144
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Storing and manipulating graphs in HBase

  1. 1. Storing and Manipulating Graphs in HBase Dan Lynn dan@fullcontact.com @danklynn
  2. 2. Keeps Contact Information Current and Complete Based in Denver, Colorado CTO & Co-Founder
  3. 3. Turn Partial Contacts Into Full Contacts
  4. 4. Refresher: Graph Theory
  5. 5. Refresher: Graph Theory
  6. 6. Refresher: Graph Theory rt exVe
  7. 7. Refresher: Graph Theory Edg e
  8. 8. Social Networks
  9. 9. Tweets@danklynn retweeted “#HBase rocks” follows author @xorlev
  10. 10. Web Linkshttp://fullcontact.com/blog/ <a href=”...”>TechStars</a> http://techstars.com/
  11. 11. Why should you care?Vertex Influence- PageRank- Social Influence- Network bottlenecksIdentifying Communities
  12. 12. Storage Options
  13. 13. neo4j
  14. 14. neo4jVery expressive querying (e.g. Gremlin)
  15. 15. neo4jTransactional
  16. 16. neo4jData must fit on a single machine :-(
  17. 17. FlockDB
  18. 18. FlockDBScales horizontally
  19. 19. FlockDBVery fast
  20. 20. FlockDBNo multi-hop query support :-(
  21. 21. RDBMS(e.g. MySQL, Postgres, et al.)
  22. 22. RDBMSTransactional
  23. 23. RDBMSHuge amounts of JOINing :-(
  24. 24. HBaseMassively scalable
  25. 25. HBaseData model well-suited
  26. 26. HBaseMulti-hop querying?
  27. 27. ModelingTechniques
  28. 28. Adjacency Matrix1 3 2
  29. 29. Adjacency Matrix 1 2 31 0 1 12 1 0 13 1 1 0
  30. 30. Adjacency MatrixCan use vectorized libraries
  31. 31. Adjacency MatrixRequires O(n2) memory n = number of vertices
  32. 32. Adjacency MatrixHard(er) to distribute
  33. 33. Adjacency List1 3 2
  34. 34. Adjacency List1 2,32 1,33 1,2
  35. 35. Adjacency List Design in HBasee:dan@fullcontact.com p:+13039316251 t:danklynn
  36. 36. Adjacency List Design in HBase row key “edges” column familye:dan@fullcontact.com p:+13039316251= ... t:danklynn= ...p:+13039316251 e:dan@fullcontact.com= ... t:danklynn= ...t:danklynn e:dan@fullcontact.com= ... p:+13039316251= ...
  37. 37. Adjacency List Design in HBase row key “edges” column familye:dan@fullcontact.com p:+13039316251= ... t:danklynn= ... at to W e?hp:+13039316251 e:dan@fullcontact.com= ... st or t:danklynn= ...t:danklynn e:dan@fullcontact.com= ... p:+13039316251= ...
  38. 38. Custom Writablespackage org.apache.hadoop.io;public interface Writable { void write(java.io.DataOutput dataOutput); void readFields(java.io.DataInput dataInput);} java
  39. 39. Custom Writablesclass EdgeValueWritable implements Writable { EdgeValue edgeValue void write(DataOutput dataOutput) { dataOutput.writeDouble edgeValue.weight } void readFields(DataInput dataInput) { Double weight = dataInput.readDouble() edgeValue = new EdgeValue(weight) } // ...} groovy
  40. 40. Don’t get fancy with byte[]class EdgeValueWritable implements Writable { EdgeValue edgeValue byte[] toBytes() { // use strings if you can help it } static EdgeValueWritable fromBytes(byte[] bytes) { // use strings if you can help it }} groovy
  41. 41. Querying by vertexdef get = new Get(vertexKeyBytes)get.addFamily(edgesFamilyBytes)Result result = table.get(get);result.noVersionMap.each {family, data -> // construct edge objects as needed // data is a Map<byte[],byte[]>}
  42. 42. Adding edges to a vertexdef put = new Put(vertexKeyBytes)put.add( edgesFamilyBytes, destinationVertexBytes, edgeValue.toBytes() // your own implementation here)// if writing directlytable.put(put)// if using TableReducercontext.write(NullWritable.get(), put)
  43. 43. Distributed Traversal / Indexinge:dan@fullcontact.com p:+13039316251 t:danklynn
  44. 44. Distributed Traversal / Indexinge:dan@fullcontact.com p:+13039316251 t:danklynn
  45. 45. Distributed Traversal / Indexinge:dan@fullcontact.com p:+13039316251 Pi v ot v e rt ex t:danklynn
  46. 46. Distributed Traversal / Indexing e:dan@fullcontact.com p:+13039316251Ma pReduce ove rout bou nd edges t:danklynn
  47. 47. Distributed Traversal / Indexing e:dan@fullcontact.com p:+13039316251Em it vertexes an d edgedat a gro upe d bythe piv ot t:danklynn
  48. 48. Distributed Traversal / Indexing Re duc e key p:+13039316251“Ou t” vertex e:dan@fullcontact.com t:danklynn“In” vertex
  49. 49. Distributed Traversal / Indexinge:dan@fullcontact.com t:danklynnRe duc er em its higher-order edge
  50. 50. Distributed Traversal / IndexingIte rat ion 0
  51. 51. Distributed Traversal / IndexingIte rat ion 1
  52. 52. Distributed Traversal / IndexingIte rat ion 2
  53. 53. Distributed Traversal / Indexing Reuse edges created during previ ous iterat ionsIte rat ion 2
  54. 54. Distributed Traversal / IndexingIte rat ion 3
  55. 55. Distributed Traversal / Indexing Reuse edges created during previ ous iterat ionsIte rat ion 3
  56. 56. Distributed Traversal / Indexing hop s req uires on ly ite rat ion s
  57. 57. Tips / Gotchas
  58. 58. Do implement your own comparatorpublic static class Comparator extends WritableComparator { public int compare( byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { // ..... }} java
  59. 59. Do implement your own comparatorstatic { WritableComparator.define(VertexKeyWritable, new VertexKeyWritable.Comparator())} java
  60. 60. MultiScanTableInputFormatMultiScanTableInputFormat.setTable(conf, "graph");MultiScanTableInputFormat.addScan(conf, new Scan());job.setInputFormatClass( MultiScanTableInputFormat.class); java
  61. 61. TableMapReduceUtilTableMapReduceUtil.initTableReducerJob( "graph", MyReducer.class, job); java
  62. 62. ElasticMapReduce
  63. 63. Elastic MapReduceHFi les
  64. 64. Elastic MapReduceHFi les Copy to S3 Seq uen ceFiles
  65. 65. Elastic MapReduceHFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles
  66. 66. Elastic MapReduceHFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles
  67. 67. Elastic MapReduceHFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles HFileOutputFormat.configureIncrementalLoad(job, outputTable) HFi les
  68. 68. Elastic MapReduceHFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles HFileOutputFormat.configureIncrementalLoad(job, outputTable) HFi les HBase $ hadoop jar hbase-VERSION.jar completebulkload
  69. 69. Additional ResourcesGoogle Pregel: BSP-based graph processing systemApache Giraph: Implementation of Pregel for HadoopMultiScanTableInputFormat exampleApache Mahout - Distributed machine learning on Hadoop
  70. 70. Thanks!dan@fullcontact.com

×