HBaseCon 2012 | Storing and Manipulating Graphs in HBase
 

HBaseCon 2012 | Storing and Manipulating Graphs in HBase

on

  • 4,568 views

Google’s original use case for BigTable was the storage and processing of web graph information, represented as sparse matrices. However, many organizations tend to treat HBase as merely a “web ...

Google’s original use case for BigTable was the storage and processing of web graph information, represented as sparse matrices. However, many organizations tend to treat HBase as merely a “web scale” RDBMS. This session will cover several use cases for storing graph data in HBase, including social networks and web link graphs, MapReduce processes like cached traversal, PageRank, and clustering and lastly will look at some lower-level modeling details like row key and column qualifier design, using FullContact’s graph processing systems as a real-world use case.

Statistics

Views

Total Views
4,568
Views on SlideShare
4,298
Embed Views
270

Actions

Likes
24
Downloads
201
Comments
3

3 Embeds 270

http://www.cloudera.com 190
http://www.scoop.it 76
http://blog.cloudera.com 4

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

HBaseCon 2012 | Storing and Manipulating Graphs in HBase HBaseCon 2012 | Storing and Manipulating Graphs in HBase Presentation Transcript

  • Storing and Manipulating Graphs in HBase Dan Lynn dan@fullcontact.com @danklynn
  • Keeps Contact Information Current and Complete Based in Denver, Colorado CTO & Co-Founder
  • Turn Partial Contacts Into Full Contacts
  • Refresher: Graph Theory
  • Refresher: Graph Theory
  • Refresher: Graph Theory rt exVe
  • Refresher: Graph Theory Edg e
  • Social Networks
  • Tweets@danklynn retweeted “#HBase rocks” follows author @xorlev
  • Web Linkshttp://fullcontact.com/blog/ <a href=”...”>TechStars</a> http://techstars.com/
  • Why should you care?Vertex Influence- PageRank- Social Influence- Network bottlenecksIdentifying Communities
  • Storage Options
  • neo4j
  • neo4jVery expressive querying (e.g. Gremlin)
  • neo4jTransactional
  • neo4jData must fit on a single machine :-(
  • FlockDB
  • FlockDBScales horizontally
  • FlockDBVery fast
  • FlockDBNo multi-hop query support :-(
  • RDBMS(e.g. MySQL, Postgres, et al.)
  • RDBMSTransactional
  • RDBMSHuge amounts of JOINing :-(
  • HBaseMassively scalable
  • HBaseData model well-suited
  • HBaseMulti-hop querying?
  • ModelingTechniques
  • Adjacency Matrix1 3 2
  • Adjacency Matrix 1 2 31 0 1 12 1 0 13 1 1 0
  • Adjacency MatrixCan use vectorized libraries
  • Adjacency MatrixRequires O(n2) memory n = number of vertices
  • Adjacency MatrixHard(er) to distribute
  • Adjacency List1 3 2
  • Adjacency List1 2,32 1,33 1,2
  • Adjacency List Design in HBasee:dan@fullcontact.com p:+13039316251 t:danklynn
  • Adjacency List Design in HBase row key “edges” column familye:dan@fullcontact.com p:+13039316251= ... t:danklynn= ...p:+13039316251 e:dan@fullcontact.com= ... t:danklynn= ...t:danklynn e:dan@fullcontact.com= ... p:+13039316251= ...
  • Adjacency List Design in HBase row key “edges” column familye:dan@fullcontact.com p:+13039316251= ... t:danklynn= ... at to W e?hp:+13039316251 e:dan@fullcontact.com= ... st or t:danklynn= ...t:danklynn e:dan@fullcontact.com= ... p:+13039316251= ...
  • Custom Writablespackage org.apache.hadoop.io;public interface Writable { void write(java.io.DataOutput dataOutput); void readFields(java.io.DataInput dataInput);} java
  • Custom Writablesclass EdgeValueWritable implements Writable { EdgeValue edgeValue void write(DataOutput dataOutput) { dataOutput.writeDouble edgeValue.weight } void readFields(DataInput dataInput) { Double weight = dataInput.readDouble() edgeValue = new EdgeValue(weight) } // ...} groovy
  • Don’t get fancy with byte[]class EdgeValueWritable implements Writable { EdgeValue edgeValue byte[] toBytes() { // use strings if you can help it } static EdgeValueWritable fromBytes(byte[] bytes) { // use strings if you can help it }} groovy
  • Querying by vertexdef get = new Get(vertexKeyBytes)get.addFamily(edgesFamilyBytes)Result result = table.get(get);result.noVersionMap.each {family, data -> // construct edge objects as needed // data is a Map<byte[],byte[]>}
  • Adding edges to a vertexdef put = new Put(vertexKeyBytes)put.add( edgesFamilyBytes, destinationVertexBytes, edgeValue.toBytes() // your own implementation here)// if writing directlytable.put(put)// if using TableReducercontext.write(NullWritable.get(), put)
  • Distributed Traversal / Indexinge:dan@fullcontact.com p:+13039316251 t:danklynn
  • Distributed Traversal / Indexinge:dan@fullcontact.com p:+13039316251 t:danklynn
  • Distributed Traversal / Indexinge:dan@fullcontact.com p:+13039316251 Pi v ot v e rt ex t:danklynn
  • Distributed Traversal / Indexing e:dan@fullcontact.com p:+13039316251Ma pReduce ove rout bou nd edges t:danklynn
  • Distributed Traversal / Indexing e:dan@fullcontact.com p:+13039316251Em it vertexes an d edgedat a gro upe d bythe piv ot t:danklynn
  • Distributed Traversal / Indexing Re duc e key p:+13039316251“Ou t” vertex e:dan@fullcontact.com t:danklynn“In” vertex
  • Distributed Traversal / Indexinge:dan@fullcontact.com t:danklynnRe duc er em its higher-order edge
  • Distributed Traversal / IndexingIte rat ion 0
  • Distributed Traversal / IndexingIte rat ion 1
  • Distributed Traversal / IndexingIte rat ion 2
  • Distributed Traversal / Indexing Reuse edges created during previ ous iterat ionsIte rat ion 2
  • Distributed Traversal / IndexingIte rat ion 3
  • Distributed Traversal / Indexing Reuse edges created during previ ous iterat ionsIte rat ion 3
  • Distributed Traversal / Indexing hop s req uires on ly ite rat ion s
  • Tips / Gotchas
  • Do implement your own comparatorpublic static class Comparator extends WritableComparator { public int compare( byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { // ..... }} java
  • Do implement your own comparatorstatic { WritableComparator.define(VertexKeyWritable, new VertexKeyWritable.Comparator())} java
  • MultiScanTableInputFormatMultiScanTableInputFormat.setTable(conf, "graph");MultiScanTableInputFormat.addScan(conf, new Scan());job.setInputFormatClass( MultiScanTableInputFormat.class); java
  • TableMapReduceUtilTableMapReduceUtil.initTableReducerJob( "graph", MyReducer.class, job); java
  • ElasticMapReduce
  • Elastic MapReduceHFi les
  • Elastic MapReduceHFi les Copy to S3 Seq uen ceFiles
  • Elastic MapReduceHFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles
  • Elastic MapReduceHFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles
  • Elastic MapReduceHFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles HFileOutputFormat.configureIncrementalLoad(job, outputTable) HFi les
  • Elastic MapReduceHFi les Copy to S3 Elastic MapReduce Seq uen ceFiles Seq uen ceFiles HFileOutputFormat.configureIncrementalLoad(job, outputTable) HFi les HBase $ hadoop jar hbase-VERSION.jar completebulkload
  • Additional ResourcesGoogle Pregel: BSP-based graph processing systemApache Giraph: Implementation of Pregel for HadoopMultiScanTableInputFormat: (code to appear on GitHub)Apache Mahout - Distributed machine learning on Hadoop
  • Thanks!dan@fullcontact.com