Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon 2013: Rebuilding for Scale on Apache HBase


Published on

Presented by: Robert Roland, Simply Measured

Published in: Technology
  • Be the first to comment

HBaseCon 2013: Rebuilding for Scale on Apache HBase

  1. 1. Rebuilding from MongoDB for Scale on HBase Robert Roland (@robdaemon) Lead Software Engineer
  2. 2. © 2013 Simply Measured, Inc Who Are We Social Media Analytics Serving 25% of the Interbrand Top 100 Global Brands Collecting data from Twitter, Facebook, Instagram, YouTube, Google Analytics and more Delivered in a marketer’s favorite format – Excel! GeekWire’s 2013 Startup of the Year 2
  3. 3. © 2013 Simply Measured, Inc Twitter Account Report, jetBlue Airlines 3
  4. 4. © 2013 Simply Measured, Inc Complete Social Media Snapshot, Red Bull 4
  5. 5. © 2013 Simply Measured, Inc Our Setup Cloudera CDH 4.2.1 on Ubuntu 12.04 3 “control” nodes (HDFS name node, job tracker, HBase master) 11 “data” nodes (HDFS data nodes, task trackers, HBase region servers) Bare-metal, using managed colo hosting 14 TB of data across 50 HBase tables (One table per data source -Twitter, Facebook, plus secondary indexes and operations) Our customers have tracked 1.5 billion Tweets so far, and growing More than 5 million rows across 3,000 reports generated daily 5
  6. 6. © 2013 Simply Measured, Inc Why did we start with MongoDB? • Ease of development • Lack of schema is a good and a bad thing • Easy to use libraries in Ruby (mongomatic) • Lower initial investment • You can run MongoDB on one server in production (but you shouldn’t) • Master/slave was easier to start with than the current sharding model • Active, engaged community • Really, really effective marketing masks MongoDB's shortcomings… 6
  7. 7. © 2013 Simply Measured, Inc Why leave MongoDB? • 10 TB of data was too much for it • Instability • No one wants to restart mongos every two days • Bugs • Very public, very high profile failures • Silent failure modes • What do you mean, the index creation failed and you just attempted to load an entire collection into RAM? 7
  8. 8. © 2013 Simply Measured, Inc Why HBase, or, How is this better than Mongo? • More linear scaling • Add new nodes, run a major compaction • Our data can easily be modeled as a sparse column store • We value consistency • Our users will tell us if we’re missing one Tweet out of 1 million • Monitoring • Metrics • Stability • Excellent vendor support 8
  9. 9. © 2013 Simply Measured, Inc How did we migrate? • Implemented the Strangler pattern • Dual writes to Mongo and HBase • Migrate older data a few customers at a time, a few data sources at a time • Dual report generation platform – enabled us to compare reports off our MongoDB platform and our HBase platform • Migrated existing Ruby 1.8 code to JRuby • Direct access to the HBase cluster • I’m a better Java developer than Ruby developer • Great profiler tools 9
  10. 10. © 2013 Simply Measured, Inc Challenges with HBase • Out of the box configuration is not good enough • GC Tuning • Default file size of 256 mb is too small • Compactions will eat you alive • Be sure to enable the HDFS trashcan! (fs.trash.interval) • Configurations can be difficult to manage • Chef / Puppet if you want to roll your own • Cloudera Manager • Schema • Lack of types means the HBase shell is harder to read • No native secondary indexes 10
  11. 11. © 2013 Simply Measured, Inc What’s been awesome • Great user community on the mailing lists • Source code is easy to follow and submit patches • Stable, stable, stable • Unless you configure something wrong, like ulimits or xcievers! 11
  12. 12. © 2013 Simply Measured, Inc Google Analytics schema • Row key • salt|dataSourceId|dimension|…|date • Columns • CF: metadata • dataSourceId – hash, identifies a data source mapping to a customer • timezone – Google Analytics time zone for this data • dimensions – delimited list of Google Analytics dimensions • CF: data • GA data key – Name of metric as defined by Google Analytics, Protocol Buffer representing value 12
  13. 13. © 2013 Simply Measured, Inc Schema evolution • Enrichment via Klout, geolocation data,, etc. was stored in a separate table, and “joined” at query time • Enrichment is now a column family within each data source • Started with Protocol Buffers, now writing as qualifiers and columns • Ability to use server-side filters, ease of use within HBase shell • Protocol Buffers still exist in some cases (example coming up) • Rekeyed several tables over time • Dual write during migration • Map/reduce jobs to migrate data • More column families per table than were necessary • Lots and lots of memstore pain. 13
  14. 14. © 2013 Simply Measured, Inc Protocol Buffers • Union-like Protocol Buffer for arbitrary key/value pairs 14
  15. 15. © 2013 Simply Measured, Inc Tips and Tricks • Don’t expect to get your row keys correct on the first try • Look for hot spotting • Does ordering matter during queries? • Consider your backup strategy • S3 seems to work for us • Replication is also an option • This is not an RDBMS. Don’t JOIN, denormalize! 15
  16. 16. © 2013 Simply Measured, Inc What’s coming up • Hive • Easier querying • Entirely standardized representation of our data • HCatalog • Expose our schema to other internal tooling • A much larger cluster • Even more data sources 16
  17. 17. Thank You Robert Roland @robdaemon We’re hiring! Our Tech Blog: Our Open Source: More sample reports: