• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
HBaseCon 2013: Rebuilding for Scale on Apache HBase
 

HBaseCon 2013: Rebuilding for Scale on Apache HBase

on

  • 1,041 views

Presented by: Robert Roland, Simply Measured

Presented by: Robert Roland, Simply Measured

Statistics

Views

Total Views
1,041
Views on SlideShare
1,040
Embed Views
1

Actions

Likes
1
Downloads
0
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Discuss our “enrichment” idea here with the denormalize part
  • Mention Impala or other options

HBaseCon 2013: Rebuilding for Scale on Apache HBase HBaseCon 2013: Rebuilding for Scale on Apache HBase Presentation Transcript

  • Rebuilding from MongoDB for Scale on HBase Robert Roland (@robdaemon) Lead Software Engineer rob@simplymeasured.com http://www.simplymeasured.com
  • © 2013 Simply Measured, Inc Who Are We Social Media Analytics Serving 25% of the Interbrand Top 100 Global Brands Collecting data from Twitter, Facebook, Instagram, YouTube, Google Analytics and more Delivered in a marketer’s favorite format – Excel! GeekWire’s 2013 Startup of the Year 2
  • © 2013 Simply Measured, Inc Twitter Account Report, jetBlue Airlines 3
  • © 2013 Simply Measured, Inc Complete Social Media Snapshot, Red Bull 4
  • © 2013 Simply Measured, Inc Our Setup Cloudera CDH 4.2.1 on Ubuntu 12.04 3 “control” nodes (HDFS name node, job tracker, HBase master) 11 “data” nodes (HDFS data nodes, task trackers, HBase region servers) Bare-metal, using managed colo hosting 14 TB of data across 50 HBase tables (One table per data source -Twitter, Facebook, plus secondary indexes and operations) Our customers have tracked 1.5 billion Tweets so far, and growing More than 5 million rows across 3,000 reports generated daily 5
  • © 2013 Simply Measured, Inc Why did we start with MongoDB? • Ease of development • Lack of schema is a good and a bad thing • Easy to use libraries in Ruby (mongomatic) • Lower initial investment • You can run MongoDB on one server in production (but you shouldn’t) • Master/slave was easier to start with than the current sharding model • Active, engaged community • Really, really effective marketing masks MongoDB's shortcomings… 6
  • © 2013 Simply Measured, Inc Why leave MongoDB? • 10 TB of data was too much for it • Instability • No one wants to restart mongos every two days • Bugs • Very public, very high profile failures • Silent failure modes • What do you mean, the index creation failed and you just attempted to load an entire collection into RAM? 7
  • © 2013 Simply Measured, Inc Why HBase, or, How is this better than Mongo? • More linear scaling • Add new nodes, run a major compaction • Our data can easily be modeled as a sparse column store • We value consistency • Our users will tell us if we’re missing one Tweet out of 1 million • Monitoring • Metrics • Stability • Excellent vendor support 8
  • © 2013 Simply Measured, Inc How did we migrate? • Implemented the Strangler pattern • Dual writes to Mongo and HBase • Migrate older data a few customers at a time, a few data sources at a time • Dual report generation platform – enabled us to compare reports off our MongoDB platform and our HBase platform • Migrated existing Ruby 1.8 code to JRuby • Direct access to the HBase cluster • I’m a better Java developer than Ruby developer • Great profiler tools 9
  • © 2013 Simply Measured, Inc Challenges with HBase • Out of the box configuration is not good enough • GC Tuning • Default file size of 256 mb is too small • Compactions will eat you alive • Be sure to enable the HDFS trashcan! (fs.trash.interval) • Configurations can be difficult to manage • Chef / Puppet if you want to roll your own • Cloudera Manager • Schema • Lack of types means the HBase shell is harder to read • No native secondary indexes 10
  • © 2013 Simply Measured, Inc What’s been awesome • Great user community on the mailing lists • Source code is easy to follow and submit patches • Stable, stable, stable • Unless you configure something wrong, like ulimits or xcievers! 11
  • © 2013 Simply Measured, Inc Google Analytics schema • Row key • salt|dataSourceId|dimension|…|date • Columns • CF: metadata • dataSourceId – hash, identifies a data source mapping to a customer • timezone – Google Analytics time zone for this data • dimensions – delimited list of Google Analytics dimensions • CF: data • GA data key – Name of metric as defined by Google Analytics, Protocol Buffer representing value 12
  • © 2013 Simply Measured, Inc Schema evolution • Enrichment via Klout, geolocation data, Bit.ly, etc. was stored in a separate table, and “joined” at query time • Enrichment is now a column family within each data source • Started with Protocol Buffers, now writing as qualifiers and columns • Ability to use server-side filters, ease of use within HBase shell • Protocol Buffers still exist in some cases (example coming up) • Rekeyed several tables over time • Dual write during migration • Map/reduce jobs to migrate data • More column families per table than were necessary • Lots and lots of memstore pain. 13
  • © 2013 Simply Measured, Inc Protocol Buffers • Union-like Protocol Buffer for arbitrary key/value pairs 14
  • © 2013 Simply Measured, Inc Tips and Tricks • Don’t expect to get your row keys correct on the first try • Look for hot spotting • Does ordering matter during queries? • Consider your backup strategy • S3 seems to work for us • Replication is also an option • This is not an RDBMS. Don’t JOIN, denormalize! 15
  • © 2013 Simply Measured, Inc What’s coming up • Hive • Easier querying • Entirely standardized representation of our data • HCatalog • Expose our schema to other internal tooling • A much larger cluster • Even more data sources 16
  • Thank You Robert Roland @robdaemon rob@simplymeasured.com We’re hiring! http://simplymeasured.com/about/careers/ Our Tech Blog: http://engineering.simplymeasured.com/ Our Open Source: http://simplymeasured.github.io/ More sample reports: http://simplymeasured.com/tour/sample-reports/