Rebuilding from MongoDB for Scale on HBase


Published on

Simply Measured's migration from MongoDB to HBase

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Introduce myself
  • Twitter and Klout metrics, expansion, geolocation
  • Facebook, YouTube, Google+, Twitter, Instagram
  • Mention recent upgrade to CDH4 from CDH3
  • Mention initial rollout didn’t use Cloudera ManagerTalk about chef
  • Mention clock skew
  • Could have flattened into columns, use a Hungarian-style notation, etc
  • Discuss our “enrichment” idea here with the denormalize part
  • Mention Impala or other options
  • Rebuilding from MongoDB for Scale on HBase

    1. 1. Rebuilding from MongoDB for Scale on HBaseRobert Roland (@robdaemon)Lead Software Engineerrob@simplymeasured.com
    2. 2. © 2013 Simply Measured, IncWho Are WeSocial Media AnalyticsServing 25% of the Interbrand Top 100 Global BrandsCollecting data from Twitter, Facebook, Instagram, YouTube,Google Analytics and moreDelivered in a marketer’s favorite format – Excel!GeekWire’s 2013 Startup of the Year2
    3. 3. © 2013 Simply Measured, IncTwitter Account Report, jetBlue Airlines3
    4. 4. © 2013 Simply Measured, IncComplete Social Media Snapshot, Red Bull4
    5. 5. © 2013 Simply Measured, IncOur SetupCloudera CDH 4.2.1 on Ubuntu 12.043 “control” nodes (HDFS name node, job tracker, HBase master)11 “data” nodes (HDFS data nodes, task trackers, HBase region servers)Bare-metal, using managed colo hosting14 TB of data across 50 HBase tables (One table per data source -Twitter,Facebook, plus secondary indexes and operations)Our customers have tracked 1.5 billion Tweets so far, and growingMore than 5 million rows across 3,000 reports generated daily5
    6. 6. © 2013 Simply Measured, IncWhy did we start with MongoDB?• Ease of development• Lack of schema is a good and a bad thing• Easy to use libraries in Ruby (mongomatic)• Lower initial investment• You can run MongoDB on one server in production (but you shouldn’t)• Master/slave was easier to start with than the current sharding model• Active, engaged community• Really, really effective marketing masks MongoDBsshortcomings…6
    7. 7. © 2013 Simply Measured, IncWhy leave MongoDB?• 10 TB of data was too much for it• Instability• No one wants to restart mongos every two days• Bugs• Very public, very high profile failures• Silent failure modes• What do you mean, the index creation failed and you just attemptedto load an entire collection into RAM?7
    8. 8. © 2013 Simply Measured, IncWhy HBase, or, How is this better than Mongo?• More linear scaling• Add new nodes, run a major compaction• Our data can easily be modeled as a sparse column store• We value consistency• Our users will tell us if we’re missing one Tweet out of 1 million• Monitoring• Metrics• Stability• Excellent vendor support8
    9. 9. © 2013 Simply Measured, IncHow did we migrate?• Implemented the Strangler pattern• Dual writes to Mongo and HBase• Migrate older data a few customers at a time, a few data sources at atime• Dual report generation platform – enabled us to compare reports offour MongoDB platform and our HBase platform• Migrated existing Ruby 1.8 code to JRuby• Direct access to the HBase cluster• I’m a better Java developer than Ruby developer• Great profiler tools9
    10. 10. © 2013 Simply Measured, IncChallenges with HBase• Out of the box configuration is not good enough• GC Tuning• Default file size of 256 mb is too small• Compactions will eat you alive• Be sure to enable the HDFS trashcan! (fs.trash.interval)• Configurations can be difficult to manage• Chef / Puppet if you want to roll your own• Cloudera Manager• Schema• Lack of types means the HBase shell is harder to read• No native secondary indexes10
    11. 11. © 2013 Simply Measured, IncWhat’s been awesome• Great user community on the mailing lists• Source code is easy to follow and submit patches• Stable, stable, stable• Unless you configure something wrong, like ulimits or xcievers!11
    12. 12. © 2013 Simply Measured, IncGoogle Analytics schema• Row key• salt|dataSourceId|dimension|…|date• Columns• CF: metadata• dataSourceId – hash, identifies a data source mapping to a customer• timezone – Google Analytics time zone for this data• dimensions – delimited list of Google Analytics dimensions• CF: data• GA data key – Name of metric as defined by Google Analytics, ProtocolBuffer representing value12
    13. 13. © 2013 Simply Measured, IncSchema evolution• Enrichment via Klout, geolocation data,, etc. was storedin a separate table, and “joined” at query time• Enrichment is now a column family within each data source• Started with Protocol Buffers, now writing as qualifiers andcolumns• Ability to use server-side filters, ease of use within HBase shell• Protocol Buffers still exist in some cases (example coming up)• Rekeyed several tables over time• Dual write during migration• Map/reduce jobs to migrate data• More column families per table than were necessary• Lots and lots of memstore pain.13
    14. 14. © 2013 Simply Measured, IncProtocol Buffers• Union-like Protocol Buffer for arbitrary key/value pairs14
    15. 15. © 2013 Simply Measured, IncTips and Tricks• Don’t expect to get your row keys correct on the first try• Look for hot spotting• Does ordering matter during queries?• Consider your backup strategy• S3 seems to work for us• Replication is also an option• This is not an RDBMS. Don’t JOIN, denormalize!15
    16. 16. © 2013 Simply Measured, IncWhat’s coming up• Hive• Easier querying• Entirely standardized representation of our data• HCatalog• Expose our schema to other internal tooling• A much larger cluster• Even more data sources16
    17. 17. Thank YouRobert Roland@robdaemonrob@simplymeasured.comWe’re hiring! Tech Blog: Open Source: sample reports: