HBaseCon 2013: Rebuilding for Scale on Apache HBase

Cloudera, Inc.
Cloudera, Inc.Cloudera, Inc.
Rebuilding from MongoDB for Scale on HBase
Robert Roland (@robdaemon)
Lead Software Engineer
rob@simplymeasured.com
http://www.simplymeasured.com
© 2013 Simply Measured, Inc
Who Are We
Social Media Analytics
Serving 25% of the Interbrand Top 100 Global Brands
Collecting data from
Twitter, Facebook, Instagram, YouTube, Google Analytics and
more
Delivered in a marketer’s favorite format – Excel!
GeekWire’s 2013 Startup of the Year
2
© 2013 Simply Measured, Inc
Twitter Account Report, jetBlue Airlines
3
© 2013 Simply Measured, Inc
Complete Social Media Snapshot, Red Bull
4
© 2013 Simply Measured, Inc
Our Setup
Cloudera CDH 4.2.1 on Ubuntu 12.04
3 “control” nodes (HDFS name node, job tracker, HBase master)
11 “data” nodes (HDFS data nodes, task trackers, HBase region servers)
Bare-metal, using managed colo hosting
14 TB of data across 50 HBase tables (One table per data source -Twitter,
Facebook, plus secondary indexes and operations)
Our customers have tracked 1.5 billion Tweets so far, and growing
More than 5 million rows across 3,000 reports generated daily
5
© 2013 Simply Measured, Inc
Why did we start with MongoDB?
• Ease of development
• Lack of schema is a good and a bad thing
• Easy to use libraries in Ruby (mongomatic)
• Lower initial investment
• You can run MongoDB on one server in production (but you shouldn’t)
• Master/slave was easier to start with than the current sharding model
• Active, engaged community
• Really, really effective marketing masks MongoDB's
shortcomings…
6
© 2013 Simply Measured, Inc
Why leave MongoDB?
• 10 TB of data was too much for it
• Instability
• No one wants to restart mongos every two days
• Bugs
• Very public, very high profile failures
• Silent failure modes
• What do you mean, the index creation failed and you just attempted
to load an entire collection into RAM?
7
© 2013 Simply Measured, Inc
Why HBase, or, How is this better than Mongo?
• More linear scaling
• Add new nodes, run a major compaction
• Our data can easily be modeled as a sparse column store
• We value consistency
• Our users will tell us if we’re missing one Tweet out of 1 million
• Monitoring
• Metrics
• Stability
• Excellent vendor support
8
© 2013 Simply Measured, Inc
How did we migrate?
• Implemented the Strangler pattern
• Dual writes to Mongo and HBase
• Migrate older data a few customers at a time, a few data sources at a
time
• Dual report generation platform – enabled us to compare reports off
our MongoDB platform and our HBase platform
• Migrated existing Ruby 1.8 code to JRuby
• Direct access to the HBase cluster
• I’m a better Java developer than Ruby developer
• Great profiler tools
9
© 2013 Simply Measured, Inc
Challenges with HBase
• Out of the box configuration is not good enough
• GC Tuning
• Default file size of 256 mb is too small
• Compactions will eat you alive
• Be sure to enable the HDFS trashcan! (fs.trash.interval)
• Configurations can be difficult to manage
• Chef / Puppet if you want to roll your own
• Cloudera Manager
• Schema
• Lack of types means the HBase shell is harder to read
• No native secondary indexes
10
© 2013 Simply Measured, Inc
What’s been awesome
• Great user community on the mailing lists
• Source code is easy to follow and submit patches
• Stable, stable, stable
• Unless you configure something wrong, like ulimits or xcievers!
11
© 2013 Simply Measured, Inc
Google Analytics schema
• Row key
• salt|dataSourceId|dimension|…|date
• Columns
• CF: metadata
• dataSourceId – hash, identifies a data source mapping to a customer
• timezone – Google Analytics time zone for this data
• dimensions – delimited list of Google Analytics dimensions
• CF: data
• GA data key – Name of metric as defined by Google Analytics, Protocol
Buffer representing value
12
© 2013 Simply Measured, Inc
Schema evolution
• Enrichment via Klout, geolocation data, Bit.ly, etc. was stored
in a separate table, and “joined” at query time
• Enrichment is now a column family within each data source
• Started with Protocol Buffers, now writing as qualifiers and
columns
• Ability to use server-side filters, ease of use within HBase shell
• Protocol Buffers still exist in some cases (example coming up)
• Rekeyed several tables over time
• Dual write during migration
• Map/reduce jobs to migrate data
• More column families per table than were necessary
• Lots and lots of memstore pain.
13
© 2013 Simply Measured, Inc
Protocol Buffers
• Union-like Protocol Buffer for arbitrary key/value pairs
14
© 2013 Simply Measured, Inc
Tips and Tricks
• Don’t expect to get your row keys correct on the first try
• Look for hot spotting
• Does ordering matter during queries?
• Consider your backup strategy
• S3 seems to work for us
• Replication is also an option
• This is not an RDBMS. Don’t JOIN, denormalize!
15
© 2013 Simply Measured, Inc
What’s coming up
• Hive
• Easier querying
• Entirely standardized representation of our data
• HCatalog
• Expose our schema to other internal tooling
• A much larger cluster
• Even more data sources
16
Thank You
Robert Roland
@robdaemon
rob@simplymeasured.com
We’re hiring!
http://simplymeasured.com/about/careers/
Our Tech Blog: http://engineering.simplymeasured.com/
Our Open Source: http://simplymeasured.github.io/
More sample reports: http://simplymeasured.com/tour/sample-reports/
1 of 17

More Related Content

What's hot(20)

HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
Matthew Blair7.6K views
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
HBaseCon8.7K views

Viewers also liked(20)

Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBase
HBaseCon3.5K views
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 Minutes
Cloudera, Inc.4.1K views
HBaseCon 2013: Apache HBase on FlashHBaseCon 2013: Apache HBase on Flash
HBaseCon 2013: Apache HBase on Flash
Cloudera, Inc.4.3K views
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
HBaseCon3.3K views

More from Cloudera, Inc.(20)

Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.4.5K views

Recently uploaded(20)

The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya59 views
Green Leaf Consulting: Capabilities DeckGreen Leaf Consulting: Capabilities Deck
Green Leaf Consulting: Capabilities Deck
GreenLeafConsulting177 views
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation24 views
ChatGPT and AI for Web DevelopersChatGPT and AI for Web Developers
ChatGPT and AI for Web Developers
Maximiliano Firtman161 views
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh36 views
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet49 views

HBaseCon 2013: Rebuilding for Scale on Apache HBase

  • 1. Rebuilding from MongoDB for Scale on HBase Robert Roland (@robdaemon) Lead Software Engineer rob@simplymeasured.com http://www.simplymeasured.com
  • 2. © 2013 Simply Measured, Inc Who Are We Social Media Analytics Serving 25% of the Interbrand Top 100 Global Brands Collecting data from Twitter, Facebook, Instagram, YouTube, Google Analytics and more Delivered in a marketer’s favorite format – Excel! GeekWire’s 2013 Startup of the Year 2
  • 3. © 2013 Simply Measured, Inc Twitter Account Report, jetBlue Airlines 3
  • 4. © 2013 Simply Measured, Inc Complete Social Media Snapshot, Red Bull 4
  • 5. © 2013 Simply Measured, Inc Our Setup Cloudera CDH 4.2.1 on Ubuntu 12.04 3 “control” nodes (HDFS name node, job tracker, HBase master) 11 “data” nodes (HDFS data nodes, task trackers, HBase region servers) Bare-metal, using managed colo hosting 14 TB of data across 50 HBase tables (One table per data source -Twitter, Facebook, plus secondary indexes and operations) Our customers have tracked 1.5 billion Tweets so far, and growing More than 5 million rows across 3,000 reports generated daily 5
  • 6. © 2013 Simply Measured, Inc Why did we start with MongoDB? • Ease of development • Lack of schema is a good and a bad thing • Easy to use libraries in Ruby (mongomatic) • Lower initial investment • You can run MongoDB on one server in production (but you shouldn’t) • Master/slave was easier to start with than the current sharding model • Active, engaged community • Really, really effective marketing masks MongoDB's shortcomings… 6
  • 7. © 2013 Simply Measured, Inc Why leave MongoDB? • 10 TB of data was too much for it • Instability • No one wants to restart mongos every two days • Bugs • Very public, very high profile failures • Silent failure modes • What do you mean, the index creation failed and you just attempted to load an entire collection into RAM? 7
  • 8. © 2013 Simply Measured, Inc Why HBase, or, How is this better than Mongo? • More linear scaling • Add new nodes, run a major compaction • Our data can easily be modeled as a sparse column store • We value consistency • Our users will tell us if we’re missing one Tweet out of 1 million • Monitoring • Metrics • Stability • Excellent vendor support 8
  • 9. © 2013 Simply Measured, Inc How did we migrate? • Implemented the Strangler pattern • Dual writes to Mongo and HBase • Migrate older data a few customers at a time, a few data sources at a time • Dual report generation platform – enabled us to compare reports off our MongoDB platform and our HBase platform • Migrated existing Ruby 1.8 code to JRuby • Direct access to the HBase cluster • I’m a better Java developer than Ruby developer • Great profiler tools 9
  • 10. © 2013 Simply Measured, Inc Challenges with HBase • Out of the box configuration is not good enough • GC Tuning • Default file size of 256 mb is too small • Compactions will eat you alive • Be sure to enable the HDFS trashcan! (fs.trash.interval) • Configurations can be difficult to manage • Chef / Puppet if you want to roll your own • Cloudera Manager • Schema • Lack of types means the HBase shell is harder to read • No native secondary indexes 10
  • 11. © 2013 Simply Measured, Inc What’s been awesome • Great user community on the mailing lists • Source code is easy to follow and submit patches • Stable, stable, stable • Unless you configure something wrong, like ulimits or xcievers! 11
  • 12. © 2013 Simply Measured, Inc Google Analytics schema • Row key • salt|dataSourceId|dimension|…|date • Columns • CF: metadata • dataSourceId – hash, identifies a data source mapping to a customer • timezone – Google Analytics time zone for this data • dimensions – delimited list of Google Analytics dimensions • CF: data • GA data key – Name of metric as defined by Google Analytics, Protocol Buffer representing value 12
  • 13. © 2013 Simply Measured, Inc Schema evolution • Enrichment via Klout, geolocation data, Bit.ly, etc. was stored in a separate table, and “joined” at query time • Enrichment is now a column family within each data source • Started with Protocol Buffers, now writing as qualifiers and columns • Ability to use server-side filters, ease of use within HBase shell • Protocol Buffers still exist in some cases (example coming up) • Rekeyed several tables over time • Dual write during migration • Map/reduce jobs to migrate data • More column families per table than were necessary • Lots and lots of memstore pain. 13
  • 14. © 2013 Simply Measured, Inc Protocol Buffers • Union-like Protocol Buffer for arbitrary key/value pairs 14
  • 15. © 2013 Simply Measured, Inc Tips and Tricks • Don’t expect to get your row keys correct on the first try • Look for hot spotting • Does ordering matter during queries? • Consider your backup strategy • S3 seems to work for us • Replication is also an option • This is not an RDBMS. Don’t JOIN, denormalize! 15
  • 16. © 2013 Simply Measured, Inc What’s coming up • Hive • Easier querying • Entirely standardized representation of our data • HCatalog • Expose our schema to other internal tooling • A much larger cluster • Even more data sources 16
  • 17. Thank You Robert Roland @robdaemon rob@simplymeasured.com We’re hiring! http://simplymeasured.com/about/careers/ Our Tech Blog: http://engineering.simplymeasured.com/ Our Open Source: http://simplymeasured.github.io/ More sample reports: http://simplymeasured.com/tour/sample-reports/

Editor's Notes

  1. Discuss our “enrichment” idea here with the denormalize part
  2. Mention Impala or other options