Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HBase @Twitter
@gario @ctrezzo
HBase Meetup 7/16
Agenda
● Infrastructure overview
● Example use cases
● hRaven
Infrastructure Overview
● HBase/Hadoop versions
○ HBase 0.94.x
○ Hadoop 2.0
● PROC, DW, TST, EXP
● Puppet
○ Config managem...
Infrastructure Overview
Major Use Cases
● Mutable data store for batch processing
● Operational Intelligence
● Monitoring/Metrics
Mutable data store for batch
processing
● Tables copied from MySQL
○ Allowing for incremental loads
● MapReduce jobs over ...
Operational Intelligence
● DCEvents - Audit log for changes in production
● TCC big users of python!
○ HappyBase
○ Thrift ...
Monitoring/Metrics
https://github.com/twitter/hRaven
● Stores stats, configuration and timing for every map
reduce job on every cluster
● Structured around the full DAG of job...
● cluster - each cluster has a unique name mapping to
the Job Tracker
● user - map reduce jobs are run as a given user
● a...
hRaven: Application Flows
hRaven: Application Flows
● All jobs in a flow are ordered together
hRaven: Flow Storage
● Most recent flow is ordered first
hRaven: Flow Storage
● All jobs in a flow are ordered together
● Per-job metrics stored
○ Total map and reduce tasks
○ HDFS bytes read / writte...
● Pig reducer optimizations
● Cluster utilization / capacity planning
● Application performance trending over time
● Ident...
hRaven: Current Uses
hRaven: Current Uses
● HBase 0.96 on Hadoop 2.0
● Flow centric hRaven UI
● Improvements to HBase replication
Future Work
Questions?
We are Hiring!
http://twitter.com/jobs
@JoinTheFlock
Upcoming SlideShare
Loading in …5
×

HBase @ Twitter

6,936 views

Published on

A presentation given at an HBase meetup hosted at Twitter on 7/16/2013. Authors: Gary Helmling and Chris Trezzo.

Published in: Technology, Business
  • Be the first to comment

HBase @ Twitter

  1. 1. HBase @Twitter @gario @ctrezzo HBase Meetup 7/16
  2. 2. Agenda ● Infrastructure overview ● Example use cases ● hRaven
  3. 3. Infrastructure Overview ● HBase/Hadoop versions ○ HBase 0.94.x ○ Hadoop 2.0 ● PROC, DW, TST, EXP ● Puppet ○ Config management ○ Packaging/Deployment (RPMs) ○ Rolling Upgrades ● Using replication for data movement between PROC and DW
  4. 4. Infrastructure Overview
  5. 5. Major Use Cases ● Mutable data store for batch processing ● Operational Intelligence ● Monitoring/Metrics
  6. 6. Mutable data store for batch processing ● Tables copied from MySQL ○ Allowing for incremental loads ● MapReduce jobs over data in HBase ● Snapshot of data copied into HDFS for processing ○ HBASE-8369 will optimize this
  7. 7. Operational Intelligence ● DCEvents - Audit log for changes in production ● TCC big users of python! ○ HappyBase ○ Thrift Gateway
  8. 8. Monitoring/Metrics https://github.com/twitter/hRaven
  9. 9. ● Stores stats, configuration and timing for every map reduce job on every cluster ● Structured around the full DAG of jobs from a Pig or Scalding application ● Easily queryable for historical trending ● Allows for Pig reducer optimization based on historical run stats ● Keep data online forever (12.6M jobs, 4.5B tasks + attempts) hRaven: Why?
  10. 10. ● cluster - each cluster has a unique name mapping to the Job Tracker ● user - map reduce jobs are run as a given user ● application - a Pig or Scalding script (or plain map reduce job) ● flow - the combined DAG of jobs executed from a single run of an application ● version - changes impacting the DAG are recorded as a new version of the same application hRaven: Key Concepts
  11. 11. hRaven: Application Flows
  12. 12. hRaven: Application Flows
  13. 13. ● All jobs in a flow are ordered together hRaven: Flow Storage
  14. 14. ● Most recent flow is ordered first hRaven: Flow Storage
  15. 15. ● All jobs in a flow are ordered together ● Per-job metrics stored ○ Total map and reduce tasks ○ HDFS bytes read / written ○ File bytes read / written ○ Total map and reduce slot milliseconds ● Easy to aggregate stats for an entire flow ● Easy to scan the timeseries of each application’s flows hRaven: Key Features
  16. 16. ● Pig reducer optimizations ● Cluster utilization / capacity planning ● Application performance trending over time ● Identifying common job anti-patterns ● Ad-hoc analysis troubleshooting cluster problems hRaven: Current Uses
  17. 17. hRaven: Current Uses
  18. 18. hRaven: Current Uses
  19. 19. ● HBase 0.96 on Hadoop 2.0 ● Flow centric hRaven UI ● Improvements to HBase replication Future Work
  20. 20. Questions?
  21. 21. We are Hiring! http://twitter.com/jobs @JoinTheFlock

×