HBase @ Twitter

HBase @Twitter
@gario @ctrezzo
HBase Meetup 7/16

Agenda
● Infrastructure overview
● Example use cases
● hRaven

Infrastructure Overview
● HBase/Hadoop versions
○ HBase 0.94.x
○ Hadoop 2.0
● PROC, DW, TST, EXP
● Puppet
○ Config management
○ Packaging/Deployment (RPMs)
○ Rolling Upgrades
● Using replication for data movement between PROC
and DW

Major Use Cases
● Mutable data store for batch processing
● Operational Intelligence
● Monitoring/Metrics

Mutable data store for batch
processing
● Tables copied from MySQL
○ Allowing for incremental loads
● MapReduce jobs over data in HBase
● Snapshot of data copied into HDFS for processing
○ HBASE-8369 will optimize this

Operational Intelligence
● DCEvents - Audit log for changes in production
● TCC big users of python!
○ HappyBase
○ Thrift Gateway

Monitoring/Metrics
https://github.com/twitter/hRaven

● Stores stats, configuration and timing for every map
reduce job on every cluster
● Structured around the full DAG of jobs from a Pig or
Scalding application
● Easily queryable for historical trending
● Allows for Pig reducer optimization based on historical
run stats
● Keep data online forever (12.6M jobs, 4.5B tasks +
attempts)
hRaven: Why?

● cluster - each cluster has a unique name mapping to
the Job Tracker
● user - map reduce jobs are run as a given user
● application - a Pig or Scalding script (or plain map
reduce job)
● flow - the combined DAG of jobs executed from a
single run of an application
● version - changes impacting the DAG are recorded as
a new version of the same application
hRaven: Key Concepts

● All jobs in a flow are ordered together
hRaven: Flow Storage

● Most recent flow is ordered first
hRaven: Flow Storage

● All jobs in a flow are ordered together
● Per-job metrics stored
○ Total map and reduce tasks
○ HDFS bytes read / written
○ File bytes read / written
○ Total map and reduce slot milliseconds
● Easy to aggregate stats for an entire flow
● Easy to scan the timeseries of each application’s flows
hRaven: Key Features

● Pig reducer optimizations
● Cluster utilization / capacity planning
● Application performance trending over time
● Identifying common job anti-patterns
● Ad-hoc analysis troubleshooting cluster problems
hRaven: Current Uses

● HBase 0.96 on Hadoop 2.0
● Flow centric hRaven UI
● Improvements to HBase replication
Future Work

We are Hiring!
http://twitter.com/jobs
@JoinTheFlock

HBase @ Twitter

More Related Content

What's hot

Similar to HBase @ Twitter

Recently uploaded

HBase @ Twitter