HBase @Twitter
@gario @ctrezzo
HBase Meetup 7/16
Agenda
● Infrastructure overview
● Example use cases
● hRaven
Infrastructure Overview
● HBase/Hadoop versions
○ HBase 0.94.x
○ Hadoop 2.0
● PROC, DW, TST, EXP
● Puppet
○ Config management
○ Packaging/Deployment (RPMs)
○ Rolling Upgrades
● Using replication for data movement between PROC
and DW
Infrastructure Overview
Major Use Cases
● Mutable data store for batch processing
● Operational Intelligence
● Monitoring/Metrics
Mutable data store for batch
processing
● Tables copied from MySQL
○ Allowing for incremental loads
● MapReduce jobs over data in HBase
● Snapshot of data copied into HDFS for processing
○ HBASE-8369 will optimize this
Operational Intelligence
● DCEvents - Audit log for changes in production
● TCC big users of python!
○ HappyBase
○ Thrift Gateway
Monitoring/Metrics
https://github.com/twitter/hRaven
● Stores stats, configuration and timing for every map
reduce job on every cluster
● Structured around the full DAG of jobs from a Pig or
Scalding application
● Easily queryable for historical trending
● Allows for Pig reducer optimization based on historical
run stats
● Keep data online forever (12.6M jobs, 4.5B tasks +
attempts)
hRaven: Why?
● cluster - each cluster has a unique name mapping to
the Job Tracker
● user - map reduce jobs are run as a given user
● application - a Pig or Scalding script (or plain map
reduce job)
● flow - the combined DAG of jobs executed from a
single run of an application
● version - changes impacting the DAG are recorded as
a new version of the same application
hRaven: Key Concepts
hRaven: Application Flows
hRaven: Application Flows
● All jobs in a flow are ordered together
hRaven: Flow Storage
● Most recent flow is ordered first
hRaven: Flow Storage
● All jobs in a flow are ordered together
● Per-job metrics stored
○ Total map and reduce tasks
○ HDFS bytes read / written
○ File bytes read / written
○ Total map and reduce slot milliseconds
● Easy to aggregate stats for an entire flow
● Easy to scan the timeseries of each application’s flows
hRaven: Key Features
● Pig reducer optimizations
● Cluster utilization / capacity planning
● Application performance trending over time
● Identifying common job anti-patterns
● Ad-hoc analysis troubleshooting cluster problems
hRaven: Current Uses
hRaven: Current Uses
hRaven: Current Uses
● HBase 0.96 on Hadoop 2.0
● Flow centric hRaven UI
● Improvements to HBase replication
Future Work
Questions?
We are Hiring!
http://twitter.com/jobs
@JoinTheFlock

HBase @ Twitter

  • 1.
  • 2.
    Agenda ● Infrastructure overview ●Example use cases ● hRaven
  • 3.
    Infrastructure Overview ● HBase/Hadoopversions ○ HBase 0.94.x ○ Hadoop 2.0 ● PROC, DW, TST, EXP ● Puppet ○ Config management ○ Packaging/Deployment (RPMs) ○ Rolling Upgrades ● Using replication for data movement between PROC and DW
  • 4.
  • 5.
    Major Use Cases ●Mutable data store for batch processing ● Operational Intelligence ● Monitoring/Metrics
  • 6.
    Mutable data storefor batch processing ● Tables copied from MySQL ○ Allowing for incremental loads ● MapReduce jobs over data in HBase ● Snapshot of data copied into HDFS for processing ○ HBASE-8369 will optimize this
  • 7.
    Operational Intelligence ● DCEvents- Audit log for changes in production ● TCC big users of python! ○ HappyBase ○ Thrift Gateway
  • 8.
  • 9.
    ● Stores stats,configuration and timing for every map reduce job on every cluster ● Structured around the full DAG of jobs from a Pig or Scalding application ● Easily queryable for historical trending ● Allows for Pig reducer optimization based on historical run stats ● Keep data online forever (12.6M jobs, 4.5B tasks + attempts) hRaven: Why?
  • 10.
    ● cluster -each cluster has a unique name mapping to the Job Tracker ● user - map reduce jobs are run as a given user ● application - a Pig or Scalding script (or plain map reduce job) ● flow - the combined DAG of jobs executed from a single run of an application ● version - changes impacting the DAG are recorded as a new version of the same application hRaven: Key Concepts
  • 11.
  • 12.
  • 13.
    ● All jobsin a flow are ordered together hRaven: Flow Storage
  • 14.
    ● Most recentflow is ordered first hRaven: Flow Storage
  • 15.
    ● All jobsin a flow are ordered together ● Per-job metrics stored ○ Total map and reduce tasks ○ HDFS bytes read / written ○ File bytes read / written ○ Total map and reduce slot milliseconds ● Easy to aggregate stats for an entire flow ● Easy to scan the timeseries of each application’s flows hRaven: Key Features
  • 16.
    ● Pig reduceroptimizations ● Cluster utilization / capacity planning ● Application performance trending over time ● Identifying common job anti-patterns ● Ad-hoc analysis troubleshooting cluster problems hRaven: Current Uses
  • 17.
  • 18.
  • 19.
    ● HBase 0.96on Hadoop 2.0 ● Flow centric hRaven UI ● Improvements to HBase replication Future Work
  • 20.
  • 21.