How Klout migrated
from CDH3 to CDH4
…and survived to tell about it
Large Scale Production Engineering Meetup
September 19...
About Klout
● recognizing & rewarding online influence
● major social network activity signals
● Facebook, Twitter, Google...
By The Numbers
● 2 TB data intake, 200 TB processed daily
● jobs clusters x 2 (dev/staging + production)
● hbase x 6 (dev/...
● pipelines instable, slow on cdh3 (v0.20.2)
● HBase performance predictability
● old hive version limited pipeline develo...
The Environment
● data center hosted
● I/O subsystems are under our control
● network latencies are under our control
● FA...
● this is super easy on AWS
● bring up a replacement cluster
● double-write or migrate data to replacement
● tear down old...
● nagios, pager duty for monitoring
● monit for process watchdogging
● jmx+, graphite, gdash+ for metrics
● ubuntu boot im...
● no replacement infra to migrate to
○ so upgrades must be done in place
● Cloudera's prefers Cloudera Manager
○ so we wer...
● detailed checklists, kanban board
● small test clusters, the dev clusters
● planned SLA miss for prod cluster upgrade
● ...
● jobs run faster (speculative execution?)
● pipelines are faster
● metrics exposed are improved
● HBase clusters lose blo...
● we had many post-mortems along the way
● lots of engineering time & attention
● sweating the details paid off
● mostly b...
● dev/staging + prod clusters x 2
● better use of HDFS paths & job scheduling
● consolidating zookeeper ensembles
● implem...
Klout is hiring awesome people passionate about
optimizing for innovation & stability, crunching big data &
robust systems...
Upcoming SlideShare
Loading in …5
×

How Klout migrated from CDH3 to CDH4 …and survived to tell about it

447 views

Published on

A short talk on Klout's journey from cdh3 to cdh4

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
447
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

How Klout migrated from CDH3 to CDH4 …and survived to tell about it

  1. 1. How Klout migrated from CDH3 to CDH4 …and survived to tell about it Large Scale Production Engineering Meetup September 19, 2013 Ian Kallen Lead Engineer, Klout © 2013 Klout
  2. 2. About Klout ● recognizing & rewarding online influence ● major social network activity signals ● Facebook, Twitter, Google+, LinkedIn, 4sq ● billions data points consumed & processed ● pipelines update scores & topics ● hive & oozie driven jobs & workflow © 2013 Klout
  3. 3. By The Numbers ● 2 TB data intake, 200 TB processed daily ● jobs clusters x 2 (dev/staging + production) ● hbase x 6 (dev/staging + production x 5) ● hbase: 350M req/day, 17K req/sec peak ● jobs, hbase & zookeeper total =~ 350 hosts © 2013 Klout
  4. 4. ● pipelines instable, slow on cdh3 (v0.20.2) ● HBase performance predictability ● old hive version limited pipeline developers ● cdh3 EOL’d 6/2013 ● cdh4 (v2.0.x) supports NN H/A, impala ● more shiney things Motivations © 2013 Klout
  5. 5. The Environment ● data center hosted ● I/O subsystems are under our control ● network latencies are under our control ● FAQ: Why not AWS? ● saved millions of dollars last year ● that's a lot of beer money. ● elasticity need is low, but... © 2013 Klout
  6. 6. ● this is super easy on AWS ● bring up a replacement cluster ● double-write or migrate data to replacement ● tear down old cluster ● have a celebratory drink ● if you have any beer money left Cloud Envy © 2013 Klout
  7. 7. ● nagios, pager duty for monitoring ● monit for process watchdogging ● jmx+, graphite, gdash+ for metrics ● ubuntu boot images for provisioning ● puppet for configuration management ● … no Cloudera Manager Ops Infra © 2013 Klout
  8. 8. ● no replacement infra to migrate to ○ so upgrades must be done in place ● Cloudera's prefers Cloudera Manager ○ so we were on our own to devise a plan ● Cloudera helped vet our plan (thanks!) ● confidence building on dev/staging clusters ● lots of rehearsals on VM's, bug reports Making Plans © 2013 Klout
  9. 9. ● detailed checklists, kanban board ● small test clusters, the dev clusters ● planned SLA miss for prod cluster upgrade ● lined up phone consult availability w/Cloudera ○ we needed it about 10 hours into prod jobs cluster ● nobody died Execution © 2013 Klout
  10. 10. ● jobs run faster (speculative execution?) ● pipelines are faster ● metrics exposed are improved ● HBase clusters lose block locality in transit ○ fixable ● no animals were harmed in this production Aftermath © 2013 Klout
  11. 11. ● we had many post-mortems along the way ● lots of engineering time & attention ● sweating the details paid off ● mostly because we’re “power users” of hive ● lessons learned: ○ re-align clusters ○ improve use of vendor tools where possible ■ e.g. Cloudera Manager Retrospect © 2013 Klout
  12. 12. ● dev/staging + prod clusters x 2 ● better use of HDFS paths & job scheduling ● consolidating zookeeper ensembles ● implementing NameNode H/A ● evaluating Cloudera Manager ● evaluating Impala (maybe) Onward © 2013 Klout
  13. 13. Klout is hiring awesome people passionate about optimizing for innovation & stability, crunching big data & robust systems If you are a great Hadoop DevOps Engineer Join Us! ian@klout.com Thanks! Gratuitous Recruiting Slide © 2013 Klout

×