Hado "OPS" or Had "oops"

778 views

Published on

Published in: Technology, Design
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
778
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
38
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Most people have probably used IMDb, but they probably won’t use if if they have to pay.
  • What we do: Display, video, mobile, social
  • Loop 1-6 is 200ms, 3-4 is 100ms for RF
    We do this 45B times a day
  • Real Time Auction
    Selecting the right ad for each auction
  • Automatically learning from every response & getting better
    Nobody else is doing this as fast, precisely, consistently for our customers
  • Loop 1-6 is 200ms, 3-4 is 100ms for RF
    We do this 45B times a day
  • We are now in 8 data centers in the world
  • We have optimized design of data centers as well.
    We custom design our racks, get servers assembled, racked and tested in a California facility. Then, ship to the data center.
    This is what we do not just for US data centers but also for data centers in Europe or Asia. Each rack can be 1500 lb or more and many racks are sent by air for initially install.

    Now, let’s look at the two kinds of racks shown above:
    Hadoop Server (the full racks) :L Data (Hadoop) servers are bigger as they have 12X3TB drives and 20 servers fill the whole rack.
    Bidders: Bidders have lot of cores but take less space because they have only 2 2.5” drives each. 40 servers fill up half the rack but we run out of switch ports.

    And, this is 5% of Rocket Fuel
  • Just say “We have amazing scale” – let the numbers speak for themselves.
  • Managing Hadoop cluster is not easy
    Start early
    We are heavy users of puppet
    Infradb is similar what puppet hiera but infradb was written in house 4 yrs ago.
    -> puppet and infradb are tightly integrated
    We use puppet and infradb to make maintenance easy
    Infradb helps us populate hadoop property values based on hardware config we have.
    For ex: Our fairshare slot distribution is automatically handled by infradb whenever we add new nodes.
  • -> here is an example , we define the formula to decide no. of MR slots per server based on mem , cpu , disks
    -> we always want to have homogenous hardware for easy maintenance and planning , but it is impossible since “need changes with time”
    -> automation like this will let you not about having heterogeneous servers.
    -> not just configuration, we use infradb to define alerts once and all the newly added hosts and clusters will be automatically monitored by our nagios.
  • A typical hadoop problem, you start with small cluster and want to grow.
    Hadoop default properties works well on small clusters , the problem starts when your cluster grows.
    Problem will be big when it happens on large clusters
    Aren’t we suppose to get better performance after adding nodes ?
  • -> too many properties to change
    -> be careful when tuning any changes
    -> have metrics to compare pre and post changes.
    -> MAPREDUCE-2026 : JobTracker.getJobCounter() will lock JobTracker and call JobInProgress.getCounters(). JobInProgress.getCounters() can be very expensive because it aggregates all the task counters. We found that from the JobTracker jstacks that this method is one of the bottleneck of the JobTracker performance.
  • -> few jiras that talk about memory leak in JobTracker.
    -> none of them really fixed issue even though the bugs are marked as resolved.
    -> 5351 is fixed but introduced 5508. 5508 is later fixed but there is workaround “set keep.failed.task.files=true”
    -> none of them really resolved the issue of JT OOM.
  • -> any HDFS heavy job can impact your HDFS performance , you will not realize unless you monitor the metrics
    -> don’t let any engineer impact your cluster.
  • -> monitoring your applications is not enough when you are running on a scale in multiple datacenters across world
    -> we should monitor the network mesh
  • -> find out bad queries immediately and kill them before they impact cluster
    -> don’t loose your capacity due to mass tasktracker blacklisting by a single job.
    -> long running jobs should be killed . No point in letting them run.
  • -> understand the workload on your cluster to better tune the scheduler properties
    ->whenever you add more nodes, the slot distribution should be automatically distributed to different queues.
    -> no scheduler is perfect unless you understand and tune it
    -> have ACL’s in place , don’t let any one engineer impact your MR workload.
    -> have an proper accounting for teams who use more MR capacity.
  • How we operated for initial few years
    Recently added DATA BCP (Business Continuity Plan)
    Latency critical and important data goes both places
    Other data after processing
    Make use of BCP cluster to do meaningful things until disaster happens
  • Hado "OPS" or Had "oops"

    1. 1. Proprietary & Confidential. Copyright © 2014. Hado’ops’ or Had’oops’ 1 We’re Hiring rocketfuel.com/careers Kishore Kumar Yellamraju Abhijit Pol
    2. 2. Proprietary & Confidential. Copyright © 2014. The Web Is Monetized By Advertising
    3. 3. Proprietary & Confidential. Copyright © 2014. Delivery Methods »Display »Video »Mobile »Social
    4. 4. Proprietary & Confidential. Copyright © 2014. 6. Ad Served User Segment s 3. Bid Reques t Overview Publishers 2. Ad Request 1. Page Request 4. Bid & Ad User Engagement s Data Partners Advertisers Browser Some Exchange Partners Ad Exchange Optimize Rocket Fuel Platform Real-time Bidder Automated Decisions Model s Refresh learning Data Store Ads & Budget Model Scores Events 5. Rocketfuel Winning Ad
    5. 5. Proprietary & Confidential. Copyright © 2014. 1.25 $2.11 $1.26 $2.78 $1.256 $1.809 $2.42 1.25 $2.11 $1.26 $2.78 $0.586 $2.009 1.25 $2.11 $1.26 $2.78 $1.56 $0.00 [ + ][ + ] Site/PageGeo/WeatherTime of DayBrand AffinityUser Always buying the best impressions & serving the best ad Real Time Bidding and Serving
    6. 6. Proprietary & Confidential. Copyright © 2014. Goal: Leads & sales Goal: Coupon downloads Goal: Brand awareness Site/PageGeo/WeatherTime of DayBrand AffinityDemo Impression Scorecard Demo Brand Affinity Time of Day Geo/Weather Site/Page Ad Position In-market Behavior Response Impression Scorecard Demo Brand Affinity Time of Day Geo/Weather Site/Page Ad Position In-Market Behavior Response X Impression Scorecard Demo Brand Affinity Time of Day Geo/Weather Site/Page Ad Position In-Market Behavior Response +100 +40 -20 +20 +15 +10 +40 +35 +9.7% +40 -70 -20 +10 +15 -25 -40 -18 +0.7% +10 -10 -20 +20 +10 -35 -25 +10 +1.4% Real Time Bidding and Serving X
    7. 7. Proprietary & Confidential. Copyright © 2014. 6. Ad Served User Segment s 3. Bid Reques t Overview Publishers 2. Ad Request 1. Page Request 4. Bid & Ad User Engagement s Data Partners Advertisers Browser Some Exchange Partners Ad Exchange Optimize Rocket Fuel Platform Real-time Bidder Automated Decisions Model s Refresh learning Data Store Ads & Budget Model Scores Events 5. Rocketfuel Winning Ad
    8. 8. Proprietary & Confidential. Copyright © 2014. 5 B 6 B 45 B Facebook likes Searches on Google Bid Requests Considered by Rocketfuel Requests per day Throughput
    9. 9. Proprietary & Confidential. Copyright © 2014. 400 100 20 2 Blink of an eye SF to Tokyo network round trip One beat of a hummindbird's wing Look up in Blackbird Time (ms) Latency
    10. 10. Proprietary & Confidential. Copyright © 2014. Architecture and Scale »Datacenters »Scale »Growth »Architecture
    11. 11. Proprietary & Confidential. Copyright © 2014. Data Center Expansion »abc
    12. 12. Proprietary & Confidential. Copyright © 2014. Data Center Design • Racks custom built at Rocket Fuel • Leased space/bandwidth in colocation facilities Hadoop Server 20 2U servers (8.5kW) Bidders 40 2-U Twin 2 servers (17kW)
    13. 13. Proprietary & Confidential. Copyright © 2014. Rocket Fuel Scale »34,474 CPU processor cores –2655 servers –187.4 Teraflops of computing »188 Terabytes of memory –13X the memory of IBM computer Watson that played Jeopardy »42PB Petabytes of storage –106X the data volume of the entire Library of Congress
    14. 14. Proprietary & Confidential. Copyright © 2014. Hadoop at Rocket Fuel »1400 servers »15K Disks »15K Cores »90 TB »30K MR slots »12K daily MR jobs
    15. 15. Proprietary & Confidential. Copyright © 2014. 200 Servers 1400 Servers 5 PB 41 PB 8x Growth
    16. 16. Proprietary & Confidential. Copyright © 2014. Data Architecture 3.0
    17. 17. Proprietary & Confidential. Copyright © 2014. Hadoop Setup QJM ZK Quorum » 6x2TB Disks » 2x6 core » 196 GB RAM » 2x1G NIC » 12x3TB Disks » 2x6 core » 64 GB RAM » 10G NIC » same as DN’s » Dedicated disk to ZK or JN JT Standby NN ZKFCZKFC Active NN DN TT DN TT DN TT DN TT DN TT DN TT
    18. 18. Proprietary & Confidential. Copyright © 2014. Operations » Maintenance » Performance Tuning » Monitoring » BCP » YARN
    19. 19. Proprietary & Confidential. Copyright © 2014. Puppet + Infradb Automation is key Maintenance is Not Easy
    20. 20. Proprietary & Confidential. Copyright © 2014. Puppet and Infradb » Automate as much as you can » Adding a slave node to Hadoop cluster < 120 seconds » Bringing up a new Hadoop cluster < 500 seconds » MR slots are automatically determined based on hardware config Isn’t it cool ? Just define once
    21. 21. Proprietary & Confidential. Copyright © 2014. No issues when cluster is small Problems starts when it grows Performance Tuning
    22. 22. Proprietary & Confidential. Copyright © 2014. dfs.namenode.handler.count dfs.image.transfer.timeout mapred.reduce.parallel.copies mapred.job.tracker.handler.count io.sort.mbio.sort.factor maxClientCnxns ZK : HDFS : MR : IMP : MAPREDUCE-2026 -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=1 -XX:CMSInitiatingOccupancyFraction=60 ha.*-timeout.ms JVM: Performance Tuning mapreduce.reduce.shuffle.parallelcopies
    23. 23. Proprietary & Confidential. Copyright © 2014. MAPREDUCE-5351 MAPREDUCE-5508 "keep.failed.task.files=true" We Have an Issue!
    24. 24. Proprietary & Confidential. Copyright © 2014. #instances of "JobInProgress” class = no. of users submitted jobs X mapred.jobtracker.completeuserjobs.maximum mapred.jobtracker.completeuserjobs.maximum mapred.jobtracker.retirejob.interval mapred.jobtracker.retiredjobs.cache.size JT OOM
    25. 25. Proprietary & Confidential. Copyright © 2014. Operations » Maintenance » Performance Tuning » Monitoring » BCP » YARN
    26. 26. Proprietary & Confidential. Copyright © 2014. Monitoring Wall of Ops
    27. 27. Proprietary & Confidential. Copyright © 2014. Monitoring hadoop.namenode.CallQueueLength hadoop.jobtracker.jvm.memheapusedm Don’t fly blind, you will crash!
    28. 28. Proprietary & Confidential. Copyright © 2014. MR Workload Monitoring
    29. 29. Proprietary & Confidential. Copyright © 2014. Network Monitoring Don’t blame network, instead monitor it Network Mesh can be mess
    30. 30. Proprietary & Confidential. Copyright © 2014. Alerting Monitoring is not enough, need better Alerting
    31. 31. Proprietary & Confidential. Copyright © 2014. Alerts http://hostname:port/jmx? qry=Hadoop:service=NameNode,name=NameNodeInfo >> Checking whether NN and JT are up is a no brainer >> Reduce alert noise by having summary/aggregate alerts >> We heavily rely on custom scripts that query /jmx for NN and JT qry=hadoop:service=JobTracker,name=JobTrackerInfo NameDirStatuses, DeadNodes, NumberOfMissingBlocks , qry=Hadoop:service=NameNode,name=FSNamesystemState FSState , CapacityRemaining , NumDeadDataNodes , UnderReplicatedBlocks Blacklisted TT’s , #jobs , #slots_used , ThreadCount , qry=java.lang:type=Memory" Used jvm , free jvm etc
    32. 32. Proprietary & Confidential. Copyright © 2014. MR Workload Alerting » Monitoring MR workload and alert – In-house tool that use “houdah” ruby gem monitors – Long running jobs , jobs with more map tasks , blacklisted TT’s with more failure counts etc… » Collect details and auto-restart blacklisted TT’s » Parse the JT logfile for rouge jobs. » Parse the JT log and collects all Job related info » White-elephant or hraven could help » Parse the scheduler html page or use metrics page http://<JT-hostname>:50030/scheduler?advanced http://<JT-hostname>:50030/metrics
    33. 33. Proprietary & Confidential. Copyright © 2014. Modeling OPS ETL Ad-hoc Multi Tenancy
    34. 34. Proprietary & Confidential. Copyright © 2014. No Scheduler is perfect unless you understand and tune it properly Scheduling
    35. 35. Proprietary & Confidential. Copyright © 2014. Operations » Maintenance » Performance Tuning » Monitoring » BCP » YARN
    36. 36. Proprietary & Confidential. Copyright © 2014. BCP » BCP  Business Continuity Plan » Near real time reporting over 15+ TB of daily data » Freshness of models trained over petabytes of data
    37. 37. Proprietary & Confidential. Copyright © 2014. Data BCP Cluster INW Data Cluster US Serving Clusters EU Serving Clusters HK Serving Clusters Modeling Repor ting User Queries Amazon Backup LSV Data Cluster US/EU/HK Serving Clusters Research Ad-hoc Queries Processed Data
    38. 38. Proprietary & Confidential. Copyright © 2014. YARN » Resource Manager - Global resource scheduler - Hierarchical queues - Application management » Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring » Application Master - Per-application - Manages application scheduling and task execution
    39. 39. Proprietary & Confidential. Copyright © 2014. YARN at Rocket FueI » Yarn is in production » 700+ nodes » 31TB RAM , 8500 disks , 8500 cores » Primary use case Map-Reduce » No more static slots » Tez , Spark , Storm are in race YAY !!!
    40. 40. Proprietary & Confidential. Copyright © 2014. Obligatory “we are hiring” slide! http://rocketfuel.com/careers
    41. 41. Proprietary & Confidential. Copyright © 2014. THANKS kishore@rocketfuel.com apol@rocketfuel.com

    ×