ext-Generation Hadoopperationshat’s ahead in the next 12 monthsr Hadoop cluster administrationw Ryanps Engineer6, 2011
genda1 Hadoop operations @Facebook: an overview2 Existing operational best practices3 The challenges ahead: new directions in Hadoop4 Emerging operational best practices5 Conclusions and next steps
adoop Operations @Facebookan staffing, fast moving, highly leveraged sic oncall structure: evel 1: 24x7 sysadmin team (“SRO”) for whole site evel 2: 2 people (“AppOps”) trading 1-week oncall shifts evel 3: 4 different Hadoop dev subteams with 1-week rotationsPlus oncalls from other adjunct teams: SiteOps for machine repairNetEng for network, etc.ery engineer @FB is issued a cell phone and expected to beailable in emergencies and/or if they make a change to a productstem or code.
perational gaps in Hadoopr best practices address all these gaps rdware selection, preparation, and configurationstallation/packaging gradestostart/start/stop/restart/status as correct UNIX user de level application and system monitoringuster-level and job-level monitoringegrated log viewing/tailing/greppingst, reliable, centrally logged cluster-level shell ( != slaves.sh)
isting operational best practices (1)sadmin the stuff you would do for a large distributed system but especia iled/failing hardware is your biggest enemy. FIND IT AND FIX IT,ET IT OUT OF YOUR CLUSTERS! (the ‘excludes’ file is your frien gularly run every possible diagnostic to safely scan for bad hardwentify and remove “repeat offender” hardwareil fast, recover quickly, small things add up in big clusters:RHEL/Centos kickstart steals your disk space (1.5%-3%+ per diskNo swap + vm.panic_on_oom=1 + kernel.kdb=0 for “fast auto reboOOM”
sadmin examplentifying your “America’s Most Wanted” pays off
isting operational best practices (2)olingaintain a central registry of clusters, nodes, and each node’s role ster, integrated with your service/asset management platform ild centrally maintained tools to:Start/stop/restart/autostart daemons on hosts (hadoopctl)View/grep/tail daemon logs on hosts (hadooplog)Start/stop, or execute commands on entire clusters (clusterctl)Manage excludes files based on repair status (excluderator)Deploy any arbitrary version of software to clustersMonitor daemon health and collect statistics
oling exampleploy & upgrade clusterseploy an HDFS/MapReduce cluster pair: 2 to 4000 nodes via torreploy-hadoop-release.py --clusterdeploy=DFS1,SILVER branch@usterctl restart DFS1 SILVER efresh deploy” on 10 clusters, and then restart just the datanodeeploy-hadoop-release.py –poddeploy=DFSSCRIBE-ALL redeployusterctl restart DFSSCRIBE-ALL:datanode
isting operational best practices (3)cesscument everythinggregate different classes of users on different clusters, with approrvice levels and capacitiesaph user-visible metrics like HDFS and job latencyild “least destructive” procedures for getting hardware back in sevelopers and Ops should use the same procedures and tools
ocess exampleaphing our users’ experience on the cluster
Hadoop cluster admin’s worst enemies“X-Files”: machines which fail in strange ways, undetected by yoitoring systemset your basics under control, then you’ll have more time for theseerica’s Most Wanted”: machines which keep failing, again and agur data: 1% of our machines accounted for 30% of our repair ticke
ew directions for Hadoopase (Facebook Messages, real-time click logs)o-downtime upgrades (AvatarNode, rolling upgrades) gadatanodes” and Hadoop RAIDFS as an “appliance”e also: ://www.facebook.com/notes/facebook-engineering/looking-at-the e-behind-our-three-uses-of-apache-hadoop/468211193919
base and Hadoopy new technology with emerging operational characteristicsplications using Hbase are also new, with their own usage quirks ing for large number of small clusters (~100 nodes)w/dead nodes are a big problem: these are real-time, user facinggion failover slow ; no speculative execution-downtime restarts must be avoided the Messages tech talk here: http://fb.me/95OQ8YaD2rkb3r
ro-downtime upgradesFS upgrades are 1-2 hours of downtimetracker upgrades are quick (5 min), but kill all currently running joing upgrades work today, but are too slow for large clustersst be able to be strict and lenient about multiple versions of clientver software installed and running in the cluster
egadatanodes” and Hadoop RAIDrage requirements continue to increase rapidly, as does CPU/RA ncrease in datanode density from 2009-2011 (4TB->36TB)doop RAID with XOR and Reed-Solomon bring tremendous cost ings along with management challenges:osing one node is a big deal (200k-600k blocks/node?). A rack?uch!ools and admin capabilities are not ready yet HDFS administration in 2012 be “like administering a cluster of 4apps”?st/rack level network will be a bottleneck
DFS as an “appliance”e HDFS cluster instead of commercial storage applianceequires commercial-grade support & featuresust be price-competitive vs.
merging operational best practicese careful selection of hardware and network designs toommodate new uses of Hadoopd and deal with slowness at a node/rack/segment levelo-healing at granularity better than “reboot” or “restart”de-level version detection and installation ing, zero-downtime upgrades (AvatarNode + new JobTracker)d do all this without making Hadoop any harder to set up and run
ext steps we trying to do too much?acebook needs an enormous data warehouseacebook needs a large distributed filesystemacebook needs a database alternative to MySQLacebook always looking to spend less money and all that other stuff tooure is not an optionver a dull moment!
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0