Andrew Ryan describes how Facebook operates Hadoop to provide access as a shared resource between groups.
More information and video at:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/
3. Agenda
Hadoop operations @Facebook: an
1 overview
2 Existing operational best practices
The challenges ahead: new directions in
3 Hadoop
4 Emerging operational best practices
5 Conclusions and next steps
4.
5. Hadoop Operations @Facebook
▪ Lean staffing, fast moving, highly leveraged
▪ Basic oncall structure:
▪ Level 1: 24x7 sysadmin team (“SRO”) for whole site
▪ Level 2: 2 people (“AppOps”) trading 1-week oncall shifts
▪ Level 3: 4 different Hadoop dev subteams with 1-week rotations
▪ Plus oncalls from other adjunct teams: SiteOps for machine
repairs, NetEng for network, etc.
▪ Every engineer @FB is issued a cell phone and expected to be
available in emergencies and/or if they make a change to a
production system or code.
6. Operational gaps in Hadoop
Our best practices address all these gaps
▪ Hardware selection, preparation, and configuration
▪ Installation/packaging
▪ Upgrades
▪ Autostart/start/stop/restart/status as correct UNIX user
▪ Node level application and system monitoring
▪ Cluster-level and job-level monitoring
▪ Integrated log viewing/tailing/grepping
▪ Fast, reliable, centrally logged cluster-level shell ( != slaves.sh)
7. Existing operational best practices (1)
Sysadmin
▪ All the stuff you would do for a large distributed system but especially…
▪ Failed/failing hardware is your biggest enemy. FIND IT AND FIX IT, OR
GET IT OUT OF YOUR CLUSTERS! (the ‘excludes’ file is your friend)
▪ Regularly run every possible diagnostic to safely scan for bad hardware
▪ Identify and remove “repeat offender” hardware
▪ Fail fast, recover quickly, small things add up in big clusters:
▪ RHEL/Centos kickstart steals your disk space (1.5%-3%+ per disk)
▪ No swap + vm.panic_on_oom=1 + kernel.kdb=0 for “fast auto reboot
on OOM”
▪ Never fsck ext3 data drives unless Hadoop says you have to
9. Existing operational best practices (2)
Tooling
▪ Maintain a central registry of clusters, nodes, and each node’s role in
the cluster, integrated with your service/asset management platform
▪ Build centrally maintained tools to:
▪ Start/stop/restart/autostart daemons on hosts (hadoopctl)
▪ View/grep/tail daemon logs on hosts (hadooplog)
▪ Start/stop, or execute commands on entire clusters (clusterctl)
▪ Manage excludes files based on repair status (excluderator)
▪ Deploy any arbitrary version of software to clusters
▪ Monitor daemon health and collect statistics
10. Tooling example
Deploy & upgrade clusters
# Deploy an HDFS/MapReduce cluster pair: 2 to 4000 nodes via
torrent
$ deploy-hadoop-release.py --clusterdeploy=DFS1,SILVER
branch@rev
$ clusterctl restart DFS1 SILVER
# “Refresh deploy” on 10 clusters, and then restart just the datanodes
$ deploy-hadoop-release.py –poddeploy=DFSSCRIBE-ALL redeploy
$ clusterctl restart DFSSCRIBE-ALL:datanode
11. Existing operational best practices (3)
Process
▪ Document everything
▪ Segregate different classes of users on different clusters, with
appropriate service levels and capacities
▪ Graph user-visible metrics like HDFS and job latency
▪ Build “least destructive” procedures for getting hardware back in
service
▪ Developers and Ops should use the same procedures and tools
13. A Hadoop cluster admin’s worst
enemies
▪ The “X-Files”: machines which fail in strange ways, undetected by your
monitoring systems
▪ Get your basics under control, then you’ll have more time for these
▪ “America’s Most Wanted”: machines which keep failing, again and
again
▪ Our data: 1% of our machines accounted for 30% of our repair
tickets
14. New directions for Hadoop
▪ Hbase (Facebook Messages, real-time click logs)
▪ Zero-downtime upgrades (AvatarNode, rolling upgrades)
▪ “Megadatanodes” and Hadoop RAID
▪ HDFS as an “appliance”
See also:
http://www.facebook.com/notes/facebook-engineering/looking-at-
the-code-behind-our-three-uses-of-apache-hadoop/468211193919
15. Hbase and Hadoop
▪ Very new technology with emerging operational characteristics
▪ Applications using Hbase are also new, with their own usage quirks
▪ Aiming for large number of small clusters (~100 nodes)
▪ Slow/dead nodes are a big problem: these are real-time, user facing
▪ Region failover slow ; no speculative execution
▪ Full-downtime restarts must be avoided
View the Messages tech talk here: http://fb.me/95OQ8YaD2rkb3r
16. Zero-downtime upgrades
▪ HDFS upgrades are 1-2 hours of downtime
▪ Jobtracker upgrades are quick (5 min), but kill all currently running
jobs
▪ Rolling upgrades work today, but are too slow for large clusters
▪ Must be able to be strict and lenient about multiple versions of client
and server software installed and running in the cluster
17. “Megadatanodes” and Hadoop RAID
▪ Storage requirements continue to increase rapidly, as does CPU/
RAM
▪ 9X increase in datanode density from 2009-2011 (4TB->36TB)
▪ Hadoop RAID with XOR and Reed-Solomon bring tremendous cost
savings along with management challenges:
▪ Losing one node is a big deal (200k-600k blocks/node?). A rack?
Ouch!
▪ Tools and admin capabilities are not ready yet
▪ Will HDFS administration in 2012 be “like administering a cluster of
4000 Netapps”?
▪ Host/rack level network will be a bottleneck
18. HDFS as an “appliance”
▪ Use HDFS cluster instead of commercial storage appliance
▪ Requires commercial-grade support & features
▪ Must be price-competitive
vs.
19. Emerging operational best practices
▪ More careful selection of hardware and network designs to
accommodate new uses of Hadoop
▪ Find and deal with slowness at a node/rack/segment level
▪ Auto-healing at granularity better than “reboot” or “restart”
▪ Node-level version detection and installation
▪ Rolling, zero-downtime upgrades (AvatarNode + new JobTracker)
…and do all this without making Hadoop any harder to set up and run
20. Next steps
▪ Are we trying to do too much?
▪ Facebook needs an enormous data warehouse
▪ Facebook needs a large distributed filesystem
▪ Facebook needs a database alternative to MySQL
▪ Facebook always looking to spend less money
▪ …and all that other stuff too
▪ Failure is not an option
▪ Never a dull moment!
21. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0