Apache Hadoop at Yahoo is a massive platform with 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments, totaling 60,000 servers made up of 100s of different hardware configurations accumulated over generations, presenting unique operational challenges and a variety of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with large scale kernel upgrade on heterogeneous platforms within tight timeframes with 100% uptime and no service or data loss through the Dirty COW use case (privilege escalation vulnerability found in the Linux Kernel in late 2016).
We will dive deep into our three phased approach that led to eventual success of the program - pre work, kernel upgrade itself, and post work / cleanup. We will share the details on automation tools, UIs, and reporting tools developed and used to achieve the stated objectives of 800+ server upgrades per hour, track the upgrade progress, validate and report data blocks, and recover quickly from bad blocks encountered. Throughout the talk, we will highlight the importance of process management, communicating with 100s of custom teams to ensure they are onboard and aware, and successful coordination tactics with SREs and Site Operations. We will also touch upon some of the unique challenges we faced along with way such as BIOS updates necessary on over 20,000 hosts along the way, and explain system rolling upgrade support we added to HBase and Storm for avoiding service disruption to low latency customer during these upgrades.
3. 3
WHAT IS DIRTY COW?
Dirty COW (Copy-On-Write) is a security vulnerability of the Linux kernel that affects all
Linux-based operating systems including Android.
It allows a malicious actor to tamper with read-only, root-owned executable file.
It’s been around for a decade but had surfaced and was actively exploited in early Q4,
2016.
Linux kernel needed to to be patched, followed with a full reboot.
4. 4
CHALLENGE
Yahoo Grid comprises of 38 clusters:
19 Hadoop clusters
9 Hbase clusters
10 Storm clusters
47,000+ hosts of diverse makes and models
6. 6
CHALLENGE
End of quarter deadline
Cannot afford data loss
Need minimal to no downtime to prevent inconvenience to customers using the
clusters.
Coordinating the whole thing between different tiers of operations, site ops
technician and the users.
Rigorous end to end automation.
7. 7
PLANNING AND PREPARATION
Numerous discussions between prod ops and dev teams.
Leverage existing framework to rollout new kernel.
Many of these hosts weren’t rebooted in ages, so behavior was uncertain.
Thorough testing of new kernel on different kinds of hardware.
Encountered a variety of issues while testing.
o Use this as opportunity to fix hosts with hardware issues.
Resulted in BIOS + BMC + CPLD upgrade across a particular type of systems.
Use Kexec on system at higher risk and with time constraint
8. 8
EXECUTION
Pre upgrade work:
Required scanning all hosts for hardware issues - mem, disks and cpu.
Decom them before the upgrade.
We kept the namenodes up at all times and used it to help with the upgrade
by reporting the missing blocks.
Namenode HA setup: IP aliasing, nn1-ha1, nn1-ha2 and nn1
Clients talk to nn1
Upgrade components while nn1 was down.
9. 9
EXECUTION
Before start of the upgrade:
Increase namenode heartbeat recheck interval
o dfs.namenode.heartbeat.recheck-interval
Upgrade namenodes
Block map of hosts to blocks on the
The Upgrade
Bring down nn1, bring down component services
Try a rack and a stripe first and increase that count as needed
Troubleshoot hosts failing to come back up
For Storm and Hbase, rolling upgrade script was updated to do a system upgrade as they could
sustain a rolling upgrade.
11. 11
HADOOP SUBSYSTEMS
This included various sub-components such as LDAP, Kerberos, syslog
servers, monitoring nodes, proxy-nodes, gateways, admin servers to name
a few.
These servers could be failed over and were not considered a single point
of failure.
The upgrade was done in rolling fashion with no down time to the service.
Inbuilt support in kernel Upgrade.
12. 12
COORDINATION
Comprehensive UI
Display all the clusters, with kernel and bios versions of all hosts
Display host upgrade progress and host health status
Display stats on numbers of hosts upgraded, being upgraded and not upgraded
yet
2nd tier of Ops scan the UI for hosts with hardware issues that need to be looked
into by site ops.
Site ops technicians on standby to immediately troubleshoot hosts with hardware
issues.
16. 16
Initialize
Workflow
/Anchor Function
Find Active/
Non Active
Kernel
Upgrade
Required
Kernel Current/
Unreachable
Push Repo
mv old
kernel
Push RPM
Validate
Nodes
Failed
Register
Error/Failures
Terminate Exit
Passed Nodes
(nodes to work)
Select batch to
work
Shutdown
processes
Reboot
Reboot
Failure
Start Services Service failure
Check
HDFS
Status
Thresholds
crossed
Underrepl failed
nodes/ Missing
blocks
Find Active
nodes in
HDFS
Push New
temp repo,
/boot can’t
have
multiple
kernel,
Move old
kernel. Push
New kernel
RPMs
Validation
involved disk,
CPU, memory
consistency
checks
Kernel Upgrade Flow
17. 17
Initialize Block
Map Tool
Find All Blocks
on DNs and
record path
Monitor
Namenode
Missing
Blocks
Trigger
metasave
Find failed
nodes
Find nodes
having all blocks
Escalate to
siteops
Do find to
upload all blocks
and path on hdfs
and locally
(Use pig to find
block locations) After this step we are
ready to do Kernel
Upgrade
Block Map Tool
18. 18
TOOL CONFIG
default:
database_type: 'mysql’
host_netswitch_map: /home/y/conf/ygrid_kernel_upgrade/netswitch_mapping.yaml
hbase_client_config: /home/y/conf/cluster_upgrade/ygrid_package_version.yaml
repo_file:
'http://xxxxxx.yyyyyyyy.yahoo.com:xxxxx/yum/properties/ylinux/ylinux/dirtycow/ylinux6-
kernel-upgrade.yum’
# host Selection logic based on batch specified
# [0-9+]s - stripe, select a stripe in cluster
# r - rack, select biggest rack of cluster
# [0-9]+ - group on n(number) hosts
# stop,halt - stop execution further
# example
# r,s,50,100,stop - upgrade rack, then stripe, then batch of 50 & 100 respectively and
then stop irrespective host available or not
batch: r,s,4s,7s
reboot_wait: 1500
missing_blocks_threshold: 1000
namenode_safemode_timeout: 3600
addNodes:
datanode: command_add_datanode
storm: command_add_storm
removeNodes:
datanode: command_remove_datanode
storm: command_remove_storm
moveKernel: "mv /boot/initramfs-2.6.32-*.img /boot/initrd-2.6.32-*.img /grid/0/tmp/ "
installKernel: "yum -y shell /tmp/ylinux6-kernel-upgrade.yum"
validateKernelHost: "/usr/local/libexec/validateNodeHealth.py"
reboot: "SUDO_USER=kernelupgrade /etc/init.d/systemupgrade.py”
reboot: " kernel=`grubby --default-kernel`; initrd=`grubby --info=${kernel} | grep '^initrd' |
cut -d'=' -f2`; kexec -l $kernel --initrd=$initrd --command-line="$( cat /proc/cmdline )" ;
sleep 5 ; reboot "
command:
command_add_datanode: "/home/y/bin/addNodes -input_data
[cluster]_[colo]:HDFS:[hosts]”
command_add_storm: "/home/y/bin/quarantineDebugNodes -input_data
[cluster]_[colo]:STORM:[hosts]”
command_remove_datanode: "/home/y/bin/shutdownNodes -input_data
[cluster]_[colo]:HDFS:[hosts]”
command_remove_storm: "/home/y/bin/shutdownNodes -input_data
[cluster]_[colo]:STORM:[hosts]"
20. 20
CI/CD
process
Git
(release
info)
Jenkins
Start
Put NN in RU
mode &
Upgrade NN
SNN
Master
Upgrade
Region-
server
Upgrade
process
Stargate
Upgrade
Gateway
Upgrade
HBase Upgrade
Foreach
DN/RS
System
Upgrade
regionserver
Repo Server
Package +
conf version
Stop
Regionserver
Stop DN Reboot Host
Validate and
Start DN, RS
1
2
3
4
3a
3c
3b
3d 3e
3f
3f
5HDFS Rolling
Upgrade process
Iterate over each group
Iterate over
each server in
a group
21. 21
Storm Kernel Upgrade CI/CD
process
Git
(release
info)
Jenkins
Start
Artifactory
(State files &
Release info)
RE Jenkins
and SD
process
Pacemaker
System
Upgrade
Nimbus
System
Upgrade
Kill workers
and stop
Supervisor
Reboot
Host(s)
Start
Supervisor
Services
Verify
Services
DRPC
System
Upgrade
Run
Test/Validatio
n topology
Audit All
Components
RE Jenkins leads to statefile
generation for each component and
updates git with release info
Statefiles are published in artifactory
and downloaded during upgrade
Upgrade fails if
more than X
supervisors
fails to upgrade
23. 23
TEST RESULTS MODEL VS RHEL VERSIONS
We use different configs starting multiple architectures such as Westmere,
Sandybridge, Ivybridge, Haswell, Broadwell.
Each of the configurations were installed with different OS versions and kernel
versions.
OS version Kernel minor version
RHEL 6.4 2.6.32-358
RHEL 6.6 and RHEL 6.7 2.6.32-432 to 2.6.32-512
RHEL 6.8 2.6.32-632
24. 24
MODEL VS RHEL AND KERNEL
Issues
Slower reboots
Boot failing due to iDRAC/IPMI
Slowness on systems
Hardware issues
25. 25
KEXEC
The primary difference between a standard system boot and a kexec boot is that
the hardware initialization or POST normally performed by the BIOS is not
performed during a kexec boot. This has the effect of reducing the time required
for a reboot.
We had approximately 3000 nodes that had the potential to cause issues if we
chose a standard system boot. These nodes belonged to a specific config and
had a bad history when it came to rebooting.
We did do a full system reboot in rolling fashion after we were done with the
dirtyCOW kernel upgrade project.
26. 26
SUCCESS MATRIX
Zero data loss.
47000+ nodes upgraded at extremely fast pace.
Minimal customer downtime.
S0 security bug resolved.
Minimum impact to low latency services.
Uncovered multiple system issues. Got an opportunity to upgrade BIOS, BMC,
fix edac issue that was causing system slowness and that resulted in improved
system reliability.