SlideShare a Scribd company logo
1 of 27
DIRTY COW
STORY AT
YAHOO GRID
Sameer Gawande
Savitha Ravikrishnan
June 13, 2017
2
Agenda
Topic Speaker(s)
Overview, Planning, Execution & Coordination Savitha Ravikrishnan
Automation details & Final outcome Sameer Gawande
Q&A All Presenters
3
WHAT IS DIRTY COW?
 Dirty COW (Copy-On-Write) is a security vulnerability of the Linux kernel that affects all
Linux-based operating systems including Android.
 It allows a malicious actor to tamper with read-only, root-owned executable file.
 It’s been around for a decade but had surfaced and was actively exploited in early Q4,
2016.
 Linux kernel needed to to be patched, followed with a full reboot.
4
CHALLENGE
 Yahoo Grid comprises of 38 clusters:
 19 Hadoop clusters
 9 Hbase clusters
 10 Storm clusters
 47,000+ hosts of diverse makes and models
5
GRID STACK
Zookeeper
Backend
Support
Hadoop
Storage
Hadoop
Compute
Hadoop
Services
Support
Shop
Monitoring
Starling for
logging
HDFS Hbase as
NoSql store
Hcatalog for
metadata
registry
YARN (Mapred) and Tez
for Batch processing
Storm for stream
processing
Spark for iterative
programming
PIG for
ETL
Hive for
SQL
Oozie for
workflows
Proxy
services
GDM for
data Mang
Café on
Spark for
ML
6
CHALLENGE
 End of quarter deadline
 Cannot afford data loss
 Need minimal to no downtime to prevent inconvenience to customers using the
clusters.
 Coordinating the whole thing between different tiers of operations, site ops
technician and the users.
 Rigorous end to end automation.
7
PLANNING AND PREPARATION
 Numerous discussions between prod ops and dev teams.
 Leverage existing framework to rollout new kernel.
 Many of these hosts weren’t rebooted in ages, so behavior was uncertain.
 Thorough testing of new kernel on different kinds of hardware.
 Encountered a variety of issues while testing.
o Use this as opportunity to fix hosts with hardware issues.
 Resulted in BIOS + BMC + CPLD upgrade across a particular type of systems.
 Use Kexec on system at higher risk and with time constraint
8
EXECUTION
 Pre upgrade work:
 Required scanning all hosts for hardware issues - mem, disks and cpu.
 Decom them before the upgrade.
 We kept the namenodes up at all times and used it to help with the upgrade
by reporting the missing blocks.
 Namenode HA setup: IP aliasing, nn1-ha1, nn1-ha2 and nn1
 Clients talk to nn1
 Upgrade components while nn1 was down.
9
EXECUTION
 Before start of the upgrade:
Increase namenode heartbeat recheck interval
o dfs.namenode.heartbeat.recheck-interval
Upgrade namenodes
Block map of hosts to blocks on the
 The Upgrade
Bring down nn1, bring down component services
Try a rack and a stripe first and increase that count as needed
Troubleshoot hosts failing to come back up
 For Storm and Hbase, rolling upgrade script was updated to do a system upgrade as they could
sustain a rolling upgrade.
10
Subsystem Upgrade
11
HADOOP SUBSYSTEMS
 This included various sub-components such as LDAP, Kerberos, syslog
servers, monitoring nodes, proxy-nodes, gateways, admin servers to name
a few.
 These servers could be failed over and were not considered a single point
of failure.
 The upgrade was done in rolling fashion with no down time to the service.
 Inbuilt support in kernel Upgrade.
12
COORDINATION
 Comprehensive UI
 Display all the clusters, with kernel and bios versions of all hosts
 Display host upgrade progress and host health status
 Display stats on numbers of hosts upgraded, being upgraded and not upgraded
yet
 2nd tier of Ops scan the UI for hosts with hardware issues that need to be looked
into by site ops.
 Site ops technicians on standby to immediately troubleshoot hosts with hardware
issues.
13
UI SNAPSHOT
14
15
Kernel Upgrade Flow Diagram
16
Initialize
Workflow
/Anchor Function
Find Active/
Non Active
Kernel
Upgrade
Required
Kernel Current/
Unreachable
Push Repo
mv old
kernel
Push RPM
Validate
Nodes
Failed
Register
Error/Failures
Terminate Exit
Passed Nodes
(nodes to work)
Select batch to
work
Shutdown
processes
Reboot
Reboot
Failure
Start Services Service failure
Check
HDFS
Status
Thresholds
crossed
Underrepl failed
nodes/ Missing
blocks
Find Active
nodes in
HDFS
Push New
temp repo,
/boot can’t
have
multiple
kernel,
Move old
kernel. Push
New kernel
RPMs
Validation
involved disk,
CPU, memory
consistency
checks
Kernel Upgrade Flow
17
Initialize Block
Map Tool
Find All Blocks
on DNs and
record path
Monitor
Namenode
Missing
Blocks
Trigger
metasave
Find failed
nodes
Find nodes
having all blocks
Escalate to
siteops
Do find to
upload all blocks
and path on hdfs
and locally
(Use pig to find
block locations) After this step we are
ready to do Kernel
Upgrade
Block Map Tool
18
TOOL CONFIG
default:
database_type: 'mysql’
host_netswitch_map: /home/y/conf/ygrid_kernel_upgrade/netswitch_mapping.yaml
hbase_client_config: /home/y/conf/cluster_upgrade/ygrid_package_version.yaml
repo_file:
'http://xxxxxx.yyyyyyyy.yahoo.com:xxxxx/yum/properties/ylinux/ylinux/dirtycow/ylinux6-
kernel-upgrade.yum’
# host Selection logic based on batch specified
# [0-9+]s - stripe, select a stripe in cluster
# r - rack, select biggest rack of cluster
# [0-9]+ - group on n(number) hosts
# stop,halt - stop execution further
# example
# r,s,50,100,stop - upgrade rack, then stripe, then batch of 50 & 100 respectively and
then stop irrespective host available or not
batch: r,s,4s,7s
reboot_wait: 1500
missing_blocks_threshold: 1000
namenode_safemode_timeout: 3600
addNodes:
datanode: command_add_datanode
storm: command_add_storm
removeNodes:
datanode: command_remove_datanode
storm: command_remove_storm
moveKernel: "mv /boot/initramfs-2.6.32-*.img /boot/initrd-2.6.32-*.img /grid/0/tmp/ "
installKernel: "yum -y shell /tmp/ylinux6-kernel-upgrade.yum"
validateKernelHost: "/usr/local/libexec/validateNodeHealth.py"
reboot: "SUDO_USER=kernelupgrade /etc/init.d/systemupgrade.py”
reboot: " kernel=`grubby --default-kernel`; initrd=`grubby --info=${kernel} | grep '^initrd' |
cut -d'=' -f2`; kexec -l $kernel --initrd=$initrd --command-line="$( cat /proc/cmdline )" ;
sleep 5 ; reboot "
command:
command_add_datanode: "/home/y/bin/addNodes -input_data
[cluster]_[colo]:HDFS:[hosts]”
command_add_storm: "/home/y/bin/quarantineDebugNodes -input_data
[cluster]_[colo]:STORM:[hosts]”
command_remove_datanode: "/home/y/bin/shutdownNodes -input_data
[cluster]_[colo]:HDFS:[hosts]”
command_remove_storm: "/home/y/bin/shutdownNodes -input_data
[cluster]_[colo]:STORM:[hosts]"
19
Rolling Upgrade low latency
20
CI/CD
process
Git
(release
info)
Jenkins
Start
Put NN in RU
mode &
Upgrade NN
SNN
Master
Upgrade
Region-
server
Upgrade
process
Stargate
Upgrade
Gateway
Upgrade
HBase Upgrade
Foreach
DN/RS
System
Upgrade
regionserver
Repo Server
Package +
conf version
Stop
Regionserver
Stop DN Reboot Host
Validate and
Start DN, RS
1
2
3
4
3a
3c
3b
3d 3e
3f
3f
5HDFS Rolling
Upgrade process
Iterate over each group
Iterate over
each server in
a group
21
Storm Kernel Upgrade CI/CD
process
Git
(release
info)
Jenkins
Start
Artifactory
(State files &
Release info)
RE Jenkins
and SD
process
Pacemaker
System
Upgrade
Nimbus
System
Upgrade
Kill workers
and stop
Supervisor
Reboot
Host(s)
Start
Supervisor
Services
Verify
Services
DRPC
System
Upgrade
Run
Test/Validatio
n topology
Audit All
Components
RE Jenkins leads to statefile
generation for each component and
updates git with release info
Statefiles are published in artifactory
and downloaded during upgrade
Upgrade fails if
more than X
supervisors
fails to upgrade
22
Impact and Statistics
23
TEST RESULTS MODEL VS RHEL VERSIONS
 We use different configs starting multiple architectures such as Westmere,
Sandybridge, Ivybridge, Haswell, Broadwell.
 Each of the configurations were installed with different OS versions and kernel
versions.
OS version Kernel minor version
RHEL 6.4 2.6.32-358
RHEL 6.6 and RHEL 6.7 2.6.32-432 to 2.6.32-512
RHEL 6.8 2.6.32-632
24
MODEL VS RHEL AND KERNEL
 Issues
Slower reboots
Boot failing due to iDRAC/IPMI
Slowness on systems
Hardware issues
25
KEXEC
 The primary difference between a standard system boot and a kexec boot is that
the hardware initialization or POST normally performed by the BIOS is not
performed during a kexec boot. This has the effect of reducing the time required
for a reboot.
 We had approximately 3000 nodes that had the potential to cause issues if we
chose a standard system boot. These nodes belonged to a specific config and
had a bad history when it came to rebooting.
 We did do a full system reboot in rolling fashion after we were done with the
dirtyCOW kernel upgrade project.
26
SUCCESS MATRIX
 Zero data loss.
 47000+ nodes upgraded at extremely fast pace.
 Minimal customer downtime.
 S0 security bug resolved.
 Minimum impact to low latency services.
 Uncovered multiple system issues. Got an opportunity to upgrade BIOS, BMC,
fix edac issue that was causing system slowness and that resulted in improved
system reliability.
Handling Kernel Upgrades at Scale - The Dirty Cow Story

More Related Content

What's hot

Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeDataWorks Summit
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureDataWorks Summit
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudDataWorks Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasyDataWorks Summit
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingDataWorks Summit/Hadoop Summit
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowDataWorks Summit
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingDataWorks Summit
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4DataWorks Summit
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopHortonworks
 

What's hot (20)

Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made Easy
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
 
YARN and the Docker container runtime
YARN and the Docker container runtimeYARN and the Docker container runtime
YARN and the Docker container runtime
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 

Similar to Handling Kernel Upgrades at Scale - The Dirty Cow Story

Introduction to Stacki at Atlanta Meetup February 2016
Introduction to Stacki at Atlanta Meetup February 2016Introduction to Stacki at Atlanta Meetup February 2016
Introduction to Stacki at Atlanta Meetup February 2016StackIQ
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made EasyAll Things Open
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios
 
A Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy SystemA Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy Systemadrian_nye
 
Introduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning ToolIntroduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning ToolSuresh Paulraj
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
2008-11-13 CAVMEN RHEL for System z Deep Dive
2008-11-13 CAVMEN RHEL for System z Deep Dive2008-11-13 CAVMEN RHEL for System z Deep Dive
2008-11-13 CAVMEN RHEL for System z Deep DiveShawn Wells
 
Open stack implementation
Open stack implementation Open stack implementation
Open stack implementation Soumyajit Basu
 
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...The Linux Foundation
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at ScaleAntony Messerl
 
2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...
2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...
2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...Shawn Wells
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
OpenShift_Installation_Deep_Dive_Robert_Bohne.pdf
OpenShift_Installation_Deep_Dive_Robert_Bohne.pdfOpenShift_Installation_Deep_Dive_Robert_Bohne.pdf
OpenShift_Installation_Deep_Dive_Robert_Bohne.pdfssuser9e06a61
 
Planning For High Performance Web Application
Planning For High Performance Web ApplicationPlanning For High Performance Web Application
Planning For High Performance Web ApplicationYue Tian
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Nagios Conference 2011 - Daniel Wittenberg - Scaling Nagios At A Giant Insur...
Nagios Conference 2011 - Daniel Wittenberg -  Scaling Nagios At A Giant Insur...Nagios Conference 2011 - Daniel Wittenberg -  Scaling Nagios At A Giant Insur...
Nagios Conference 2011 - Daniel Wittenberg - Scaling Nagios At A Giant Insur...Nagios
 
Docker and coreos20141020b
Docker and coreos20141020bDocker and coreos20141020b
Docker and coreos20141020bRichard Kuo
 

Similar to Handling Kernel Upgrades at Scale - The Dirty Cow Story (20)

Introduction to Stacki at Atlanta Meetup February 2016
Introduction to Stacki at Atlanta Meetup February 2016Introduction to Stacki at Atlanta Meetup February 2016
Introduction to Stacki at Atlanta Meetup February 2016
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made Easy
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
Introduction to Docker
Introduction to DockerIntroduction to Docker
Introduction to Docker
 
A Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy SystemA Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy System
 
Introduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning ToolIntroduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning Tool
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Cl221
Cl221Cl221
Cl221
 
2008-11-13 CAVMEN RHEL for System z Deep Dive
2008-11-13 CAVMEN RHEL for System z Deep Dive2008-11-13 CAVMEN RHEL for System z Deep Dive
2008-11-13 CAVMEN RHEL for System z Deep Dive
 
Open stack implementation
Open stack implementation Open stack implementation
Open stack implementation
 
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at Scale
 
2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...
2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...
2010-01-28 NSA Open Source User Group Meeting, Current & Future Linux on Syst...
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
OpenShift_Installation_Deep_Dive_Robert_Bohne.pdf
OpenShift_Installation_Deep_Dive_Robert_Bohne.pdfOpenShift_Installation_Deep_Dive_Robert_Bohne.pdf
OpenShift_Installation_Deep_Dive_Robert_Bohne.pdf
 
RAC - Test
RAC - TestRAC - Test
RAC - Test
 
Planning For High Performance Web Application
Planning For High Performance Web ApplicationPlanning For High Performance Web Application
Planning For High Performance Web Application
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Nagios Conference 2011 - Daniel Wittenberg - Scaling Nagios At A Giant Insur...
Nagios Conference 2011 - Daniel Wittenberg -  Scaling Nagios At A Giant Insur...Nagios Conference 2011 - Daniel Wittenberg -  Scaling Nagios At A Giant Insur...
Nagios Conference 2011 - Daniel Wittenberg - Scaling Nagios At A Giant Insur...
 
Docker and coreos20141020b
Docker and coreos20141020bDocker and coreos20141020b
Docker and coreos20141020b
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Handling Kernel Upgrades at Scale - The Dirty Cow Story

  • 1. DIRTY COW STORY AT YAHOO GRID Sameer Gawande Savitha Ravikrishnan June 13, 2017
  • 2. 2 Agenda Topic Speaker(s) Overview, Planning, Execution & Coordination Savitha Ravikrishnan Automation details & Final outcome Sameer Gawande Q&A All Presenters
  • 3. 3 WHAT IS DIRTY COW?  Dirty COW (Copy-On-Write) is a security vulnerability of the Linux kernel that affects all Linux-based operating systems including Android.  It allows a malicious actor to tamper with read-only, root-owned executable file.  It’s been around for a decade but had surfaced and was actively exploited in early Q4, 2016.  Linux kernel needed to to be patched, followed with a full reboot.
  • 4. 4 CHALLENGE  Yahoo Grid comprises of 38 clusters:  19 Hadoop clusters  9 Hbase clusters  10 Storm clusters  47,000+ hosts of diverse makes and models
  • 5. 5 GRID STACK Zookeeper Backend Support Hadoop Storage Hadoop Compute Hadoop Services Support Shop Monitoring Starling for logging HDFS Hbase as NoSql store Hcatalog for metadata registry YARN (Mapred) and Tez for Batch processing Storm for stream processing Spark for iterative programming PIG for ETL Hive for SQL Oozie for workflows Proxy services GDM for data Mang Café on Spark for ML
  • 6. 6 CHALLENGE  End of quarter deadline  Cannot afford data loss  Need minimal to no downtime to prevent inconvenience to customers using the clusters.  Coordinating the whole thing between different tiers of operations, site ops technician and the users.  Rigorous end to end automation.
  • 7. 7 PLANNING AND PREPARATION  Numerous discussions between prod ops and dev teams.  Leverage existing framework to rollout new kernel.  Many of these hosts weren’t rebooted in ages, so behavior was uncertain.  Thorough testing of new kernel on different kinds of hardware.  Encountered a variety of issues while testing. o Use this as opportunity to fix hosts with hardware issues.  Resulted in BIOS + BMC + CPLD upgrade across a particular type of systems.  Use Kexec on system at higher risk and with time constraint
  • 8. 8 EXECUTION  Pre upgrade work:  Required scanning all hosts for hardware issues - mem, disks and cpu.  Decom them before the upgrade.  We kept the namenodes up at all times and used it to help with the upgrade by reporting the missing blocks.  Namenode HA setup: IP aliasing, nn1-ha1, nn1-ha2 and nn1  Clients talk to nn1  Upgrade components while nn1 was down.
  • 9. 9 EXECUTION  Before start of the upgrade: Increase namenode heartbeat recheck interval o dfs.namenode.heartbeat.recheck-interval Upgrade namenodes Block map of hosts to blocks on the  The Upgrade Bring down nn1, bring down component services Try a rack and a stripe first and increase that count as needed Troubleshoot hosts failing to come back up  For Storm and Hbase, rolling upgrade script was updated to do a system upgrade as they could sustain a rolling upgrade.
  • 11. 11 HADOOP SUBSYSTEMS  This included various sub-components such as LDAP, Kerberos, syslog servers, monitoring nodes, proxy-nodes, gateways, admin servers to name a few.  These servers could be failed over and were not considered a single point of failure.  The upgrade was done in rolling fashion with no down time to the service.  Inbuilt support in kernel Upgrade.
  • 12. 12 COORDINATION  Comprehensive UI  Display all the clusters, with kernel and bios versions of all hosts  Display host upgrade progress and host health status  Display stats on numbers of hosts upgraded, being upgraded and not upgraded yet  2nd tier of Ops scan the UI for hosts with hardware issues that need to be looked into by site ops.  Site ops technicians on standby to immediately troubleshoot hosts with hardware issues.
  • 14. 14
  • 16. 16 Initialize Workflow /Anchor Function Find Active/ Non Active Kernel Upgrade Required Kernel Current/ Unreachable Push Repo mv old kernel Push RPM Validate Nodes Failed Register Error/Failures Terminate Exit Passed Nodes (nodes to work) Select batch to work Shutdown processes Reboot Reboot Failure Start Services Service failure Check HDFS Status Thresholds crossed Underrepl failed nodes/ Missing blocks Find Active nodes in HDFS Push New temp repo, /boot can’t have multiple kernel, Move old kernel. Push New kernel RPMs Validation involved disk, CPU, memory consistency checks Kernel Upgrade Flow
  • 17. 17 Initialize Block Map Tool Find All Blocks on DNs and record path Monitor Namenode Missing Blocks Trigger metasave Find failed nodes Find nodes having all blocks Escalate to siteops Do find to upload all blocks and path on hdfs and locally (Use pig to find block locations) After this step we are ready to do Kernel Upgrade Block Map Tool
  • 18. 18 TOOL CONFIG default: database_type: 'mysql’ host_netswitch_map: /home/y/conf/ygrid_kernel_upgrade/netswitch_mapping.yaml hbase_client_config: /home/y/conf/cluster_upgrade/ygrid_package_version.yaml repo_file: 'http://xxxxxx.yyyyyyyy.yahoo.com:xxxxx/yum/properties/ylinux/ylinux/dirtycow/ylinux6- kernel-upgrade.yum’ # host Selection logic based on batch specified # [0-9+]s - stripe, select a stripe in cluster # r - rack, select biggest rack of cluster # [0-9]+ - group on n(number) hosts # stop,halt - stop execution further # example # r,s,50,100,stop - upgrade rack, then stripe, then batch of 50 & 100 respectively and then stop irrespective host available or not batch: r,s,4s,7s reboot_wait: 1500 missing_blocks_threshold: 1000 namenode_safemode_timeout: 3600 addNodes: datanode: command_add_datanode storm: command_add_storm removeNodes: datanode: command_remove_datanode storm: command_remove_storm moveKernel: "mv /boot/initramfs-2.6.32-*.img /boot/initrd-2.6.32-*.img /grid/0/tmp/ " installKernel: "yum -y shell /tmp/ylinux6-kernel-upgrade.yum" validateKernelHost: "/usr/local/libexec/validateNodeHealth.py" reboot: "SUDO_USER=kernelupgrade /etc/init.d/systemupgrade.py” reboot: " kernel=`grubby --default-kernel`; initrd=`grubby --info=${kernel} | grep '^initrd' | cut -d'=' -f2`; kexec -l $kernel --initrd=$initrd --command-line="$( cat /proc/cmdline )" ; sleep 5 ; reboot " command: command_add_datanode: "/home/y/bin/addNodes -input_data [cluster]_[colo]:HDFS:[hosts]” command_add_storm: "/home/y/bin/quarantineDebugNodes -input_data [cluster]_[colo]:STORM:[hosts]” command_remove_datanode: "/home/y/bin/shutdownNodes -input_data [cluster]_[colo]:HDFS:[hosts]” command_remove_storm: "/home/y/bin/shutdownNodes -input_data [cluster]_[colo]:STORM:[hosts]"
  • 20. 20 CI/CD process Git (release info) Jenkins Start Put NN in RU mode & Upgrade NN SNN Master Upgrade Region- server Upgrade process Stargate Upgrade Gateway Upgrade HBase Upgrade Foreach DN/RS System Upgrade regionserver Repo Server Package + conf version Stop Regionserver Stop DN Reboot Host Validate and Start DN, RS 1 2 3 4 3a 3c 3b 3d 3e 3f 3f 5HDFS Rolling Upgrade process Iterate over each group Iterate over each server in a group
  • 21. 21 Storm Kernel Upgrade CI/CD process Git (release info) Jenkins Start Artifactory (State files & Release info) RE Jenkins and SD process Pacemaker System Upgrade Nimbus System Upgrade Kill workers and stop Supervisor Reboot Host(s) Start Supervisor Services Verify Services DRPC System Upgrade Run Test/Validatio n topology Audit All Components RE Jenkins leads to statefile generation for each component and updates git with release info Statefiles are published in artifactory and downloaded during upgrade Upgrade fails if more than X supervisors fails to upgrade
  • 23. 23 TEST RESULTS MODEL VS RHEL VERSIONS  We use different configs starting multiple architectures such as Westmere, Sandybridge, Ivybridge, Haswell, Broadwell.  Each of the configurations were installed with different OS versions and kernel versions. OS version Kernel minor version RHEL 6.4 2.6.32-358 RHEL 6.6 and RHEL 6.7 2.6.32-432 to 2.6.32-512 RHEL 6.8 2.6.32-632
  • 24. 24 MODEL VS RHEL AND KERNEL  Issues Slower reboots Boot failing due to iDRAC/IPMI Slowness on systems Hardware issues
  • 25. 25 KEXEC  The primary difference between a standard system boot and a kexec boot is that the hardware initialization or POST normally performed by the BIOS is not performed during a kexec boot. This has the effect of reducing the time required for a reboot.  We had approximately 3000 nodes that had the potential to cause issues if we chose a standard system boot. These nodes belonged to a specific config and had a bad history when it came to rebooting.  We did do a full system reboot in rolling fashion after we were done with the dirtyCOW kernel upgrade project.
  • 26. 26 SUCCESS MATRIX  Zero data loss.  47000+ nodes upgraded at extremely fast pace.  Minimal customer downtime.  S0 security bug resolved.  Minimum impact to low latency services.  Uncovered multiple system issues. Got an opportunity to upgrade BIOS, BMC, fix edac issue that was causing system slowness and that resulted in improved system reliability.