SlideShare a Scribd company logo
Hadoop 0.20.2 to 2.0
Jabir Ahmed
https://twitter.com/jabirahmed
https://www.linkedin.com/in/jabirahmed
• New Features
• HA Namenode
• YARN
• Bug fixes & performance improvements
• keeping pace with the community and to be ready to adapt
technologies that are being built rapidly over Hadoop
Why Hadoop 2.0 ?
Analytics Reporting
Data
Streaming
via HDFS
Adhoc
Querying
/Modeling
Real time
data
processing
Hadoop Usage @ inmobi
• HDFS & MRV1
• Falcon
‣ Active MQ
• HBase
• Conduit
‣ Scribe
‣ Conduit Worker
‣ Pintail
Hadoop Eco System
• Zoo-keeper
• Oozie
• WebHDFS
• Pig
• Hive
• Hcatalog &
Metastore
Cluster Topology
• 5 Production Quality Clusters spread across 5 co-locations
• 30 to 150 Node clusters
• Largest is over 1 Peta Byte
• Average 500TB
• 2,00,000+ Jobs Per day
• 6TB of data generated every day
• 10,000,000,000 Events Generated a day (10 Billion)
Clusters In Inmobi
Cluster topology
Centralized Cluster
Co-located Clusters Co-located Clusters
Upgraded Components
Component Old version New version Other
Changes
HDFS 0.20.2 2.0
Job-tracker 0.20.2 2.0
Oozie 3.3.2 3.3.2 Recompiled
Hbase 0.90.6 0.94.6
Webhdfs -NA- 0.0.3 Re-Compiled
internally
Falcon 0.2.3 0.4.5
Pig 0.8.1 0.11.0
Zookeeper 3.3.4 3.4.5
Conduit Recompiled
1. Configuration management
1. Heterogeneous clusters
2. Host Level Configurations were really hard to manage
2. Data movement had to continue between clusters which could/would run
different version of Hadoop
3. All applications had to be seamlessly migrated with least downtime & NO
failures
4. Capacity Challenges
1. Network Challenges
2. Hardware Limitations
3. Storage & Computation limitations
5. Expected uptime for each of the clusters in over 99%, which meant we
couldn’t keep the cluster down for upgrade for a long time
6. Roll back was not possible
Challenges
How we over came the challenges
1. Configuration Management
Problem
‣ We had configurations in debians like
‣ Cluster_Conf_version_1.deb
‣ Cluster_conf_version_2.deb and so on
‣ For 5 cluster and 10 components we would manage a lot of debs each with 15-20 Confs
‣ Changing a property value across the cluster was time consuming
Packages & configurations
‣ Since host specific Configurations were really hard to manage we deprecated debians
‣ Moved entire Package & configuration to puppet
Advantages
‣ Under 10 files to manage
‣ Everything was managed via templates and only host/component/cluster specific variables had to
be set appropritely
‣ Verification & confidence was higher with puppet while deploying changes in production
1.1 Sample puppet configuration
Template
<% mapred_queues.each_pair do
|key,value| -%>
<!-- setting for queue <%= key %> -->
<property>
<name>mapred.capacity-
scheduler.queue.<%= key
%>.capacity</name>
<value><%= value %></value>
</property>
<% end -%>
Actual values
$mapred_queues={
”reporting" => 25,
”analytics" => 12,
”default" => 21,
”Hourly" => 14,
”daily" => 13,
.......
}
Apply template
file {
"${conf_location}/capacity-
scheduler.xml":
ensure => present,
owner => root,
group => root,
mode => 644,
content =>
template('grid/hadoop2.0/hadoopC
oreConfig/capacity-scheduler.xml');
2. Data Movement
All applications had
to change to pull the
data from other
clusters
distcp across
clusters was
not possible
with the
standard hdfs
& hftp
protocols
We had to use
Web-HDFS
• Code was patched
to allow only reads
All applications
& Falcon
feeds/data
replications
had to be
tested &
migrated to
web-hdfs.
Since web-hdfs
was a SPOF, it
had to be
made scalable
& high
available
All clients
reading from
HDFS had to
also upgrade
libraries
• Ensured all stacks
were compatible to
read from
upgraded HDFS
• Some applications like falcon & conduit had to be enhanced to use the webhdfs protocol as a pre-requisite.
3. Application Challenges
• 2 Versions of applications had to be maintained
‣ One for 0.20 and the other for 2.0
‣ To avoid disruption in current business & respective development
• Staging cluster had to be rebuilt to run 2 version of
Hadoop for pre prod testing,validation and sign off
• A lot of applications had to be made compatible since
some functions & classes were deprecated in 2.0
• Few class path changes were identified in pre-prod
testing
Capacity was a limitation since our headroom in other co-locations
was only 30% but we were flipping 100% from one region to another
• Network & Infra challenge
‣ N/W bandwidth for latency to avoid delays in data movement.
‣ Other stacks also had to check for capacity while we did a failover for the upgrade
• Ensuring we have enough capacity in other cluster to process data
while meeting SLA’s
‣ Added physical nodes to existing clusters & dependent stacks if it was required.
‣ Added more conduit/scribe agents to handle the increase in traffic during upgrade.
4. Capacity Challenges
5. Deployment & Upgrade
• Rolling upgrade of clusters
• The GSLB was changed to redirect traffic to closest region
• Had to latencies were met as per the business requirement
• Maintenance was scheduled at a time when the impact was least.
• The time chosen was when the # of requests were the least for the specific region to ensure we don’t impact
the performance and also don’t require 100% capacity in the failed over region
• Data was processed on another cluster to have business continuity.
• Since datanode upgrade depends on the number of blocks on the datanode we
cleaned up HDFS to reduce the blocks on the datanodes which eventually helped
in expediting the upgrade process
• Upgrading components in parallel where there was no dependency
5.1 Deployment Sequence
HDFS Datanodes 4 hours
JT Tasktrackers 45 minutes
Zookeeper Hbase Master Region Servers 45 minutes
Oozie 15 minutes
WebHDFS < 15 minutes
Falcon < 10 minutes
Conduit 30 minutes
Most of the nagios checks and metrics collected in
ganglia remained the same
New Monitoring
• Monitoring for new services like web-hdfs had to
be added.
Monitoring changes
• Nagios had minor changes to monitor the edits
since the edits had changed from
‣ edits & edits.new to
‣ edits_0000000000357071448-
0000000000357619117
edits_inprogress_000000000035873924
1
Ganglia Metrics
• Ganglia was over whelmed with the new RS
metric so we had to patch it to skip sending some
metric that wasn’t required.
‣ Custom filter was written to filter events that
were not used
6. Monitoring & Metrics
• The job tracker had a memory leak and had to be restarted once every 3-4 days
• https://issues.apache.org/jira/browse/MAPREDUCE-5508
• Hbase started emitting 1000’s of metrics per table brining ganglia down & we
had to patch it internally to fix it
Issues / Bugs encountered
• One step at a time
• We didn’t want to do a lot of things at one go, so we took small steps and at
the end achieved the goal
• Team work, Works!
• Its really hard to do this as a “One Man Show”, we noticed immense sense of
trust and responsibility among the team during the entire process
• Every Mistake was a learning
• Every mistake that was done in the initial stages was not a reason to blame
each other but we went ahead and fixed it ensuring it didn’t happen again
• Finally
• There were smooth upgrades !
Learning & Best Practices
Hadoop Migration from 0.20.2 to 2.0

More Related Content

What's hot

What Big Data Folks Need to Know About DevOps
What Big Data Folks Need to Know About DevOpsWhat Big Data Folks Need to Know About DevOps
What Big Data Folks Need to Know About DevOps
Matt Ray
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
confluent
 
WINER Workflow Integrated Network Resource Orchestration
WINER Workflow Integrated Network Resource OrchestrationWINER Workflow Integrated Network Resource Orchestration
WINER Workflow Integrated Network Resource Orchestration
Tal Lavian Ph.D.
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
confluent
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
DataStax Academy
 
So You’ve Inherited Kafka? Now What? (Alon Gavra, AppsFlyer) Kafka Summit Lon...
So You’ve Inherited Kafka? Now What? (Alon Gavra, AppsFlyer) Kafka Summit Lon...So You’ve Inherited Kafka? Now What? (Alon Gavra, AppsFlyer) Kafka Summit Lon...
So You’ve Inherited Kafka? Now What? (Alon Gavra, AppsFlyer) Kafka Summit Lon...
confluent
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Inside Kafka Streams—Monitoring Comcast’s Outside Plant Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
confluent
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function Programming
Ahmed Soliman
 
Introduction to Graph QL
Introduction to Graph QLIntroduction to Graph QL
Introduction to Graph QL
Deepak More
 
Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014
Claudiu Barbura
 
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
Pivoting Spring XD to Spring Cloud Data Flow with Sabby AnandanPivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
PivotalOpenSourceHub
 
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
Maximize the Business Value of Machine Learning and Data Science with Kafka (...Maximize the Business Value of Machine Learning and Data Science with Kafka (...
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
confluent
 
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
HostedbyConfluent
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
confluent
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Flink Forward
 
Docker. Does it matter for Java developer ?
Docker. Does it matter for Java developer ?Docker. Does it matter for Java developer ?
Docker. Does it matter for Java developer ?
Izzet Mustafaiev
 
Service Stampede: Surviving a Thousand Services
Service Stampede: Surviving a Thousand ServicesService Stampede: Surviving a Thousand Services
Service Stampede: Surviving a Thousand Services
Anil Gursel
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent
 

What's hot (20)

What Big Data Folks Need to Know About DevOps
What Big Data Folks Need to Know About DevOpsWhat Big Data Folks Need to Know About DevOps
What Big Data Folks Need to Know About DevOps
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
 
WINER Workflow Integrated Network Resource Orchestration
WINER Workflow Integrated Network Resource OrchestrationWINER Workflow Integrated Network Resource Orchestration
WINER Workflow Integrated Network Resource Orchestration
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
So You’ve Inherited Kafka? Now What? (Alon Gavra, AppsFlyer) Kafka Summit Lon...
So You’ve Inherited Kafka? Now What? (Alon Gavra, AppsFlyer) Kafka Summit Lon...So You’ve Inherited Kafka? Now What? (Alon Gavra, AppsFlyer) Kafka Summit Lon...
So You’ve Inherited Kafka? Now What? (Alon Gavra, AppsFlyer) Kafka Summit Lon...
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Inside Kafka Streams—Monitoring Comcast’s Outside Plant Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function Programming
 
Introduction to Graph QL
Introduction to Graph QLIntroduction to Graph QL
Introduction to Graph QL
 
Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014
 
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
Pivoting Spring XD to Spring Cloud Data Flow with Sabby AnandanPivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
 
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
Maximize the Business Value of Machine Learning and Data Science with Kafka (...Maximize the Business Value of Machine Learning and Data Science with Kafka (...
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
 
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
Docker. Does it matter for Java developer ?
Docker. Does it matter for Java developer ?Docker. Does it matter for Java developer ?
Docker. Does it matter for Java developer ?
 
Service Stampede: Surviving a Thousand Services
Service Stampede: Surviving a Thousand ServicesService Stampede: Surviving a Thousand Services
Service Stampede: Surviving a Thousand Services
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 

Similar to Hadoop Migration from 0.20.2 to 2.0

DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
Rahul Agarwal
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop Overview
Yafang Chang
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0
Kai Sasaki
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Hitless Controller Upgrades
Hitless Controller UpgradesHitless Controller Upgrades
Hitless Controller Upgrades
Lumina Networks
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
DataWorks Summit
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Alluxio, Inc.
 
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Sascha Wenninger
 
DevOps for Big Data - Data 360 2014 Conference
DevOps for Big Data - Data 360 2014 ConferenceDevOps for Big Data - Data 360 2014 Conference
DevOps for Big Data - Data 360 2014 Conference
Grid Dynamics
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
DataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
Tony Ng
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
Valeriia Maliarenko
 
Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.
Mydbops
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
Ming Ma
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Nagios
 
How we scaled Rudder to 10k, and the road to 50k
How we scaled Rudder to 10k, and the road to 50kHow we scaled Rudder to 10k, and the road to 50k
How we scaled Rudder to 10k, and the road to 50k
RUDDER
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
 
Change management in hybrid landscapes
Change management in hybrid landscapesChange management in hybrid landscapes
Change management in hybrid landscapes
Chris Kernaghan
 
Scale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 servicesScale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 services
LinuxCon ContainerCon CloudOpen China
 

Similar to Hadoop Migration from 0.20.2 to 2.0 (20)

DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop Overview
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Hitless Controller Upgrades
Hitless Controller UpgradesHitless Controller Upgrades
Hitless Controller Upgrades
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
 
DevOps for Big Data - Data 360 2014 Conference
DevOps for Big Data - Data 360 2014 ConferenceDevOps for Big Data - Data 360 2014 Conference
DevOps for Big Data - Data 360 2014 Conference
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
 
Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
How we scaled Rudder to 10k, and the road to 50k
How we scaled Rudder to 10k, and the road to 50kHow we scaled Rudder to 10k, and the road to 50k
How we scaled Rudder to 10k, and the road to 50k
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Change management in hybrid landscapes
Change management in hybrid landscapesChange management in hybrid landscapes
Change management in hybrid landscapes
 
Scale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 servicesScale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 services
 

Recently uploaded

Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
Fwdays
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 

Recently uploaded (20)

Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 

Hadoop Migration from 0.20.2 to 2.0

  • 1. Hadoop 0.20.2 to 2.0 Jabir Ahmed https://twitter.com/jabirahmed https://www.linkedin.com/in/jabirahmed
  • 2. • New Features • HA Namenode • YARN • Bug fixes & performance improvements • keeping pace with the community and to be ready to adapt technologies that are being built rapidly over Hadoop Why Hadoop 2.0 ?
  • 4. • HDFS & MRV1 • Falcon ‣ Active MQ • HBase • Conduit ‣ Scribe ‣ Conduit Worker ‣ Pintail Hadoop Eco System • Zoo-keeper • Oozie • WebHDFS • Pig • Hive • Hcatalog & Metastore
  • 6. • 5 Production Quality Clusters spread across 5 co-locations • 30 to 150 Node clusters • Largest is over 1 Peta Byte • Average 500TB • 2,00,000+ Jobs Per day • 6TB of data generated every day • 10,000,000,000 Events Generated a day (10 Billion) Clusters In Inmobi
  • 7. Cluster topology Centralized Cluster Co-located Clusters Co-located Clusters
  • 8. Upgraded Components Component Old version New version Other Changes HDFS 0.20.2 2.0 Job-tracker 0.20.2 2.0 Oozie 3.3.2 3.3.2 Recompiled Hbase 0.90.6 0.94.6 Webhdfs -NA- 0.0.3 Re-Compiled internally Falcon 0.2.3 0.4.5 Pig 0.8.1 0.11.0 Zookeeper 3.3.4 3.4.5 Conduit Recompiled
  • 9. 1. Configuration management 1. Heterogeneous clusters 2. Host Level Configurations were really hard to manage 2. Data movement had to continue between clusters which could/would run different version of Hadoop 3. All applications had to be seamlessly migrated with least downtime & NO failures 4. Capacity Challenges 1. Network Challenges 2. Hardware Limitations 3. Storage & Computation limitations 5. Expected uptime for each of the clusters in over 99%, which meant we couldn’t keep the cluster down for upgrade for a long time 6. Roll back was not possible Challenges
  • 10. How we over came the challenges
  • 11. 1. Configuration Management Problem ‣ We had configurations in debians like ‣ Cluster_Conf_version_1.deb ‣ Cluster_conf_version_2.deb and so on ‣ For 5 cluster and 10 components we would manage a lot of debs each with 15-20 Confs ‣ Changing a property value across the cluster was time consuming Packages & configurations ‣ Since host specific Configurations were really hard to manage we deprecated debians ‣ Moved entire Package & configuration to puppet Advantages ‣ Under 10 files to manage ‣ Everything was managed via templates and only host/component/cluster specific variables had to be set appropritely ‣ Verification & confidence was higher with puppet while deploying changes in production
  • 12. 1.1 Sample puppet configuration Template <% mapred_queues.each_pair do |key,value| -%> <!-- setting for queue <%= key %> --> <property> <name>mapred.capacity- scheduler.queue.<%= key %>.capacity</name> <value><%= value %></value> </property> <% end -%> Actual values $mapred_queues={ ”reporting" => 25, ”analytics" => 12, ”default" => 21, ”Hourly" => 14, ”daily" => 13, ....... } Apply template file { "${conf_location}/capacity- scheduler.xml": ensure => present, owner => root, group => root, mode => 644, content => template('grid/hadoop2.0/hadoopC oreConfig/capacity-scheduler.xml');
  • 13. 2. Data Movement All applications had to change to pull the data from other clusters distcp across clusters was not possible with the standard hdfs & hftp protocols We had to use Web-HDFS • Code was patched to allow only reads All applications & Falcon feeds/data replications had to be tested & migrated to web-hdfs. Since web-hdfs was a SPOF, it had to be made scalable & high available All clients reading from HDFS had to also upgrade libraries • Ensured all stacks were compatible to read from upgraded HDFS • Some applications like falcon & conduit had to be enhanced to use the webhdfs protocol as a pre-requisite.
  • 14. 3. Application Challenges • 2 Versions of applications had to be maintained ‣ One for 0.20 and the other for 2.0 ‣ To avoid disruption in current business & respective development • Staging cluster had to be rebuilt to run 2 version of Hadoop for pre prod testing,validation and sign off • A lot of applications had to be made compatible since some functions & classes were deprecated in 2.0 • Few class path changes were identified in pre-prod testing
  • 15. Capacity was a limitation since our headroom in other co-locations was only 30% but we were flipping 100% from one region to another • Network & Infra challenge ‣ N/W bandwidth for latency to avoid delays in data movement. ‣ Other stacks also had to check for capacity while we did a failover for the upgrade • Ensuring we have enough capacity in other cluster to process data while meeting SLA’s ‣ Added physical nodes to existing clusters & dependent stacks if it was required. ‣ Added more conduit/scribe agents to handle the increase in traffic during upgrade. 4. Capacity Challenges
  • 16. 5. Deployment & Upgrade • Rolling upgrade of clusters • The GSLB was changed to redirect traffic to closest region • Had to latencies were met as per the business requirement • Maintenance was scheduled at a time when the impact was least. • The time chosen was when the # of requests were the least for the specific region to ensure we don’t impact the performance and also don’t require 100% capacity in the failed over region • Data was processed on another cluster to have business continuity. • Since datanode upgrade depends on the number of blocks on the datanode we cleaned up HDFS to reduce the blocks on the datanodes which eventually helped in expediting the upgrade process • Upgrading components in parallel where there was no dependency
  • 17. 5.1 Deployment Sequence HDFS Datanodes 4 hours JT Tasktrackers 45 minutes Zookeeper Hbase Master Region Servers 45 minutes Oozie 15 minutes WebHDFS < 15 minutes Falcon < 10 minutes Conduit 30 minutes
  • 18. Most of the nagios checks and metrics collected in ganglia remained the same New Monitoring • Monitoring for new services like web-hdfs had to be added. Monitoring changes • Nagios had minor changes to monitor the edits since the edits had changed from ‣ edits & edits.new to ‣ edits_0000000000357071448- 0000000000357619117 edits_inprogress_000000000035873924 1 Ganglia Metrics • Ganglia was over whelmed with the new RS metric so we had to patch it to skip sending some metric that wasn’t required. ‣ Custom filter was written to filter events that were not used 6. Monitoring & Metrics
  • 19. • The job tracker had a memory leak and had to be restarted once every 3-4 days • https://issues.apache.org/jira/browse/MAPREDUCE-5508 • Hbase started emitting 1000’s of metrics per table brining ganglia down & we had to patch it internally to fix it Issues / Bugs encountered
  • 20. • One step at a time • We didn’t want to do a lot of things at one go, so we took small steps and at the end achieved the goal • Team work, Works! • Its really hard to do this as a “One Man Show”, we noticed immense sense of trust and responsibility among the team during the entire process • Every Mistake was a learning • Every mistake that was done in the initial stages was not a reason to blame each other but we went ahead and fixed it ensuring it didn’t happen again • Finally • There were smooth upgrades ! Learning & Best Practices

Editor's Notes

  1. We broke the upgrade in 2 phases 1. HDFS upgrade &amp; HA 2. YARN
  2. ----- Meeting Notes (06/05/14 00:08) -----every procument had some slight variations in terms of specs some had more ram some had 6 disk and some 12 disks CPU cores were differentRoll back was not possibleFailures in production strict No noQA test cases were re verified
  3. Hardware specs kept changing with every new procurementSome had more ram , some had more disks etc.Since we grew significantly in terms of number of servers , debians were a tech debt and we took this as an opportunity to fix it----- Meeting Notes (06/05/14 00:23) -----verification of configs post installation was simpler
  4. We built a separate module for the configuration that could be used across all clusters going forwardAll the properties were verified, deprecated properties were retained along with the new properties to avoid any failures , just in case they were being used anywhereThis significantly reduced the time we took post installation to validate if the configurations were correct since it was all centralized
  5. 5-10 Pipelines each with multiple jobs QA effortTest casesData validationEvery bug had to be fixed &amp; merged in two placesDeployment Challenges