SlideShare a Scribd company logo
Hadoop Administration
issues and fixes
InMobi cluster
HDP Distribution - 2.2
450 nodes
5PB data storage
60-75K applications runs everyday
9000 Cores
14 TB of Total RAM
Our Hadoop ecosystem
falcon
oozie
scribe/databus
kafka
httpfs services
zookeeper
aerospike
HardWare Configurations
NameNode
64GB of RAM / 24 core cpu
DataNode
64GB of RAM /32 core/ 6*2TB disks
128GB of RAM/40 core / 10*2TB disks
OS
Ubuntu 12.04
Configuration management
Puppet
system package pushes
system configuration management
hadoop package pushes
hadoop configuration management using templates
Hadoop configurations
1.5GB of RAM and 1 core for a container
Max container limit is 20G - to enable spark jobs
Max vcore allocation is 8 cpu
Using Capacity Scheduler
Node Manager recovery is enabled
CGroups for CPU isolation
jvm configuration
NameNode - 30GB
NodeManager - 2GB
DataNode - 3GB
gc - concurrent mark sweep collector ( CMS)
Alers and Trends
Using Nagios for Alerting.
monitor only master node services eg:
namenode,resourcemanager,historyserver
datanodes/nodemanagers are monitored based on availability ( say 90% of
the nodes are up)
alerts are configured to send to PagerDuty
Alers and Trends [contd]
Using Graphite for Metrics
jmx metrics
use jmxtrans and custom scripts to send metrics to graphite
clusterwide alerts on top on graphite data
Issues And challenges - hadoop
Tasks hogging up whole allocated CPU vcores.
Problem
We have couple of jobs which are very CPU intensive. While these jobs are running, other tasks are starving for
resources, leads to SLA misses.
Reason
with the DefaultContainerExecutor class, there is no limit in using CPU by a single task. Other tasks running on the
node node starve for CPU resource and can lead to long running jobs.
Issues And challenges - hadoop [cntd]
solution
We mitigated this problem by implementing strict CPU limit with cgroups and LinuxContainerExecutor class.
install libcgroup1 package
Issues And challenges - hadoop [cntd]
Job priority
Problem
yarn does not honor mapred job priority ( mapreduce.job.priority VERY_LOW, LOW,HIGH,VERY_HIGH)
Reason
Priorities across application within the same queue is not implemented yet.
Issues And challenges - hadoop [cntd]
Solution
Implemented hierarchical queues within a queue with different thresholds
allocate different proportions to the sub-queues say 30% to report, 20% to report_low and 30%
report_high queue.
Issues And challenges - hadoop [cntd]
Namenode HA failover with ssh fencing
Problem
Active Namenode(Namenode1) server went down physically. We expected the standby to become active. But that
didn’t happen.
Reason
We are using Namenode HA with ssh fencing. zkfc(zookeeper failover controller) does ssh to the active
namenode and make sure the namenode process is killed to avoid any possible chance of split-brain. Since the active
namenode was physically down, the ssh-fencing always returned non-zero status
Issues And challenges - hadoop [cntd]
Solution
1. We added a dummy host in place of the Namenode1 in the hdfs-site.xml. The zkfc running in standby
successfully
2. We can use shell fencing method to return exit status ‘0’
3. You can also use custom scripts with shell fencing method to handle this situation
Issues And challenges - hadoop [cntd]
tasks running slow
problem
Jobs are running extremely slow.
Reason
There can be system issues as well as application issues that can cause this problem.
1. misconfigured nodemanager
The actual cpu available in a nodemanager is 12, but you set
yarn.nodemanager.resource.cpu-vcores as 24.
2. Nodes were configured with 100Mb/s network speed
3. Bad disks
Solution
a. Enabling Speculative execution
cons
Enabling speculative execution can lead to wastage of resources. We
have patched speculative execution, It spawn duplicate tasks only in case absolute necessity.
Issues And challenges - hadoop [cntd]
Issues And challenges - hadoop [cntd]
b. Excluding bad disks from the cluster
We exclude bad disks from a node using puppet automatically.
c. Exclude bad nodes from the cluster
We are using custom healthcheck to blacklist a node if a certain number of applications fails
on this node in a certain amount of time.
i. separate the NodeManager Audit log to a different file using log4j
Issues And challenges - hadoop [cntd]
Then use a custom script to get the number of failures in nodemanager and use that to
blacklist a node.
Issues And challenges - hadoop [cntd]
d. machines with bad hardware configurations
Solution
Use script to validate each nodes
Issues And challenges - hadoop [cntd]
History Server log size.
have seen AM logs with size 1G
controlled AM log size with yarn.app.mapreduce.am.container.log.limit.kb
log aggregation is enabled and logs are kept in HDFS for 3 days
Q&AWe are hiring for rockstars like you. If interested please drop a note at
sivakumar@inmobi.com/srikanth.sundarrajan@inmobi.com

More Related Content

What's hot

Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
Edureka!
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
Edureka!
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
Santosh Nage
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache Hadoop
Cloudera, Inc.
 
Hadoop admin training
Hadoop admin trainingHadoop admin training
Hadoop admin training
Arun Kumar
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
Upgrading hadoop
Upgrading hadoopUpgrading hadoop
Upgrading hadoop
Shashwat Shriparv
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
Shashwat Shriparv
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
awesomesos
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14
jijukjoseph
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
Edureka!
 
Apache kafka configuration-guide
Apache kafka configuration-guideApache kafka configuration-guide
Apache kafka configuration-guide
Chetan Khatri
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
Shashwat Shriparv
 

What's hot (20)

Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache Hadoop
 
Hadoop admin training
Hadoop admin trainingHadoop admin training
Hadoop admin training
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Upgrading hadoop
Upgrading hadoopUpgrading hadoop
Upgrading hadoop
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 
Apache kafka configuration-guide
Apache kafka configuration-guideApache kafka configuration-guide
Apache kafka configuration-guide
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
 

Viewers also liked

Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Bn1028 demo hadoop administration and development
Bn1028 demo  hadoop administration and developmentBn1028 demo  hadoop administration and development
Bn1028 demo hadoop administration and development
conline training
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
Athanasios Anastasiou
 
Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
Christian Ariza Porras
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
Rohit Agrawal
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
Hortonworks
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
Ravindra Bandara
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
Steve Loughran
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop Cluster
Edureka!
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
gethue
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Cloudera, Inc.
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeople
SpringPeople
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Jonathan Seidman
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop AdministrationEdureka!
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
GetInData
 

Viewers also liked (20)

Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Bn1028 demo hadoop administration and development
Bn1028 demo  hadoop administration and developmentBn1028 demo  hadoop administration and development
Bn1028 demo hadoop administration and development
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
 
Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop Cluster
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeople
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
 

Similar to Hadoop administration

Enabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via KubernetesEnabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via Kubernetes
mountpoint.io
 
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLEQ2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
Linaro
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance
Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance
Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance
Ceph Community
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
Edureka!
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
Nikita Kesharwani
 
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
DataStax
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
J On The Beach
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wild
datamantra
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
the NML project
the NML projectthe NML project
the NML projectLei Yang
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
Amazon Web Services
 
Linux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisLinux Crash Dump Capture and Analysis
Linux Crash Dump Capture and Analysis
Paul V. Novarese
 
Advanced Namespaces and cgroups
Advanced Namespaces and cgroupsAdvanced Namespaces and cgroups
Advanced Namespaces and cgroups
Kernel TLV
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
Amazon Web Services
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
VMware Tanzu
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
Cloudera, Inc.
 
Hands on Virtualization with Ganeti
Hands on Virtualization with GanetiHands on Virtualization with Ganeti
Hands on Virtualization with Ganeti
OSCON Byrum
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
inside-BigData.com
 
Handling Kernel Upgrades at Scale - The Dirty Cow Story
Handling Kernel Upgrades at Scale - The Dirty Cow StoryHandling Kernel Upgrades at Scale - The Dirty Cow Story
Handling Kernel Upgrades at Scale - The Dirty Cow Story
DataWorks Summit
 

Similar to Hadoop administration (20)

Enabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via KubernetesEnabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via Kubernetes
 
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLEQ2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance
Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance
Ceph Day Beijing: CeTune: A Framework of Profile and Tune Ceph Performance
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wild
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
the NML project
the NML projectthe NML project
the NML project
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Linux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisLinux Crash Dump Capture and Analysis
Linux Crash Dump Capture and Analysis
 
Advanced Namespaces and cgroups
Advanced Namespaces and cgroupsAdvanced Namespaces and cgroups
Advanced Namespaces and cgroups
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Hands on Virtualization with Ganeti
Hands on Virtualization with GanetiHands on Virtualization with Ganeti
Hands on Virtualization with Ganeti
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
Handling Kernel Upgrades at Scale - The Dirty Cow Story
Handling Kernel Upgrades at Scale - The Dirty Cow StoryHandling Kernel Upgrades at Scale - The Dirty Cow Story
Handling Kernel Upgrades at Scale - The Dirty Cow Story
 

Recently uploaded

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 

Recently uploaded (20)

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 

Hadoop administration

  • 2. InMobi cluster HDP Distribution - 2.2 450 nodes 5PB data storage 60-75K applications runs everyday 9000 Cores 14 TB of Total RAM
  • 4. HardWare Configurations NameNode 64GB of RAM / 24 core cpu DataNode 64GB of RAM /32 core/ 6*2TB disks 128GB of RAM/40 core / 10*2TB disks OS Ubuntu 12.04
  • 5. Configuration management Puppet system package pushes system configuration management hadoop package pushes hadoop configuration management using templates
  • 6. Hadoop configurations 1.5GB of RAM and 1 core for a container Max container limit is 20G - to enable spark jobs Max vcore allocation is 8 cpu Using Capacity Scheduler Node Manager recovery is enabled CGroups for CPU isolation
  • 7. jvm configuration NameNode - 30GB NodeManager - 2GB DataNode - 3GB gc - concurrent mark sweep collector ( CMS)
  • 8. Alers and Trends Using Nagios for Alerting. monitor only master node services eg: namenode,resourcemanager,historyserver datanodes/nodemanagers are monitored based on availability ( say 90% of the nodes are up) alerts are configured to send to PagerDuty
  • 9. Alers and Trends [contd] Using Graphite for Metrics jmx metrics use jmxtrans and custom scripts to send metrics to graphite clusterwide alerts on top on graphite data
  • 10. Issues And challenges - hadoop Tasks hogging up whole allocated CPU vcores. Problem We have couple of jobs which are very CPU intensive. While these jobs are running, other tasks are starving for resources, leads to SLA misses. Reason with the DefaultContainerExecutor class, there is no limit in using CPU by a single task. Other tasks running on the node node starve for CPU resource and can lead to long running jobs.
  • 11. Issues And challenges - hadoop [cntd] solution We mitigated this problem by implementing strict CPU limit with cgroups and LinuxContainerExecutor class. install libcgroup1 package
  • 12. Issues And challenges - hadoop [cntd] Job priority Problem yarn does not honor mapred job priority ( mapreduce.job.priority VERY_LOW, LOW,HIGH,VERY_HIGH) Reason Priorities across application within the same queue is not implemented yet.
  • 13. Issues And challenges - hadoop [cntd] Solution Implemented hierarchical queues within a queue with different thresholds allocate different proportions to the sub-queues say 30% to report, 20% to report_low and 30% report_high queue.
  • 14. Issues And challenges - hadoop [cntd] Namenode HA failover with ssh fencing Problem Active Namenode(Namenode1) server went down physically. We expected the standby to become active. But that didn’t happen. Reason We are using Namenode HA with ssh fencing. zkfc(zookeeper failover controller) does ssh to the active namenode and make sure the namenode process is killed to avoid any possible chance of split-brain. Since the active namenode was physically down, the ssh-fencing always returned non-zero status
  • 15. Issues And challenges - hadoop [cntd] Solution 1. We added a dummy host in place of the Namenode1 in the hdfs-site.xml. The zkfc running in standby successfully 2. We can use shell fencing method to return exit status ‘0’ 3. You can also use custom scripts with shell fencing method to handle this situation
  • 16. Issues And challenges - hadoop [cntd] tasks running slow problem Jobs are running extremely slow. Reason There can be system issues as well as application issues that can cause this problem. 1. misconfigured nodemanager The actual cpu available in a nodemanager is 12, but you set yarn.nodemanager.resource.cpu-vcores as 24.
  • 17. 2. Nodes were configured with 100Mb/s network speed 3. Bad disks Solution a. Enabling Speculative execution cons Enabling speculative execution can lead to wastage of resources. We have patched speculative execution, It spawn duplicate tasks only in case absolute necessity. Issues And challenges - hadoop [cntd]
  • 18. Issues And challenges - hadoop [cntd] b. Excluding bad disks from the cluster We exclude bad disks from a node using puppet automatically. c. Exclude bad nodes from the cluster We are using custom healthcheck to blacklist a node if a certain number of applications fails on this node in a certain amount of time. i. separate the NodeManager Audit log to a different file using log4j
  • 19. Issues And challenges - hadoop [cntd] Then use a custom script to get the number of failures in nodemanager and use that to blacklist a node.
  • 20. Issues And challenges - hadoop [cntd] d. machines with bad hardware configurations Solution Use script to validate each nodes
  • 21. Issues And challenges - hadoop [cntd] History Server log size. have seen AM logs with size 1G controlled AM log size with yarn.app.mapreduce.am.container.log.limit.kb log aggregation is enabled and logs are kept in HDFS for 3 days
  • 22. Q&AWe are hiring for rockstars like you. If interested please drop a note at sivakumar@inmobi.com/srikanth.sundarrajan@inmobi.com