This document discusses the Hadoop cluster configuration at InMobi. It includes details about the cluster hardware specifications with 450 nodes and 5PB of storage. It also describes the software stack including Hadoop, Falcon, Oozie, Kafka and monitoring tools like Nagios and Graphite. The document then outlines some common issues faced like tasks hogging CPU resources and solutions implemented like cgroups resource limits. It provides examples of NameNode HA failover challenges and approaches to address slow running jobs.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
The Hadoop Cluster Administration course at Edureka starts with the fundamental concepts of Apache Hadoop and Hadoop Cluster. It covers topics to deploy, manage, monitor, and secure a Hadoop Cluster. You will learn to configure backup options, diagnose and recover node failures in a Hadoop Cluster. The course will also cover HBase Administration. There will be many challenging, practical and focused hands-on exercises for the learners. Software professionals new to Hadoop can quickly learn the cluster administration through technical sessions and hands-on labs. By the end of this six week Hadoop Cluster Administration training, you will be prepared to understand and solve real world problems that you may come across while working on Hadoop Cluster.
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
The following blogs will help you understand the significance of Hadoop Administration training:
http://www.edureka.co/blog/why-should-you-go-for-hadoop-administration-course/
http://www.edureka.co/blog/how-to-become-a-hadoop-administrator/
http://www.edureka.co/blog/hadoop-admin-responsibilities/
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
The Hadoop Cluster Administration course at Edureka starts with the fundamental concepts of Apache Hadoop and Hadoop Cluster. It covers topics to deploy, manage, monitor, and secure a Hadoop Cluster. You will learn to configure backup options, diagnose and recover node failures in a Hadoop Cluster. The course will also cover HBase Administration. There will be many challenging, practical and focused hands-on exercises for the learners. Software professionals new to Hadoop can quickly learn the cluster administration through technical sessions and hands-on labs. By the end of this six week Hadoop Cluster Administration training, you will be prepared to understand and solve real world problems that you may come across while working on Hadoop Cluster.
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
The following blogs will help you understand the significance of Hadoop Administration training:
http://www.edureka.co/blog/why-should-you-go-for-hadoop-administration-course/
http://www.edureka.co/blog/how-to-become-a-hadoop-administrator/
http://www.edureka.co/blog/hadoop-admin-responsibilities/
Hadoop Interview Questions and Answers by rohit kapakapa rohit
Hadoop Interview Questions and Answers - More than 130 real time questions and answers covering hadoop hdfs,mapreduce and administrative concepts by rohit kapa
Introduction to Cloudera's Administrator Training for Apache HadoopCloudera, Inc.
Learn who is best suited to attend the full Administrator Training, what prior knowledge you should have, and what topics the course covers. Cloudera Senior Curriculum Manager, Ian Wrigley, will discuss the skills you will attain during Admin Training and how they will help you move your Hadoop deployment from strategy to production and prepare for the Cloudera Certified Administrator for Apache Hadoop (CCAH) exam.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
This presentation contains brief description about big data along with that hadoop installation, configuration and MapReduce wordcount program and its explanation.
Hadoop Interview Questions and Answers by rohit kapakapa rohit
Hadoop Interview Questions and Answers - More than 130 real time questions and answers covering hadoop hdfs,mapreduce and administrative concepts by rohit kapa
Introduction to Cloudera's Administrator Training for Apache HadoopCloudera, Inc.
Learn who is best suited to attend the full Administrator Training, what prior knowledge you should have, and what topics the course covers. Cloudera Senior Curriculum Manager, Ian Wrigley, will discuss the skills you will attain during Admin Training and how they will help you move your Hadoop deployment from strategy to production and prepare for the Cloudera Certified Administrator for Apache Hadoop (CCAH) exam.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
This presentation contains brief description about big data along with that hadoop installation, configuration and MapReduce wordcount program and its explanation.
This presentation accompanied a practical demonstration of Amazon's Elastic Computing services to CNET students at the University of Plymouth on 16/03/2010.
The practical demonstration involved an obviously parallel problem split on 5 Medium size AMIs. The problem was the calculation of the Clustering Coefficient and the Mean Path Length (Based on the original work done by Watts and Strogatz) for large networks. The code was written in Python taking advantage of the scipy, pyparallel and networkx toolkits
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
Learning Objectives - In this module, you will learn the Hadoop Cluster Architecture and Setup, Important Configuration files in a Hadoop Cluster, Data Loading Techniques.
I have studied on Big Data analysis and found Hadoop is the best technology and most popular as well for it's distributed data processing approaches. I have gathered all possible information about various Hadoop distributions available in the market and tried to describe most important tools and their functionality in the Hadoop echosystems in this slide show. I have also tried to discuss about connectivity with language R interm of data analysis and visualization perspective. Hope you will be enjoying the whole!
Explores the notion of "Hadoop as a Data Refinery" within an organisation, be it one with an existing Business Intelligence system or none - looks at 'agile data' as a a benefit of using Hadoop as the store for historical, unstructured and very-large-scale datasets.
The final slides look at the challenge of an organisation becoming "data driven"
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetupgethue
This talk will describe how Hue can be integrated with existing Hadoop deployments with minimal changes/disturbances. Romain will cover details on how Hue can leverage the existing authentication system and security model of your company. He will also cover the Hive/Shark/Pig/Oozie best practice setup for Hue.
http://www.meetup.com/hadoop/events/125191612/
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
Many people refer to Apache Hadoop as their system of choice for big data management but few actually use just Apache Hadoop. Hadoop has become a proxy for a much larger system which has HDFS storage at its core. The Apache Hadoop based "big data stack" has changed dramatically over the past 24 months and will chance even more over the next 24 months. This talk talks about trends in the evolution of the Hadoop stack, change in architecture and changes in the kinds of use cases that are supported. It will also talk about the role of interoperability and cohesion in the Apache Hadoop stack and the role of Apache Bigtop in this regard.
Introduction To Hadoop Administration - SpringPeopleSpringPeople
The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. The popularity of Hadoop is juts increasing exponentially.
Simplified Data Management And Process Scheduling in HadoopGetInData
If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8
Proper data management and process scheduling are challenges that many data-driven companies under-prioritize. Although it might not cause troubles in short run, it becomes a nightmare when your cluster grows. However, even when you realize this problem, you might not see that possible solutions are so close... In this talk, we share how we simplified our data management and process scheduling in Hadoop with useful (but less adopted) open-source tools. We describe how Falcon, HCatalog, Avro, HDFS FsImage, CLI tools and tricks helped us to address typical problems related to orchestration of data pipelines and discovery, retention, lineage of datasets.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
This slide contain basic detail about Hadoop and big data. Steps to install and configure Hadoop in Linux OS. And an example to count number of words in a text file using Hadoop.
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...DataStax
Successfully running Apache Cassandra in production often means knowing what configuration settings to change and which ones to leave as default. Over the years the cassandra.yaml file has grown to provide a number of settings that can improve stability and performance. While the file contains plenty of helpful comments, there is more to be said about the settings and when to change them.
In this talk Edward Capriolo, Consultant at The Last Pickle, will break down the parameters in the configuration files. Looking at those that are essential to getting started, those that impact performance, those that improve availability, the exotic ones, and the ones that should not be played with. This talk is ideal for someone someone setting up Cassandra for the first time up to people with deployments in productions and wondering what the more exotic configuration options do.
About the Speaker
Edward Capriolo Consultant, The Last Pickle
Long time Apache Cassandra user, big data enthusiast.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and Accelerated Computing (GPU and FPGA) instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
Advanced cgroups and namespaces
This talk picks up where we left off in the previous cgroups and namespaces talk and dive in even deeper!
Agenda:
* cgroups v2 design (cgroup v2 was started to be merged in the current kernel, 4.4)
* cgroups v2 examples (migrating tasks, enabling and disabling controllers, and more).
* comparison between cgroup v2 unified hierarchy and cgroup v1 legacy hierarchy.
* PIDs namespaces (from kernel 4.3)
* cgroup namespaces (not merged yet)
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and Accelerated Computing (GPU and FPGA) instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
Customize and Secure the Runtime and Dependencies of Your Procedural Language...VMware Tanzu
Customize and Secure the Runtime and Dependencies of Your Procedural Languages Using PL/Container
Greenplum Summit at PostgresConf US 2018
Hubert Zhang and Jack Wu
The job throughput and Apache Hadoop cluster utilization benefits of YARN and MapReduce v2 are widely known. Who wouldn’t want job throughput increased by 2x? Most likely you’ve heard (repeatedly) about the key benefits that could be gained from migrating your Hadoop cluster from MapReduce v1 to YARN: namely around improved job throughput and cluster utilization, as well as around permitting different computational frameworks to run on Hadoop. What you probably haven’t heard about are the configuration tweaks needed to ensure your existing MR v1 jobs can run on your YARN cluster as well as YARN specific configuration settings. In this session we’ll start with a list of recommended YARN configurations, and then step through the most common use-cases we’ve seen in the field. Production migrations can quickly go awry without proper guidance. Learn from others’ misconfigurations to get your YARN cluster configured right the first time.
Ganeti is a cluster virtualization management software tool built on top of existing virtualization technologies such as Xen or KVM and other Open Source software. This hands-on tutorial will give an overview of Ganeti, how to install it, how to get started deploying VMs, & administrative guide to Ganeti. The tutorial will also cover installing & using Ganeti Web Manager as a web front-end.
State of Containers and the Convergence of HPC and BigDatainside-BigData.com
In this deck from 2018 Swiss HPC Conference, Christian Kniep from Docker Inc. presents: State of Containers and the Convergence of HPC and BigData.
"This talk will recap the history of and what constitutes Linux Containers, before laying out how the technology is employed by various engines and what problems these engines have to solve. Afterward Christian will elaborate on why the advent of standards for images and runtimes moved the discussion from building and distributing containers to orchestrating containerized applications at scale. In conclusion attendees will get an update on how containers foster the convergence of Big Data and HPC workloads and the state of native HPC containers."
Learn more: http://docker.com
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Handling Kernel Upgrades at Scale - The Dirty Cow StoryDataWorks Summit
Apache Hadoop at Yahoo is a massive platform with 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments, totaling 60,000 servers made up of 100s of different hardware configurations accumulated over generations, presenting unique operational challenges and a variety of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with large scale kernel upgrade on heterogeneous platforms within tight timeframes with 100% uptime and no service or data loss through the Dirty COW use case (privilege escalation vulnerability found in the Linux Kernel in late 2016).
We will dive deep into our three phased approach that led to eventual success of the program - pre work, kernel upgrade itself, and post work / cleanup. We will share the details on automation tools, UIs, and reporting tools developed and used to achieve the stated objectives of 800+ server upgrades per hour, track the upgrade progress, validate and report data blocks, and recover quickly from bad blocks encountered. Throughout the talk, we will highlight the importance of process management, communicating with 100s of custom teams to ensure they are onboard and aware, and successful coordination tactics with SREs and Site Operations. We will also touch upon some of the unique challenges we faced along with way such as BIOS updates necessary on over 20,000 hosts along the way, and explain system rolling upgrade support we added to HBase and Storm for avoiding service disruption to low latency customer during these upgrades.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
6. Hadoop configurations
1.5GB of RAM and 1 core for a container
Max container limit is 20G - to enable spark jobs
Max vcore allocation is 8 cpu
Using Capacity Scheduler
Node Manager recovery is enabled
CGroups for CPU isolation
8. Alers and Trends
Using Nagios for Alerting.
monitor only master node services eg:
namenode,resourcemanager,historyserver
datanodes/nodemanagers are monitored based on availability ( say 90% of
the nodes are up)
alerts are configured to send to PagerDuty
9. Alers and Trends [contd]
Using Graphite for Metrics
jmx metrics
use jmxtrans and custom scripts to send metrics to graphite
clusterwide alerts on top on graphite data
10. Issues And challenges - hadoop
Tasks hogging up whole allocated CPU vcores.
Problem
We have couple of jobs which are very CPU intensive. While these jobs are running, other tasks are starving for
resources, leads to SLA misses.
Reason
with the DefaultContainerExecutor class, there is no limit in using CPU by a single task. Other tasks running on the
node node starve for CPU resource and can lead to long running jobs.
11. Issues And challenges - hadoop [cntd]
solution
We mitigated this problem by implementing strict CPU limit with cgroups and LinuxContainerExecutor class.
install libcgroup1 package
12. Issues And challenges - hadoop [cntd]
Job priority
Problem
yarn does not honor mapred job priority ( mapreduce.job.priority VERY_LOW, LOW,HIGH,VERY_HIGH)
Reason
Priorities across application within the same queue is not implemented yet.
13. Issues And challenges - hadoop [cntd]
Solution
Implemented hierarchical queues within a queue with different thresholds
allocate different proportions to the sub-queues say 30% to report, 20% to report_low and 30%
report_high queue.
14. Issues And challenges - hadoop [cntd]
Namenode HA failover with ssh fencing
Problem
Active Namenode(Namenode1) server went down physically. We expected the standby to become active. But that
didn’t happen.
Reason
We are using Namenode HA with ssh fencing. zkfc(zookeeper failover controller) does ssh to the active
namenode and make sure the namenode process is killed to avoid any possible chance of split-brain. Since the active
namenode was physically down, the ssh-fencing always returned non-zero status
15. Issues And challenges - hadoop [cntd]
Solution
1. We added a dummy host in place of the Namenode1 in the hdfs-site.xml. The zkfc running in standby
successfully
2. We can use shell fencing method to return exit status ‘0’
3. You can also use custom scripts with shell fencing method to handle this situation
16. Issues And challenges - hadoop [cntd]
tasks running slow
problem
Jobs are running extremely slow.
Reason
There can be system issues as well as application issues that can cause this problem.
1. misconfigured nodemanager
The actual cpu available in a nodemanager is 12, but you set
yarn.nodemanager.resource.cpu-vcores as 24.
17. 2. Nodes were configured with 100Mb/s network speed
3. Bad disks
Solution
a. Enabling Speculative execution
cons
Enabling speculative execution can lead to wastage of resources. We
have patched speculative execution, It spawn duplicate tasks only in case absolute necessity.
Issues And challenges - hadoop [cntd]
18. Issues And challenges - hadoop [cntd]
b. Excluding bad disks from the cluster
We exclude bad disks from a node using puppet automatically.
c. Exclude bad nodes from the cluster
We are using custom healthcheck to blacklist a node if a certain number of applications fails
on this node in a certain amount of time.
i. separate the NodeManager Audit log to a different file using log4j
19. Issues And challenges - hadoop [cntd]
Then use a custom script to get the number of failures in nodemanager and use that to
blacklist a node.
20. Issues And challenges - hadoop [cntd]
d. machines with bad hardware configurations
Solution
Use script to validate each nodes
21. Issues And challenges - hadoop [cntd]
History Server log size.
have seen AM logs with size 1G
controlled AM log size with yarn.app.mapreduce.am.container.log.limit.kb
log aggregation is enabled and logs are kept in HDFS for 3 days
22. Q&AWe are hiring for rockstars like you. If interested please drop a note at
sivakumar@inmobi.com/srikanth.sundarrajan@inmobi.com