SlideShare a Scribd company logo
1 of 5
Download to read offline
3/26/2015 Hadoop admin interview question and answers
data:text/html;charset=utf-8,%3Cdiv%20class%3D%22article-header%22%20style%3D%22margin%3A%200px%3B%20outline… 1/5
Which operating system(s) are supported for production Hadoop deployment?
The main supported operating system is Linux. However, with some additional software Hadoop can be
deployed on Windows.
What is the role of the namenode?
The namenode is the "brain" of the Hadoop cluster and responsible for managing the distribution blocks
on the system based on the replication policy. The namenode also supplies the specific addresses for the
data based on the client requests.
What happen on the namenode when a client tries to read a data file?
The namenode will look up the information about file in the edit file and then retrieve the remaining
information from filesystem memory snapshot. Since the namenode needs to support a large number of
the clients, the primary namenode will only send information back for the data location. The datanode
itselt is responsible for the retrieval.
What are the hardware requirements for a Hadoop cluster (primary and secondary namenodes and
datanodes)?
There are no requirements for datanodes. However, the namenodes require a specified amount of RAM
to store filesystem image in memory Based on the design of the primary namenode and secondary
namenode, entire filesystem information will be stored in memory. Therefore, both namenodes need to
have enough memory to contain the entire filesystem image.
What mode(s) can Hadoop code be run in?
Hadoop can be deployed in stand alone mode, pseudo­distributed mode or fully­distributed mode.
Hadoop was specifically designed to be deployed on multi­node cluster. However, it also can be deployed
on single machine and as a single process for testing purposes
How would an Hadoop administrator deploy various components of Hadoop in production?
Deploy namenode and jobtracker on the master node, and deploy datanodes and taskstrackers on
multiple slave nodes. There is a need for only one namenode and jobtracker on the system. The number
of datanodes depends on the available hardware
What is the best practice to deploy the secondary namenode
Deploy secondary namenode on a separate standalone machine. The secondary namenode needs to be
deployed on a separate machine. It will not interfere with primary namenode operations in this way. The
secondary namenode must have the same memory requirements as the main namenode. 
Hadoop admin interview question
and answers
3/26/2015 Hadoop admin interview question and answers
data:text/html;charset=utf-8,%3Cdiv%20class%3D%22article-header%22%20style%3D%22margin%3A%200px%3B%20outline… 2/5
Is there a standard procedure to deploy Hadoop?
No, there are some differences between various distributions. However, they all require that Hadoop jars
be installed on the machine. There are some common requirements for all Hadoop distributions but the
specific procedures will be different for different vendors since they all have some degree of proprietary
software
What is the role of the secondary namenode?
Secondary namenode performs CPU intensive operation of combining edit logs and current filesystem
snapshots. The secondary namenode was separated out as a process due to having CPU intensive
operations and additional requirements for metadata back­up
What are the side effects of not running a secondary name node?
The cluster performance will degrade over time since edit log will grow bigger and bigger. If the secondary
namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also,
the system will go into safemode for an extended time since the namenode needs to combine the edit log
and the current filesystem checkpoint image.
What happen if a datanode loses network connection for a few minutes?
The namenode will detect that a datanode is not responsive and will start replication of the data from
remaining replicas. When datanode comes back online, the extra replicas will beThe replication factor is
actively maintained by the namenode. The namenode monitors the status of all datanodes and keeps
track which blocks are located on that node. The moment the datanode is not avaialble it will trigger
replication of the data from the existing replicas. However, if the datanode comes back up, overreplicated
data will be deleted. Note: the data might be deleted from the original datanode.
What happen if one of the datanodes has much slower CPU?
The task execution will be as fast as the slowest worker. However, if speculative execution is enabled, the
slowest worker will not have such big impact Hadoop was specifically designed to work with commodity
hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same
task will be created and job tracker will take the first result into consideration and the second instance of
the task will be killed.
What is speculative execution?
If speculative execution is enabled, the job tracker will issue multiple instances of the same
task on multiple nodes and it will take the result of the task that finished first. The other
instances of the task will be killed.
The speculative execution is used to offset the impact of the slow workers in the cluster. The jobtracker
creates multiple instances of the same task and takes the result of the first successful task. The rest of the
tasks will be discarded.
How many racks do you need to create an Hadoop cluster in order to make sure that the cluster
operates reliably?
In order to ensure a reliable operation it is recommended to have at least 2 racks with rack placement
configured Hadoop has a built­in rack awareness mechanism that allows data distribution between
3/26/2015 Hadoop admin interview question and answers
data:text/html;charset=utf-8,%3Cdiv%20class%3D%22article-header%22%20style%3D%22margin%3A%200px%3B%20outline… 3/5
different racks based on the configuration.
Are there any special requirements for namenode?
Yes, the namenode holds information about all files in the system and needs to be extra reliable. The
namenode is a single point of failure. It needs to be extra reliable and metadata need to be replicated in
multiple places. Note that the community is working on solving the single point of failure issue with the
namenode.
If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the
cluster that will correspond to that file (assuming the default apache and cloudera configuration)?
6
Based on the configuration settings the file will be divided into multiple blocks according to the default
block size of 64M. 128M / 64M = 2 . Each block will be replicated according to replication factor settings
(default 3). 2 * 3 = 6 .
What is distributed copy (distcp)?
Distcp is a Hadoop utility for launching MapReduce jobs to copy data. The primary usage is for copying a
large amount of data. One of the major challenges in the Hadoop enviroment is copying data across
multiple clusters and distcp will allow multiple datanodes to be leveraged for parallel copying of the data.
What is replication factor?
Replication factor controls how many times each individual block can be replicated –
Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor
guarantees data availability in the event of failure.
What daemons run on Master nodes?
NameNode, Secondary NameNode and JobTracker
Hadoop is comprised of five separate daemons and each of these daemon run in its own JVM.
NameNode, Secondary NameNode and JobTracker run on Master nodes. DataNode and TaskTracker
run on each Slave nodes.
What is rack awareness?
Rack awareness is the way in which the namenode decides how to place blocks based on the rack
definitions. Hadoop will try to minimize the network traffic between datanodes within the same rack and
will only contact remote racks if it has to. The namenode is able to control this due to rack awareness
What is the role of the jobtracker in an Hadoop cluster? 
The jobtracker is responsible for scheduling tasks on slave nodes, collecting results, retrying failed tasks.
The job tracker is the main component of the map­reduce execution. It control the division of the job into
smaller tasks, submits tasks to individual tasktracker, tracks the progress of the jobs and reports results
back to calling code.
How does the Hadoop cluster tolerate datanode failures?
Since Hadoop is design to run on commodity hardware, the datanode failures are expected.
Namenode keeps track of all available datanodes and actively maintains replication factor
on all data.
The namenode actively tracks the status of all datanodes and acts immediately if the datanodes become
non­responsive. The namenode is the central "brain" of the HDFS and starts replication of the data the
moment a disconnect is detected.
3/26/2015 Hadoop admin interview question and answers
data:text/html;charset=utf-8,%3Cdiv%20class%3D%22article-header%22%20style%3D%22margin%3A%200px%3B%20outline… 4/5
What is the procedure for namenode recovery?
A namenode can be recovered in two ways: starting new namenode from backup metadata or promoting
secondary namenode to primary namenode. 
The namenode recovery procedure is very important to ensure the reliability of the data.It can be
accomplished by starting a new namenode using backup data or by promoting the secondary namenode
to primary.
Web­UI shows that half of the datanodes are in decommissioning mode. What does that mean? Is
it safe to remove those nodes from the network?
This means that namenode is trying retrieve data from those datanodes by moving replicas to remaining
datanodes. There is a possibility that data can be lost if administrator removes those datanodes before
decomissioning finished . 
Due to replication strategy it is possible to lose some data due to datanodes removal en masse prior to
completing the decommissioning process. Decommissioning refers to namenode trying to retrieve data
from datanodes by moving replicas to remaining datanodes.
What does the Hadoop administrator have to do after adding new datanodes to the Hadoop
cluster?
Since the new nodes will not have any data on them, the administrator needs to start the balancer to
redistribute data evenly between all nodes.
Hadoop cluster will detect new datanodes automatically. However, in order to optimize the cluster
performance it is recommended to start rebalancer to redistribute the data between datanodes evenly.
If the Hadoop administrator needs to make a change, which configuration file does he need to
change?
Each node in the Hadoop cluster has its own configuration files and the changes needs to be made in
every file. One of the reasons for this is that configuration can be different for every node.
Map Reduce jobs are failing on a cluster that was just restarted. They worked before restart. What
could be wrong?
The cluster is in a safe mode. The administrator needs to wait for namenode to exit the safe mode before
restarting the jobs again 
This is a very common mistake by Hadoop administrators when there is no secondary namenode on the
cluster and the cluster has not been restarted in a long time. The namenode will go into safemode and
combine the edit log and current file system timestamp
Map Reduce jobs take too long. What can be done to improve the performance of the cluster?
One the most common reasons for performance problems on Hadoop cluster is uneven distribution of the
tasks. The number tasks has to match the number of available slots on the cluster
Hadoop is not a hardware aware system. It is the responsibility of the developers and the administrators
to make sure that the resource supply and demand match.
How often do you need to reformat the namenode?
Never. The namenode needs to formatted only once in the beginning. Reformatting of the namenode will
lead to lost of the data on entire 
The namenode is the only system that needs to be formatted only once. It will create the directory
structure for file system metadata and create namespaceID for the entire file system.
3/26/2015 Hadoop admin interview question and answers
data:text/html;charset=utf-8,%3Cdiv%20class%3D%22article-header%22%20style%3D%22margin%3A%200px%3B%20outline… 5/5
After increasing the replication level, I still see that data is under replicated. What could be wrong?
Data replication takes time due to large quantities of data. The Hadoop administrator should allow
sufficient time for data replication
Depending on the data size the data replication will take some time. Hadoop cluster still needs to copy
data around and if data size is big enough it is not uncommon that replication will take from a few minutes
to a few hours.

More Related Content

Recently uploaded

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 

Recently uploaded (20)

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 

Featured

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Hadoop admin interview question and answers

  • 1. 3/26/2015 Hadoop admin interview question and answers data:text/html;charset=utf-8,%3Cdiv%20class%3D%22article-header%22%20style%3D%22margin%3A%200px%3B%20outline… 1/5 Which operating system(s) are supported for production Hadoop deployment? The main supported operating system is Linux. However, with some additional software Hadoop can be deployed on Windows. What is the role of the namenode? The namenode is the "brain" of the Hadoop cluster and responsible for managing the distribution blocks on the system based on the replication policy. The namenode also supplies the specific addresses for the data based on the client requests. What happen on the namenode when a client tries to read a data file? The namenode will look up the information about file in the edit file and then retrieve the remaining information from filesystem memory snapshot. Since the namenode needs to support a large number of the clients, the primary namenode will only send information back for the data location. The datanode itselt is responsible for the retrieval. What are the hardware requirements for a Hadoop cluster (primary and secondary namenodes and datanodes)? There are no requirements for datanodes. However, the namenodes require a specified amount of RAM to store filesystem image in memory Based on the design of the primary namenode and secondary namenode, entire filesystem information will be stored in memory. Therefore, both namenodes need to have enough memory to contain the entire filesystem image. What mode(s) can Hadoop code be run in? Hadoop can be deployed in stand alone mode, pseudo­distributed mode or fully­distributed mode. Hadoop was specifically designed to be deployed on multi­node cluster. However, it also can be deployed on single machine and as a single process for testing purposes How would an Hadoop administrator deploy various components of Hadoop in production? Deploy namenode and jobtracker on the master node, and deploy datanodes and taskstrackers on multiple slave nodes. There is a need for only one namenode and jobtracker on the system. The number of datanodes depends on the available hardware What is the best practice to deploy the secondary namenode Deploy secondary namenode on a separate standalone machine. The secondary namenode needs to be deployed on a separate machine. It will not interfere with primary namenode operations in this way. The secondary namenode must have the same memory requirements as the main namenode.  Hadoop admin interview question and answers
  • 2. 3/26/2015 Hadoop admin interview question and answers data:text/html;charset=utf-8,%3Cdiv%20class%3D%22article-header%22%20style%3D%22margin%3A%200px%3B%20outline… 2/5 Is there a standard procedure to deploy Hadoop? No, there are some differences between various distributions. However, they all require that Hadoop jars be installed on the machine. There are some common requirements for all Hadoop distributions but the specific procedures will be different for different vendors since they all have some degree of proprietary software What is the role of the secondary namenode? Secondary namenode performs CPU intensive operation of combining edit logs and current filesystem snapshots. The secondary namenode was separated out as a process due to having CPU intensive operations and additional requirements for metadata back­up What are the side effects of not running a secondary name node? The cluster performance will degrade over time since edit log will grow bigger and bigger. If the secondary namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the namenode needs to combine the edit log and the current filesystem checkpoint image. What happen if a datanode loses network connection for a few minutes? The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back online, the extra replicas will beThe replication factor is actively maintained by the namenode. The namenode monitors the status of all datanodes and keeps track which blocks are located on that node. The moment the datanode is not avaialble it will trigger replication of the data from the existing replicas. However, if the datanode comes back up, overreplicated data will be deleted. Note: the data might be deleted from the original datanode. What happen if one of the datanodes has much slower CPU? The task execution will be as fast as the slowest worker. However, if speculative execution is enabled, the slowest worker will not have such big impact Hadoop was specifically designed to work with commodity hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same task will be created and job tracker will take the first result into consideration and the second instance of the task will be killed. What is speculative execution? If speculative execution is enabled, the job tracker will issue multiple instances of the same task on multiple nodes and it will take the result of the task that finished first. The other instances of the task will be killed. The speculative execution is used to offset the impact of the slow workers in the cluster. The jobtracker creates multiple instances of the same task and takes the result of the first successful task. The rest of the tasks will be discarded. How many racks do you need to create an Hadoop cluster in order to make sure that the cluster operates reliably? In order to ensure a reliable operation it is recommended to have at least 2 racks with rack placement configured Hadoop has a built­in rack awareness mechanism that allows data distribution between
  • 3. 3/26/2015 Hadoop admin interview question and answers data:text/html;charset=utf-8,%3Cdiv%20class%3D%22article-header%22%20style%3D%22margin%3A%200px%3B%20outline… 3/5 different racks based on the configuration. Are there any special requirements for namenode? Yes, the namenode holds information about all files in the system and needs to be extra reliable. The namenode is a single point of failure. It needs to be extra reliable and metadata need to be replicated in multiple places. Note that the community is working on solving the single point of failure issue with the namenode. If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the cluster that will correspond to that file (assuming the default apache and cloudera configuration)? 6 Based on the configuration settings the file will be divided into multiple blocks according to the default block size of 64M. 128M / 64M = 2 . Each block will be replicated according to replication factor settings (default 3). 2 * 3 = 6 . What is distributed copy (distcp)? Distcp is a Hadoop utility for launching MapReduce jobs to copy data. The primary usage is for copying a large amount of data. One of the major challenges in the Hadoop enviroment is copying data across multiple clusters and distcp will allow multiple datanodes to be leveraged for parallel copying of the data. What is replication factor? Replication factor controls how many times each individual block can be replicated – Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor guarantees data availability in the event of failure. What daemons run on Master nodes? NameNode, Secondary NameNode and JobTracker Hadoop is comprised of five separate daemons and each of these daemon run in its own JVM. NameNode, Secondary NameNode and JobTracker run on Master nodes. DataNode and TaskTracker run on each Slave nodes. What is rack awareness? Rack awareness is the way in which the namenode decides how to place blocks based on the rack definitions. Hadoop will try to minimize the network traffic between datanodes within the same rack and will only contact remote racks if it has to. The namenode is able to control this due to rack awareness What is the role of the jobtracker in an Hadoop cluster?  The jobtracker is responsible for scheduling tasks on slave nodes, collecting results, retrying failed tasks. The job tracker is the main component of the map­reduce execution. It control the division of the job into smaller tasks, submits tasks to individual tasktracker, tracks the progress of the jobs and reports results back to calling code. How does the Hadoop cluster tolerate datanode failures? Since Hadoop is design to run on commodity hardware, the datanode failures are expected. Namenode keeps track of all available datanodes and actively maintains replication factor on all data. The namenode actively tracks the status of all datanodes and acts immediately if the datanodes become non­responsive. The namenode is the central "brain" of the HDFS and starts replication of the data the moment a disconnect is detected.
  • 4. 3/26/2015 Hadoop admin interview question and answers data:text/html;charset=utf-8,%3Cdiv%20class%3D%22article-header%22%20style%3D%22margin%3A%200px%3B%20outline… 4/5 What is the procedure for namenode recovery? A namenode can be recovered in two ways: starting new namenode from backup metadata or promoting secondary namenode to primary namenode.  The namenode recovery procedure is very important to ensure the reliability of the data.It can be accomplished by starting a new namenode using backup data or by promoting the secondary namenode to primary. Web­UI shows that half of the datanodes are in decommissioning mode. What does that mean? Is it safe to remove those nodes from the network? This means that namenode is trying retrieve data from those datanodes by moving replicas to remaining datanodes. There is a possibility that data can be lost if administrator removes those datanodes before decomissioning finished .  Due to replication strategy it is possible to lose some data due to datanodes removal en masse prior to completing the decommissioning process. Decommissioning refers to namenode trying to retrieve data from datanodes by moving replicas to remaining datanodes. What does the Hadoop administrator have to do after adding new datanodes to the Hadoop cluster? Since the new nodes will not have any data on them, the administrator needs to start the balancer to redistribute data evenly between all nodes. Hadoop cluster will detect new datanodes automatically. However, in order to optimize the cluster performance it is recommended to start rebalancer to redistribute the data between datanodes evenly. If the Hadoop administrator needs to make a change, which configuration file does he need to change? Each node in the Hadoop cluster has its own configuration files and the changes needs to be made in every file. One of the reasons for this is that configuration can be different for every node. Map Reduce jobs are failing on a cluster that was just restarted. They worked before restart. What could be wrong? The cluster is in a safe mode. The administrator needs to wait for namenode to exit the safe mode before restarting the jobs again  This is a very common mistake by Hadoop administrators when there is no secondary namenode on the cluster and the cluster has not been restarted in a long time. The namenode will go into safemode and combine the edit log and current file system timestamp Map Reduce jobs take too long. What can be done to improve the performance of the cluster? One the most common reasons for performance problems on Hadoop cluster is uneven distribution of the tasks. The number tasks has to match the number of available slots on the cluster Hadoop is not a hardware aware system. It is the responsibility of the developers and the administrators to make sure that the resource supply and demand match. How often do you need to reformat the namenode? Never. The namenode needs to formatted only once in the beginning. Reformatting of the namenode will lead to lost of the data on entire  The namenode is the only system that needs to be formatted only once. It will create the directory structure for file system metadata and create namespaceID for the entire file system.
  • 5. 3/26/2015 Hadoop admin interview question and answers data:text/html;charset=utf-8,%3Cdiv%20class%3D%22article-header%22%20style%3D%22margin%3A%200px%3B%20outline… 5/5 After increasing the replication level, I still see that data is under replicated. What could be wrong? Data replication takes time due to large quantities of data. The Hadoop administrator should allow sufficient time for data replication Depending on the data size the data replication will take some time. Hadoop cluster still needs to copy data around and if data size is big enough it is not uncommon that replication will take from a few minutes to a few hours.