SlideShare a Scribd company logo
1 of 26
Vemula ravi (Hadoop Developer and admin) 
Hadoop Interview Quations: 
HDFS(Hadoop Distributed File System) 
HDFS is relatively a new name in the market and many professionals are trying to step into the 
realm of this future technology. However, to facilitate candidates preparing for a career on 
Hadoop file system (HDFS), here are some common questions and solutions that are often asked 
in the interview board: 
· What are the essential features of HDFS? 
Hadoop Distributed File System (HDFS) is one high fault-tolerant system and is architected to 
run onlow-cost hardware. It has high throughput to access application data and is suited for 
applications with large data sets. The processing logic of HDFS lies in close proximity to the 
data and is portable across heterogeneous operating systems and hardware. HDFS is highly 
scalable and is one perfect solution to process large chunks of data. 
· What is called streaming access in HDFS? 
Streaming access is one of the most important parts HDFS, as it relies on the principle of ‘Write 
Once, Read Many’. HDFS does not focus much on the storage of data. Rather, it focuses more 
to find the best possible means to retrieve records in a faster way. 
· What is Namenode? 
It refers to the master node where the job trackers operate. The namenode contains metadata 
and it helps to manage the blocks present on the datanode. 
· What is datanode? 
Datanode is also referred as slaves and are installed on respective workstations to enable 
individual data storage. It also serves read/write request for the clients. 
· What is heartbeat?
Vemula ravi (Hadoop Developer and admin) 
A hearbeat is referred to a signal to indicate the file-system is live. A datanode sends heartbeat to 
the namenode followed by task tracker to the job tracker. If at any point of time the namenode 
fails to receive the heartbeat, it will indicate there is some problem in the datanode or the 
tasktracker fails to perform the task. 
· Does namenode and jobtracker belong to the same host? 
It is possible to run all daemons on a single machine but in production environment namenode 
and jobtracker run on different hosts. 
· Explain the term ‘block’ in HDFS. 
It refers to the minimum data that can either be read or written. The default block size of Hadoop 
is 64 Mb as compared to 8192 Bytes in Unix/Linux. In HDFS, the files are broken into chunks of 
the size of a block and are stored in independent units. 
· If a file is of size 20 Mb and block size is 64 Mb. Will HDFS consume 64 Mb like 
other file systems? 
No. 64 Mb is referred to the unit where the 20 Mb data will be stored. Here in this case, when 
upon storing 20 Mb, 44 MB will be spared, which can be used to store anything else. 
· How to identify if the datanode is full? 
When the data gets stored in datanode, the metadata of the stored data gets updated in the 
namenode. It is the namenode which identifies if the datanode is full. 
· What are the different methods to access HDFS? 
There are various ways to access HDFS. In order to access natively, HDFS offers Java API for 
the application. There is also availability of C language wrapper. Moreover, HTTP browser can 
also be used to access and browse the files of HDFS instance. 
· Is there any way to recover a deleted file in HDFS? 
HDFS do not remove a file fully after deletion. The file is first renamed by HDFS and then 
stored in /trash directory. The file can be recovered from the said directory as long as it remains 
in the said folder. However, after the expiry of life of the deleted file, the NameNode deletes the 
file from HDFS namespace. 
· What is the command to create a directory in HDFS via FS shell? 
bin/hadoop dfs -mkdir /<directory_name> 
· What is a snapshot and how it helps when NameNode fails? 
A snapshot helps to store a data copy at one particular time. If in case, the NameNode machine 
fails, the snapshot can be rolled back to derive the last known good configuration.
Vemula ravi (Hadoop Developer and admin) 
MapReduce(Processing Hug Amount data):
Vemula ravi (Hadoop Developer and admin) 
A good understanding of Hadoop Architecture is required to understand and leverage the power 
of Hadoop. Here are few important practical questions which can be asked to a Senior 
Experienced Hadoop Developer in an interview. This list primarily includes questions related to 
Hadoop Architecture, MapReduce, Hadoop API and Hadoop Distributed File System (HDFS). 
What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop 
Cluster? 
JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There 
is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM 
process. In a typical production cluster its run on a separate machine. Each slave node is 
configured with job tracker node location. The JobTracker is single point of failure for the 
Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop 
performs following actions(from Hadoop Wiki:) 
· Client applications submit jobs to the Job tracker. 
· The JobTracker talks to the NameNode to determine the location of the data 
· The JobTracker locates TaskTracker nodes with available slots at or near the data 
· The JobTracker submits the work to the chosen TaskTracker nodes. 
· The TaskTracker nodes are monitored. If they do not submit heartbeat signals often 
enough, they are deemed to have failed and the work is scheduled on a different TaskTracker. 
· A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what 
to do then: it may resubmit the job elsewhere, it may mark that specific record as something to 
avoid, and it may may even blacklist the TaskTracker as unreliable. 
· When the work is completed, the JobTracker updates its status. 
· Client applications can poll the JobTracker for information. 
How JobTracker schedules a task? 
The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to 
reassure the JobTracker that it is still alive. These message also inform the JobTracker of the 
number of available slots, so the JobTracker can stay up to date with where in the cluster work 
can be delegated. When the JobTracker tries to find somewhere to schedule a task within the
Vemula ravi (Hadoop Developer and admin) 
MapReduce operations, it first looks for an empty slot on the same server that hosts the 
DataNode containing the data, and if not, it looks for an empty slot on a machine in the same 
rack. 
What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop 
Cluster 
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle 
operations) from a JobTracker. There is only One Task Tracker process run on any hadoop slave 
node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of 
slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate 
JVM processes to do the actual work (called as Task Instance) this is to ensure that process 
failure does not take down the task tracker. The TaskTracker monitors these task instances, 
capturing the output and exit codes. When the Task instances finish, successfully or not, the task 
tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the 
JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These 
message also inform the JobTracker of the number of available slots, so the JobTracker can stay 
up to date with where in the cluster work can be delegated. 
What is a Task instance in Hadoop? Where does it run? 
Task instances are the actual MapReduce jobs which are run on each slave node. The 
TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this 
is to ensure that process failure does not take down the task tracker. Each Task Instance runs on 
its own JVM process. There can be multiple processes of task instance running on a slave node. 
This is based on the number of slots configured on task tracker. By default a new task instance 
JVM process is spawned for a task. 
How many Daemon processes run on a Hadoop system? 
Hadoop is comprised of five separate daemons. Each of these daemon run in its own 
JVM.Following 3 Daemons run on Master nodes 
NameNode - This daemon stores and maintains the metadata for HDFS. 
Secondary NameNode - Performs housekeeping functions for the NameNode.
Vemula ravi (Hadoop Developer and admin) 
JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the 
Task Tracker. 
Following 2 Daemons run on each Slave nodes 
DataNode – Stores actual HDFS data blocks. 
TaskTracker - Responsible for instantiating and monitoring individual Map and Reduce tasks. 
What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on 
a slave node? 
· Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a 
separate JVM process. 
· Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is 
run as a separate JVM process. 
· One or Multiple instances of Task Instance is run on each slave node. Each task instance 
is run as a separate JVM process. The number of Task instances can be controlled by 
configuration. Typically a high end machine is configured to run more task instances. 
What is the difference between HDFS and NAS ? 
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on 
commodity hardware. It has many similarities with existing distributed file systems. However, 
the differences from other distributed file systems are significant. Following are differences 
between HDFS and NAS 
· In HDFS Data Blocks are distributed across local drives of all machines in a cluster. 
Whereas in NAS data is stored on dedicated hardware. 
· HDFS is designed to work with MapReduce System, since computation are moved to 
data. NAS is not suitable for MapReduce since data is stored seperately from the computations. 
· HDFS runs on a cluster of machines and provides redundancy usinga replication protocal. 
Whereas NAS is provided by a single machine therefore does not provide data redundancy. 
How NameNode Handles data node failures? 
NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in 
the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A 
Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not 
recieved a hearbeat message from a data node after a certain amount of time, the data node is 
marked as dead. Since blocks will be under replicated the system begins replicating the blocks 
that were stored on the dead datanode. The NameNode Orchestrates the replication of data
Vemula ravi (Hadoop Developer and admin) 
blocks from one datanode to another. The replication data transfer happens directly between 
datanodes and the data never passes through the namenode. 
Does MapReduce programming model provide a way for reducers to communicate with 
each other? In a MapReduce job can a reducer communicate with another reducer? 
Nope, MapReduce programming model does not allow reducers to communicate with each other. 
Reducers run in isolation. 
Can I set the number of reducers to zero? 
Yes, Setting the number of reducers to zero is a valid configuration in Hadoop. When you set the 
reducers to zero no reducers will be executed, and the output of each mapper will be stored to a 
separate file on HDFS. [This is different from the condition when reducers are set to a number 
greater than zero and the Mappers output (intermediate data) is written to the Local file 
system(NOT HDFS) of each mappter slave node.] 
Where is the Mapper Output (intermediate kay-value data) stored ? 
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each 
individual mapper nodes. This is typically a temporary directory location which can be setup in 
config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job 
completes. 
What are combiners? When should I use a combiner in my MapReduce Job? 
Combiners are used to increase the efficiency of a MapReduce program. They are used to 
aggregate intermediate map output locally on individual mapper outputs. Combiners can help 
you reduce the amount of data that needs to be transferred across to the reducers. You can use 
your reducer code as a combiner if the operation performed is commutative and associative. The 
execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if 
required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend 
on the combiners execution. 
What is Writable & WritableComparable interface? 
· org.apache.hadoop.io.Writable is a Java interface. Any key or value type in the Hadoop 
Map-Reduce framework implements this interface. Implementations typically implement a static
Vemula ravi (Hadoop Developer and admin) 
read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns 
the instance. 
· org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is to be 
used as a key in the Hadoop Map-Reduce framework should implement this interface. 
WritableComparable objects can be compared to each other using Comparators. 
What is the Hadoop MapReduce API contract for a key and value Class? 
· The Key must implement the org.apache.hadoop.io.WritableComparable interface. 
· The value must implement the org.apache.hadoop.io.Writable interface. 
What is a IdentityMapper and IdentityReducer in MapReduce ? 
· org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, 
mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using 
JobConf.setMapperClass then IdentityMapper.class is used as a default value. 
· org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all 
input values directly to the output. If MapReduce programmer do not set the Reducer Class using 
JobConf.setReducerClass then IdentityReducer.class is used as a default value. 
What is the meaning of speculative execution in Hadoop? Why is it important? 
Speculative execution is a way of coping with individual Machine performance. In large clusters 
where hundreds or thousands of machines are involved there may be machines which are not 
performing as fast as others. This may result in delays in a full job due to only one machine not 
performaing well. To avoid this, speculative execution in hadoop can run multiple copies of 
same map or reduce task on different slave nodes. The results from first node to finish are used. 
When is the reducers are started in a MapReduce job? 
In a MapReduce job reducers do not start executing the reduce method until the all Map jobs 
have completed. Reducers start copying intermediate key-value pairs from the mappers as soon 
as they are available. The programmer defined reduce method is called only after all the mappers 
have finished. 
If reducers do not start before all mappers finish then why does the progress on 
MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers 
progress percentage is displayed when mapper is not finished yet? 
Reducers start copying intermediate key-value pairs from the mappers as soon as they are 
available. The progress calculation also takes in account the processing of data transfer which is 
done by reduce process, therefore the reduce progress starts showing up as soon as any
Vemula ravi (Hadoop Developer and admin) 
intermediate key-value pair for a mapper is available to be transferred to reducer. Though the 
reducer progress is updated still the programmer defined reduce method is called only after all 
the mappers have finished. 
What is HDFS ? How it is different from traditional file systems? 
HDFS, the Hadoop Distributed File System, is responsible for storing huge data on the cluster. 
This is a distributed file system designed to run on commodity hardware. It has many similarities 
with existing distributed file systems. However, the differences from other distributed file 
systems are significant. 
· HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. 
· HDFS provides high throughput access to application data and is suitable for applications 
that have large data sets. 
· HDFS is designed to support very large files. Applications that are compatible with 
HDFS are those that deal with large data sets. These applications write their data only once but 
they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS 
supports write-once-read-many semantics on files. 
What is HDFS Block size? How is it different from traditional file system block size? 
In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block 
is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to 
replicate each block three times. Replicas are stored on different nodes. HDFS utilizes the local 
file system to store each HDFS block as a separate file. HDFS Block size can not be compared 
with the traditional file system block size. 
What is a NameNode? How many instances of NameNode run on a Hadoop Cluster? 
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files 
in the file system, and tracks where across the cluster the file data is kept. It does not store the 
data of these files itself. There is only One NameNode process run on any hadoop cluster. 
NameNode runs on its own JVM process. In a typical production cluster its run on a separate 
machine. The NameNode is a Single Point of Failure for the HDFS Cluster. When the 
NameNode goes down, the file system goes offline. Client applications talk to the NameNode 
whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The 
NameNode responds the successful requests by returning a list of relevant DataNode servers 
where the data lives. 
What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?
Vemula ravi (Hadoop Developer and admin) 
A DataNode stores data in the Hadoop File System HDFS. There is only One DataNode process 
run on any hadoop slave node. DataNode runs on its own JVM process. On startup, a DataNode 
connects to the NameNode. DataNode instances can talk to each other, this is mostly during 
replicating data. 
How the Client communicates with HDFS? 
The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk 
to the NameNode whenever they wish to locate a file, or when they want to 
add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by 
returning a list of relevant DataNode servers where the data lives. Client applications can talk 
directly to a DataNode, once the NameNode has provided the location of the data. 
How the HDFS Blocks are replicated? 
HDFS is designed to reliably store very large files across machines in a large cluster. It stores 
each file as a sequence of blocks; all blocks in a file except the last block are the same size. The 
blocks of a file are replicated for fault tolerance. The block size and replication factor are 
configurable per file. An application can specify the number of replicas of a file. The replication 
factor can be specified at file creation time and can be changed later. Files in HDFS are write-once 
and have strictly one writer at any time. The NameNode makes all decisions regarding 
replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration 
there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack 
and 3rd copy on a different rack. 
Q1. Name the most common Input Formats defined in Hadoop? Which one is default? 
The two most common Input Formats defined in Hadoop are: 
- TextInputFormat 
- KeyValueInputFormat 
- SequenceFileInputFormat
Vemula ravi (Hadoop Developer and admin) 
TextInputFormat is the Hadoop default. 
Q2. What is the difference between TextInputFormat and KeyValueInputFormat class? 
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the 
Mapper and actual line as Value to the mapper. 
KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to 
the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to 
the mapper. 
Q3. What is InputSplit in Hadoop? 
When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to 
process. This is called InputSplit. 
Q4. How is the splitting of file invoked in Hadoop framework? 
It is invoked by the Hadoop framework by running getInputSplit()method of the Input format 
class (like FileInputFormat) defined by the user. 
Q5. Consider case scenario: In M/R system, - HDFS block size is 64 MB 
- Input format is FileInputFormat 
- We have 3 files of size 64K, 65Mb and 127Mb 
How many input splits will be made by Hadoop framework? 
Hadoop will make 5 splits as follows: 
- 1 split for 64K files 
- 2 splits for 65MB files 
- 2 splits for 127MB files 
Q6. What is the purpose of RecordReader in Hadoop?
Vemula ravi (Hadoop Developer and admin) 
The InputSplit has defined a slice of work, but does not describe how to access it. The 
RecordReader class actually loads the data from its source and converts it into (key, value) pairs 
suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format. 
Q7. After the Map phase finishes, the Hadoop framework does “Partitioning, Shuffle and 
sort”. Explain what happens in this phase? 
Partitioning: It is the process of determining which reducer instance will receive which 
intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs 
which reducer will receive them. It is necessary that for any key, regardless of which mapper 
instance generated it, the destination partition is the same. 
Shuffle: After the first map tasks have completed, the nodes may still be performing several 
more map tasks each. But they also begin exchanging the intermediate outputs from the map 
tasks to where they are required by the reducers. This process of moving map outputs to the 
reducers is known as shuffling. 
Sort: Each reduce task is responsible for reducing the values associated with several 
intermediate keys. The set of intermediate keys on a single node is automatically sorted by 
Hadoop before they are presented to the Reducer. 
Q8. If no custom partitioner is defined in Hadoop then how is data partitioned before it is 
sent to the reducer? 
The default partitioner computes a hash value for the key and assigns the partition based on this 
result. 
Q9. What is a Combiner? 
The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. 
The Combiner will receive as input all data emitted by the Mapper instances on a given node. 
The output from the Combiner is then sent to the Reducers, instead of the output from the 
Mappers. 
Q10. What is JobTracker? 
JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.
Vemula ravi (Hadoop Developer and admin) 
Q11. What are some typical functions of Job Tracker? 
The following are some typical tasks of JobTracker:- 
- Accepts jobs from clients 
- It talks to the NameNode to determine the location of the data. 
- It locates TaskTracker nodes with available slots at or near the data. 
- It submits the work to the chosen TaskTracker nodes and monitors progress of each task by 
receiving heartbeat signals from Task tracker. 
Q12. What is TaskTracker? 
TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – 
from a JobTracker. 
Q13. What is the relationship between Jobs and Tasks in Hadoop? 
One job is broken down into one or many tasks in Hadoop. 
Q14. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will 
Hadoop do? 
It will restart the task again on some other TaskTracker and only if the task fails more than four 
(default setting and can be changed) times will it kill the job. 
Q15. Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible 
for a few slow nodes to rate-limit the rest of the program and slow down the program. 
What mechanism Hadoop provides to combat this? 
Speculative Execution. 
Q16. How does speculative execution work in Hadoop? 
JobTracker makes different TaskTrackers process same input. When tasks complete, they 
announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the 
definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to
Vemula ravi (Hadoop Developer and admin) 
abandon the tasks and discard their outputs. The Reducers then receive their inputs from 
whichever Mapper completed successfully, first. 
Q17. Using command line in Linux, how will you 
- See all jobs running in the Hadoop cluster 
- Kill a job? 
Hadoop job – list 
Hadoop job – kill jobID 
Q18. What is Hadoop Streaming? 
Streaming is a generic API that allows programs written in virtually any language to be used as 
Hadoop Mapper and Reducer implementations. 
Q19. What is the characteristic of streaming API that makes it flexible run MapReduce 
jobs in languages like Perl, Ruby, Awk etc.? 
Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a 
MapReduce job by having both Mappers and Reducers receive their input on stdin and emit 
output (key, value) pairs on stdout. 
Q20. What is Distributed Cache in Hadoop? 
Distributed Cache is a facility provided by the MapReduce framework to cache files (text, 
archives, jars and so on) needed by applications during execution of the job. The framework will 
copy the necessary files to the slave node before any tasks for the job are executed on that node. 
Q21. What is the benefit of Distributed cache? Why can we just have the file in HDFS and 
have the application read it? 
This is because distributed cache is much faster. It copies the file to all trackers at the start of the 
job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of 
distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job 
then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then
Vemula ravi (Hadoop Developer and admin) 
it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like 
this. 
Q.22 What mechanism does Hadoop framework provide to synchronise changes made in 
Distribution Cache during runtime of the application? 
This is a tricky question. There is no such mechanism. Distributed Cache by design is read only 
during the time of Job execution. 
Q23. Have you ever used Counters in Hadoop. Give us an example scenario? 
Anybody who claims to have worked on a Hadoop project is expected to use counters. 
Q24. Is it possible to provide multiple input to Hadoop? If yes then how can you give 
multiple directories as input to the Hadoop job? 
Yes, the input format class provides methods to add multiple directories as input to a Hadoop 
job. 
Q25. Is it possible to have Hadoop job output in multiple directories? If yes, how? 
Yes, by using Multiple Outputs class. 
Q26. What will a Hadoop job do if you try to run it with an output directory that is already 
present? Will it 
- Overwrite it 
- Warn you and continue 
- Throw an exception and exit 
The Hadoop job will throw an exception and exit. 
Q27. How can you set an arbitrary number of mappers to be created for a job in Hadoop? 
You cannot set it. 
Q28. How can you set an arbitrary number of Reducers to be created for a job in Hadoop?
Vemula ravi (Hadoop Developer and admin) 
You can either do it programmatically by using method setNumReduceTasks in the Jobconf 
Class or set it up as a configuration setting. 
Q29. How will you write a custom partitioner for a Hadoop job? 
To have Hadoop use a custom partitioner you will have to do minimum the following three: 
- Create a new class that extends Partitioner Class 
- Override method getPartition 
- In the wrapper that runs the Mapreduce, either 
- Add the custom partitioner to the job programmatically using method set Partitioner Class or – 
add the custom partitioner to the job as a config file (if your wrapper reads from config file or 
oozie) 
Q30. How did you debug your Hadoop code? 
There can be several ways of doing this but most common ways are:- 
- By using counters. 
- The web interface provided by Hadoop framework. 
Q31. Did you ever built a production process in Hadoop? If yes, what was the process when 
your Hadoop job fails due to any reason? 
It is an open-ended question but most candidates if they have written a production job, should 
talk about some type of alert mechanism like email is sent or there monitoring system sends an 
alert. Since Hadoop works on unstructured data, it is very important to have a good alerting 
system for errors since unexpected data can very easily break the job 
32Q:what is hadoop framework ? 
Ans: hadoop is a open source framework which is written in java by apche software foundation. 
This framework is used to write s/w application which requires to process vast amount of data 
(it handle tera bytes of data) .it works in paralle on large clusters which could have 1000 of 
computers(nodes)on the clusters.it also process data very reliably and fault-tolerant manner.see 
the below image how does it works.
Vemula ravi (Hadoop Developer and admin) 
Hadoop Frameworks 
33Q: On what concept the hadoop framework works ? 
Ans:it works on MapReduce and it is devised by the Google. 
34Q:What is MapReduce ? 
Ans: Map reduce is an algorithm or concept to process Huge amount of data in a faster way .As 
per its name you can divide it Map and Reduce. 
The main MapReduce job usually splits the input data-set into independent chunks.(Big data 
sets in the multiple small datasets) 
Map Task:will process these chunks in a completely parallel manner (one node can process one 
or more chunks) 
 The Frameworks sorts the outputs of the maps. 
Reduce Task: And the above output will be the input for the reducetasks,produces the final 
results. 
Your business logic would be written in the MappedTask and ReducedTask. 
Typically both the input and output of the job are stored in file system (not database). 
The framework takes care of scheduling task, monitoring them and re-executes the failed 
tasks. 
35Q:what is compute and storage nodes ? 
Ans: 
compute node: this is the computer or mechine where your actual business logic will be 
executed. 
Storage node: This is the computer or machine where your file system reside to store the 
processing data.
Vemula ravi (Hadoop Developer and admin) 
In most of the cases compute node and storage node would be the same machine. 
36Q: how does master slave architecture in the hadoop ? 
Ans:The MapReduce framework consists of a single master JobTracker and multiple slaves, 
each cluster-node will have one tasktracker 
The is master is responsible for scheduling the jobs components tasks on the slaves, 
Monitoring them and re-execution the failed tasks. The slaves executes the tasks as directed 
by the master. 
37Q: How does an Hadoop application look like or their basic components ? 
Ans: Minimally an Hadoop application would have following components 
 Input locaton of data 
 Output location of processed data 
 A map task 
 A reduced task 
 Job configuration 
The Hadoop job client then submits the job (jar/executable etc.) and configuration to 
the jobTracker which then assumes the responsibility of distributing the software 
/configuration to the slaves, scheduling tasks and monitoring them providing status and 
diagnostic information to the job-client. 
38 Q:The MapReduce Framework operates exclusively on pairs,that’s is the framework 
view the input to the job as a set of pairs and produces a set of pairs as the output of the 
job, conceivably of different types see the flow mentioned below 
Input->map->-> conbine/sorting->-> reduce-> output
Vemula ravi (Hadoop Developer and admin) 
39 Q:what are the restriction to the key and value class ? 
Ans: The key and value classes have to be serialized by the framework . To make them 
serializable Hadoop provides a writable interface .as you know from the java itself that 
the key of the map should be comparable ,hence the key has to implement one more 
interface writable comparable 
40 Q: Explain the WordCount implementation via Hadoop Framework ? 
Ans: 
we will count the words in all the input file flow as below: 
 Input 
Assume there are two files each having a sentence 
Hello World Hello World (in file 1) 
Hello World Hello World (in file 2) 
Mapper: There would be each mapper for the a file 
For the given sample input the first map output: 
<hello,1> 
<world,1> 
<hello,1> 
<world,1> 
The second map output: 
<hello,1> 
<world,1> 
<hello,1> 
<world,1> 
 Combiner/Sorting (This is done for each individual map) 
So output looks like this
Vemula ravi (Hadoop Developer and admin) 
The output of the first map: 
<hello,2> 
<world,2> 
The output of the second map: 
<hello,2> 
<world,2> 
Reducer: 
It sumes up the above output and generates the output as below 
<hello,4> 
<world,4> 
 Output 
Final Output would like: 
Hello 4 times 
World 4 times 
41Q: which interface needs to be implemented to create mapper and reducer for the 
hadoop ? 
Ans: 
Org.apache.hadoop.mapreduce.mapper 
Org.apache.hadoop.mapreduce.reducer 
42Q: what mapper does ? 
Ans: 
Maps are the individual task that transform input records into intermediate records.the 
given input pair may map to 
Zero or many output pairs. 
43 Q: what is the inputsplit in map reduce software ?
Vemula ravi (Hadoop Developer and admin) 
An inputsplit is a logical representation of a unit(A chunk) of input work for a map task; 
Eg: a filename and a byte range within that file to process or a row set in a text file. 
44Q:what is the inputformat ? 
Ans: the inputformat is responsible for enumerate (itemize) the inputsplits, and 
producing a RecordReader which will turn those logical work units into physical input 
records. 
45 Q: where do you specify the mapper implementation ? 
Ans: Generally mapper implementation is specified in the job itself. 
46 Q: how mapper is instantiated in a running job ? 
Ans: The Mapper itself is instantiated in the running job and will be passed a 
mapContexr object which it can use to configure itself. 
47 Q: which are the methods in the mapper interface ? 
Ans: the mapper contains the run() method ,which call its own setup() method only 
once,it also call a map() method for each input and finally calls it cleanup() 
Method. All above methods you can override in your code. 
48 Q: What happens if you don’t’t have any methods and keep them as it is ? 
Ans: if you don’t override any methods (leaving even map as-is),it will act as identity 
function ,emitting each input record as a separate output.
Vemula ravi (Hadoop Developer and admin) 
49 Q: What is the use of Contex Object ? 
Ans: The Contex object allows the mapper to interact with the rest of the Hadoop 
system.it includes configuration data for the jab,as well as interface which all it to emit 
output. 
50 Q: How can you add the arbitrary key-value pairs in youer mapper ? 
Ans: you can set arbitrary(key,value) pair of configuration data in your job, eg: with 
Job.getConfiguration().set(“myKey”,”myVal”), and then retrieve this data in your 
mapper with Context.getContext().get(“mykey”).this kind of functionality is typically 
done in the Mappers setup() method. 
51 Q: how does Mappers run() method works? 
Ans: The Mapper.run() method then calls map(KeyinType,VallnType,Context) for each 
key/value pair in the inputSplit for that task. 
52 Q: which object can be used to get the prograss of a particular job ? 
Ans: Context. 
53 Q: what is next step for Mapper or Map Task ? 
Ans: 
The output of the Mapper are sorted are partitions will be created for the output. 
Number of partition depends on the number of reducer.
Vemula ravi (Hadoop Developer and admin) 
54 Q: How can we control particular key should go in specific reducer ? 
Ans: User can control which keys (and hence records) go to which Reducer by 
implementing a custom partitioner 
55 Q: what is the use of Combiner ? 
Ans: it is an option component or class,and can be specify via 
Job.setCombinerClass(ClassName),to perform local aggregation of the intermediate 
ouputs, which helps to cut down the amount of data transferred from the Mapper to the 
reducer. 
56Q: How to change File Split Size in Hadoop ? 
Ans: 
Hadoop the Definitive Guide, page 203 "The maximum split size defaults to the maximum value that 
can be represented by a Java long type. It has an effect only when it is less than the block 
size, forcing splits to be smaller than a block. The split size is calculated by the formula: 
max(minimumSize, min(maximumSize, blockSize)) 
by default 
minimumSize < blockSize < maximumSize 
so the split size is blockSize 
For example, 
Minimum Split Size 1 
Maximum Split Size 32mb 
Block Size 64mb 
Split Size 32mb 
Hadoop Works better with a small number of large files than a large number of small files. One 
reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a
Vemula ravi (Hadoop Developer and admin) 
single file. If the file is very small ("small" means significantly smaller than an HDFS block) and there 
are a lot of them, then each map task will process very little input, and there will be a lot of them (one 
per file), each of which imposes extra bookkeeping overhead. Compare a 1gb file broken into 
sixteen 64mb blocks, and 10.000 or so 100kb files. The 10.000 files use one map each, and the job 
time can be tens or hundreds of times slower than the equivalent one with a single input file and 16 
map tasks. 
It is the base class for all implementations of InputFormat. It provides two things: a place to define 
which files are included as the input to a job and an implementation for generating splits for the input 
files. 
How to split input files? 
Altough, FileInputFormat splits only those files which are larger than HDFS block. The split size can 
be controlled by various Hadoop properties. Input path and filter properties are given in the below 
table. 
Property name Type Default value Description 
mapred.min.split.size int 1 Smallest valid 
size in 
bytes for a 
file split 
mapred.max.split.size long Long.MAX_VALUE, that is, Largest valid 
size in 
9223372036854775807 bytes for a 
file split 
dfs.block.size long 64 MB, The size block 
in HDFS 
The minimum split size is usually 1 byte, although some formats have a lower bound on the split 
size. We may impose a minimum split size. By setting this to a value larger than the block size, they 
can force splits to be larger than a block. But this is not good while using HDFS, because doing so 
will increase the number of blocks that are not local to a map task. The maximum split size defaults 
to the maximum value that can be represented by a Java long type. It has an effect only when it is 
less than the block size, forcing splits to be smaller than a block. The split size is calculated by the 
formula (see the computeSplitSize() method in FileInputFormat):
Vemula ravi (Hadoop Developer and admin) 
max(minimumSize, min(maximumSize, blockSize)) 
and by default: 
minimumSize < blockSize < maximumSize 
so the split size is blockSize. 
57Q: How to Change Block Size in Hadoop ? 
Ans: I found that my map tasks is currently inefficient when parsing one particular set of files 
(total 2 TB). I'd like to change the block size of files in the Hadoop dfs (from 64MB to 128 
MB). I can't find how to do it in the documentation for only one set of files and not the entire 
cluster, does anyone know the command that would change the block size when I upload it 
(ie copy from local to the dfs)? 
hadoop fs -D fs.local.block.size=134217728 -put local_name remote_location 
Original Answer 
You can programatically specify the block size when you create a file with the Hadoop API. 
Unfortunately, you can't do this on the command line with the hadoop fs -put command. To do 
what you want, you'll have to write your own code to copy the local file to a remote location; it's not 
hard, just open a FileInputStream for the local file, create the 
remote OutputStream with FileSystem.create, and then use something like IOUtils.copy from 
Apache Commons IO to copy between the two streams. 
hadoop fs -D dfs.block.size=134217728 -put local_name remote_location 
The setting for me is not fs.local.block.size but rather dfs.block.size 
you can also modify your block size in your programs like this 
Configuration conf = new Configuration() ; 
conf.set( "dfs.block.size", 128*1024*1024) ;
Vemula ravi (Hadoop Developer and admin)

More Related Content

What's hot

Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answerstechieguy85
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterEdureka!
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 

What's hot (20)

Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Next generation technology
Next generation technologyNext generation technology
Next generation technology
 
Upgrading hadoop
Upgrading hadoopUpgrading hadoop
Upgrading hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 

Similar to Hadoop interview quations1

Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewNitesh Ghosh
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersMindsMapped Consulting
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Apache hadoop overview
Apache hadoop overviewApache hadoop overview
Apache hadoop overviewDevi kala
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 

Similar to Hadoop interview quations1 (20)

Hdfs questions answers
Hdfs questions answersHdfs questions answers
Hdfs questions answers
 
Hadoop
HadoopHadoop
Hadoop
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Apache hadoop overview
Apache hadoop overviewApache hadoop overview
Apache hadoop overview
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Unit 5
Unit  5Unit  5
Unit 5
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 

Hadoop interview quations1

  • 1. Vemula ravi (Hadoop Developer and admin) Hadoop Interview Quations: HDFS(Hadoop Distributed File System) HDFS is relatively a new name in the market and many professionals are trying to step into the realm of this future technology. However, to facilitate candidates preparing for a career on Hadoop file system (HDFS), here are some common questions and solutions that are often asked in the interview board: · What are the essential features of HDFS? Hadoop Distributed File System (HDFS) is one high fault-tolerant system and is architected to run onlow-cost hardware. It has high throughput to access application data and is suited for applications with large data sets. The processing logic of HDFS lies in close proximity to the data and is portable across heterogeneous operating systems and hardware. HDFS is highly scalable and is one perfect solution to process large chunks of data. · What is called streaming access in HDFS? Streaming access is one of the most important parts HDFS, as it relies on the principle of ‘Write Once, Read Many’. HDFS does not focus much on the storage of data. Rather, it focuses more to find the best possible means to retrieve records in a faster way. · What is Namenode? It refers to the master node where the job trackers operate. The namenode contains metadata and it helps to manage the blocks present on the datanode. · What is datanode? Datanode is also referred as slaves and are installed on respective workstations to enable individual data storage. It also serves read/write request for the clients. · What is heartbeat?
  • 2. Vemula ravi (Hadoop Developer and admin) A hearbeat is referred to a signal to indicate the file-system is live. A datanode sends heartbeat to the namenode followed by task tracker to the job tracker. If at any point of time the namenode fails to receive the heartbeat, it will indicate there is some problem in the datanode or the tasktracker fails to perform the task. · Does namenode and jobtracker belong to the same host? It is possible to run all daemons on a single machine but in production environment namenode and jobtracker run on different hosts. · Explain the term ‘block’ in HDFS. It refers to the minimum data that can either be read or written. The default block size of Hadoop is 64 Mb as compared to 8192 Bytes in Unix/Linux. In HDFS, the files are broken into chunks of the size of a block and are stored in independent units. · If a file is of size 20 Mb and block size is 64 Mb. Will HDFS consume 64 Mb like other file systems? No. 64 Mb is referred to the unit where the 20 Mb data will be stored. Here in this case, when upon storing 20 Mb, 44 MB will be spared, which can be used to store anything else. · How to identify if the datanode is full? When the data gets stored in datanode, the metadata of the stored data gets updated in the namenode. It is the namenode which identifies if the datanode is full. · What are the different methods to access HDFS? There are various ways to access HDFS. In order to access natively, HDFS offers Java API for the application. There is also availability of C language wrapper. Moreover, HTTP browser can also be used to access and browse the files of HDFS instance. · Is there any way to recover a deleted file in HDFS? HDFS do not remove a file fully after deletion. The file is first renamed by HDFS and then stored in /trash directory. The file can be recovered from the said directory as long as it remains in the said folder. However, after the expiry of life of the deleted file, the NameNode deletes the file from HDFS namespace. · What is the command to create a directory in HDFS via FS shell? bin/hadoop dfs -mkdir /<directory_name> · What is a snapshot and how it helps when NameNode fails? A snapshot helps to store a data copy at one particular time. If in case, the NameNode machine fails, the snapshot can be rolled back to derive the last known good configuration.
  • 3. Vemula ravi (Hadoop Developer and admin) MapReduce(Processing Hug Amount data):
  • 4. Vemula ravi (Hadoop Developer and admin) A good understanding of Hadoop Architecture is required to understand and leverage the power of Hadoop. Here are few important practical questions which can be asked to a Senior Experienced Hadoop Developer in an interview. This list primarily includes questions related to Hadoop Architecture, MapReduce, Hadoop API and Hadoop Distributed File System (HDFS). What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster? JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is configured with job tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop performs following actions(from Hadoop Wiki:) · Client applications submit jobs to the Job tracker. · The JobTracker talks to the NameNode to determine the location of the data · The JobTracker locates TaskTracker nodes with available slots at or near the data · The JobTracker submits the work to the chosen TaskTracker nodes. · The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker. · A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable. · When the work is completed, the JobTracker updates its status. · Client applications can poll the JobTracker for information. How JobTracker schedules a task? The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the
  • 5. Vemula ravi (Hadoop Developer and admin) MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack. What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. What is a Task instance in Hadoop? Where does it run? Task instances are the actual MapReduce jobs which are run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task. How many Daemon processes run on a Hadoop system? Hadoop is comprised of five separate daemons. Each of these daemon run in its own JVM.Following 3 Daemons run on Master nodes NameNode - This daemon stores and maintains the metadata for HDFS. Secondary NameNode - Performs housekeeping functions for the NameNode.
  • 6. Vemula ravi (Hadoop Developer and admin) JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker. Following 2 Daemons run on each Slave nodes DataNode – Stores actual HDFS data blocks. TaskTracker - Responsible for instantiating and monitoring individual Map and Reduce tasks. What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node? · Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate JVM process. · Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM process. · One or Multiple instances of Task Instance is run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances. What is the difference between HDFS and NAS ? The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. Following are differences between HDFS and NAS · In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware. · HDFS is designed to work with MapReduce System, since computation are moved to data. NAS is not suitable for MapReduce since data is stored seperately from the computations. · HDFS runs on a cluster of machines and provides redundancy usinga replication protocal. Whereas NAS is provided by a single machine therefore does not provide data redundancy. How NameNode Handles data node failures? NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data
  • 7. Vemula ravi (Hadoop Developer and admin) blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the namenode. Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer? Nope, MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation. Can I set the number of reducers to zero? Yes, Setting the number of reducers to zero is a valid configuration in Hadoop. When you set the reducers to zero no reducers will be executed, and the output of each mapper will be stored to a separate file on HDFS. [This is different from the condition when reducers are set to a number greater than zero and the Mappers output (intermediate data) is written to the Local file system(NOT HDFS) of each mappter slave node.] Where is the Mapper Output (intermediate kay-value data) stored ? The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes. What are combiners? When should I use a combiner in my MapReduce Job? Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution. What is Writable & WritableComparable interface? · org.apache.hadoop.io.Writable is a Java interface. Any key or value type in the Hadoop Map-Reduce framework implements this interface. Implementations typically implement a static
  • 8. Vemula ravi (Hadoop Developer and admin) read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance. · org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. WritableComparable objects can be compared to each other using Comparators. What is the Hadoop MapReduce API contract for a key and value Class? · The Key must implement the org.apache.hadoop.io.WritableComparable interface. · The value must implement the org.apache.hadoop.io.Writable interface. What is a IdentityMapper and IdentityReducer in MapReduce ? · org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value. · org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value. What is the meaning of speculative execution in Hadoop? Why is it important? Speculative execution is a way of coping with individual Machine performance. In large clusters where hundreds or thousands of machines are involved there may be machines which are not performing as fast as others. This may result in delays in a full job due to only one machine not performaing well. To avoid this, speculative execution in hadoop can run multiple copies of same map or reduce task on different slave nodes. The results from first node to finish are used. When is the reducers are started in a MapReduce job? In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished. If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet? Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any
  • 9. Vemula ravi (Hadoop Developer and admin) intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished. What is HDFS ? How it is different from traditional file systems? HDFS, the Hadoop Distributed File System, is responsible for storing huge data on the cluster. This is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. · HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. · HDFS provides high throughput access to application data and is suitable for applications that have large data sets. · HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. What is HDFS Block size? How is it different from traditional file system block size? In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size can not be compared with the traditional file system block size. What is a NameNode? How many instances of NameNode run on a Hadoop Cluster? The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. There is only One NameNode process run on any hadoop cluster. NameNode runs on its own JVM process. In a typical production cluster its run on a separate machine. The NameNode is a Single Point of Failure for the HDFS Cluster. When the NameNode goes down, the file system goes offline. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?
  • 10. Vemula ravi (Hadoop Developer and admin) A DataNode stores data in the Hadoop File System HDFS. There is only One DataNode process run on any hadoop slave node. DataNode runs on its own JVM process. On startup, a DataNode connects to the NameNode. DataNode instances can talk to each other, this is mostly during replicating data. How the Client communicates with HDFS? The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. How the HDFS Blocks are replicated? HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rd copy on a different rack. Q1. Name the most common Input Formats defined in Hadoop? Which one is default? The two most common Input Formats defined in Hadoop are: - TextInputFormat - KeyValueInputFormat - SequenceFileInputFormat
  • 11. Vemula ravi (Hadoop Developer and admin) TextInputFormat is the Hadoop default. Q2. What is the difference between TextInputFormat and KeyValueInputFormat class? TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper. KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper. Q3. What is InputSplit in Hadoop? When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called InputSplit. Q4. How is the splitting of file invoked in Hadoop framework? It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user. Q5. Consider case scenario: In M/R system, - HDFS block size is 64 MB - Input format is FileInputFormat - We have 3 files of size 64K, 65Mb and 127Mb How many input splits will be made by Hadoop framework? Hadoop will make 5 splits as follows: - 1 split for 64K files - 2 splits for 65MB files - 2 splits for 127MB files Q6. What is the purpose of RecordReader in Hadoop?
  • 12. Vemula ravi (Hadoop Developer and admin) The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format. Q7. After the Map phase finishes, the Hadoop framework does “Partitioning, Shuffle and sort”. Explain what happens in this phase? Partitioning: It is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same. Shuffle: After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling. Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer. Q8. If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer? The default partitioner computes a hash value for the key and assigns the partition based on this result. Q9. What is a Combiner? The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers. Q10. What is JobTracker? JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.
  • 13. Vemula ravi (Hadoop Developer and admin) Q11. What are some typical functions of Job Tracker? The following are some typical tasks of JobTracker:- - Accepts jobs from clients - It talks to the NameNode to determine the location of the data. - It locates TaskTracker nodes with available slots at or near the data. - It submits the work to the chosen TaskTracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker. Q12. What is TaskTracker? TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker. Q13. What is the relationship between Jobs and Tasks in Hadoop? One job is broken down into one or many tasks in Hadoop. Q14. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do? It will restart the task again on some other TaskTracker and only if the task fails more than four (default setting and can be changed) times will it kill the job. Q15. Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this? Speculative Execution. Q16. How does speculative execution work in Hadoop? JobTracker makes different TaskTrackers process same input. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to
  • 14. Vemula ravi (Hadoop Developer and admin) abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first. Q17. Using command line in Linux, how will you - See all jobs running in the Hadoop cluster - Kill a job? Hadoop job – list Hadoop job – kill jobID Q18. What is Hadoop Streaming? Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations. Q19. What is the characteristic of streaming API that makes it flexible run MapReduce jobs in languages like Perl, Ruby, Awk etc.? Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a MapReduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout. Q20. What is Distributed Cache in Hadoop? Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node. Q21. What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the application read it? This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then
  • 15. Vemula ravi (Hadoop Developer and admin) it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this. Q.22 What mechanism does Hadoop framework provide to synchronise changes made in Distribution Cache during runtime of the application? This is a tricky question. There is no such mechanism. Distributed Cache by design is read only during the time of Job execution. Q23. Have you ever used Counters in Hadoop. Give us an example scenario? Anybody who claims to have worked on a Hadoop project is expected to use counters. Q24. Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job? Yes, the input format class provides methods to add multiple directories as input to a Hadoop job. Q25. Is it possible to have Hadoop job output in multiple directories? If yes, how? Yes, by using Multiple Outputs class. Q26. What will a Hadoop job do if you try to run it with an output directory that is already present? Will it - Overwrite it - Warn you and continue - Throw an exception and exit The Hadoop job will throw an exception and exit. Q27. How can you set an arbitrary number of mappers to be created for a job in Hadoop? You cannot set it. Q28. How can you set an arbitrary number of Reducers to be created for a job in Hadoop?
  • 16. Vemula ravi (Hadoop Developer and admin) You can either do it programmatically by using method setNumReduceTasks in the Jobconf Class or set it up as a configuration setting. Q29. How will you write a custom partitioner for a Hadoop job? To have Hadoop use a custom partitioner you will have to do minimum the following three: - Create a new class that extends Partitioner Class - Override method getPartition - In the wrapper that runs the Mapreduce, either - Add the custom partitioner to the job programmatically using method set Partitioner Class or – add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie) Q30. How did you debug your Hadoop code? There can be several ways of doing this but most common ways are:- - By using counters. - The web interface provided by Hadoop framework. Q31. Did you ever built a production process in Hadoop? If yes, what was the process when your Hadoop job fails due to any reason? It is an open-ended question but most candidates if they have written a production job, should talk about some type of alert mechanism like email is sent or there monitoring system sends an alert. Since Hadoop works on unstructured data, it is very important to have a good alerting system for errors since unexpected data can very easily break the job 32Q:what is hadoop framework ? Ans: hadoop is a open source framework which is written in java by apche software foundation. This framework is used to write s/w application which requires to process vast amount of data (it handle tera bytes of data) .it works in paralle on large clusters which could have 1000 of computers(nodes)on the clusters.it also process data very reliably and fault-tolerant manner.see the below image how does it works.
  • 17. Vemula ravi (Hadoop Developer and admin) Hadoop Frameworks 33Q: On what concept the hadoop framework works ? Ans:it works on MapReduce and it is devised by the Google. 34Q:What is MapReduce ? Ans: Map reduce is an algorithm or concept to process Huge amount of data in a faster way .As per its name you can divide it Map and Reduce. The main MapReduce job usually splits the input data-set into independent chunks.(Big data sets in the multiple small datasets) Map Task:will process these chunks in a completely parallel manner (one node can process one or more chunks)  The Frameworks sorts the outputs of the maps. Reduce Task: And the above output will be the input for the reducetasks,produces the final results. Your business logic would be written in the MappedTask and ReducedTask. Typically both the input and output of the job are stored in file system (not database). The framework takes care of scheduling task, monitoring them and re-executes the failed tasks. 35Q:what is compute and storage nodes ? Ans: compute node: this is the computer or mechine where your actual business logic will be executed. Storage node: This is the computer or machine where your file system reside to store the processing data.
  • 18. Vemula ravi (Hadoop Developer and admin) In most of the cases compute node and storage node would be the same machine. 36Q: how does master slave architecture in the hadoop ? Ans:The MapReduce framework consists of a single master JobTracker and multiple slaves, each cluster-node will have one tasktracker The is master is responsible for scheduling the jobs components tasks on the slaves, Monitoring them and re-execution the failed tasks. The slaves executes the tasks as directed by the master. 37Q: How does an Hadoop application look like or their basic components ? Ans: Minimally an Hadoop application would have following components  Input locaton of data  Output location of processed data  A map task  A reduced task  Job configuration The Hadoop job client then submits the job (jar/executable etc.) and configuration to the jobTracker which then assumes the responsibility of distributing the software /configuration to the slaves, scheduling tasks and monitoring them providing status and diagnostic information to the job-client. 38 Q:The MapReduce Framework operates exclusively on pairs,that’s is the framework view the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types see the flow mentioned below Input->map->-> conbine/sorting->-> reduce-> output
  • 19. Vemula ravi (Hadoop Developer and admin) 39 Q:what are the restriction to the key and value class ? Ans: The key and value classes have to be serialized by the framework . To make them serializable Hadoop provides a writable interface .as you know from the java itself that the key of the map should be comparable ,hence the key has to implement one more interface writable comparable 40 Q: Explain the WordCount implementation via Hadoop Framework ? Ans: we will count the words in all the input file flow as below:  Input Assume there are two files each having a sentence Hello World Hello World (in file 1) Hello World Hello World (in file 2) Mapper: There would be each mapper for the a file For the given sample input the first map output: <hello,1> <world,1> <hello,1> <world,1> The second map output: <hello,1> <world,1> <hello,1> <world,1>  Combiner/Sorting (This is done for each individual map) So output looks like this
  • 20. Vemula ravi (Hadoop Developer and admin) The output of the first map: <hello,2> <world,2> The output of the second map: <hello,2> <world,2> Reducer: It sumes up the above output and generates the output as below <hello,4> <world,4>  Output Final Output would like: Hello 4 times World 4 times 41Q: which interface needs to be implemented to create mapper and reducer for the hadoop ? Ans: Org.apache.hadoop.mapreduce.mapper Org.apache.hadoop.mapreduce.reducer 42Q: what mapper does ? Ans: Maps are the individual task that transform input records into intermediate records.the given input pair may map to Zero or many output pairs. 43 Q: what is the inputsplit in map reduce software ?
  • 21. Vemula ravi (Hadoop Developer and admin) An inputsplit is a logical representation of a unit(A chunk) of input work for a map task; Eg: a filename and a byte range within that file to process or a row set in a text file. 44Q:what is the inputformat ? Ans: the inputformat is responsible for enumerate (itemize) the inputsplits, and producing a RecordReader which will turn those logical work units into physical input records. 45 Q: where do you specify the mapper implementation ? Ans: Generally mapper implementation is specified in the job itself. 46 Q: how mapper is instantiated in a running job ? Ans: The Mapper itself is instantiated in the running job and will be passed a mapContexr object which it can use to configure itself. 47 Q: which are the methods in the mapper interface ? Ans: the mapper contains the run() method ,which call its own setup() method only once,it also call a map() method for each input and finally calls it cleanup() Method. All above methods you can override in your code. 48 Q: What happens if you don’t’t have any methods and keep them as it is ? Ans: if you don’t override any methods (leaving even map as-is),it will act as identity function ,emitting each input record as a separate output.
  • 22. Vemula ravi (Hadoop Developer and admin) 49 Q: What is the use of Contex Object ? Ans: The Contex object allows the mapper to interact with the rest of the Hadoop system.it includes configuration data for the jab,as well as interface which all it to emit output. 50 Q: How can you add the arbitrary key-value pairs in youer mapper ? Ans: you can set arbitrary(key,value) pair of configuration data in your job, eg: with Job.getConfiguration().set(“myKey”,”myVal”), and then retrieve this data in your mapper with Context.getContext().get(“mykey”).this kind of functionality is typically done in the Mappers setup() method. 51 Q: how does Mappers run() method works? Ans: The Mapper.run() method then calls map(KeyinType,VallnType,Context) for each key/value pair in the inputSplit for that task. 52 Q: which object can be used to get the prograss of a particular job ? Ans: Context. 53 Q: what is next step for Mapper or Map Task ? Ans: The output of the Mapper are sorted are partitions will be created for the output. Number of partition depends on the number of reducer.
  • 23. Vemula ravi (Hadoop Developer and admin) 54 Q: How can we control particular key should go in specific reducer ? Ans: User can control which keys (and hence records) go to which Reducer by implementing a custom partitioner 55 Q: what is the use of Combiner ? Ans: it is an option component or class,and can be specify via Job.setCombinerClass(ClassName),to perform local aggregation of the intermediate ouputs, which helps to cut down the amount of data transferred from the Mapper to the reducer. 56Q: How to change File Split Size in Hadoop ? Ans: Hadoop the Definitive Guide, page 203 "The maximum split size defaults to the maximum value that can be represented by a Java long type. It has an effect only when it is less than the block size, forcing splits to be smaller than a block. The split size is calculated by the formula: max(minimumSize, min(maximumSize, blockSize)) by default minimumSize < blockSize < maximumSize so the split size is blockSize For example, Minimum Split Size 1 Maximum Split Size 32mb Block Size 64mb Split Size 32mb Hadoop Works better with a small number of large files than a large number of small files. One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a
  • 24. Vemula ravi (Hadoop Developer and admin) single file. If the file is very small ("small" means significantly smaller than an HDFS block) and there are a lot of them, then each map task will process very little input, and there will be a lot of them (one per file), each of which imposes extra bookkeeping overhead. Compare a 1gb file broken into sixteen 64mb blocks, and 10.000 or so 100kb files. The 10.000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file and 16 map tasks. It is the base class for all implementations of InputFormat. It provides two things: a place to define which files are included as the input to a job and an implementation for generating splits for the input files. How to split input files? Altough, FileInputFormat splits only those files which are larger than HDFS block. The split size can be controlled by various Hadoop properties. Input path and filter properties are given in the below table. Property name Type Default value Description mapred.min.split.size int 1 Smallest valid size in bytes for a file split mapred.max.split.size long Long.MAX_VALUE, that is, Largest valid size in 9223372036854775807 bytes for a file split dfs.block.size long 64 MB, The size block in HDFS The minimum split size is usually 1 byte, although some formats have a lower bound on the split size. We may impose a minimum split size. By setting this to a value larger than the block size, they can force splits to be larger than a block. But this is not good while using HDFS, because doing so will increase the number of blocks that are not local to a map task. The maximum split size defaults to the maximum value that can be represented by a Java long type. It has an effect only when it is less than the block size, forcing splits to be smaller than a block. The split size is calculated by the formula (see the computeSplitSize() method in FileInputFormat):
  • 25. Vemula ravi (Hadoop Developer and admin) max(minimumSize, min(maximumSize, blockSize)) and by default: minimumSize < blockSize < maximumSize so the split size is blockSize. 57Q: How to Change Block Size in Hadoop ? Ans: I found that my map tasks is currently inefficient when parsing one particular set of files (total 2 TB). I'd like to change the block size of files in the Hadoop dfs (from 64MB to 128 MB). I can't find how to do it in the documentation for only one set of files and not the entire cluster, does anyone know the command that would change the block size when I upload it (ie copy from local to the dfs)? hadoop fs -D fs.local.block.size=134217728 -put local_name remote_location Original Answer You can programatically specify the block size when you create a file with the Hadoop API. Unfortunately, you can't do this on the command line with the hadoop fs -put command. To do what you want, you'll have to write your own code to copy the local file to a remote location; it's not hard, just open a FileInputStream for the local file, create the remote OutputStream with FileSystem.create, and then use something like IOUtils.copy from Apache Commons IO to copy between the two streams. hadoop fs -D dfs.block.size=134217728 -put local_name remote_location The setting for me is not fs.local.block.size but rather dfs.block.size you can also modify your block size in your programs like this Configuration conf = new Configuration() ; conf.set( "dfs.block.size", 128*1024*1024) ;
  • 26. Vemula ravi (Hadoop Developer and admin)