Some of the common interview questions asked during a Big Data Hadoop Interview. These may apply to Hadoop Interviews. Be prepared with answers for the interview questions below when you prepare for an interview. Also have an example to explain how you worked on various interview questions asked below. Hadoop Developers are expected to have references and be able to explain from their past experiences. All the Best for a successful career as a Hadoop Developer!
2. 1. Without touching Block size & input split, can we have a say on the no.
of mappers?
Ans: Create a Custom input Format and override the 'issplitable()' to return
false.
2. What is the difference between Block size & input split?
Ans: Block is a physical division whereas, input split is a logical division of
the data.
3. To process one hundred files each of size - 100MB on HDFS whose
default block size is 64MB, how many mappers would be invoked?
Ans: Each file occupies 2 blocks of data(block1 - 64 MB & block2 - 36 MB)
and hence 100 files would occupy 200 map slots.
HDFS
3. 4. What is data locality optimization?
Ans: In Hadoop, execution is done near the data. This execution can be done in 3
possible ways, out of which the first way is always preferred by the Namenode
Same node execution: Tasktracker process is initiated in the Datanode where the block
of data is stored.
Off-node execution: In the event of the unavailability of Tasktracker slots in the
Datanode(where the data block is located), this block of data is copied to the nearest
datanode in the same rack and execution is done.
Off-rack execution: If no slot is free to run the Tasktracker in the entire rack where the
block of data is present, the block of data is moved across to a different rack and
executed.
5. What is Speculative execution?
Ans: If one of the tasks of a MapReduce job is slow, it pulls down the overall
performance of the job. Hence, Jobtracker continuously monitors each task for
progress(via heart beat signals). If certain task does not respond in the given time-
interval, then the job trackerspeculates that the task is down and initiates a similar
Tasktracker on a different replica of the same block. This concept is called Speculative
execution.
Important thing to note here is that, it will not kill the slow running task. Both tasks
would run simultaneously. Only when one of the tasks get completed, the remaining
task would be killed.
4. HDFS
6. What are the different types of File permissions in HDFS?
Ans:
drwxrwxrwx user1 prog 10 Aug 16 15:02 myfolder
-rwxrwxrwx user1 prog 10 Aug 01 07:02 myfile.sas
Position 1: ‘d’ means folder, ‘-’ means file
Positions 2-4: Owners permissions on file/folder
Positions 5-7: Group permissions on file or folder
Positions 8-10: Global permissions on file or folder
7. What is Rack-awareness?
Ans: In HDFS, not all replicas of a single block are copied in the same Rack. This
concept is called Rack-awareness. In the event of an entire Rack going down, if
all the replicas are in that rack, there would be no way of recovering that block
of data.
5. HDFS
8. What are the different modes of HDFS that one can run? Where do we
configure these modes?
Ans: Hadoop can be configured to run on one of the following modes.
a. Standalone Mode or local (default mode)
b. Psuedo distributed mode
c. Fully distributed mode.
These configuration settings can be set via - core-site.xml, mapred-site.xml, hdfs-
site.xml
9. What are the available data-types in Hadoop?
Ans: To support serialization-deserialization and to be able to get compared with
one another, hadoop has built its own datatypes.
Following is the list of types that implement WritableComparable-
Primitives: BooleanWritable, ByteWritable, ShortWritable, IntWritable,
VIntWritable,FloatWritable,
LongWritable, VLongWritable, DoubleWritable.
Others:
NullWritable, Text, BytesWritable, MD5Hash
6. HDFS
10. Explain the command '-getMerge'
Ans: hadoop fs -getmerge <directory> <merged file name>
This option gets all the files in the directory and merges them into a single
file.
11. Explain the anatomy of a file read in HDFS
Ans:
1. Client opens the file (calls open() on Distributed File System).
2. DFS calls the namenode to get block locations.
3. DFS creates FSDataInputStream and client invokes read() on this object.
4. Using DFSDataInputStream(a sub class of FSDataInputStream), read
operation is done on datanodes where file blocks are present. Blocks are read
in the order. Once reading all the blocks is finished, client calls close() on the
FSDataInputStream
7. 12. Explain the anatomy of a file write in HDFS
Ans:
1. Client creates a files (calls create() on DFS)
2. Client calls namenode(NN) to create a file. NN checks for client's access
permissions to the file and if file already exists. If the file already exists, it
throws an IO Exception
3. The DFS returns an FSDataOutputStream to write data into.
FSDataOutputStream has a subclass DFSDataOutputStream which handles
communication with NN & datanode(DN)
4. DFSDataOutputStream writes data in the form of packets(small units of
data) and these packets are written to various DNs to form blocks of data. A
pipeline is formed that consists of the list of DNs that a single block has to be
replicated to.
5. When a block of data is written to all DNs in the pipeline, acknowledgement
comes from the DNs in the pipeline in the reverse order.
6. When client has finished writing the data, it calls close() on the stream
7. Waits for acknowledgement before contacting the name to signal that file is
complete.
HDFS
8. MAPREDUCE
1. What is Distributed Cache?
Ans: Distributed Cache is a mechanism by which 'Side Data' (extra read-only
data needed by a MR program) is distributed
2. What is 'Sequence File' format? Where do we use it?
Ans: SequenceFile is a flat file consisting of binary key/value pairs. It is
extensively used in MapReduce as input/output formats. It is also worth
noting that, internally, the temporary outputs of maps are stored using
SequenceFile.
The SequenceFile provides a Writer, Reader and Sorter classes for writing,
reading and sorting respectively.
There are 3 different SequenceFile formats:
a. Uncompressed key/value records
b. Record compressed key/value records - only 'values' are compressed
here.
c. Block compressed key/value records - both keys and values are
collected in 'blocks' separately and compressed. The size of the 'block' is
configurable
9. MAPREDUCE
3. What are the different File Input Formats in MapReduce?
Ans: FileInputFormat is the base class for all implementations
ofInputFormat that uses file as their data source. The sub-classes of
FileInputFormat are: CombineFileInputFormat, TextInputFormat (default),
KeyValueTextInputFormat, NLineInputFormat, SequenceFileInputFormat.
SequenceFileInputFormat has few subclasses like -
SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat,
SequenceFileInputFilter
4. What is ‘Shuffling & sorting’ phase in MapReduce?
Ans: This phase occurs between Map & Reduce phases. During this phase,
the all keys emitted by various mappers is collected, grouped and copied to
the reducers.
5. How many instances of a 'jobtracker' run in a cluster?
Ans: Only one instance of Jobtracker would run in a cluster
10. 6. Can two different Mappers communicate with each other?
Ans: No, Mappers/Reducers run independently of each other.
7. How do you make sure that only one mapper runs your entire
file?
Ans: Create a Custom 'InputFormat' and override
the'issplitable()' to return false. (or) a rather rude way to do is - set
the block size greater than the size of the input file.
8.When will the reducer phase start in a MR program?
Ans: Reducer phase starts only after all mappers finish execution.
MAPREDUCE
11. 9. Explain various phases of a MapReduce program.
Ans:
Mapper phase:
Sort & Shuffle phase:
Reducer phase:
10. What is a 'Task instance' ?
Ans: Task instance is the child JVM process that is initiated by
the Tasktracker itself. This is to ensure that process failure does not
take down the Tasktracker.
MAPREDUCE
12. 1.What is HBase?
Hbase is Column-Oriented , Open-Source, Multidimensional,
Distributed database. It run on the top of HDFS
2.Why we use Hbase?
Hbase provide random read and write, Need to do thousand of
operation per second on large data set.
3.List the main component of HBase?
Zookeeper
Catalog Tables
Master
RegionServer
Region
HBASE
13. 4.How many Operational command in Hbase?
There are five main command in HBase.
1. Get
2. Put
3. Delete
4. Scan
5. Increment
5.How to open a connection in Hbase?
If you are going to open connection with the help of Java API.
The following code provide the connection
Configuration myConf = HBaseConfiguration.create();
HTableInterface usersTable = new HTable(myConf, "users");
HBASE
14. 6.When Should I Use HBase?
HBase isnt suitable for every problem.
First, make sure you have enough data. If you have hundreds of
millions or billions of rows, then HBase is a good candidate. If you only
have a few thousand/million rows, then using a traditional RDBMS might
be a better choice due to the fact that all of your data might wind up on a
single node (or two) and the rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that
an RDBMS provides (e.g., typed columns, secondary indexes, transactions,
advanced query languages, etc.) An application built against an RDBMS
cannot be ported to HBase by simply changing a JDBC driver, for example.
Consider moving from an RDBMS to HBase as a complete redesign as
opposed to a port.
Third, make sure you have enough hardware. Even HDFS doesnt do
well with anything less than 5 DataNodes (due to things such as HDFS block
replication which has a default of 3), plus a NameNode.
HBASE
15. 7. How does Hbase achieve random read/write?
HBase stores data in HFiles that are indexed (sorted) by their key.
Given a random key, the client can determine which region server to ask
for the row from. The region server can determine which region to retrieve
the row from, and then do a binary search through the region to access the
correct row. This is accomplished by having sufficient statistics to know the
number of blocks, block size, start key, and end key.
For example: A table may contain 10 TB of data. But, the table is
broken up into regions of size 4GB. Each region has a start/end key. The
client can get the list of regions for a table and determine which region has
the key it is looking for. Regions are broken up into blocks, so that the
region server can do a binary search through its blocks. Blocks are
essentially long lists of key, attribute, value, and version. If you know what
the starting key is for each block, you can determine one file to access, and
what the byte-offset (block) is to start reading to see where you are in the
binary search.
HBASE
Editor's Notes
NYSE – 4-5 TB/day
Aero - 10TB/30 min.
Facebook – 240 Billion Photos; 7 PB/month
It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes
Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
NYSE – 4-5 TB/day
Aero - 10TB/30 min.
Facebook – 240 Billion Photos; 7 PB/month
It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes
Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
NYSE – 4-5 TB/day
Aero - 10TB/30 min.
Facebook – 240 Billion Photos; 7 PB/month
It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes
Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
NYSE – 4-5 TB/day
Aero - 10TB/30 min.
Facebook – 240 Billion Photos; 7 PB/month
It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes
Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
NYSE – 4-5 TB/day
Aero - 10TB/30 min.
Facebook – 240 Billion Photos; 7 PB/month
It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes
Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
NYSE – 4-5 TB/day
Aero - 10TB/30 min.
Facebook – 240 Billion Photos; 7 PB/month
It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes
Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta