SlideShare a Scribd company logo
1 of 15
Big Data Hadoop Interview Questions
Big Data Hadoop Training
1. Without touching Block size & input split, can we have a say on the no.
of mappers?
Ans: Create a Custom input Format and override the 'issplitable()' to return
false.
2. What is the difference between Block size & input split?
Ans: Block is a physical division whereas, input split is a logical division of
the data.
3. To process one hundred files each of size - 100MB on HDFS whose
default block size is 64MB, how many mappers would be invoked?
Ans: Each file occupies 2 blocks of data(block1 - 64 MB & block2 - 36 MB)
and hence 100 files would occupy 200 map slots.
HDFS
4. What is data locality optimization?
Ans: In Hadoop, execution is done near the data. This execution can be done in 3
possible ways, out of which the first way is always preferred by the Namenode
Same node execution: Tasktracker process is initiated in the Datanode where the block
of data is stored.
Off-node execution: In the event of the unavailability of Tasktracker slots in the
Datanode(where the data block is located), this block of data is copied to the nearest
datanode in the same rack and execution is done.
Off-rack execution: If no slot is free to run the Tasktracker in the entire rack where the
block of data is present, the block of data is moved across to a different rack and
executed.
5. What is Speculative execution?
Ans: If one of the tasks of a MapReduce job is slow, it pulls down the overall
performance of the job. Hence, Jobtracker continuously monitors each task for
progress(via heart beat signals). If certain task does not respond in the given time-
interval, then the job trackerspeculates that the task is down and initiates a similar
Tasktracker on a different replica of the same block. This concept is called Speculative
execution.
Important thing to note here is that, it will not kill the slow running task. Both tasks
would run simultaneously. Only when one of the tasks get completed, the remaining
task would be killed.
HDFS
6. What are the different types of File permissions in HDFS?
Ans:
drwxrwxrwx user1 prog 10 Aug 16 15:02 myfolder
-rwxrwxrwx user1 prog 10 Aug 01 07:02 myfile.sas
Position 1: ‘d’ means folder, ‘-’ means file
Positions 2-4: Owners permissions on file/folder
Positions 5-7: Group permissions on file or folder
Positions 8-10: Global permissions on file or folder
7. What is Rack-awareness?
Ans: In HDFS, not all replicas of a single block are copied in the same Rack. This
concept is called Rack-awareness. In the event of an entire Rack going down, if
all the replicas are in that rack, there would be no way of recovering that block
of data.
HDFS
8. What are the different modes of HDFS that one can run? Where do we
configure these modes?
Ans: Hadoop can be configured to run on one of the following modes.
a. Standalone Mode or local (default mode)
b. Psuedo distributed mode
c. Fully distributed mode.
These configuration settings can be set via - core-site.xml, mapred-site.xml, hdfs-
site.xml
9. What are the available data-types in Hadoop?
Ans: To support serialization-deserialization and to be able to get compared with
one another, hadoop has built its own datatypes.
Following is the list of types that implement WritableComparable-
Primitives: BooleanWritable, ByteWritable, ShortWritable, IntWritable,
VIntWritable,FloatWritable,
LongWritable, VLongWritable, DoubleWritable.
Others:
NullWritable, Text, BytesWritable, MD5Hash
HDFS
10. Explain the command '-getMerge'
Ans: hadoop fs -getmerge <directory> <merged file name>
This option gets all the files in the directory and merges them into a single
file.
11. Explain the anatomy of a file read in HDFS
Ans:
1. Client opens the file (calls open() on Distributed File System).
2. DFS calls the namenode to get block locations.
3. DFS creates FSDataInputStream and client invokes read() on this object.
4. Using DFSDataInputStream(a sub class of FSDataInputStream), read
operation is done on datanodes where file blocks are present. Blocks are read
in the order. Once reading all the blocks is finished, client calls close() on the
FSDataInputStream
12. Explain the anatomy of a file write in HDFS
Ans:
1. Client creates a files (calls create() on DFS)
2. Client calls namenode(NN) to create a file. NN checks for client's access
permissions to the file and if file already exists. If the file already exists, it
throws an IO Exception
3. The DFS returns an FSDataOutputStream to write data into.
FSDataOutputStream has a subclass DFSDataOutputStream which handles
communication with NN & datanode(DN)
4. DFSDataOutputStream writes data in the form of packets(small units of
data) and these packets are written to various DNs to form blocks of data. A
pipeline is formed that consists of the list of DNs that a single block has to be
replicated to.
5. When a block of data is written to all DNs in the pipeline, acknowledgement
comes from the DNs in the pipeline in the reverse order.
6. When client has finished writing the data, it calls close() on the stream
7. Waits for acknowledgement before contacting the name to signal that file is
complete.
HDFS
MAPREDUCE
1. What is Distributed Cache?
Ans: Distributed Cache is a mechanism by which 'Side Data' (extra read-only
data needed by a MR program) is distributed
2. What is 'Sequence File' format? Where do we use it?
Ans: SequenceFile is a flat file consisting of binary key/value pairs. It is
extensively used in MapReduce as input/output formats. It is also worth
noting that, internally, the temporary outputs of maps are stored using
SequenceFile.
The SequenceFile provides a Writer, Reader and Sorter classes for writing,
reading and sorting respectively.
There are 3 different SequenceFile formats:
a. Uncompressed key/value records
b. Record compressed key/value records - only 'values' are compressed
here.
c. Block compressed key/value records - both keys and values are
collected in 'blocks' separately and compressed. The size of the 'block' is
configurable
MAPREDUCE
3. What are the different File Input Formats in MapReduce?
Ans: FileInputFormat is the base class for all implementations
ofInputFormat that uses file as their data source. The sub-classes of
FileInputFormat are: CombineFileInputFormat, TextInputFormat (default),
KeyValueTextInputFormat, NLineInputFormat, SequenceFileInputFormat.
SequenceFileInputFormat has few subclasses like -
SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat,
SequenceFileInputFilter
4. What is ‘Shuffling & sorting’ phase in MapReduce?
Ans: This phase occurs between Map & Reduce phases. During this phase,
the all keys emitted by various mappers is collected, grouped and copied to
the reducers.
5. How many instances of a 'jobtracker' run in a cluster?
Ans: Only one instance of Jobtracker would run in a cluster
6. Can two different Mappers communicate with each other?
Ans: No, Mappers/Reducers run independently of each other.
7. How do you make sure that only one mapper runs your entire
file?
Ans: Create a Custom 'InputFormat' and override
the'issplitable()' to return false. (or) a rather rude way to do is - set
the block size greater than the size of the input file.
8.When will the reducer phase start in a MR program?
Ans: Reducer phase starts only after all mappers finish execution.
MAPREDUCE
9. Explain various phases of a MapReduce program.
Ans:
Mapper phase:
Sort & Shuffle phase:
Reducer phase:
10. What is a 'Task instance' ?
Ans: Task instance is the child JVM process that is initiated by
the Tasktracker itself. This is to ensure that process failure does not
take down the Tasktracker.
MAPREDUCE
1.What is HBase?
Hbase is Column-Oriented , Open-Source, Multidimensional,
Distributed database. It run on the top of HDFS
2.Why we use Hbase?
Hbase provide random read and write, Need to do thousand of
operation per second on large data set.
3.List the main component of HBase?
Zookeeper
Catalog Tables
Master
RegionServer
Region
HBASE
4.How many Operational command in Hbase?
There are five main command in HBase.
1. Get
2. Put
3. Delete
4. Scan
5. Increment
5.How to open a connection in Hbase?
If you are going to open connection with the help of Java API.
The following code provide the connection
Configuration myConf = HBaseConfiguration.create();
HTableInterface usersTable = new HTable(myConf, "users");
HBASE
6.When Should I Use HBase?
HBase isnt suitable for every problem.
First, make sure you have enough data. If you have hundreds of
millions or billions of rows, then HBase is a good candidate. If you only
have a few thousand/million rows, then using a traditional RDBMS might
be a better choice due to the fact that all of your data might wind up on a
single node (or two) and the rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that
an RDBMS provides (e.g., typed columns, secondary indexes, transactions,
advanced query languages, etc.) An application built against an RDBMS
cannot be ported to HBase by simply changing a JDBC driver, for example.
Consider moving from an RDBMS to HBase as a complete redesign as
opposed to a port.
Third, make sure you have enough hardware. Even HDFS doesnt do
well with anything less than 5 DataNodes (due to things such as HDFS block
replication which has a default of 3), plus a NameNode.
HBASE
7. How does Hbase achieve random read/write?
HBase stores data in HFiles that are indexed (sorted) by their key.
Given a random key, the client can determine which region server to ask
for the row from. The region server can determine which region to retrieve
the row from, and then do a binary search through the region to access the
correct row. This is accomplished by having sufficient statistics to know the
number of blocks, block size, start key, and end key.
For example: A table may contain 10 TB of data. But, the table is
broken up into regions of size 4GB. Each region has a start/end key. The
client can get the list of regions for a table and determine which region has
the key it is looking for. Regions are broken up into blocks, so that the
region server can do a binary search through its blocks. Blocks are
essentially long lists of key, attribute, value, and version. If you know what
the starting key is for each block, you can determine one file to access, and
what the byte-offset (block) is to start reading to see where you are in the
binary search.
HBASE

More Related Content

What's hot

HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User ReferenceBiju Nair
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemBhavesh Padharia
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapakapa rohit
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori
 

What's hot (19)

HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Interacting with hdfs
Interacting with hdfsInteracting with hdfs
Interacting with hdfs
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
Drbd
DrbdDrbd
Drbd
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
March 2011 HUG: HDFS Federation
March 2011 HUG: HDFS FederationMarch 2011 HUG: HDFS Federation
March 2011 HUG: HDFS Federation
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
 

Viewers also liked

Tarea smartphone
Tarea smartphoneTarea smartphone
Tarea smartphonelez95
 
Entorno Visual de Microsoft excel
Entorno Visual de Microsoft excelEntorno Visual de Microsoft excel
Entorno Visual de Microsoft excelUPTAEB
 
Classroom censorship module three ssn
Classroom censorship module three ssnClassroom censorship module three ssn
Classroom censorship module three ssnLyndseyTucker
 
Problems of well being
Problems of well beingProblems of well being
Problems of well beingkayleelclarke
 
Family History Project 2015
Family History Project 2015Family History Project 2015
Family History Project 2015reghilda
 
197 ger.pdf osteosarcoma
197 ger.pdf osteosarcoma197 ger.pdf osteosarcoma
197 ger.pdf osteosarcomaKarina Vázquez
 
Taller productos notables
Taller productos notablesTaller productos notables
Taller productos notableslaura9611
 
Dengue campaign
Dengue campaignDengue campaign
Dengue campaigndrmdin
 
Mapa conceptual
Mapa conceptualMapa conceptual
Mapa conceptualrossy0181
 

Viewers also liked (17)

Tarea smartphone
Tarea smartphoneTarea smartphone
Tarea smartphone
 
Entorno Visual de Microsoft excel
Entorno Visual de Microsoft excelEntorno Visual de Microsoft excel
Entorno Visual de Microsoft excel
 
Presentacion de ensayando
Presentacion de ensayandoPresentacion de ensayando
Presentacion de ensayando
 
Classroom censorship module three ssn
Classroom censorship module three ssnClassroom censorship module three ssn
Classroom censorship module three ssn
 
Lavado de manos
Lavado de manosLavado de manos
Lavado de manos
 
Problems of well being
Problems of well beingProblems of well being
Problems of well being
 
Polímeros
PolímerosPolímeros
Polímeros
 
Presentación extensión
Presentación extensión Presentación extensión
Presentación extensión
 
Sistema Nervioso
Sistema Nervioso Sistema Nervioso
Sistema Nervioso
 
Family History Project 2015
Family History Project 2015Family History Project 2015
Family History Project 2015
 
Demencia senil
Demencia senilDemencia senil
Demencia senil
 
197 ger.pdf osteosarcoma
197 ger.pdf osteosarcoma197 ger.pdf osteosarcoma
197 ger.pdf osteosarcoma
 
Fomento de estilo de vida saludable
Fomento de estilo de vida saludableFomento de estilo de vida saludable
Fomento de estilo de vida saludable
 
Los sitemas registrales
Los sitemas registralesLos sitemas registrales
Los sitemas registrales
 
Taller productos notables
Taller productos notablesTaller productos notables
Taller productos notables
 
Dengue campaign
Dengue campaignDengue campaign
Dengue campaign
 
Mapa conceptual
Mapa conceptualMapa conceptual
Mapa conceptual
 

Similar to Hadoop Interview Questions and Answers

Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxsunithachphd
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1Vemula Ravi
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxSakthiVinoth78
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 

Similar to Hadoop Interview Questions and Answers (20)

Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptx
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Hdfs
HdfsHdfs
Hdfs
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 

Recently uploaded

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 

Recently uploaded (20)

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 

Hadoop Interview Questions and Answers

  • 1. Big Data Hadoop Interview Questions Big Data Hadoop Training
  • 2. 1. Without touching Block size & input split, can we have a say on the no. of mappers? Ans: Create a Custom input Format and override the 'issplitable()' to return false. 2. What is the difference between Block size & input split? Ans: Block is a physical division whereas, input split is a logical division of the data. 3. To process one hundred files each of size - 100MB on HDFS whose default block size is 64MB, how many mappers would be invoked? Ans: Each file occupies 2 blocks of data(block1 - 64 MB & block2 - 36 MB) and hence 100 files would occupy 200 map slots. HDFS
  • 3. 4. What is data locality optimization? Ans: In Hadoop, execution is done near the data. This execution can be done in 3 possible ways, out of which the first way is always preferred by the Namenode Same node execution: Tasktracker process is initiated in the Datanode where the block of data is stored. Off-node execution: In the event of the unavailability of Tasktracker slots in the Datanode(where the data block is located), this block of data is copied to the nearest datanode in the same rack and execution is done. Off-rack execution: If no slot is free to run the Tasktracker in the entire rack where the block of data is present, the block of data is moved across to a different rack and executed. 5. What is Speculative execution? Ans: If one of the tasks of a MapReduce job is slow, it pulls down the overall performance of the job. Hence, Jobtracker continuously monitors each task for progress(via heart beat signals). If certain task does not respond in the given time- interval, then the job trackerspeculates that the task is down and initiates a similar Tasktracker on a different replica of the same block. This concept is called Speculative execution. Important thing to note here is that, it will not kill the slow running task. Both tasks would run simultaneously. Only when one of the tasks get completed, the remaining task would be killed.
  • 4. HDFS 6. What are the different types of File permissions in HDFS? Ans: drwxrwxrwx user1 prog 10 Aug 16 15:02 myfolder -rwxrwxrwx user1 prog 10 Aug 01 07:02 myfile.sas Position 1: ‘d’ means folder, ‘-’ means file Positions 2-4: Owners permissions on file/folder Positions 5-7: Group permissions on file or folder Positions 8-10: Global permissions on file or folder 7. What is Rack-awareness? Ans: In HDFS, not all replicas of a single block are copied in the same Rack. This concept is called Rack-awareness. In the event of an entire Rack going down, if all the replicas are in that rack, there would be no way of recovering that block of data.
  • 5. HDFS 8. What are the different modes of HDFS that one can run? Where do we configure these modes? Ans: Hadoop can be configured to run on one of the following modes. a. Standalone Mode or local (default mode) b. Psuedo distributed mode c. Fully distributed mode. These configuration settings can be set via - core-site.xml, mapred-site.xml, hdfs- site.xml 9. What are the available data-types in Hadoop? Ans: To support serialization-deserialization and to be able to get compared with one another, hadoop has built its own datatypes. Following is the list of types that implement WritableComparable- Primitives: BooleanWritable, ByteWritable, ShortWritable, IntWritable, VIntWritable,FloatWritable, LongWritable, VLongWritable, DoubleWritable. Others: NullWritable, Text, BytesWritable, MD5Hash
  • 6. HDFS 10. Explain the command '-getMerge' Ans: hadoop fs -getmerge <directory> <merged file name> This option gets all the files in the directory and merges them into a single file. 11. Explain the anatomy of a file read in HDFS Ans: 1. Client opens the file (calls open() on Distributed File System). 2. DFS calls the namenode to get block locations. 3. DFS creates FSDataInputStream and client invokes read() on this object. 4. Using DFSDataInputStream(a sub class of FSDataInputStream), read operation is done on datanodes where file blocks are present. Blocks are read in the order. Once reading all the blocks is finished, client calls close() on the FSDataInputStream
  • 7. 12. Explain the anatomy of a file write in HDFS Ans: 1. Client creates a files (calls create() on DFS) 2. Client calls namenode(NN) to create a file. NN checks for client's access permissions to the file and if file already exists. If the file already exists, it throws an IO Exception 3. The DFS returns an FSDataOutputStream to write data into. FSDataOutputStream has a subclass DFSDataOutputStream which handles communication with NN & datanode(DN) 4. DFSDataOutputStream writes data in the form of packets(small units of data) and these packets are written to various DNs to form blocks of data. A pipeline is formed that consists of the list of DNs that a single block has to be replicated to. 5. When a block of data is written to all DNs in the pipeline, acknowledgement comes from the DNs in the pipeline in the reverse order. 6. When client has finished writing the data, it calls close() on the stream 7. Waits for acknowledgement before contacting the name to signal that file is complete. HDFS
  • 8. MAPREDUCE 1. What is Distributed Cache? Ans: Distributed Cache is a mechanism by which 'Side Data' (extra read-only data needed by a MR program) is distributed 2. What is 'Sequence File' format? Where do we use it? Ans: SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. It is also worth noting that, internally, the temporary outputs of maps are stored using SequenceFile. The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively. There are 3 different SequenceFile formats: a. Uncompressed key/value records b. Record compressed key/value records - only 'values' are compressed here. c. Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable
  • 9. MAPREDUCE 3. What are the different File Input Formats in MapReduce? Ans: FileInputFormat is the base class for all implementations ofInputFormat that uses file as their data source. The sub-classes of FileInputFormat are: CombineFileInputFormat, TextInputFormat (default), KeyValueTextInputFormat, NLineInputFormat, SequenceFileInputFormat. SequenceFileInputFormat has few subclasses like - SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat, SequenceFileInputFilter 4. What is ‘Shuffling & sorting’ phase in MapReduce? Ans: This phase occurs between Map & Reduce phases. During this phase, the all keys emitted by various mappers is collected, grouped and copied to the reducers. 5. How many instances of a 'jobtracker' run in a cluster? Ans: Only one instance of Jobtracker would run in a cluster
  • 10. 6. Can two different Mappers communicate with each other? Ans: No, Mappers/Reducers run independently of each other. 7. How do you make sure that only one mapper runs your entire file? Ans: Create a Custom 'InputFormat' and override the'issplitable()' to return false. (or) a rather rude way to do is - set the block size greater than the size of the input file. 8.When will the reducer phase start in a MR program? Ans: Reducer phase starts only after all mappers finish execution. MAPREDUCE
  • 11. 9. Explain various phases of a MapReduce program. Ans: Mapper phase: Sort & Shuffle phase: Reducer phase: 10. What is a 'Task instance' ? Ans: Task instance is the child JVM process that is initiated by the Tasktracker itself. This is to ensure that process failure does not take down the Tasktracker. MAPREDUCE
  • 12. 1.What is HBase? Hbase is Column-Oriented , Open-Source, Multidimensional, Distributed database. It run on the top of HDFS 2.Why we use Hbase? Hbase provide random read and write, Need to do thousand of operation per second on large data set. 3.List the main component of HBase? Zookeeper Catalog Tables Master RegionServer Region HBASE
  • 13. 4.How many Operational command in Hbase? There are five main command in HBase. 1. Get 2. Put 3. Delete 4. Scan 5. Increment 5.How to open a connection in Hbase? If you are going to open connection with the help of Java API. The following code provide the connection Configuration myConf = HBaseConfiguration.create(); HTableInterface usersTable = new HTable(myConf, "users"); HBASE
  • 14. 6.When Should I Use HBase? HBase isnt suitable for every problem. First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle. Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be ported to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port. Third, make sure you have enough hardware. Even HDFS doesnt do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode. HBASE
  • 15. 7. How does Hbase achieve random read/write? HBase stores data in HFiles that are indexed (sorted) by their key. Given a random key, the client can determine which region server to ask for the row from. The region server can determine which region to retrieve the row from, and then do a binary search through the region to access the correct row. This is accomplished by having sufficient statistics to know the number of blocks, block size, start key, and end key. For example: A table may contain 10 TB of data. But, the table is broken up into regions of size 4GB. Each region has a start/end key. The client can get the list of regions for a table and determine which region has the key it is looking for. Regions are broken up into blocks, so that the region server can do a binary search through its blocks. Blocks are essentially long lists of key, attribute, value, and version. If you know what the starting key is for each block, you can determine one file to access, and what the byte-offset (block) is to start reading to see where you are in the binary search. HBASE

Editor's Notes

  1. NYSE – 4-5 TB/day Aero - 10TB/30 min. Facebook – 240 Billion Photos; 7 PB/month It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
  2. NYSE – 4-5 TB/day Aero - 10TB/30 min. Facebook – 240 Billion Photos; 7 PB/month It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
  3. NYSE – 4-5 TB/day Aero - 10TB/30 min. Facebook – 240 Billion Photos; 7 PB/month It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
  4. NYSE – 4-5 TB/day Aero - 10TB/30 min. Facebook – 240 Billion Photos; 7 PB/month It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
  5. NYSE – 4-5 TB/day Aero - 10TB/30 min. Facebook – 240 Billion Photos; 7 PB/month It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
  6. NYSE – 4-5 TB/day Aero - 10TB/30 min. Facebook – 240 Billion Photos; 7 PB/month It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta