Hadoop

A Study of Hadoop in Map-Reduce
Poumita Das
Shubharthi Dasgupta
Priyanka Das

What is Big Data??
Big data is an evolving term that describes any voluminous amount of
structured, semi-structured and unstructured data that has the
potential to be mined for information.

An introduction to Map-Reduce
Map-Reduce programs are designed to compute large volumes of data in a
parallel fashion. There are 3 steps
• Map
• Shuffle
• Reduce

Map-Reduce continued
Map Shuffle Reduce

What is Hadoop??
Apache Hadoop is a framework that
allows for the distributed processing
of large data sets across clusters of
commodity computers using a
simple programming model.

Hadoop core components
• Namenode
• Datanode
• Client
• User
• Job tracker
• Task tracker

Namenode
The NameNode maintains the namespace tree and the mapping of
blocks to DataNodes. In a cluster there may exist hundreds or even
thousands of datanodes.
Secondary NameNode reads the metadata from RAM and writes it into a
secondary storage. However it is NOT a substitute of a NameNode

Datanode
On startup, a DataNode connects to the NameNode; spinning until that
service comes up. It then responds to requests from the NameNode for
filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has
provided the location of the data.

HDFS client
User applications access the filesystem using the HDFS client. A client has mainly 3
operations.
• Creating a new file
• File read
• File write

File read
HDFS implements a single-
writer, multiple-reader model.
That is reading is a parallel
operation in Hadoop

File write
An HDFS file consists of blocks.
When there is a need for a new
block, the NameNode allocates
a block with a unique block ID
and determines a list of
DataNodes to host replicas of
the block.

Hadoop ecosystem
• PIG
• HIVE
• MAHOUT

Why Anagrams?
• Started out as a simple relaxation game, finding anagrams in
sentences
• Games and Puzzles like Scrabble
• Ciphers, like permutation cipher, transposition ciphers

Future scope
Keeping in mind the vast application of Hadoop we have certain graph-
searching techniques in mind that would be much more easier to solve
with the help of Map-reduce engine.

References
• Introduction to Hadoop: Welcome to Apache
https://hadoop.apache.org/
• Cloudera Documentation: Usage
http://www.cloudera.com/content/cloudera/en/documentation/hado
op-tutorial/CDH5/Hadoop-Tutorial/ht_usage.html
• Edureka: Anatomy of a Map-Reduce Job
http://www.edureka.co/blog/anatomy-of-a-mapreduce-job-in-
apache-hadoop/
• Stackoverflow: Explain Map-Reduce Simply
http://stackoverflow.com/questions/28982/please-explain-
mapreduce-simply

Hadoop

More Related Content

What's hot

Similar to Hadoop

Hadoop