A Study of Hadoop in Map-Reduce
Poumita Das
Shubharthi Dasgupta
Priyanka Das
What is Big Data??
Big data is an evolving term that describes any voluminous amount of
structured, semi-structured and unstructured data that has the
potential to be mined for information.
The 3 V’s
Why DFS
An introduction to Map-Reduce
Map-Reduce programs are designed to compute large volumes of data in a
parallel fashion. There are 3 steps
• Map
• Shuffle
• Reduce
Map-Reduce continued
Map Shuffle Reduce
What is Hadoop??
Apache Hadoop is a framework that
allows for the distributed processing
of large data sets across clusters of
commodity computers using a
simple programming model.
Hadoop core components
• Namenode
• Datanode
• Client
• User
• Job tracker
• Task tracker
Namenode
The NameNode maintains the namespace tree and the mapping of
blocks to DataNodes. In a cluster there may exist hundreds or even
thousands of datanodes.
Secondary NameNode reads the metadata from RAM and writes it into a
secondary storage. However it is NOT a substitute of a NameNode
Datanode
On startup, a DataNode connects to the NameNode; spinning until that
service comes up. It then responds to requests from the NameNode for
filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has
provided the location of the data.
HDFS client
User applications access the filesystem using the HDFS client. A client has mainly 3
operations.
• Creating a new file
• File read
• File write
Creating a new file
File read
HDFS implements a single-
writer, multiple-reader model.
That is reading is a parallel
operation in Hadoop
File write
An HDFS file consists of blocks.
When there is a need for a new
block, the NameNode allocates
a block with a unique block ID
and determines a list of
DataNodes to host replicas of
the block.
Job tracker and task tracker
Hadoop ecosystem
• PIG
• HIVE
• MAHOUT
A Sample Program
The Output
Why Anagrams?
• Started out as a simple relaxation game, finding anagrams in
sentences
• Games and Puzzles like Scrabble
• Ciphers, like permutation cipher, transposition ciphers
Future scope
Keeping in mind the vast application of Hadoop we have certain graph-
searching techniques in mind that would be much more easier to solve
with the help of Map-reduce engine.
References
• Introduction to Hadoop: Welcome to Apache
https://hadoop.apache.org/
• Cloudera Documentation: Usage
http://www.cloudera.com/content/cloudera/en/documentation/hado
op-tutorial/CDH5/Hadoop-Tutorial/ht_usage.html
• Edureka: Anatomy of a Map-Reduce Job
http://www.edureka.co/blog/anatomy-of-a-mapreduce-job-in-
apache-hadoop/
• Stackoverflow: Explain Map-Reduce Simply
http://stackoverflow.com/questions/28982/please-explain-
mapreduce-simply
Thank you

Hadoop

  • 1.
    A Study ofHadoop in Map-Reduce Poumita Das Shubharthi Dasgupta Priyanka Das
  • 2.
    What is BigData?? Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.
  • 3.
  • 4.
  • 5.
    An introduction toMap-Reduce Map-Reduce programs are designed to compute large volumes of data in a parallel fashion. There are 3 steps • Map • Shuffle • Reduce
  • 6.
  • 7.
    What is Hadoop?? ApacheHadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
  • 8.
    Hadoop core components •Namenode • Datanode • Client • User • Job tracker • Task tracker
  • 9.
    Namenode The NameNode maintainsthe namespace tree and the mapping of blocks to DataNodes. In a cluster there may exist hundreds or even thousands of datanodes. Secondary NameNode reads the metadata from RAM and writes it into a secondary storage. However it is NOT a substitute of a NameNode
  • 10.
    Datanode On startup, aDataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.
  • 11.
    HDFS client User applicationsaccess the filesystem using the HDFS client. A client has mainly 3 operations. • Creating a new file • File read • File write
  • 12.
  • 13.
    File read HDFS implementsa single- writer, multiple-reader model. That is reading is a parallel operation in Hadoop
  • 14.
    File write An HDFSfile consists of blocks. When there is a need for a new block, the NameNode allocates a block with a unique block ID and determines a list of DataNodes to host replicas of the block.
  • 15.
    Job tracker andtask tracker
  • 16.
  • 17.
  • 18.
  • 19.
    Why Anagrams? • Startedout as a simple relaxation game, finding anagrams in sentences • Games and Puzzles like Scrabble • Ciphers, like permutation cipher, transposition ciphers
  • 20.
    Future scope Keeping inmind the vast application of Hadoop we have certain graph- searching techniques in mind that would be much more easier to solve with the help of Map-reduce engine.
  • 21.
    References • Introduction toHadoop: Welcome to Apache https://hadoop.apache.org/ • Cloudera Documentation: Usage http://www.cloudera.com/content/cloudera/en/documentation/hado op-tutorial/CDH5/Hadoop-Tutorial/ht_usage.html • Edureka: Anatomy of a Map-Reduce Job http://www.edureka.co/blog/anatomy-of-a-mapreduce-job-in- apache-hadoop/ • Stackoverflow: Explain Map-Reduce Simply http://stackoverflow.com/questions/28982/please-explain- mapreduce-simply
  • 22.