BY ANKIT PRASAD
CSE 3RD YEAR
NSEC
What is Big Data?
Big Data is a collection of large datasets that
cannot be processed using traditional
computing techniques.
Big Data includes huge volume, high velocity,
and extensible variety of data.
Classification of Big
Data
The data in it will be of three types:
Structured data: Relational data.
Semi Structured data: XML data.
Unstructured data: Word, PDF, Text, Media
Logs.
Big Data Challenges
The major challenges associated with big data:
Capturing data
Storage
Searching
Sharing
Transfer
Analysis
 Presentation
's Solution
MapReduce
It is a parallel programming model for writing
distributed applications.
It can efficiently process multi-terabyte data-
sets.
Runs on large clusters of commodity
hardware in a reliable, fault-tolerant manner.
Introduction to Hadoop
Hadoop was developed by Doug Cutting.
Hadoop is an Apache open source
framework written in java.
 Hadoop allows distributed storage and
processing of large datasets across clusters of
computers.
Hadoop Architecture
Hadoop has the two major layers namely:
Processing/Computation layer (MapReduce)
Storage layer (Hadoop Distributed File
System)
Other modules of Hadoop Framework includes:
Hadoop Common
 Hadoop YARN(Yet Another Resource
Negotiator)
What is MapReduce?
The MapReduce algorithm contains two
important tasks, namely Map and Reduce.
Map takes a set of data and breaks
individual elements into tuples (key/value
pairs).
Reduce takes Map’s output as an input and
combines those data tuples forming a
smaller set of tuples.
Under the MapReduce model, the data
processing primitives are called mappers and
reducers.
MapReduce Algorithm
Hadoop initiates Map stage by issuing
mapping task to appropriate servers in the
cluster.
Map stage:
The input file or directory, stored in the HDFS is
passed to the mapper function line by line.
The mapper processes the data and creates
several small chunks of data(key/value pairs).
Hadoop monitors for task completion and
initiates shuffle stage.
Shuffle stage:
The framework groups data from all mappers
by the keys and splits them among the
appropriate servers for the reduce stage.
Reduce stage:
The Reducer processes the data coming from
the mapper, producing a new set of output,
that is stored in the HDFS.
The framework manages all the details of
data-passing and copying between the
nodes in the cluster.
Hadoop Distributed File
System
HDFS is based on the Google File System.
It is highly fault-tolerant and is designed to be
deployed on low-cost hardware.
It is suitable for applications having large
datasets.
These files are stored in redundant fashion to
rescue the system from possible data losses in
case of failure.
HDFS Architecture
Namenode:
It acts as a master server that manages the
file system namespace.
Regulates client’s access to files.
Datanode:
These nodes manage the data storage of
their system.
And performs read-write and block
operations regulated by namenode.
Block:
It is the minimum amount of data that HDFS
can read/ write.
The files are divided into one or more blocks.
Blocks are stored in individual data nodes.
Hadoop Common
It provides essential services and basic
processes such as abstraction of the
underlying operating system and its file
system.
It assumes that hardware failures are
common and should be automatically
handled by the Framework.
It also contains the necessary Java Archive
(JAR) files and scripts required to start
Hadoop.
Hadoop YARN
ResourceManager:
It is a clustering platform that helps to
manage and allocate resources to
applications and schedule tasks.
ApplicationMasters:
 Responsible for negotiating resources with
the ResourceManager and for working
with the Node Managers to execute and
monitor the tasks.
NodeManager:
Takes instructions from the ResourceManager
and manage resources on its own node.
How Does Hadoop
Work?
Data is initially divided into directories and
files. Files are divided into uniform sized blocks
of 128M and 64M.
These files are then distributed across various
cluster nodes for further processing
supervised by the HDFS.
Blocks are replicated for handling hardware
failure.
Checking that the code was executed
successfully.
Performing the sort that takes place between
the map and reduce stages.
Sending the sorted data to a certain
computer.
Writing the debugging logs for each job.
Applications of Hadoop
Black Box Data
Social Media Data
Stock Exchange Data
Transport Data
Search Engine Data
Prominent users of
Hadoop
The Search Webmap is a Hadoop
application that runs on a big Linux cluster.
In 2010, Facebook claimed that they had the
largest Hadoop cluster in the world.
The New York Times used 100
instances and a Hadoop application to
process 4 TB data into 11 million PDFs in a day
at a computation cost of about $240.
Advantages of Hadoop
Hadoop is open source and compatible on
all the platforms since it is Java based.
Hadoop does not rely on hardware to
provide fault-tolerance and high availability.
Servers can be added or removed from the
cluster dynamically without interruption.
Hadoop efficiently utilizes the underlying
parallelism of the CPU cores in distributed
systems .
References:
www.tutorialspoint.com/hadoop/
https://en.wikipedia.org/wiki/Apache_Hado
op
https://hadoop.apache.org/docs/r2.7.1/had
oop-yarn/hadoop-yarn-site/YARN.html
https://hortonworks.com/blog/apache-
hadoop-yarn-resourcemanager/
http://saphanatutorial.com/how-yarn-
overcomes-mapreduce-limitations-in-
hadoop-2-0/
Hadoop

Hadoop

  • 1.
    BY ANKIT PRASAD CSE3RD YEAR NSEC
  • 2.
    What is BigData? Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. Big Data includes huge volume, high velocity, and extensible variety of data.
  • 3.
    Classification of Big Data Thedata in it will be of three types: Structured data: Relational data. Semi Structured data: XML data. Unstructured data: Word, PDF, Text, Media Logs.
  • 4.
    Big Data Challenges Themajor challenges associated with big data: Capturing data Storage Searching Sharing Transfer Analysis  Presentation
  • 5.
    's Solution MapReduce It isa parallel programming model for writing distributed applications. It can efficiently process multi-terabyte data- sets. Runs on large clusters of commodity hardware in a reliable, fault-tolerant manner.
  • 7.
    Introduction to Hadoop Hadoopwas developed by Doug Cutting. Hadoop is an Apache open source framework written in java.  Hadoop allows distributed storage and processing of large datasets across clusters of computers.
  • 8.
    Hadoop Architecture Hadoop hasthe two major layers namely: Processing/Computation layer (MapReduce) Storage layer (Hadoop Distributed File System) Other modules of Hadoop Framework includes: Hadoop Common  Hadoop YARN(Yet Another Resource Negotiator)
  • 9.
    What is MapReduce? TheMapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and breaks individual elements into tuples (key/value pairs). Reduce takes Map’s output as an input and combines those data tuples forming a smaller set of tuples.
  • 10.
    Under the MapReducemodel, the data processing primitives are called mappers and reducers.
  • 11.
    MapReduce Algorithm Hadoop initiatesMap stage by issuing mapping task to appropriate servers in the cluster. Map stage: The input file or directory, stored in the HDFS is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data(key/value pairs). Hadoop monitors for task completion and initiates shuffle stage.
  • 12.
    Shuffle stage: The frameworkgroups data from all mappers by the keys and splits them among the appropriate servers for the reduce stage. Reduce stage: The Reducer processes the data coming from the mapper, producing a new set of output, that is stored in the HDFS. The framework manages all the details of data-passing and copying between the nodes in the cluster.
  • 13.
    Hadoop Distributed File System HDFSis based on the Google File System. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It is suitable for applications having large datasets. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure.
  • 14.
    HDFS Architecture Namenode: It actsas a master server that manages the file system namespace. Regulates client’s access to files. Datanode: These nodes manage the data storage of their system. And performs read-write and block operations regulated by namenode.
  • 15.
    Block: It is theminimum amount of data that HDFS can read/ write. The files are divided into one or more blocks. Blocks are stored in individual data nodes.
  • 16.
    Hadoop Common It providesessential services and basic processes such as abstraction of the underlying operating system and its file system. It assumes that hardware failures are common and should be automatically handled by the Framework. It also contains the necessary Java Archive (JAR) files and scripts required to start Hadoop.
  • 17.
    Hadoop YARN ResourceManager: It isa clustering platform that helps to manage and allocate resources to applications and schedule tasks. ApplicationMasters:  Responsible for negotiating resources with the ResourceManager and for working with the Node Managers to execute and monitor the tasks.
  • 18.
    NodeManager: Takes instructions fromthe ResourceManager and manage resources on its own node.
  • 19.
    How Does Hadoop Work? Datais initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M. These files are then distributed across various cluster nodes for further processing supervised by the HDFS. Blocks are replicated for handling hardware failure. Checking that the code was executed successfully.
  • 20.
    Performing the sortthat takes place between the map and reduce stages. Sending the sorted data to a certain computer. Writing the debugging logs for each job.
  • 21.
    Applications of Hadoop BlackBox Data Social Media Data Stock Exchange Data Transport Data Search Engine Data
  • 22.
    Prominent users of Hadoop TheSearch Webmap is a Hadoop application that runs on a big Linux cluster. In 2010, Facebook claimed that they had the largest Hadoop cluster in the world. The New York Times used 100 instances and a Hadoop application to process 4 TB data into 11 million PDFs in a day at a computation cost of about $240.
  • 23.
    Advantages of Hadoop Hadoopis open source and compatible on all the platforms since it is Java based. Hadoop does not rely on hardware to provide fault-tolerance and high availability. Servers can be added or removed from the cluster dynamically without interruption. Hadoop efficiently utilizes the underlying parallelism of the CPU cores in distributed systems .
  • 24.