Krishnendu P
CONTENTS:
 Data and Big Data
 Problems with Big Data
 Hadoop
 Small History of Hadoop
 What problems can Hadoop solve?
 Components of Hadoop - HDFS, MapReduce
 Hadoop Cluster
 High Level Archetecture of Hadoop
 Hadoop Core Components
 Features of Hadoop
 Limitations of Hadoop
 Users of Hadoop
 Conclusion
 References
Data:
➔ Any real world symbol (character, numeric,
special character ) or group of them is said
to be data.
➔It may be visual, audio, scriptual etc.
Big Data
Big data means really a big data, it is a collection
of large datasets that cannot be processed using
on hand database management tools or
traditional computing techniques.
Big Data
The Big Data includes huge volume, high velocity,
and extensible variety of data. The data in it will be of
three types.
Structured data : Relational data.
Semi Structured data : XML data.
Unstructured data : Word, PDF, Text
Problems with Big Data:
➔Daily about 0.5 petabytes of updates are being
made into FACEBOOK including 40 millions
photos.
➔Daily YOUTUBE is loaded with videos that can be
watched for one year continously.
➔Limitations are encountered due to large data sets
in many areas, including genomics,complex
physics simulations, and biological and
environmental research.
Cont...
➔Also affect Internet search, finance and
business informatics.
➔The challenges include in capture, retrieval
,storage, search, sharing, analysis, and
visualization.
What could be the solution for
Big Data ?
hadoohadoo
pp
What is hadoop ?
➔Hadoop is an open source, Java-based
programming framework developed by Doug
Cutting and Mike Cafarella in 2005.
➔It is part of the Apache project sponsored by the
Apache Software Foundation.
➔Its designed to scale up from single servers to
thousands of machines, each offering local computers
and storage.
Cont...
➔It is used for distributed storage and distributed
processing of very large data sets on computer
clusters built from commodity hardware.
Small History
➔Hadoop was inspired by Google's MapReduce, a
software framework in which an application is
broken down into numerous small parts.
➔Any of these parts(also called fragments or blocks)
can be run on any node in the cluster.
➔Doug Cutting, Hadoop's creator, named the
framework after his child's stuffed toy elephant.
Small History
➔Started with building Web Search Engine
- Nutch in 2002
- Aim was to index billons of pages.
- Archetecture can't support billons of pages.
➔Google's GFS in 2003 solved storage problem.
- Nutch Distributed File System in 2004.
➔Google's MapReduce in 2004
- MapReduce implemented in 2005.
Doug Cutting with Hadoop
Mike Cafarella
2005: Doug Cutting and Mike Cafarella developed Hadoop
to support distribution for the Nutch search engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache
Software Foundation.
Now Apache Hadoop is a registered trademark of the
Apache Software Foundation.
What problems can Hadoop solve?
The Hadoop platform was designed to solve problems
where you have a lot of data " perhaps a mixture of
complex and structured data " and it doesn't fit well
into tables.
Components Of Hadoop
Hadoop consists of MapReduce, the Hadoop
distributed file system (HDFS) and a number of
related projects such as Apache Hive, HBase and
Zookeeper.
HADOOPHADOOP
HDFS MapReduce
HDFS (Hadoop Distributed File System)
➔The Hadoop Distributed File System (HDFS) is a
distributed file system designed to run on
commodity hardware.
➔ Its is a sub-project of Apache Hadoop project.
➔ HDFS is highly fault-tolerant and is designed to
be deployed on low-cost hardware.
➔HDFS provides high throughput access to
application data and is suitable for applications
that have large data sets.
Cont...
➔The HDFS takes care of storing and managing the
data within the hadoop cluster.
Cont...
MapReduce
➔ MapReducing is a programming model used for
processing large data sets.
➔Programs written in this functional style are
automatically parallelized and executed on a large
cluster of commodity machines.
➔MapReduce is an associated implementation for
processing and generating large data sets.
MapReduce
MapReduce program executes in two stages, namely
map stage, and reduce stage.
Map stage :
The map or mapper’s job is to process the
input data. Generally the input data is in the form of
file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the
mapper function line by line. The mapper processes
the data and creates several small chunks of data.
MapReduce
MapReduce program executes in two stages, namely
map stage, and reduce stage.
Reduce stage :
The Reducer’s job is to process the data that
comes from the mapper. After processing, it
produces a new set of output, which will be stored in
the HDFS.
MapReduce
Hadoop Core components
MASTER NODE
SLAVE NODE
Name node
Data node
Job tracker
Task tracker
Storage node Compute node
Cont...
Node :
It is a technical term used to describe a
machine or a computer that is present in a
cluster.
Demode :
It is a technical term used to describe the
background process that is running on a
linux machine.
Cont...
➔ The Master node responsible for running
Name nodes and Job tracker demodes.
➔The Slave node responsible for running the
Data nodes and Task tracker demodes.
Cont...
➔Name node and Data node are responsible
for storing and managing the data, and they
are commonly referred to as Storage Node.
➔Job Tracker and Task Tracker are
responsible for processing and computing the
data, and they are commonly referred to as
Compute Node.
Cont..
➔Usually Name node and Job tracker
configured on a single machine.
➔ The Data node and Task tracker
configured on multiple machines. But can
have instances running on more than one
machines at the same time.
Hadoop Cluster
➔ Normally any set of loosely connected or tightly
connected computers that work together as a single
system is called Cluster.
➔ In simple words, a computer cluster used for Hadoop
is called Hadoop Cluster.
Hadoop Cluster
Hadoop cluster is a special type of computational
cluster designed for storing and analyzing vast
amount of unstructured data in a distributed
computing environment. These clusters run on low
cost commodity computers.
Hadoop Cluster
Hadoop Cluster
➔Hadoop clusters are often referred to as "shared
nothing" systems because the only thing that is
shared between nodes is the network that connects
Them.
➔Clustering improves the system's availability to
users.
Hadoop Cluster
A Real Time Example:
Here is a picture of Yahoo's Hadoop cluster. They
have more than 10,000 machines running Hadoop
and nearly 1 petabyte of user data.
● Scalability :
Scalability basically refers to the ability of
adding or removing the nodes without bringing
down or affecting the cluster operation.
Features of Hadoop
Features of Hadoop
● Cost effective :
Hadoop does not requires any expensive
cost specialized harware. In other words, it can
be implemented on a simple hardware. These
hardware components are technically called as
commodity hardware.
Features of Hadoop
● Large Cluster of Nodes:
A hadoop cluster can be made up
off 100's and 1000's of nodes. One of the
main advantage of having a large cluster is, it
offers more computing power and huge
storage system to the clients.
Features of Hadoop
● Parallel Processing of Data:
The data can be process
simultaniously across all the nodes
within the cluster and thus saving a lot
of time.
Features of Hadoop
● Automatic Failover Management:
In case, if any of the nodes
within the cluster fails, the hadoop framework
will replace that particular machine with
another machine.
● Flexible :
Hadoop is schema-less, and can
absorb any type of data, structured or not,
from any number of sources.
● Fault-tolerant :
When you lose a node, the system
redirects work to another location of the
data and continue processing without
missing a beat.
Features of Hadoop
Limitations of Hadoop
● Security concerns
● Vulnerable by nature
● Not fit for Small data
● Potential steability issues
What is Hadoop used for?
● Search
– Yahoo, Amazon, Zvents
• Log processing
– Facebook, Yahoo, ContextWeb. Joost,
Last.fm
• Recommendation Systems
– Facebook
• Data Warehouse
– Facebook, AOL(America Online)
• Video and Image Analysis
– New York Times, Eyealike
Conclusion
➔Hadoop has been very effective for companies
dealing with the data in petabytes.
➔It has solved many problems in industry
related to huge data management and
distributed system.
➔As it is open source, so it is adopted by
companies widely.
References
● www.dezyre.com/Big-Data-and-Hadoop
● www.cloudera.com/content/www/...hadoop
/hdfs-mapreduce-yarn.html
● www.ufaber.com/hadoop/bigbata/free
● www.psgtech.edu/yrgcc/attach/haoop_archite
cture.ppt
Hadoop seminar
Hadoop seminar

Hadoop seminar

  • 1.
  • 2.
    CONTENTS:  Data andBig Data  Problems with Big Data  Hadoop  Small History of Hadoop  What problems can Hadoop solve?  Components of Hadoop - HDFS, MapReduce  Hadoop Cluster  High Level Archetecture of Hadoop  Hadoop Core Components  Features of Hadoop  Limitations of Hadoop  Users of Hadoop  Conclusion  References
  • 3.
    Data: ➔ Any realworld symbol (character, numeric, special character ) or group of them is said to be data. ➔It may be visual, audio, scriptual etc.
  • 4.
    Big Data Big datameans really a big data, it is a collection of large datasets that cannot be processed using on hand database management tools or traditional computing techniques.
  • 5.
    Big Data The BigData includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types. Structured data : Relational data. Semi Structured data : XML data. Unstructured data : Word, PDF, Text
  • 6.
    Problems with BigData: ➔Daily about 0.5 petabytes of updates are being made into FACEBOOK including 40 millions photos. ➔Daily YOUTUBE is loaded with videos that can be watched for one year continously. ➔Limitations are encountered due to large data sets in many areas, including genomics,complex physics simulations, and biological and environmental research.
  • 7.
    Cont... ➔Also affect Internetsearch, finance and business informatics. ➔The challenges include in capture, retrieval ,storage, search, sharing, analysis, and visualization.
  • 8.
    What could bethe solution for Big Data ?
  • 9.
  • 10.
    What is hadoop? ➔Hadoop is an open source, Java-based programming framework developed by Doug Cutting and Mike Cafarella in 2005. ➔It is part of the Apache project sponsored by the Apache Software Foundation.
  • 11.
    ➔Its designed toscale up from single servers to thousands of machines, each offering local computers and storage. Cont... ➔It is used for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
  • 12.
    Small History ➔Hadoop wasinspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. ➔Any of these parts(also called fragments or blocks) can be run on any node in the cluster. ➔Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant.
  • 13.
    Small History ➔Started withbuilding Web Search Engine - Nutch in 2002 - Aim was to index billons of pages. - Archetecture can't support billons of pages. ➔Google's GFS in 2003 solved storage problem. - Nutch Distributed File System in 2004. ➔Google's MapReduce in 2004 - MapReduce implemented in 2005.
  • 14.
  • 15.
  • 16.
    2005: Doug Cuttingand Mike Cafarella developed Hadoop to support distribution for the Nutch search engine project. The project was funded by Yahoo. 2006: Yahoo gave the project to Apache Software Foundation. Now Apache Hadoop is a registered trademark of the Apache Software Foundation.
  • 17.
    What problems canHadoop solve? The Hadoop platform was designed to solve problems where you have a lot of data " perhaps a mixture of complex and structured data " and it doesn't fit well into tables.
  • 18.
    Components Of Hadoop Hadoopconsists of MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper.
  • 19.
  • 21.
    HDFS (Hadoop DistributedFile System) ➔The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. ➔ Its is a sub-project of Apache Hadoop project. ➔ HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
  • 22.
    ➔HDFS provides highthroughput access to application data and is suitable for applications that have large data sets. Cont... ➔The HDFS takes care of storing and managing the data within the hadoop cluster.
  • 23.
  • 24.
    MapReduce ➔ MapReducing isa programming model used for processing large data sets. ➔Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. ➔MapReduce is an associated implementation for processing and generating large data sets.
  • 25.
    MapReduce MapReduce program executesin two stages, namely map stage, and reduce stage. Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
  • 26.
    MapReduce MapReduce program executesin two stages, namely map stage, and reduce stage. Reduce stage : The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • 27.
  • 28.
    Hadoop Core components MASTERNODE SLAVE NODE Name node Data node Job tracker Task tracker Storage node Compute node
  • 29.
    Cont... Node : It isa technical term used to describe a machine or a computer that is present in a cluster. Demode : It is a technical term used to describe the background process that is running on a linux machine.
  • 30.
    Cont... ➔ The Masternode responsible for running Name nodes and Job tracker demodes. ➔The Slave node responsible for running the Data nodes and Task tracker demodes.
  • 31.
    Cont... ➔Name node andData node are responsible for storing and managing the data, and they are commonly referred to as Storage Node. ➔Job Tracker and Task Tracker are responsible for processing and computing the data, and they are commonly referred to as Compute Node.
  • 32.
    Cont.. ➔Usually Name nodeand Job tracker configured on a single machine. ➔ The Data node and Task tracker configured on multiple machines. But can have instances running on more than one machines at the same time.
  • 33.
    Hadoop Cluster ➔ Normallyany set of loosely connected or tightly connected computers that work together as a single system is called Cluster. ➔ In simple words, a computer cluster used for Hadoop is called Hadoop Cluster.
  • 34.
    Hadoop Cluster Hadoop clusteris a special type of computational cluster designed for storing and analyzing vast amount of unstructured data in a distributed computing environment. These clusters run on low cost commodity computers.
  • 35.
  • 36.
    Hadoop Cluster ➔Hadoop clustersare often referred to as "shared nothing" systems because the only thing that is shared between nodes is the network that connects Them. ➔Clustering improves the system's availability to users.
  • 37.
    Hadoop Cluster A RealTime Example: Here is a picture of Yahoo's Hadoop cluster. They have more than 10,000 machines running Hadoop and nearly 1 petabyte of user data.
  • 38.
    ● Scalability : Scalabilitybasically refers to the ability of adding or removing the nodes without bringing down or affecting the cluster operation. Features of Hadoop
  • 39.
    Features of Hadoop ●Cost effective : Hadoop does not requires any expensive cost specialized harware. In other words, it can be implemented on a simple hardware. These hardware components are technically called as commodity hardware.
  • 40.
    Features of Hadoop ●Large Cluster of Nodes: A hadoop cluster can be made up off 100's and 1000's of nodes. One of the main advantage of having a large cluster is, it offers more computing power and huge storage system to the clients.
  • 41.
    Features of Hadoop ●Parallel Processing of Data: The data can be process simultaniously across all the nodes within the cluster and thus saving a lot of time.
  • 42.
    Features of Hadoop ●Automatic Failover Management: In case, if any of the nodes within the cluster fails, the hadoop framework will replace that particular machine with another machine.
  • 43.
    ● Flexible : Hadoopis schema-less, and can absorb any type of data, structured or not, from any number of sources. ● Fault-tolerant : When you lose a node, the system redirects work to another location of the data and continue processing without missing a beat. Features of Hadoop
  • 44.
    Limitations of Hadoop ●Security concerns ● Vulnerable by nature ● Not fit for Small data ● Potential steability issues
  • 45.
    What is Hadoopused for? ● Search – Yahoo, Amazon, Zvents • Log processing – Facebook, Yahoo, ContextWeb. Joost, Last.fm • Recommendation Systems – Facebook • Data Warehouse – Facebook, AOL(America Online) • Video and Image Analysis – New York Times, Eyealike
  • 46.
    Conclusion ➔Hadoop has beenvery effective for companies dealing with the data in petabytes. ➔It has solved many problems in industry related to huge data management and distributed system. ➔As it is open source, so it is adopted by companies widely.
  • 47.
    References ● www.dezyre.com/Big-Data-and-Hadoop ● www.cloudera.com/content/www/...hadoop /hdfs-mapreduce-yarn.html ●www.ufaber.com/hadoop/bigbata/free ● www.psgtech.edu/yrgcc/attach/haoop_archite cture.ppt