SlideShare a Scribd company logo
1 of 49
Krishnendu P
CONTENTS:
 Data and Big Data
 Problems with Big Data
 Hadoop
 Small History of Hadoop
 What problems can Hadoop solve?
 Components of Hadoop - HDFS, MapReduce
 Hadoop Cluster
 High Level Archetecture of Hadoop
 Hadoop Core Components
 Features of Hadoop
 Limitations of Hadoop
 Users of Hadoop
 Conclusion
 References
Data:
➔ Any real world symbol (character, numeric,
special character ) or group of them is said
to be data.
➔It may be visual, audio, scriptual etc.
Big Data
Big data means really a big data, it is a collection
of large datasets that cannot be processed using
on hand database management tools or
traditional computing techniques.
Big Data
The Big Data includes huge volume, high velocity,
and extensible variety of data. The data in it will be of
three types.
Structured data : Relational data.
Semi Structured data : XML data.
Unstructured data : Word, PDF, Text
Problems with Big Data:
➔Daily about 0.5 petabytes of updates are being
made into FACEBOOK including 40 millions
photos.
➔Daily YOUTUBE is loaded with videos that can be
watched for one year continously.
➔Limitations are encountered due to large data sets
in many areas, including genomics,complex
physics simulations, and biological and
environmental research.
Cont...
➔Also affect Internet search, finance and
business informatics.
➔The challenges include in capture, retrieval
,storage, search, sharing, analysis, and
visualization.
What could be the solution for
Big Data ?
hadoohadoo
pp
What is hadoop ?
➔Hadoop is an open source, Java-based
programming framework developed by Doug
Cutting and Mike Cafarella in 2005.
➔It is part of the Apache project sponsored by the
Apache Software Foundation.
➔Its designed to scale up from single servers to
thousands of machines, each offering local computers
and storage.
Cont...
➔It is used for distributed storage and distributed
processing of very large data sets on computer
clusters built from commodity hardware.
Small History
➔Hadoop was inspired by Google's MapReduce, a
software framework in which an application is
broken down into numerous small parts.
➔Any of these parts(also called fragments or blocks)
can be run on any node in the cluster.
➔Doug Cutting, Hadoop's creator, named the
framework after his child's stuffed toy elephant.
Small History
➔Started with building Web Search Engine
- Nutch in 2002
- Aim was to index billons of pages.
- Archetecture can't support billons of pages.
➔Google's GFS in 2003 solved storage problem.
- Nutch Distributed File System in 2004.
➔Google's MapReduce in 2004
- MapReduce implemented in 2005.
Doug Cutting with Hadoop
Mike Cafarella
2005: Doug Cutting and Mike Cafarella developed Hadoop
to support distribution for the Nutch search engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache
Software Foundation.
Now Apache Hadoop is a registered trademark of the
Apache Software Foundation.
What problems can Hadoop solve?
The Hadoop platform was designed to solve problems
where you have a lot of data " perhaps a mixture of
complex and structured data " and it doesn't fit well
into tables.
Components Of Hadoop
Hadoop consists of MapReduce, the Hadoop
distributed file system (HDFS) and a number of
related projects such as Apache Hive, HBase and
Zookeeper.
HADOOPHADOOP
HDFS MapReduce
HDFS (Hadoop Distributed File System)
➔The Hadoop Distributed File System (HDFS) is a
distributed file system designed to run on
commodity hardware.
➔ Its is a sub-project of Apache Hadoop project.
➔ HDFS is highly fault-tolerant and is designed to
be deployed on low-cost hardware.
➔HDFS provides high throughput access to
application data and is suitable for applications
that have large data sets.
Cont...
➔The HDFS takes care of storing and managing the
data within the hadoop cluster.
Cont...
MapReduce
➔ MapReducing is a programming model used for
processing large data sets.
➔Programs written in this functional style are
automatically parallelized and executed on a large
cluster of commodity machines.
➔MapReduce is an associated implementation for
processing and generating large data sets.
MapReduce
MapReduce program executes in two stages, namely
map stage, and reduce stage.
Map stage :
The map or mapper’s job is to process the
input data. Generally the input data is in the form of
file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the
mapper function line by line. The mapper processes
the data and creates several small chunks of data.
MapReduce
MapReduce program executes in two stages, namely
map stage, and reduce stage.
Reduce stage :
The Reducer’s job is to process the data that
comes from the mapper. After processing, it
produces a new set of output, which will be stored in
the HDFS.
MapReduce
Hadoop Core components
MASTER NODE
SLAVE NODE
Name node
Data node
Job tracker
Task tracker
Storage node Compute node
Cont...
Node :
It is a technical term used to describe a
machine or a computer that is present in a
cluster.
Demode :
It is a technical term used to describe the
background process that is running on a
linux machine.
Cont...
➔ The Master node responsible for running
Name nodes and Job tracker demodes.
➔The Slave node responsible for running the
Data nodes and Task tracker demodes.
Cont...
➔Name node and Data node are responsible
for storing and managing the data, and they
are commonly referred to as Storage Node.
➔Job Tracker and Task Tracker are
responsible for processing and computing the
data, and they are commonly referred to as
Compute Node.
Cont..
➔Usually Name node and Job tracker
configured on a single machine.
➔ The Data node and Task tracker
configured on multiple machines. But can
have instances running on more than one
machines at the same time.
Hadoop Cluster
➔ Normally any set of loosely connected or tightly
connected computers that work together as a single
system is called Cluster.
➔ In simple words, a computer cluster used for Hadoop
is called Hadoop Cluster.
Hadoop Cluster
Hadoop cluster is a special type of computational
cluster designed for storing and analyzing vast
amount of unstructured data in a distributed
computing environment. These clusters run on low
cost commodity computers.
Hadoop Cluster
Hadoop Cluster
➔Hadoop clusters are often referred to as "shared
nothing" systems because the only thing that is
shared between nodes is the network that connects
Them.
➔Clustering improves the system's availability to
users.
Hadoop Cluster
A Real Time Example:
Here is a picture of Yahoo's Hadoop cluster. They
have more than 10,000 machines running Hadoop
and nearly 1 petabyte of user data.
● Scalability :
Scalability basically refers to the ability of
adding or removing the nodes without bringing
down or affecting the cluster operation.
Features of Hadoop
Features of Hadoop
● Cost effective :
Hadoop does not requires any expensive
cost specialized harware. In other words, it can
be implemented on a simple hardware. These
hardware components are technically called as
commodity hardware.
Features of Hadoop
● Large Cluster of Nodes:
A hadoop cluster can be made up
off 100's and 1000's of nodes. One of the
main advantage of having a large cluster is, it
offers more computing power and huge
storage system to the clients.
Features of Hadoop
● Parallel Processing of Data:
The data can be process
simultaniously across all the nodes
within the cluster and thus saving a lot
of time.
Features of Hadoop
● Automatic Failover Management:
In case, if any of the nodes
within the cluster fails, the hadoop framework
will replace that particular machine with
another machine.
● Flexible :
Hadoop is schema-less, and can
absorb any type of data, structured or not,
from any number of sources.
● Fault-tolerant :
When you lose a node, the system
redirects work to another location of the
data and continue processing without
missing a beat.
Features of Hadoop
Limitations of Hadoop
● Security concerns
● Vulnerable by nature
● Not fit for Small data
● Potential steability issues
What is Hadoop used for?
● Search
– Yahoo, Amazon, Zvents
• Log processing
– Facebook, Yahoo, ContextWeb. Joost,
Last.fm
• Recommendation Systems
– Facebook
• Data Warehouse
– Facebook, AOL(America Online)
• Video and Image Analysis
– New York Times, Eyealike
Conclusion
➔Hadoop has been very effective for companies
dealing with the data in petabytes.
➔It has solved many problems in industry
related to huge data management and
distributed system.
➔As it is open source, so it is adopted by
companies widely.
References
● www.dezyre.com/Big-Data-and-Hadoop
● www.cloudera.com/content/www/...hadoop
/hdfs-mapreduce-yarn.html
● www.ufaber.com/hadoop/bigbata/free
● www.psgtech.edu/yrgcc/attach/haoop_archite
cture.ppt
Hadoop seminar
Hadoop seminar

More Related Content

What's hot

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 

What's hot (20)

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Sqoop
SqoopSqoop
Sqoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 

Similar to Hadoop seminar (20)

Anju
AnjuAnju
Anju
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 

Recently uploaded

ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationAadityaSharma884161
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 

Recently uploaded (20)

ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint Presentation
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 

Hadoop seminar

  • 2. CONTENTS:  Data and Big Data  Problems with Big Data  Hadoop  Small History of Hadoop  What problems can Hadoop solve?  Components of Hadoop - HDFS, MapReduce  Hadoop Cluster  High Level Archetecture of Hadoop  Hadoop Core Components  Features of Hadoop  Limitations of Hadoop  Users of Hadoop  Conclusion  References
  • 3. Data: ➔ Any real world symbol (character, numeric, special character ) or group of them is said to be data. ➔It may be visual, audio, scriptual etc.
  • 4. Big Data Big data means really a big data, it is a collection of large datasets that cannot be processed using on hand database management tools or traditional computing techniques.
  • 5. Big Data The Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types. Structured data : Relational data. Semi Structured data : XML data. Unstructured data : Word, PDF, Text
  • 6. Problems with Big Data: ➔Daily about 0.5 petabytes of updates are being made into FACEBOOK including 40 millions photos. ➔Daily YOUTUBE is loaded with videos that can be watched for one year continously. ➔Limitations are encountered due to large data sets in many areas, including genomics,complex physics simulations, and biological and environmental research.
  • 7. Cont... ➔Also affect Internet search, finance and business informatics. ➔The challenges include in capture, retrieval ,storage, search, sharing, analysis, and visualization.
  • 8. What could be the solution for Big Data ?
  • 10. What is hadoop ? ➔Hadoop is an open source, Java-based programming framework developed by Doug Cutting and Mike Cafarella in 2005. ➔It is part of the Apache project sponsored by the Apache Software Foundation.
  • 11. ➔Its designed to scale up from single servers to thousands of machines, each offering local computers and storage. Cont... ➔It is used for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
  • 12. Small History ➔Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. ➔Any of these parts(also called fragments or blocks) can be run on any node in the cluster. ➔Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant.
  • 13. Small History ➔Started with building Web Search Engine - Nutch in 2002 - Aim was to index billons of pages. - Archetecture can't support billons of pages. ➔Google's GFS in 2003 solved storage problem. - Nutch Distributed File System in 2004. ➔Google's MapReduce in 2004 - MapReduce implemented in 2005.
  • 16. 2005: Doug Cutting and Mike Cafarella developed Hadoop to support distribution for the Nutch search engine project. The project was funded by Yahoo. 2006: Yahoo gave the project to Apache Software Foundation. Now Apache Hadoop is a registered trademark of the Apache Software Foundation.
  • 17. What problems can Hadoop solve? The Hadoop platform was designed to solve problems where you have a lot of data " perhaps a mixture of complex and structured data " and it doesn't fit well into tables.
  • 18. Components Of Hadoop Hadoop consists of MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper.
  • 20.
  • 21. HDFS (Hadoop Distributed File System) ➔The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. ➔ Its is a sub-project of Apache Hadoop project. ➔ HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
  • 22. ➔HDFS provides high throughput access to application data and is suitable for applications that have large data sets. Cont... ➔The HDFS takes care of storing and managing the data within the hadoop cluster.
  • 24. MapReduce ➔ MapReducing is a programming model used for processing large data sets. ➔Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. ➔MapReduce is an associated implementation for processing and generating large data sets.
  • 25. MapReduce MapReduce program executes in two stages, namely map stage, and reduce stage. Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
  • 26. MapReduce MapReduce program executes in two stages, namely map stage, and reduce stage. Reduce stage : The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • 28. Hadoop Core components MASTER NODE SLAVE NODE Name node Data node Job tracker Task tracker Storage node Compute node
  • 29. Cont... Node : It is a technical term used to describe a machine or a computer that is present in a cluster. Demode : It is a technical term used to describe the background process that is running on a linux machine.
  • 30. Cont... ➔ The Master node responsible for running Name nodes and Job tracker demodes. ➔The Slave node responsible for running the Data nodes and Task tracker demodes.
  • 31. Cont... ➔Name node and Data node are responsible for storing and managing the data, and they are commonly referred to as Storage Node. ➔Job Tracker and Task Tracker are responsible for processing and computing the data, and they are commonly referred to as Compute Node.
  • 32. Cont.. ➔Usually Name node and Job tracker configured on a single machine. ➔ The Data node and Task tracker configured on multiple machines. But can have instances running on more than one machines at the same time.
  • 33. Hadoop Cluster ➔ Normally any set of loosely connected or tightly connected computers that work together as a single system is called Cluster. ➔ In simple words, a computer cluster used for Hadoop is called Hadoop Cluster.
  • 34. Hadoop Cluster Hadoop cluster is a special type of computational cluster designed for storing and analyzing vast amount of unstructured data in a distributed computing environment. These clusters run on low cost commodity computers.
  • 36. Hadoop Cluster ➔Hadoop clusters are often referred to as "shared nothing" systems because the only thing that is shared between nodes is the network that connects Them. ➔Clustering improves the system's availability to users.
  • 37. Hadoop Cluster A Real Time Example: Here is a picture of Yahoo's Hadoop cluster. They have more than 10,000 machines running Hadoop and nearly 1 petabyte of user data.
  • 38. ● Scalability : Scalability basically refers to the ability of adding or removing the nodes without bringing down or affecting the cluster operation. Features of Hadoop
  • 39. Features of Hadoop ● Cost effective : Hadoop does not requires any expensive cost specialized harware. In other words, it can be implemented on a simple hardware. These hardware components are technically called as commodity hardware.
  • 40. Features of Hadoop ● Large Cluster of Nodes: A hadoop cluster can be made up off 100's and 1000's of nodes. One of the main advantage of having a large cluster is, it offers more computing power and huge storage system to the clients.
  • 41. Features of Hadoop ● Parallel Processing of Data: The data can be process simultaniously across all the nodes within the cluster and thus saving a lot of time.
  • 42. Features of Hadoop ● Automatic Failover Management: In case, if any of the nodes within the cluster fails, the hadoop framework will replace that particular machine with another machine.
  • 43. ● Flexible : Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. ● Fault-tolerant : When you lose a node, the system redirects work to another location of the data and continue processing without missing a beat. Features of Hadoop
  • 44. Limitations of Hadoop ● Security concerns ● Vulnerable by nature ● Not fit for Small data ● Potential steability issues
  • 45. What is Hadoop used for? ● Search – Yahoo, Amazon, Zvents • Log processing – Facebook, Yahoo, ContextWeb. Joost, Last.fm • Recommendation Systems – Facebook • Data Warehouse – Facebook, AOL(America Online) • Video and Image Analysis – New York Times, Eyealike
  • 46. Conclusion ➔Hadoop has been very effective for companies dealing with the data in petabytes. ➔It has solved many problems in industry related to huge data management and distributed system. ➔As it is open source, so it is adopted by companies widely.
  • 47. References ● www.dezyre.com/Big-Data-and-Hadoop ● www.cloudera.com/content/www/...hadoop /hdfs-mapreduce-yarn.html ● www.ufaber.com/hadoop/bigbata/free ● www.psgtech.edu/yrgcc/attach/haoop_archite cture.ppt