SlideShare a Scribd company logo
1 of 38
HADOOP OVERVIEW &
ARCHITECTURE
BY
CHANDINI SANS
CONTENTS
1. Why hadoop?
2. Importance of hadoop
3. What’s in hadoop?
4. Apache hadoop echo system
5. Hadoop architecture
6. Hadoop map reduce
7. Hdfs
8. Advantages of hadoop
COST PER GIGA BYTE
STORAGE TRENDS
ISSUES WITH LARGE DATA
• Map Parallelism: Chunking input data
• Reduce Parallelism: Grouping related
data
• Dealing with failures & load imbalance
• Doug Cutting, Mike Cafarella developed an
Open Source Project called HADOOP in 2005
and Daug named it after his son's toy elephant.
• Hadoop has become one of the most talked about
technologies.
• Why? One of the top reasons is its ability to handle
huge amounts of data – any kind of data – quickly.
With volumes and varieties of data growing each
day, especially from social media and automated
sensors, that’s a key consideration for most
organizations. 
• Hadoop is an open-source software framework
for storing and processing big data in a
distributed fashion on large clusters of
commodity hardware.
• Essentially, it accomplishes two tasks:
-massive data storage
- faster processing.
• Hadoop is an Apache open source framework
written in java that allows distributed
processing of large datasets across clusters of
computers using simple programming models.
Hadoop is designed to scale up from single
server to thousands of machines, each offering
local computation and storage.
WHO USES HADOOP?
WHY IS HADOOP IMPORTANT?
• Low cost : The open-source framework is free and uses
commodity hardware to store large quantities of data.
• Computing power : Its distributed computing model
can quickly process very large volumes of data.
• Scalability : You can easily grow your system simply by
adding more nodes
• Storage flexibility : You can store as much data as you
want and decide how to use it later.
• Inherent data protection and self-healing
capabilities : Data, application processing are protected
WHAT’S IN HADOOP?
• HDFS – the Java-based distributed file system that can
store all kinds of data without prior organization.
• MapReduce – a software programming model for
processing large sets of data in parallel.
• YARN – a resource management framework for
scheduling and handling resource requests from distributed
applications.
COMPONENTS THAT HAVE ACHIEVED TOP-
LEVEL APACHE PROJECT STATUS
• Pig – a platform for manipulating data stored in HDFS. It
consists of a compiler for Map Reduce programs and a
high-level language called Pig Latin.
• Hive – a data warehousing and SQL-like query language
that presents data in the form of tables. Hive programming
is similar to database programming. (It was initially
developed by Facebook.)
• HBase – a non relational, distributed database that runs
on top of Hadoop. HBase tables can serve as input and
output for Map Reduce jobs.
• Zookeeper – an application that coordinates distributed
processes.
• Ambari – a web interface for managing, configuring
and testing Hadoop services and components.
• Flume – software that collects, aggregates and moves
large amounts of streaming data into HDFS.
• Sqoop – a connection and transfer mechanism that
moves data between Hadoop and relational databases.
• Oozie – a Hadoop job scheduler.
HADOOP ARCHITECTURE
• Hadoop framework includes following four modules:
• Hadoop Common : These are Java libraries and
utilities required by other Hadoop modules. These
libraries provides filesystem and OS level abstractions
and contains the necessary Java files and scripts
required to start Hadoop.
• Hadoop YARN : This is a framework for job
scheduling and cluster resource management.
• Hadoop Distributed File System (HDFS) : A
distributed file system that provides high-throughput
access to application data.
• Hadoop MapReduce : This is YARN-based system
for parallel processing of large data sets.
COMPONENTS OF HADOOP
FRAMEWORK:
HADOOP MAP REDUCE
• Hadoop runs applications using the Map
Reduce algorithm, where the data is processed
in parallel on different CPU nodes.
• Map Reduce program executes in three stages,
namely map stage, shuffle stage, and reduce
stage.
WHAT IS MAP REDUCE?
STAGES OF MAP REDUCE
• Map stage : The map ‘s job is to process the input data
which is in the form of file or directory and is stored in the
Hadoop file system (HDFS) and is passed to the mapper
function line by line. The mapper processes the data and
creates several small chunks of data.
• Reduce stage : This stage is the combination of
the Shuffle stage and the Reduce stage. The Reducer’s
job is to process the data that comes from the mapper.
After processing, it produces a new set of output, which will
be stored in the HDFS.
MAP REDUCE
MAP REDUCE
ARCHITECTURE
THINK MAP REDUCE
• Record = (Key, Value)
• Key : Comparable, Serializable
• Value : Serializable
• Input, Map, Shuffle, Reduce, Output
MAP
• Input: (Key1, Value1)
• Output: List(Key2, Value2)
• Projections, Filtering, Transformation
• Data is organized into files and
directories
• Files are divided into uniform sized
blocks(default 128MB) and distributed
across cluster nodes
HDFS
• Blocks are replicated to handle hardware
failure
• Replication for performance and fault
tolerance (Rack-Aware placement)
• HDFS keeps checksums of data for
corruption detection and recovery
FEATURES OF HDFS
• It is suitable for the distributed storage and
processing.
• Hadoop provides a command interface to
interact with HDFS.
• The built-in servers of name node and data
node help users to easily check the status of
cluster.
• Streaming access to file system data.
• HDFS provides file permissions and
authentication.
HDFS ARCHITECTURE
• Namenode is a software that can be run on commodity
hardware. The system having the namenode acts as the
master server and it does the following tasks:
- Manages the file system namespace.
- Regulates client’s access to files.
- It also executes file system operations such as renaming,
closing, and opening files and directories.
• Datanode nodes manage the data storage of the system.
- perform read-write operations on the file systems, as per
client request.
- perform operations such as block creation, deletion, and
replication
• Block the user data is stored in the files of HDFS in which file
system will be divided into one or more segments and stored
in individual data nodes segments are called as blocks
MASTER-SLAVE
ARCHITECTURE
GOALS OF HDFS
• Fault detection and recovery :
Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore
HDFS should have mechanisms for quick and automatic
fault detection and recovery.
• Huge datasets :
HDFS should have hundreds of nodes per cluster to
manage the applications having huge datasets.
• Hardware at data :
A requested task can be done efficiently, when the
computation takes place near the data where huge
datasets are involved, it reduces the network traffic and
increases the throughput.
ADVANTAGES OF HADOOP
• Hadoop framework allows the user to quickly write and
test distributed systems.
• Hadoop library itself detects and handles failures at the
application layer.
• Servers can be added or removed from the cluster
dynamically and Hadoop continues to operate without
interruption.
• apart from being open source, it is compatible on all the
platforms since it is Java based.
Thank
You…!!!

More Related Content

What's hot

HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture Ganesh B
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 

What's hot (20)

Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Presentation
PresentationPresentation
Presentation
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hive
HiveHive
Hive
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 

Similar to Hadoop

Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 

Similar to Hadoop (20)

Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop training
Hadoop trainingHadoop training
Hadoop training
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 

Recently uploaded

JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...BookNet Canada
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdfHCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdfROWELL MARQUINA
 
Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxKunal Gupta
 
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024BookNet Canada
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
A PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxA PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxatharvdev2010
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceOpsTree solutions
 

Recently uploaded (20)

JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdfHCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
 
Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptx
 
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
A PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxA PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer Experience
 

Hadoop

  • 2. CONTENTS 1. Why hadoop? 2. Importance of hadoop 3. What’s in hadoop? 4. Apache hadoop echo system 5. Hadoop architecture 6. Hadoop map reduce 7. Hdfs 8. Advantages of hadoop
  • 3.
  • 6. ISSUES WITH LARGE DATA • Map Parallelism: Chunking input data • Reduce Parallelism: Grouping related data • Dealing with failures & load imbalance
  • 7.
  • 8. • Doug Cutting, Mike Cafarella developed an Open Source Project called HADOOP in 2005 and Daug named it after his son's toy elephant.
  • 9. • Hadoop has become one of the most talked about technologies. • Why? One of the top reasons is its ability to handle huge amounts of data – any kind of data – quickly. With volumes and varieties of data growing each day, especially from social media and automated sensors, that’s a key consideration for most organizations. 
  • 10. • Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. • Essentially, it accomplishes two tasks: -massive data storage - faster processing.
  • 11. • Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.
  • 13.
  • 14. WHY IS HADOOP IMPORTANT? • Low cost : The open-source framework is free and uses commodity hardware to store large quantities of data. • Computing power : Its distributed computing model can quickly process very large volumes of data. • Scalability : You can easily grow your system simply by adding more nodes • Storage flexibility : You can store as much data as you want and decide how to use it later. • Inherent data protection and self-healing capabilities : Data, application processing are protected
  • 15. WHAT’S IN HADOOP? • HDFS – the Java-based distributed file system that can store all kinds of data without prior organization. • MapReduce – a software programming model for processing large sets of data in parallel. • YARN – a resource management framework for scheduling and handling resource requests from distributed applications.
  • 16.
  • 17. COMPONENTS THAT HAVE ACHIEVED TOP- LEVEL APACHE PROJECT STATUS • Pig – a platform for manipulating data stored in HDFS. It consists of a compiler for Map Reduce programs and a high-level language called Pig Latin. • Hive – a data warehousing and SQL-like query language that presents data in the form of tables. Hive programming is similar to database programming. (It was initially developed by Facebook.) • HBase – a non relational, distributed database that runs on top of Hadoop. HBase tables can serve as input and output for Map Reduce jobs. • Zookeeper – an application that coordinates distributed processes.
  • 18. • Ambari – a web interface for managing, configuring and testing Hadoop services and components. • Flume – software that collects, aggregates and moves large amounts of streaming data into HDFS. • Sqoop – a connection and transfer mechanism that moves data between Hadoop and relational databases. • Oozie – a Hadoop job scheduler.
  • 19. HADOOP ARCHITECTURE • Hadoop framework includes following four modules: • Hadoop Common : These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop. • Hadoop YARN : This is a framework for job scheduling and cluster resource management. • Hadoop Distributed File System (HDFS) : A distributed file system that provides high-throughput access to application data. • Hadoop MapReduce : This is YARN-based system for parallel processing of large data sets.
  • 20.
  • 22.
  • 24. • Hadoop runs applications using the Map Reduce algorithm, where the data is processed in parallel on different CPU nodes. • Map Reduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. WHAT IS MAP REDUCE?
  • 25. STAGES OF MAP REDUCE • Map stage : The map ‘s job is to process the input data which is in the form of file or directory and is stored in the Hadoop file system (HDFS) and is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. • Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • 28. THINK MAP REDUCE • Record = (Key, Value) • Key : Comparable, Serializable • Value : Serializable • Input, Map, Shuffle, Reduce, Output
  • 29. MAP • Input: (Key1, Value1) • Output: List(Key2, Value2) • Projections, Filtering, Transformation
  • 30. • Data is organized into files and directories • Files are divided into uniform sized blocks(default 128MB) and distributed across cluster nodes
  • 31. HDFS • Blocks are replicated to handle hardware failure • Replication for performance and fault tolerance (Rack-Aware placement) • HDFS keeps checksums of data for corruption detection and recovery
  • 32. FEATURES OF HDFS • It is suitable for the distributed storage and processing. • Hadoop provides a command interface to interact with HDFS. • The built-in servers of name node and data node help users to easily check the status of cluster. • Streaming access to file system data. • HDFS provides file permissions and authentication.
  • 34. • Namenode is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks: - Manages the file system namespace. - Regulates client’s access to files. - It also executes file system operations such as renaming, closing, and opening files and directories. • Datanode nodes manage the data storage of the system. - perform read-write operations on the file systems, as per client request. - perform operations such as block creation, deletion, and replication • Block the user data is stored in the files of HDFS in which file system will be divided into one or more segments and stored in individual data nodes segments are called as blocks
  • 36. GOALS OF HDFS • Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery. • Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets. • Hardware at data : A requested task can be done efficiently, when the computation takes place near the data where huge datasets are involved, it reduces the network traffic and increases the throughput.
  • 37. ADVANTAGES OF HADOOP • Hadoop framework allows the user to quickly write and test distributed systems. • Hadoop library itself detects and handles failures at the application layer. • Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption. • apart from being open source, it is compatible on all the platforms since it is Java based.