SlideShare a Scribd company logo
1
All of Hadoop
Yasser Hassan
Master Student
14 March 2017
Agenda
 What is Hadoop ?
 Who Uses Hadoop ?
 Uses of Hadoop
 Sample Application
 Core Hadoop Concepts
 Hadoop Goals
 Hadoop Challenges
 Hadoop Architecture
◦ HDFS (Hadoop Distributed File System)
 How are Files Stored
 Replication Strategy
 HDFS Architecture
 Understand HDFS Architecture
 HDFS Folder & Internals Files
Agenda
◦ YARN (Yet Another Resource Negotiator)
 Two important elements (Resource Manager , Node Manager )
 The application startup process
◦ MapReduce (MR)
 Importance of MapReduce
 MapReduce program execution
 Flow of MapReduce
 Inputs and Outputs
 EXAMPLE 1
 EXAMPLE 2
◦ Other Tools
 Why do these tools exist?
 Main Differences between Hadoop 1 & Hadoop 2
 Advantages of Hadoop
 Disadvantages of Hadoop
 Previous Generation
 Question
What is Hadoop ?
 Open source software framework designed
for storage and processing of large scale
data on clusters of commodity hardware.
 Created by Doug Cutting and Mike
Carafella in 2005.
Who Uses Hadoop ?
Uses of Hadoop
 Data-intensive text processing.
 Assembly of large genomes.
 Graph mining.
 Machine learning and data mining.
 Large scale social network analysis.
Sample Application
 Data analysis is the inner loop of Web
2.0
◦ Data ⇒ Information ⇒ Value
 Log processing: reporting .
 Search index
 Machine learning: Spam filters
 Competitive intelligence
Core Hadoop Concepts
 Applications are written in a high-level programming
language.
◦ No network programming.
 Nodes should communicate as little as possible.
 Data is spread among the machines in advance.
◦ Perform computation where the data is already
stored as often as possible.
Hadoop Goals
• Abstract and facilitate the storage and
processing
of large and/or rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability.
• Use commodity (cheap!) hardware.
• Fault-tolerance.
• Move computation rather than data.
Hadoop Challenges
 Hadoop is a cutting edge technology
◦ Hadoop is a new technology, and with adopting any new
technology, finding people who know the technology is
difficult!.
 Hadoop in the Enterprise Ecosystem
◦ Hadoop is designed to solve Big Data problems encountered
by Web and Social companies. Don’t support government
requirement. For example, HDFS does not offer native
support for security and authentication.
 Hadoop is still rough around the edges (under
development)
◦ The development and admin tools for Hadoop are still pretty
new. Companies like Cloudera, Hortonworks, MapR and
Hadoop Challenges
 Hadoop is NOT cheap
◦ Hardware Cost
 Hadoop runs on 'commodity' hardware. But these are not
cheapo machines, they are server grade hardware.
 So standing up a reasonably large Hadoop cluster, say
100 nodes, will cost a significant amount of money. For
example, lets say a Hadoop node is $5000, so a 100
node cluster would be $500,000 for hardware.
◦ IT and Operations costs
 A large Hadoop cluster will require support from various
teams like : Network Admins, IT, Security Admins,
System Admins. Also one needs to think about
operational costs like Data Center expenses : cooling,
electricity, etc.
Hadoop Challenges
 Map Reduce is a different programming
paradigm
◦ Solving problems using Map Reduce requires a
different kind of thinking. Engineering teams
generally need additional training to take
advantage of Hadoop.
 Hadoop and High Availability
◦ Hadoop version 1 had a single point of failure
problem because of NameNode. There was only
one NameNode for the cluster, and if it went
down, the whole Hadoop cluster would be
unavailable. This has prevented the use of
Hadoop Architecture
• Contains Libraries and other
modules
Hadoop
Common
• Hadoop Distributed File
SystemHDFS
• Yet Another Resource
Negotiator
Hadoop
YARN
• A programming model for
large scale data processing
Hadoop
MapReduce
Hadoop Architecture
 HDFS Concept
◦ HDFS is a file system written in Java based
on the Google’s GFS.
◦ Developed using distributed file system
design. It is run on commodity hardware.
Unlike other distributed systems, HDFS is
highly fault tolerant and designed using low-
cost hardware.
◦ Responsible for storing data on the cluster.
◦ Optimized for streaming reads of large files
and not random reads.
Hadoop Architecture
 HDFS Concept
◦ How are Files Stored
 Data is organized into
files and directories.
 Files are divided into
uniform sized
blocks(default 64MB)
and distributed across
cluster nodes.
 HDFS exposes block
placement so that
computation can be
migrated to data.
Hadoop Architecture
 HDFS Concept
◦ How are Files Stored
Blocks are replicated (default 3) to handle
hardware failure.
Replication for performance and fault
tolerance (Rack-Aware placement).
HDFS keeps checksums of data for
corruption detection and recovery.
Hadoop Architecture
 HDFS Concept
◦ Replication Strategy
 One replica on local node.
 Second replica on a remote rack.
 Third replica on same remote rack.
 Additional replicas are randomly placed.
◦ Clients read from nearest replica.
Hadoop Architecture
 HDFS Architecture
◦ HDFS follows master-slave architecture and it
has the following elements.
 Namenode
 Datanode
 Block
Hadoop Architecture
 HDFS Architecture
◦ Namenode : is the commodity hardware that
contains the GNU/Linux operating system and
namenode software. It is a software that can be
run on commodity hardware. The system
having the namenode acts as the master server
and it does the following tasks:
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as
renaming, closing, and opening files and
directories.
Hadoop Architecture
 HDFS Architecture
◦ Datanode : is a commodity hardware having the
GNU/Linux operating system and datanode
software. For every node (Commodity
hardware/System) in a cluster, there will be a
datanode. These nodes manage the data
storage of their system. it does the following
tasks:
 Datanodes perform read-write operations on the file
systems.
 They also perform operations such as block creation,
deletion, and replication according to the instructions
of the namenode.
Hadoop Architecture
 HDFS Architecture
◦ Block : Generally the user data is stored in the files of
HDFS. The file in a file system will be divided into one or
more segments and/or stored in individual data nodes.
These file segments are called as blocks. In other
words, the minimum amount of data that HDFS can read
or write is called a Block. The default block size is 64MB,
but it can be increased as per the need to change in
HDFS configuration.
Hadoop Architecture
 Understand HDFS Architecture
◦ The NameNode is the centerpiece of an HDFS file system. It
keeps the directory tree of all files in the file system, and
tracks where across the cluster the file data is kept. It does not
store the data of these files itself.
◦ Client applications talk to the NameNode whenever they wish
to locate a file, or when they want to add/copy/move/delete a
file.
Hadoop Architecture
 Understand HDFS Architecture
◦ The NameNode responds the successful requests by
returning a list of relevant DataNode servers where the data
lives.
Hadoop Architecture
 Understand HDFS Architecture
◦ Datanodes perform read-write operations on the file
systems.
Hadoop Architecture
 Understand HDFS Architecture
◦ The NameNode is a Single Point of Failure for the HDFS
Cluster.
◦ HDFS is not currently a High Availability system. When the
NameNode goes down, the file system goes offline.
◦ There is an optional SecondaryNameNode that can be hosted
on a separate machine. It only creates checkpoints of the
namespace by merging the edits file into the fsimage file and
does not provide any real redundancy. Hadoop 0.21+ has a
BackupNameNode.
Hadoop Architecture
 Understand HDFS Architecture
◦ How creates checkpoints reports of the namespace
?
◦ How Namenode responding for Client request ?
Hadoop Architecture
 Understand HDFS Architecture
◦ How creates checkpoints reports of the namespace
?
◦ How Namenode responding for Client request ?
Hadoop Architecture
 Understand HDFS Architecture
◦ Rack
Hadoop Architecture
 Understand HDFS Architecture
◦ General Structure of HDFS
Hadoop Architecture
 HDFS Folder
Hadoop Architecture
 HDFS Folder
Hadoop Architecture
 YARN (Yet Another Resource
Negotiator)
◦ is the framework responsible for providing the
compu-tational resources ( CPUs, memory, etc.)
needed for application executions.
◦ The YARN infrastructure and the HDFS are completely
decoupled and independent: the first one provides resources
for running an application while the second one provides
storage.
Hadoop Architecture
 YARN (Yet Another Resource Negotiator)
◦ Two important elements are:
 Resource Manager:(one per cluster) is the master. It knows
where the slaves are located (Rack) and how many
resources (container) they have. It runs several services, the
most important is Resource Scheduler which decides how
to assign resources.
Hadoop Architecture
 YARN (Yet Another Resource Negotiator)
◦ Two important elements are:
 Node Manager:(many per cluster) is the slave. When it
starts, it announces himself to the RM. Periodically, it sends
an heartbeat to RM. Each Node Manager offers some
resources to the cluster. Its resource capacity is the amount
of memory and the number of vcores. At run-time, the
Resource Scheduler will decide how to use this capacity:
a Container is a fraction of the NM capacity and it is used by
the client for running a program.
Hadoop Architecture
 YARN (Yet Another Resource Negotiator)
◦ In YARN, there are at least three actors:
 the Job Submitter (the client)
 the Resource Manager (the master)
 the Node Manager (the slave)
Hadoop Architecture
 YARN (Yet Another Resource Negotiator)
◦ The application startup process is the
following:
 a client submits an application to the Resource
Manager.
 Resource Manager allocates a container.(by RS)
 Resource Manager contacts the related Node
Manager.
 the Node Manager launches the container.
 the Container executes the Application Master.
Hadoop Architecture
 Hadoop MapReduce
o MapReduce : is a processing technique and a
program model for distributed computing based
on java.
o The MapReduce algorithm contains two
important tasks :
 Map : takes a set of data and converts it into
another set of data, where individual elements
are broken down into tuples (key/value pairs).
 Reduce :which takes the output from a map as
an input and combines those data tuples into a
smaller set of tuples.
Hadoop Architecture
 Hadoop MapReduce
 Importance of MapReduce
Hadoop Architecture
 Hadoop MapReduce
◦ MapReduce program executes in three stages :-
 Map stage : The map or mapper’s job is to process the
input data. Generally the input data is in the form of file or
directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line.
The mapper processes the data and creates several small
chunks of data.
 Reduce stage : This stage is the combination
of Shuffle stage and Reduce stage. The Reducer’s job is to
process the data that comes from the mapper. After
processing, it produces a new set of output, which will be
stored in the HDFS.
Hadoop Architecture
 Hadoop
MapReduce
◦ Flow of MapReduce :
 Define Inputs.
 Define Map function.
 Define Reduce
function.
 Define Output.
Hadoop Architecture
 Hadoop MapReduce
◦ Flow of MapReduce :
 During a MapReduce job, Hadoop sends the Map
and Reduce tasks to the appropriate servers in the
cluster.
 The framework manages all the details of data-
passing such as issuing tasks and verifying task
completion.
 Most of the computing takes place on nodes with
data on local disks that reduces the network traffic.
 After completion of the given tasks, the cluster
collects and reduces the data to form an appropriate
result, and sends it back to the Hadoop server.
Hadoop Architecture
 Hadoop MapReduce
◦ Inputs and Outputs
 The MapReduce framework operates on <key, value> pairs.
 The key and the value classes should be in serialized manner
by the framework and hence, need to implement the Writable
interface ??.
 the key classes have to implement the Writable-Comparable
interface to facilitate sorting by the framework.
 Input and Output types of a MapReduce job:
(Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3,
v3>(Output). Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
Hadoop Architecture
 Hadoop MapReduce
◦ EXAMPLE 1
Hadoop Architecture
 Hadoop MapReduce
◦ EXAMPLE 2
Hadoop Architecture
 Hadoop MapReduce
Hadoop Architecture
 Hadoop MapReduce
Hadoop Architecture
 Other Tools
◦ Hive
 Hadoop processing with SQL.
◦ Pig
 Hadoop processing with scripting.
◦ Cascading
 Pipe and Filter processing model.
◦ HBase
 Database model built on top of Hadoop.
◦ Flume
 Designed for large scale data movement.
Hadoop Architecture
 Why do these tools exist?
◦ MapReduce is very powerful, These tools
allow programmers who are familiar with
other programming styles to take
advantage of the power of MapReduce.
Hadoop Architecture
◦ Main Differences between Hadoop 1 & Hadoop 2
Hadoop 1 Hadoop 2
1- YARN not existing 1- YARN exist
2- have only Namenode 2- have Namenode and
Secondary Namenode for
recovery .
Advantages of Hadoop
 Scalable:
◦ is a highly scalable storage platform, because it can stores and
distribute very large data sets across hundreds of inexpensive
servers that operate in parallel. Unlike traditional relational
database systems (RDBMS).
 Cost effective:
◦ offers a cost effective storage solution for businesses’ exploding
data sets. The problem with traditional RDBMS is that it is high
cost to scale to such a degree in order to process such massive
volumes of data.
 Flexible:
◦ enables businesses to easily access new data sources and tap
into different types of data (both structured and unstructured) to
generate value from that data.
Advantages of Hadoop
 Fast:
◦ Hadoop’s unique storage method is based on a
distributed file system that basically ‘maps’ data
wherever it is located on a cluster. The tools for data
processing are often on the same servers where the
data is located, resulting in much faster data
processing. If you’re dealing with large volumes of
unstructured data, Hadoop is able to efficiently process
terabytes of data in just minutes, and petabytes in
hours.
 Resilient to failure:
◦ A key advantage of using Hadoop is its fault tolerance.
When data is sent to an individual node, that data is also
replicated to other nodes in the cluster, which means that in
the event of failure. There is another copy available for use
Disadvantages of Hadoop
 Security Concept
◦ Just managing a complex applications such as Hadoop can be
challenging. A simple example can be seen in the Hadoop security model,
which is disabled by default due to sheer complexity. Hadoop is also
missing encryption at the storage and network levels, which is a major
selling point for government agencies and others that prefer to keep their
data.
 Vulnerable By Nature
◦ the very makeup of Hadoop makes running it a risky proposition. The framework is
written almost entirely in Java, one of the most widely used yet controversial
programming languages in existence. Java has been heavily exploited by
cybercriminals and as a result, implicated in numerous security breaches.
 Potential Stability Issues
◦ Like all open source software, Hadoop has had its fair share of stability
issues. To avoid these issues, organizations are strongly recommended to
make sure they are running the latest stable version, or run it under a
third-party vendor equipped to handle such problems.
 Not Fit for Small Data
◦ While big data is not exclusively made for big businesses, not all big data
Question
Q/ What are differences
between
MapReduce
&
YARN
?????
Any Question …
Hadoop

More Related Content

What's hot

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
Sudipta Ghosh
 
Hadoop and friends : introduction
Hadoop and friends : introductionHadoop and friends : introduction
Hadoop and friends : introductionfredcons
 
Pig ve Hive ile Hadoop üzerinde Veri Analizi
Pig ve Hive ile Hadoop üzerinde Veri AnaliziPig ve Hive ile Hadoop üzerinde Veri Analizi
Pig ve Hive ile Hadoop üzerinde Veri Analizi
Hakan Ilter
 
MapR Tutorial Series
MapR Tutorial SeriesMapR Tutorial Series
MapR Tutorial Series
selvaraaju
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
Nick Dimiduk
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Une Introduction à Hadoop
Une Introduction à HadoopUne Introduction à Hadoop
Une Introduction à Hadoop
Modern Data Stack France
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
kristinferrier
 
Hadoop
HadoopHadoop
Hadoop
AS Stitou
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
SANG WON PARK
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Hadoop et son écosystème - v2
Hadoop et son écosystème - v2Hadoop et son écosystème - v2
Hadoop et son écosystème - v2
Khanh Maudoux
 

What's hot (20)

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Hadoop and friends : introduction
Hadoop and friends : introductionHadoop and friends : introduction
Hadoop and friends : introduction
 
Pig ve Hive ile Hadoop üzerinde Veri Analizi
Pig ve Hive ile Hadoop üzerinde Veri AnaliziPig ve Hive ile Hadoop üzerinde Veri Analizi
Pig ve Hive ile Hadoop üzerinde Veri Analizi
 
MapR Tutorial Series
MapR Tutorial SeriesMapR Tutorial Series
MapR Tutorial Series
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Une Introduction à Hadoop
Une Introduction à HadoopUne Introduction à Hadoop
Une Introduction à Hadoop
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Hadoop
HadoopHadoop
Hadoop
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Hadoop et son écosystème - v2
Hadoop et son écosystème - v2Hadoop et son écosystème - v2
Hadoop et son écosystème - v2
 

Similar to Hadoop

Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
KennyPratheepKumar
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
KavyaGo
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
Sunil D Patil
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
Jay Nagar
 
hadoop
hadoophadoop
hadoop
swatic018
 
hadoop
hadoophadoop
hadoop
swatic018
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
RajatTripathi34
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
Uttara University
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Anju
AnjuAnju
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
ManiMaran230751
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Hadoop – big deal
Hadoop – big dealHadoop – big deal
Hadoop – big deal
Abhishek Kumar
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
preetik9044
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
AltafKhadim
 

Similar to Hadoop (20)

Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Anju
AnjuAnju
Anju
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop – big deal
Hadoop – big dealHadoop – big deal
Hadoop – big deal
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 

Recently uploaded

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 

Recently uploaded (20)

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 

Hadoop

  • 1. 1 All of Hadoop Yasser Hassan Master Student 14 March 2017
  • 2. Agenda  What is Hadoop ?  Who Uses Hadoop ?  Uses of Hadoop  Sample Application  Core Hadoop Concepts  Hadoop Goals  Hadoop Challenges  Hadoop Architecture ◦ HDFS (Hadoop Distributed File System)  How are Files Stored  Replication Strategy  HDFS Architecture  Understand HDFS Architecture  HDFS Folder & Internals Files
  • 3. Agenda ◦ YARN (Yet Another Resource Negotiator)  Two important elements (Resource Manager , Node Manager )  The application startup process ◦ MapReduce (MR)  Importance of MapReduce  MapReduce program execution  Flow of MapReduce  Inputs and Outputs  EXAMPLE 1  EXAMPLE 2 ◦ Other Tools  Why do these tools exist?  Main Differences between Hadoop 1 & Hadoop 2  Advantages of Hadoop  Disadvantages of Hadoop  Previous Generation  Question
  • 4. What is Hadoop ?  Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware.  Created by Doug Cutting and Mike Carafella in 2005.
  • 6. Uses of Hadoop  Data-intensive text processing.  Assembly of large genomes.  Graph mining.  Machine learning and data mining.  Large scale social network analysis.
  • 7. Sample Application  Data analysis is the inner loop of Web 2.0 ◦ Data ⇒ Information ⇒ Value  Log processing: reporting .  Search index  Machine learning: Spam filters  Competitive intelligence
  • 8. Core Hadoop Concepts  Applications are written in a high-level programming language. ◦ No network programming.  Nodes should communicate as little as possible.  Data is spread among the machines in advance. ◦ Perform computation where the data is already stored as often as possible.
  • 9. Hadoop Goals • Abstract and facilitate the storage and processing of large and/or rapidly growing data sets • Structured and non-structured data • Simple programming models • High scalability and availability. • Use commodity (cheap!) hardware. • Fault-tolerance. • Move computation rather than data.
  • 10. Hadoop Challenges  Hadoop is a cutting edge technology ◦ Hadoop is a new technology, and with adopting any new technology, finding people who know the technology is difficult!.  Hadoop in the Enterprise Ecosystem ◦ Hadoop is designed to solve Big Data problems encountered by Web and Social companies. Don’t support government requirement. For example, HDFS does not offer native support for security and authentication.  Hadoop is still rough around the edges (under development) ◦ The development and admin tools for Hadoop are still pretty new. Companies like Cloudera, Hortonworks, MapR and
  • 11. Hadoop Challenges  Hadoop is NOT cheap ◦ Hardware Cost  Hadoop runs on 'commodity' hardware. But these are not cheapo machines, they are server grade hardware.  So standing up a reasonably large Hadoop cluster, say 100 nodes, will cost a significant amount of money. For example, lets say a Hadoop node is $5000, so a 100 node cluster would be $500,000 for hardware. ◦ IT and Operations costs  A large Hadoop cluster will require support from various teams like : Network Admins, IT, Security Admins, System Admins. Also one needs to think about operational costs like Data Center expenses : cooling, electricity, etc.
  • 12. Hadoop Challenges  Map Reduce is a different programming paradigm ◦ Solving problems using Map Reduce requires a different kind of thinking. Engineering teams generally need additional training to take advantage of Hadoop.  Hadoop and High Availability ◦ Hadoop version 1 had a single point of failure problem because of NameNode. There was only one NameNode for the cluster, and if it went down, the whole Hadoop cluster would be unavailable. This has prevented the use of
  • 13. Hadoop Architecture • Contains Libraries and other modules Hadoop Common • Hadoop Distributed File SystemHDFS • Yet Another Resource Negotiator Hadoop YARN • A programming model for large scale data processing Hadoop MapReduce
  • 14. Hadoop Architecture  HDFS Concept ◦ HDFS is a file system written in Java based on the Google’s GFS. ◦ Developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using low- cost hardware. ◦ Responsible for storing data on the cluster. ◦ Optimized for streaming reads of large files and not random reads.
  • 15. Hadoop Architecture  HDFS Concept ◦ How are Files Stored  Data is organized into files and directories.  Files are divided into uniform sized blocks(default 64MB) and distributed across cluster nodes.  HDFS exposes block placement so that computation can be migrated to data.
  • 16. Hadoop Architecture  HDFS Concept ◦ How are Files Stored Blocks are replicated (default 3) to handle hardware failure. Replication for performance and fault tolerance (Rack-Aware placement). HDFS keeps checksums of data for corruption detection and recovery.
  • 17. Hadoop Architecture  HDFS Concept ◦ Replication Strategy  One replica on local node.  Second replica on a remote rack.  Third replica on same remote rack.  Additional replicas are randomly placed. ◦ Clients read from nearest replica.
  • 18. Hadoop Architecture  HDFS Architecture ◦ HDFS follows master-slave architecture and it has the following elements.  Namenode  Datanode  Block
  • 19. Hadoop Architecture  HDFS Architecture ◦ Namenode : is the commodity hardware that contains the GNU/Linux operating system and namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:  Manages the file system namespace.  Regulates client’s access to files.  It also executes file system operations such as renaming, closing, and opening files and directories.
  • 20. Hadoop Architecture  HDFS Architecture ◦ Datanode : is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system. it does the following tasks:  Datanodes perform read-write operations on the file systems.  They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
  • 21. Hadoop Architecture  HDFS Architecture ◦ Block : Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.
  • 22. Hadoop Architecture  Understand HDFS Architecture ◦ The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. ◦ Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file.
  • 23. Hadoop Architecture  Understand HDFS Architecture ◦ The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
  • 24. Hadoop Architecture  Understand HDFS Architecture ◦ Datanodes perform read-write operations on the file systems.
  • 25. Hadoop Architecture  Understand HDFS Architecture ◦ The NameNode is a Single Point of Failure for the HDFS Cluster. ◦ HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. ◦ There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy. Hadoop 0.21+ has a BackupNameNode.
  • 26. Hadoop Architecture  Understand HDFS Architecture ◦ How creates checkpoints reports of the namespace ? ◦ How Namenode responding for Client request ?
  • 27. Hadoop Architecture  Understand HDFS Architecture ◦ How creates checkpoints reports of the namespace ? ◦ How Namenode responding for Client request ?
  • 28. Hadoop Architecture  Understand HDFS Architecture ◦ Rack
  • 29. Hadoop Architecture  Understand HDFS Architecture ◦ General Structure of HDFS
  • 32. Hadoop Architecture  YARN (Yet Another Resource Negotiator) ◦ is the framework responsible for providing the compu-tational resources ( CPUs, memory, etc.) needed for application executions. ◦ The YARN infrastructure and the HDFS are completely decoupled and independent: the first one provides resources for running an application while the second one provides storage.
  • 33. Hadoop Architecture  YARN (Yet Another Resource Negotiator) ◦ Two important elements are:  Resource Manager:(one per cluster) is the master. It knows where the slaves are located (Rack) and how many resources (container) they have. It runs several services, the most important is Resource Scheduler which decides how to assign resources.
  • 34. Hadoop Architecture  YARN (Yet Another Resource Negotiator) ◦ Two important elements are:  Node Manager:(many per cluster) is the slave. When it starts, it announces himself to the RM. Periodically, it sends an heartbeat to RM. Each Node Manager offers some resources to the cluster. Its resource capacity is the amount of memory and the number of vcores. At run-time, the Resource Scheduler will decide how to use this capacity: a Container is a fraction of the NM capacity and it is used by the client for running a program.
  • 35. Hadoop Architecture  YARN (Yet Another Resource Negotiator) ◦ In YARN, there are at least three actors:  the Job Submitter (the client)  the Resource Manager (the master)  the Node Manager (the slave)
  • 36. Hadoop Architecture  YARN (Yet Another Resource Negotiator) ◦ The application startup process is the following:  a client submits an application to the Resource Manager.  Resource Manager allocates a container.(by RS)  Resource Manager contacts the related Node Manager.  the Node Manager launches the container.  the Container executes the Application Master.
  • 37. Hadoop Architecture  Hadoop MapReduce o MapReduce : is a processing technique and a program model for distributed computing based on java. o The MapReduce algorithm contains two important tasks :  Map : takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).  Reduce :which takes the output from a map as an input and combines those data tuples into a smaller set of tuples.
  • 38. Hadoop Architecture  Hadoop MapReduce  Importance of MapReduce
  • 39. Hadoop Architecture  Hadoop MapReduce ◦ MapReduce program executes in three stages :-  Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.  Reduce stage : This stage is the combination of Shuffle stage and Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • 40. Hadoop Architecture  Hadoop MapReduce ◦ Flow of MapReduce :  Define Inputs.  Define Map function.  Define Reduce function.  Define Output.
  • 41. Hadoop Architecture  Hadoop MapReduce ◦ Flow of MapReduce :  During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.  The framework manages all the details of data- passing such as issuing tasks and verifying task completion.  Most of the computing takes place on nodes with data on local disks that reduces the network traffic.  After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
  • 42. Hadoop Architecture  Hadoop MapReduce ◦ Inputs and Outputs  The MapReduce framework operates on <key, value> pairs.  The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface ??.  the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework.  Input and Output types of a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3>(Output). Input Output Map <k1, v1> list (<k2, v2>) Reduce <k2, list(v2)> list (<k3, v3>)
  • 43. Hadoop Architecture  Hadoop MapReduce ◦ EXAMPLE 1
  • 44. Hadoop Architecture  Hadoop MapReduce ◦ EXAMPLE 2
  • 47. Hadoop Architecture  Other Tools ◦ Hive  Hadoop processing with SQL. ◦ Pig  Hadoop processing with scripting. ◦ Cascading  Pipe and Filter processing model. ◦ HBase  Database model built on top of Hadoop. ◦ Flume  Designed for large scale data movement.
  • 48. Hadoop Architecture  Why do these tools exist? ◦ MapReduce is very powerful, These tools allow programmers who are familiar with other programming styles to take advantage of the power of MapReduce.
  • 49. Hadoop Architecture ◦ Main Differences between Hadoop 1 & Hadoop 2 Hadoop 1 Hadoop 2 1- YARN not existing 1- YARN exist 2- have only Namenode 2- have Namenode and Secondary Namenode for recovery .
  • 50. Advantages of Hadoop  Scalable: ◦ is a highly scalable storage platform, because it can stores and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational database systems (RDBMS).  Cost effective: ◦ offers a cost effective storage solution for businesses’ exploding data sets. The problem with traditional RDBMS is that it is high cost to scale to such a degree in order to process such massive volumes of data.  Flexible: ◦ enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data.
  • 51. Advantages of Hadoop  Fast: ◦ Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.  Resilient to failure: ◦ A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure. There is another copy available for use
  • 52. Disadvantages of Hadoop  Security Concept ◦ Just managing a complex applications such as Hadoop can be challenging. A simple example can be seen in the Hadoop security model, which is disabled by default due to sheer complexity. Hadoop is also missing encryption at the storage and network levels, which is a major selling point for government agencies and others that prefer to keep their data.  Vulnerable By Nature ◦ the very makeup of Hadoop makes running it a risky proposition. The framework is written almost entirely in Java, one of the most widely used yet controversial programming languages in existence. Java has been heavily exploited by cybercriminals and as a result, implicated in numerous security breaches.  Potential Stability Issues ◦ Like all open source software, Hadoop has had its fair share of stability issues. To avoid these issues, organizations are strongly recommended to make sure they are running the latest stable version, or run it under a third-party vendor equipped to handle such problems.  Not Fit for Small Data ◦ While big data is not exclusively made for big businesses, not all big data
  • 53. Question Q/ What are differences between MapReduce & YARN ?????