SlideShare a Scribd company logo
1 of 29
Download to read offline
Hadoop
Presented by
Rajesh Piryani
South Asian University
3
(Visually…)
HDFS
Map/
Reduce
1
2
Hadoop
• It is open source software framework
• Licensed under Apache V2 License
• Created by Doug Cutting and Mike Cafarella in 2005
• Doug, who was working at Yahoo at the time, named it
after his son's toy elephant
• Derived from Google Map Reduce and Google File System
• Written in Java Programming Language
0
100
200
300
400
500
600
700
800
900
1000
966
848
715
619
434
364
269
227
Amount of Stored Data By Sector
(in Petabytes, 2009)
1 zettabyte?
= 1 million petabytes
= 1 billion terabytes
= 1 trillion gigabytes
Why Hadoop?
5
Sources:
"Big Data: The Next Frontier for Innovation, Competition and Productivity."
US Bureau of Labor Statistics | McKinsley Global Institute Analysis
Petabytes
Mars
Earth
35ZB = enough data
to fill a stack of DVDs
reaching halfway to Mars
If you like analogies…
Why Hadoop?
• Need to process 100TB datasets On 1 node:
– scanning @ 50MB/s = 23 days
• On 1000 node cluster:
– scanning @ 50MB/s = 33 min
• Need Efficient, Reliable and Usable framework
Distributed File System (DFS)
• classical model of a file system distributed across multiple
machines
• allows access to files located on another remote host as though
working on the actual host computer.
• multiple users on multiple machines to share files and storage
resources.
Distributed File System (DFS)
• one or more central servers store files that can be accessed,
– with proper authorization rights, by any number of remote
clients in the network.
• facilities for transparent replication and fault tolerance.
• DFS Operation should be fast to increase the performance of
System
– Operation: open, close, read, write file, send and receive file/object
Hadoop Distributed File System
(HDFS)
• Type of distributed file system
• Originally built as infrastructure for the Apache Nutch Web
Search Engine Project
• Significant Difference over other Distributed File System
– High Fault Tolerance
– High Throughput
– Easy Deployment on Low Cost Hardware
• Suitable for application that process massive data
HDFS Architecture
• Master-Slave Architecture
• HDFS Master “NameNode”
– Manages all file system metadata (hostname of datanode, block node)
• Transactions are logged, merged at startup
– Controls read/write access to files
– Mapping of block to DataNode
– Manages block replication
• HDFS Slaves “DataNodes”
– Notifies NameNode about block-IDs it has
– Serve read/write requests from clients
– Perform block create and replication tasks upon instruction by NameNode
HDFS Architecture
HDFS Architecture
12
NameNode BackupNode
DataNode DataNode DataNode DataNode DataNode
(heartbeat, balancing, replication, etc.)
nodes write to local disk
namespace backups
Getting Files From HDFS
13
NameNode BackupNode
Giant File
110010101001
010100101010
011001010100
101010010101
001100101010
010101001010
100110010101
001010100101
HDFS
Client return locations
of blocks of file
DataNode DataNode DataNode DataNode DataNode
stream blocks from data nodes
Failure types:
 Disk errors and failures
 DataNode failures
 Switch/Rack failures
 NameNode failures
 Datacenter failures
Failures, Failures, Failures
• HDFS was designed with the expectation that failures
(both hardware and software) would occur frequently
14
NameNode
DataNode
Fault Tolerance (DataNode
Failure)
15
NameNode BackupNode
DataNode DataNode DataNode DataNode DataNode
NameNode detects DataNode lossBlocks are auto-replicated on remaining
nodes to satisfy replication factor
DataNodeDataNode DataNode
Fault Tolerance (NameNode
Failure)
16
NameNode BackupNode
DataNode DataNode DataNode DataNode DataNode
Not an epic failure, because you
have the BackupNode
NameNode loss requires
manual intervention
Automatic failover is
in the works
Live Horizontal Scaling and
Rebalancing
17
NameNode BackupNode
DataNode DataNode
NameNode detects new DataNode
is added to cluster
DataNodeDataNode DataNode
Blocks are re-balanced
and re-distributed
DataNode DataNodeDataNode
• Highly scalable
– 1000s of nodes and massive (100s of TB) files
– Large block sizes to maximize sequential I/O
performance
• No use of mirroring or RAID.
– Reduce cost
– Use one mechanism (triply replicated blocks)
to deal with a wide variety of failure types
rather than multiple different mechanisms
HDFS Summary
18
Why?
Hadoop MapReduce (MR)
• Programming framework (library and runtime) for
analyzing data sets stored in HDFS
• MapReduce jobs are composed of two functions:
• User only writes the Map and Reduce functions
19
map()  reduce()
sub-divide &
conquer
combine & reduce
cardinality
Essentially, it’s…
1. Take a large problem and divide it into sub-problems
2. Perform the same function on all sub-problems
3. Combine the output from all sub-problems
20
DoWork() DoWork() DoWork()
…
…
…
Output
MAPREDUCE
Hadoop MapReduce (MR)
MapReduce
Layer
HDFS
Layer
hadoop-namenode
hadoop-
datanode1
hadoop-
datanode2
hadoop-
datanode3
hadoop-
datanode4
MapReduce Components
21
JobTracker
TaskTracker TaskTracker TaskTracker TaskTracker
Temporary data stored to local file system
JobTracker controls and
heartbeats TaskTracker nodes
TaskTrackers store temp data
Master
Slaves
- Coordinates all M/R tasks & events
- Manages job queues and scheduling
- Maintains and Controls TaskTrackers
- Moves/restarts map/reduce tasks if needed
Execute individual
map and reduce
tasks as assigned by
JobTracker (in
separate JVM)
DataNode DataNode DataNode DataNode
NameNode
MapReduce
Layer
HDFS
Layer
Job Submission
22
JobTracker
TaskTracker TaskTracker TaskTracker TaskTracker
Temporary data stored to local file system
map()’s are assigned to TaskTrackers
(HDFS DataNode locality aware)
Submit jobs to JobTracker
MR
Client
jobs get queued
Mapper Mapper Mapper Mapper
mappers spawned
in separate JVM
and execute
mappers store temp results
reduce phase begins
Reducer Reducer Reducer Reducer
Map tasks
MapReduce Visualized: Map Phase
23
53705 $65
53705 $30
53705 $15
54235 $75
54235 $22
02115 $15
02115 $15
44313 $10
44313 $25
44313 $55
5 53705 $15
6 44313 $10
5 53705 $65
0 54235 $22
9 02115 $15
6 44313 $25
3 10025 $95
8 44313 $55
2 53705 $30
1 02115 $15
4 54235 $75
7 10025 $60
Mapper
Mapper
4 54235 $75
7 10025 $60
2 53705 $30
1 02115 $15
10025 $60
5 53705 $65
0 54235 $22
5 53705 $15
6 44313 $10
3 10025 $95
8 44313 $55
9 02115 $15
6 44313 $25
10025 $95
Get sum sales grouped by zipCode
DataNode3DataNode2DataNode1
Blocks
of the
Sales
file in
HDFS
Group
By
Group
By
(custId, zipCode, amount)
One output
bucket per
reduce task
Reducer
Reducer
Reduce
tasks
ReducerReduce Phase
53705 $65
54235 $75
54235 $22
10025 $95
44313 $55
10025 $60
Mapper
53705 $30
53705 $15
02115 $15
02115 $15
44313 $10
44313 $25
Mapper
53705 $65
53705 $30
53705 $15
44313 $10
44313 $25
10025 $95
44313 $55
10025 $60
54235 $75
54235 $22
02115 $15
02115 $15
Sort
Sort
Sort
53705 $65
53705 $30
53705 $15
44313 $10
44313 $25
44313 $55
10025 $95
10025 $60
54235 $75
54235 $22
02115 $15
02115 $15
SUM
SUM
SUM
10025 $155
44313 $90
53705 $110
54235 $97
02115 $30
Shuffle
Dealing With Failures
• Like HDFS, MapReduce framework designed to be
highly fault tolerant
• Worker (Map or Reduce) failures
– Detected by periodic Master pings
– Map or Reduce jobs that fail are reset and then
given to a different node
– If a node failure occurs after the Map job has
completed, the job is redone and all Reduce jobs
are notified
• Master failure
– If the master fails for any reason the entire
computation is redone
25
Application
• Hadoop is used in wide area of application. Some
Examples are
– Search (Yahoo! , Amazon, Zvents)
– Log Processing (Facebook,Yahoo! ,ContextWeb, Joost , Last.fm)
– Recommendation System (Facebook)
– Data WareHousing (Facebook , AOL)
– Video and Image Analysis (New York Times, Eysalike)
27
HDFS
Map/
Reduce
(Visually…)
1
2
References
• Meet Hadoop! Open Source Grid Computing by
Devraj Das
• The Hadoop Distributed File System: Architecture
and Design by Dhruba Borthakur
• Hadoop Design And Architecture
http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf
• An Introduction to Hadoop Distributed File System
http://www.ibm.com/developerworks/library/wa-
introhdfs/
• Big Data What’s the Big Deal? By David J. DeWitt
and Rimma Nehme. Microsoft Jim Gray System Lab
Thank You for Your Patience

More Related Content

What's hot

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 

What's hot (20)

Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computing
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 

Viewers also liked

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 

Viewers also liked (7)

구글을 지탱하는 기술 요약 - Google 검색
구글을 지탱하는 기술 요약 - Google 검색구글을 지탱하는 기술 요약 - Google 검색
구글을 지탱하는 기술 요약 - Google 검색
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
구글을 지탱하는 기술 요약 - Bigtable
구글을 지탱하는 기술 요약 - Bigtable구글을 지탱하는 기술 요약 - Bigtable
구글을 지탱하는 기술 요약 - Bigtable
 
All vs. All Correlation Using Spark/Hadoop
All vs. All Correlation Using Spark/HadoopAll vs. All Correlation Using Spark/Hadoop
All vs. All Correlation Using Spark/Hadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Spark와 Hadoop, 완벽한 조합 (한국어)
Spark와 Hadoop, 완벽한 조합 (한국어)Spark와 Hadoop, 완벽한 조합 (한국어)
Spark와 Hadoop, 완벽한 조합 (한국어)
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 

Similar to Hadoop

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Leons Petražickis
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 

Similar to Hadoop (20)

Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Anju
AnjuAnju
Anju
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 

More from Rajesh Piryani (11)

Introduction to sentiment analysis
Introduction to sentiment analysisIntroduction to sentiment analysis
Introduction to sentiment analysis
 
Gomory's cutting plane method
Gomory's cutting plane methodGomory's cutting plane method
Gomory's cutting plane method
 
Monte carlo simulation
Monte carlo simulationMonte carlo simulation
Monte carlo simulation
 
Online Advertisements and the AdWords Problem
Online Advertisements and the AdWords ProblemOnline Advertisements and the AdWords Problem
Online Advertisements and the AdWords Problem
 
Tqm metrics
Tqm metricsTqm metrics
Tqm metrics
 
(Project) Student grading system
(Project) Student grading system(Project) Student grading system
(Project) Student grading system
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 
Agile software development
Agile software developmentAgile software development
Agile software development
 
(Paper Presentation) DSDV
(Paper Presentation) DSDV(Paper Presentation) DSDV
(Paper Presentation) DSDV
 
(Paper Presentation) ZIGZAG: An Efficient Peer-to-Peer Scheme for Media Strea...
(Paper Presentation)ZIGZAG: An Efficient Peer-to-Peer Scheme forMedia Strea...(Paper Presentation)ZIGZAG: An Efficient Peer-to-Peer Scheme forMedia Strea...
(Paper Presentation) ZIGZAG: An Efficient Peer-to-Peer Scheme for Media Strea...
 
Address Binding Scheme
Address Binding SchemeAddress Binding Scheme
Address Binding Scheme
 

Recently uploaded

Call Girls in Uttam Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in  Uttam Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in  Uttam Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in Uttam Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
AnaAcapella
 

Recently uploaded (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Call Girls in Uttam Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in  Uttam Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in  Uttam Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in Uttam Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Simple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdfSimple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdf
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
dusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningdusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learning
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 

Hadoop

  • 2.
  • 4. Hadoop • It is open source software framework • Licensed under Apache V2 License • Created by Doug Cutting and Mike Cafarella in 2005 • Doug, who was working at Yahoo at the time, named it after his son's toy elephant • Derived from Google Map Reduce and Google File System • Written in Java Programming Language
  • 5. 0 100 200 300 400 500 600 700 800 900 1000 966 848 715 619 434 364 269 227 Amount of Stored Data By Sector (in Petabytes, 2009) 1 zettabyte? = 1 million petabytes = 1 billion terabytes = 1 trillion gigabytes Why Hadoop? 5 Sources: "Big Data: The Next Frontier for Innovation, Competition and Productivity." US Bureau of Labor Statistics | McKinsley Global Institute Analysis Petabytes Mars Earth 35ZB = enough data to fill a stack of DVDs reaching halfway to Mars If you like analogies…
  • 6. Why Hadoop? • Need to process 100TB datasets On 1 node: – scanning @ 50MB/s = 23 days • On 1000 node cluster: – scanning @ 50MB/s = 33 min • Need Efficient, Reliable and Usable framework
  • 7. Distributed File System (DFS) • classical model of a file system distributed across multiple machines • allows access to files located on another remote host as though working on the actual host computer. • multiple users on multiple machines to share files and storage resources.
  • 8. Distributed File System (DFS) • one or more central servers store files that can be accessed, – with proper authorization rights, by any number of remote clients in the network. • facilities for transparent replication and fault tolerance. • DFS Operation should be fast to increase the performance of System – Operation: open, close, read, write file, send and receive file/object
  • 9. Hadoop Distributed File System (HDFS) • Type of distributed file system • Originally built as infrastructure for the Apache Nutch Web Search Engine Project • Significant Difference over other Distributed File System – High Fault Tolerance – High Throughput – Easy Deployment on Low Cost Hardware • Suitable for application that process massive data
  • 10. HDFS Architecture • Master-Slave Architecture • HDFS Master “NameNode” – Manages all file system metadata (hostname of datanode, block node) • Transactions are logged, merged at startup – Controls read/write access to files – Mapping of block to DataNode – Manages block replication • HDFS Slaves “DataNodes” – Notifies NameNode about block-IDs it has – Serve read/write requests from clients – Perform block create and replication tasks upon instruction by NameNode
  • 12. HDFS Architecture 12 NameNode BackupNode DataNode DataNode DataNode DataNode DataNode (heartbeat, balancing, replication, etc.) nodes write to local disk namespace backups
  • 13. Getting Files From HDFS 13 NameNode BackupNode Giant File 110010101001 010100101010 011001010100 101010010101 001100101010 010101001010 100110010101 001010100101 HDFS Client return locations of blocks of file DataNode DataNode DataNode DataNode DataNode stream blocks from data nodes
  • 14. Failure types:  Disk errors and failures  DataNode failures  Switch/Rack failures  NameNode failures  Datacenter failures Failures, Failures, Failures • HDFS was designed with the expectation that failures (both hardware and software) would occur frequently 14 NameNode DataNode
  • 15. Fault Tolerance (DataNode Failure) 15 NameNode BackupNode DataNode DataNode DataNode DataNode DataNode NameNode detects DataNode lossBlocks are auto-replicated on remaining nodes to satisfy replication factor DataNodeDataNode DataNode
  • 16. Fault Tolerance (NameNode Failure) 16 NameNode BackupNode DataNode DataNode DataNode DataNode DataNode Not an epic failure, because you have the BackupNode NameNode loss requires manual intervention Automatic failover is in the works
  • 17. Live Horizontal Scaling and Rebalancing 17 NameNode BackupNode DataNode DataNode NameNode detects new DataNode is added to cluster DataNodeDataNode DataNode Blocks are re-balanced and re-distributed DataNode DataNodeDataNode
  • 18. • Highly scalable – 1000s of nodes and massive (100s of TB) files – Large block sizes to maximize sequential I/O performance • No use of mirroring or RAID. – Reduce cost – Use one mechanism (triply replicated blocks) to deal with a wide variety of failure types rather than multiple different mechanisms HDFS Summary 18 Why?
  • 19. Hadoop MapReduce (MR) • Programming framework (library and runtime) for analyzing data sets stored in HDFS • MapReduce jobs are composed of two functions: • User only writes the Map and Reduce functions 19 map()  reduce() sub-divide & conquer combine & reduce cardinality
  • 20. Essentially, it’s… 1. Take a large problem and divide it into sub-problems 2. Perform the same function on all sub-problems 3. Combine the output from all sub-problems 20 DoWork() DoWork() DoWork() … … … Output MAPREDUCE Hadoop MapReduce (MR)
  • 21. MapReduce Layer HDFS Layer hadoop-namenode hadoop- datanode1 hadoop- datanode2 hadoop- datanode3 hadoop- datanode4 MapReduce Components 21 JobTracker TaskTracker TaskTracker TaskTracker TaskTracker Temporary data stored to local file system JobTracker controls and heartbeats TaskTracker nodes TaskTrackers store temp data Master Slaves - Coordinates all M/R tasks & events - Manages job queues and scheduling - Maintains and Controls TaskTrackers - Moves/restarts map/reduce tasks if needed Execute individual map and reduce tasks as assigned by JobTracker (in separate JVM) DataNode DataNode DataNode DataNode NameNode MapReduce Layer HDFS Layer
  • 22. Job Submission 22 JobTracker TaskTracker TaskTracker TaskTracker TaskTracker Temporary data stored to local file system map()’s are assigned to TaskTrackers (HDFS DataNode locality aware) Submit jobs to JobTracker MR Client jobs get queued Mapper Mapper Mapper Mapper mappers spawned in separate JVM and execute mappers store temp results reduce phase begins Reducer Reducer Reducer Reducer
  • 23. Map tasks MapReduce Visualized: Map Phase 23 53705 $65 53705 $30 53705 $15 54235 $75 54235 $22 02115 $15 02115 $15 44313 $10 44313 $25 44313 $55 5 53705 $15 6 44313 $10 5 53705 $65 0 54235 $22 9 02115 $15 6 44313 $25 3 10025 $95 8 44313 $55 2 53705 $30 1 02115 $15 4 54235 $75 7 10025 $60 Mapper Mapper 4 54235 $75 7 10025 $60 2 53705 $30 1 02115 $15 10025 $60 5 53705 $65 0 54235 $22 5 53705 $15 6 44313 $10 3 10025 $95 8 44313 $55 9 02115 $15 6 44313 $25 10025 $95 Get sum sales grouped by zipCode DataNode3DataNode2DataNode1 Blocks of the Sales file in HDFS Group By Group By (custId, zipCode, amount) One output bucket per reduce task
  • 24. Reducer Reducer Reduce tasks ReducerReduce Phase 53705 $65 54235 $75 54235 $22 10025 $95 44313 $55 10025 $60 Mapper 53705 $30 53705 $15 02115 $15 02115 $15 44313 $10 44313 $25 Mapper 53705 $65 53705 $30 53705 $15 44313 $10 44313 $25 10025 $95 44313 $55 10025 $60 54235 $75 54235 $22 02115 $15 02115 $15 Sort Sort Sort 53705 $65 53705 $30 53705 $15 44313 $10 44313 $25 44313 $55 10025 $95 10025 $60 54235 $75 54235 $22 02115 $15 02115 $15 SUM SUM SUM 10025 $155 44313 $90 53705 $110 54235 $97 02115 $30 Shuffle
  • 25. Dealing With Failures • Like HDFS, MapReduce framework designed to be highly fault tolerant • Worker (Map or Reduce) failures – Detected by periodic Master pings – Map or Reduce jobs that fail are reset and then given to a different node – If a node failure occurs after the Map job has completed, the job is redone and all Reduce jobs are notified • Master failure – If the master fails for any reason the entire computation is redone 25
  • 26. Application • Hadoop is used in wide area of application. Some Examples are – Search (Yahoo! , Amazon, Zvents) – Log Processing (Facebook,Yahoo! ,ContextWeb, Joost , Last.fm) – Recommendation System (Facebook) – Data WareHousing (Facebook , AOL) – Video and Image Analysis (New York Times, Eysalike)
  • 28. References • Meet Hadoop! Open Source Grid Computing by Devraj Das • The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur • Hadoop Design And Architecture http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf • An Introduction to Hadoop Distributed File System http://www.ibm.com/developerworks/library/wa- introhdfs/ • Big Data What’s the Big Deal? By David J. DeWitt and Rimma Nehme. Microsoft Jim Gray System Lab
  • 29. Thank You for Your Patience