SlideShare a Scribd company logo
1 of 34
Apache Hadoop
Sheetal Sharma
Intern At IBM Innovation Centre
Why Data?
Get insights to offer
a better product
“More data usually
beats better
algorithms”
Get of insights to
make better decisions
Avoid “guesstimates”
What Is Challenging?
Store data reliably
Analyze data quickly
Cost-effective way
Use expressible and
high-level language
Fundamental Ideas
A big system of
machines, not a big
machine
Failures will happen
Move computation to
data, not data to
computation
Write complex code
only once, but right
Apache Hadoop
An open-source Java
software
Storing and processing
of very large data sets
A clusters of
commodity machines
A simple programming
model
Apache Hadoop
Two main components:
HDFS - a distributed file
system
MapReduce – a
distributed processing
layer
HDFS
The Purpose Of HDFS
●
Store large datasets
in a distributed,
scalable and fault-
tolerant way
●
High throughput
●
Very large files
●
Streaming reads and writes (no edits)
HDFS Mis-Usage
Do NOT use, if you have
Low-latency
requests
Random
reads and writes
Lots of
small files
Then better to consider
RDBMs,
Splitting Files And
Replicating Blocks
Split a very large file into
smaller (but still large)
blocks
Store them redundantly on
a set of machines
Spiting Files Into Blocks
●
The default block size
is 64MB
●
Minimize the overhead
of a disk seek
operation (less than
1%)
●
A file is just “sliced”
into chunks after each
64MB (or so)
Replicating Blocks
The default
replication factor
is 3
●
It can be changed
per a file or a
directory
●
It can be
Master And Slaves
The Master node keeps and
manages all metadata
information
The Slave nodes store blocks
of data and serve them to
the client
Master node (called
NameNode)
Slave nodes (called DataNodes
Classical* HDFS Cluster
*no NameNode HA, no HDFS
Replication
Manages metadata
Does some
“house-keeping”
operations for
NameNode
Stores and retrieves
blocks of data
HDFS NameNode
Performs all the metadata-
related operations
Keeps information in RAM (for
fast look up)
The file system tree
Metadata for all
files/directories (e.g.
ownership, permissions)
Names and locations of
blocks
HDFS DataNode
Stores and retrieves blocks of
data
Data is stored as regular files on a local filesystem (e.g. ext4)
e.g. blk_-992391354910561645 (+ checksums in a separate file)
A block itself does not know which file it belongs to!
Sends a heartbeat message to
the NN to say that it is still
alive
Sends a block report to the NN
periodically
HDFS Secondary NameNode
NOT a failover NameNode
Periodically merges a prior
snapshot (fsimage) and editlog(s)
(edits)
Fetches current fsimage and
edits files from the NameNode
Applies edits to fsimage to
create the up-to-date fsimage
Then sends the up-to-date
fsimage back to the NameNode
Reading A File From HDFS
Block data is never sent through the
NameNode
The NameNode redirects a client to an
appropriate DataNode
The NameNode chooses a DataNode that
is as “close” as possible
Lots of data
comes
from DataNodes
to a client
Blocks locations
$ hadoop fs -cat /toplist/2013-05-15/poland.txt
HDFS And Local File System
●
Runs on the top
of a native file
system (e.g. ext3,
ext4, xfs)
●
HDFS is simply a
Java application
that uses a native
HDFS Data Integrity
HDFS detects corrupted
blocks
● When writing
Client computes the
checksums for each block
Client sends checksums to
a DN together with data
● When reading
Client verifies the
HDFS NameNode Scalability
Stats based on Yahoo!
Clusters
●
An average file 1.5≈
blocks (block size = 128
MB)
●
An average file 600≈
bytes in RAM (1 file and 2
blocks objects)
●
100M files 60 GB of≈
metadata
HDFS NameNode
Performance
Read/write operations
throughput limited by one
machine
●
~120K read ops/sec
●
~6K write ops/sec
●
MapReduce tasks are also
HDFS clients
Internal load increases as
the cluster grows
●
HDFS Main Limitations
Single NameNode
●
Keeps all
metadata in RAM
●
Performs all
metadata
operations
●
Becomes a single
MapReduce
MapReduce Model
Programming model
inspired by functional
programming
map() and reduce()
functions processing
<key, value> pairs
Useful for processing
Map And Reduce Functions
● Map and Reduce
Map And Reduce Functions -
Counting Word
MapReduce Job
Input data is divided
into
splits and converted
into
<key, value> pairs
Invokes map() function
multiple times
Keys are
sorted,
values not
(but
could be)
Invokes reduce()
Function multiple times
MapReduce Example: ArtistCount
Artist, Song, Timestamp, User
Key is the offset of the line
from the beginning
of the line
We could specify which artist
goes to which reducer
(HashParitioner is default one)
MapReduce Example:
ArtistCount
map(Integer key, EndSong value, Context context):
context.write(value.artist, 1)
reduce(String key, Iterator<Integer> values, Context
context):
int count = 0
for each v in values:
count += v
context.write(key, count)
Pseudo-code in
non-existing
language ;)
MapReduce Combiner
Make sure that the Combiner
combines fast and enough
(otherwise it adds overhead
only)
MapReduce Implementation
●
Batch processing system
●
Automatic parallelization
and distribution of
computation
●
Fault-tolerance
●
Deals with all messy
details related to
distributed processing
●
Relatively easy to use
for programmers
JobTracker Reponsibilities
●
Manages the
computational
resources
Available
TaskTrackers, map and
reduce slots
●
Schedules all user
jobs
Schedules all
TaskTracker Reponsibilities
●
Runs map and reduce
tasks
●
Reports to JobTracker
Heartbeats saying
that it is still alive
Number of free
map and reduce slots
Task progress,
Apache Hadoop Cluster
●
It can consists of 1, 5,
100 and 4000 nodes
Thank You!

More Related Content

What's hot

Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystemrohitraj268
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 

What's hot (20)

Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
Hadoop – big deal
Hadoop – big dealHadoop – big deal
Hadoop – big deal
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Hadoop
HadoopHadoop
Hadoop
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 

Viewers also liked

використання досягнень науки у системі роботи вчителя – основа розвитку творч...
використання досягнень науки у системі роботи вчителя – основа розвитку творч...використання досягнень науки у системі роботи вчителя – основа розвитку творч...
використання досягнень науки у системі роботи вчителя – основа розвитку творч...Lyudmila Boyko
 
7. атестація кадрів
7. атестація кадрів7. атестація кадрів
7. атестація кадрівLyudmila Boyko
 
Аналіз виховної роботи,
Аналіз виховної роботи,Аналіз виховної роботи,
Аналіз виховної роботи,Lyudmila Boyko
 
Аналіз методичної роботи
Аналіз методичної роботиАналіз методичної роботи
Аналіз методичної роботиLyudmila Boyko
 
Magazine des programmes immobiliers neufs dans la région Bordelaise
Magazine des programmes immobiliers neufs dans la région BordelaiseMagazine des programmes immobiliers neufs dans la région Bordelaise
Magazine des programmes immobiliers neufs dans la région BordelaiseFanny Rousselon
 
Lean sixsigmausedinmyindustry
Lean sixsigmausedinmyindustryLean sixsigmausedinmyindustry
Lean sixsigmausedinmyindustryepomajar
 
8. моніторингові дослідження. апробація підручників
8. моніторингові дослідження. апробація підручників8. моніторингові дослідження. апробація підручників
8. моніторингові дослідження. апробація підручниківLyudmila Boyko
 

Viewers also liked (14)

використання досягнень науки у системі роботи вчителя – основа розвитку творч...
використання досягнень науки у системі роботи вчителя – основа розвитку творч...використання досягнень науки у системі роботи вчителя – основа розвитку творч...
використання досягнень науки у системі роботи вчителя – основа розвитку творч...
 
Ganesan resume
Ganesan resumeGanesan resume
Ganesan resume
 
Rockagent
RockagentRockagent
Rockagent
 
7. атестація кадрів
7. атестація кадрів7. атестація кадрів
7. атестація кадрів
 
Аналіз виховної роботи,
Аналіз виховної роботи,Аналіз виховної роботи,
Аналіз виховної роботи,
 
Аналіз методичної роботи
Аналіз методичної роботиАналіз методичної роботи
Аналіз методичної роботи
 
Resume_2015
Resume_2015Resume_2015
Resume_2015
 
Gr lr world_042015
Gr lr world_042015Gr lr world_042015
Gr lr world_042015
 
Magazine des programmes immobiliers neufs dans la région Bordelaise
Magazine des programmes immobiliers neufs dans la région BordelaiseMagazine des programmes immobiliers neufs dans la région Bordelaise
Magazine des programmes immobiliers neufs dans la région Bordelaise
 
Lean sixsigmausedinmyindustry
Lean sixsigmausedinmyindustryLean sixsigmausedinmyindustry
Lean sixsigmausedinmyindustry
 
8. моніторингові дослідження. апробація підручників
8. моніторингові дослідження. апробація підручників8. моніторингові дослідження. апробація підручників
8. моніторингові дослідження. апробація підручників
 
Thali
ThaliThali
Thali
 
eusim unlimited call to eu
 eusim unlimited call to eu  eusim unlimited call to eu
eusim unlimited call to eu
 
Beeali smart phones 1
Beeali smart phones 1Beeali smart phones 1
Beeali smart phones 1
 

Similar to Apache hadoop

Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in DepthSyed Hadoop
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing datapreetik9044
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 

Similar to Apache hadoop (20)

Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop
HadoopHadoop
Hadoop
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Unit 1
Unit 1Unit 1
Unit 1
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop
HadoopHadoop
Hadoop
 

More from sheetal sharma

Telecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticsTelecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticssheetal sharma
 
Telecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insightTelecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insightsheetal sharma
 
Sentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps ServicesSentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps Servicessheetal sharma
 

More from sheetal sharma (9)

Db import&amp;export
Db import&amp;exportDb import&amp;export
Db import&amp;export
 
Db import&amp;export
Db import&amp;exportDb import&amp;export
Db import&amp;export
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Telecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticsTelecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analytics
 
Telecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insightTelecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insight
 
Sentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps ServicesSentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps Services
 
Watson analytics
Watson analyticsWatson analytics
Watson analytics
 

Recently uploaded

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Apache hadoop

  • 1. Apache Hadoop Sheetal Sharma Intern At IBM Innovation Centre
  • 2. Why Data? Get insights to offer a better product “More data usually beats better algorithms” Get of insights to make better decisions Avoid “guesstimates”
  • 3. What Is Challenging? Store data reliably Analyze data quickly Cost-effective way Use expressible and high-level language
  • 4. Fundamental Ideas A big system of machines, not a big machine Failures will happen Move computation to data, not data to computation Write complex code only once, but right
  • 5. Apache Hadoop An open-source Java software Storing and processing of very large data sets A clusters of commodity machines A simple programming model
  • 6. Apache Hadoop Two main components: HDFS - a distributed file system MapReduce – a distributed processing layer
  • 7. HDFS The Purpose Of HDFS ● Store large datasets in a distributed, scalable and fault- tolerant way ● High throughput ● Very large files ● Streaming reads and writes (no edits)
  • 8. HDFS Mis-Usage Do NOT use, if you have Low-latency requests Random reads and writes Lots of small files Then better to consider RDBMs,
  • 9. Splitting Files And Replicating Blocks Split a very large file into smaller (but still large) blocks Store them redundantly on a set of machines
  • 10. Spiting Files Into Blocks ● The default block size is 64MB ● Minimize the overhead of a disk seek operation (less than 1%) ● A file is just “sliced” into chunks after each 64MB (or so)
  • 11. Replicating Blocks The default replication factor is 3 ● It can be changed per a file or a directory ● It can be
  • 12. Master And Slaves The Master node keeps and manages all metadata information The Slave nodes store blocks of data and serve them to the client Master node (called NameNode) Slave nodes (called DataNodes
  • 13. Classical* HDFS Cluster *no NameNode HA, no HDFS Replication Manages metadata Does some “house-keeping” operations for NameNode Stores and retrieves blocks of data
  • 14. HDFS NameNode Performs all the metadata- related operations Keeps information in RAM (for fast look up) The file system tree Metadata for all files/directories (e.g. ownership, permissions) Names and locations of blocks
  • 15. HDFS DataNode Stores and retrieves blocks of data Data is stored as regular files on a local filesystem (e.g. ext4) e.g. blk_-992391354910561645 (+ checksums in a separate file) A block itself does not know which file it belongs to! Sends a heartbeat message to the NN to say that it is still alive Sends a block report to the NN periodically
  • 16. HDFS Secondary NameNode NOT a failover NameNode Periodically merges a prior snapshot (fsimage) and editlog(s) (edits) Fetches current fsimage and edits files from the NameNode Applies edits to fsimage to create the up-to-date fsimage Then sends the up-to-date fsimage back to the NameNode
  • 17. Reading A File From HDFS Block data is never sent through the NameNode The NameNode redirects a client to an appropriate DataNode The NameNode chooses a DataNode that is as “close” as possible Lots of data comes from DataNodes to a client Blocks locations $ hadoop fs -cat /toplist/2013-05-15/poland.txt
  • 18. HDFS And Local File System ● Runs on the top of a native file system (e.g. ext3, ext4, xfs) ● HDFS is simply a Java application that uses a native
  • 19. HDFS Data Integrity HDFS detects corrupted blocks ● When writing Client computes the checksums for each block Client sends checksums to a DN together with data ● When reading Client verifies the
  • 20. HDFS NameNode Scalability Stats based on Yahoo! Clusters ● An average file 1.5≈ blocks (block size = 128 MB) ● An average file 600≈ bytes in RAM (1 file and 2 blocks objects) ● 100M files 60 GB of≈ metadata
  • 21. HDFS NameNode Performance Read/write operations throughput limited by one machine ● ~120K read ops/sec ● ~6K write ops/sec ● MapReduce tasks are also HDFS clients Internal load increases as the cluster grows ●
  • 22. HDFS Main Limitations Single NameNode ● Keeps all metadata in RAM ● Performs all metadata operations ● Becomes a single
  • 23. MapReduce MapReduce Model Programming model inspired by functional programming map() and reduce() functions processing <key, value> pairs Useful for processing
  • 24. Map And Reduce Functions ● Map and Reduce
  • 25. Map And Reduce Functions - Counting Word
  • 26. MapReduce Job Input data is divided into splits and converted into <key, value> pairs Invokes map() function multiple times Keys are sorted, values not (but could be) Invokes reduce() Function multiple times
  • 27. MapReduce Example: ArtistCount Artist, Song, Timestamp, User Key is the offset of the line from the beginning of the line We could specify which artist goes to which reducer (HashParitioner is default one)
  • 28. MapReduce Example: ArtistCount map(Integer key, EndSong value, Context context): context.write(value.artist, 1) reduce(String key, Iterator<Integer> values, Context context): int count = 0 for each v in values: count += v context.write(key, count) Pseudo-code in non-existing language ;)
  • 29. MapReduce Combiner Make sure that the Combiner combines fast and enough (otherwise it adds overhead only)
  • 30. MapReduce Implementation ● Batch processing system ● Automatic parallelization and distribution of computation ● Fault-tolerance ● Deals with all messy details related to distributed processing ● Relatively easy to use for programmers
  • 31. JobTracker Reponsibilities ● Manages the computational resources Available TaskTrackers, map and reduce slots ● Schedules all user jobs Schedules all
  • 32. TaskTracker Reponsibilities ● Runs map and reduce tasks ● Reports to JobTracker Heartbeats saying that it is still alive Number of free map and reduce slots Task progress,
  • 33. Apache Hadoop Cluster ● It can consists of 1, 5, 100 and 4000 nodes