SlideShare a Scribd company logo
1
Introduction to HDFS
By: Siddharth Mathur
Instructor: Dr. Shiyong Lu
2
Big Data
Wikipedia Definition:
In information technology, big data is a loosely-
defined term used to describe data sets so large
and complex that they become awkward to work
with using on-hand database management tools.
3
How Big is Big Data?
2008: Google processed 20 PB a day
2009: Facebook had 2.5 PB user data + 15
TB/day
2009: eBay had 6.5 PB user data + 50 TB/day
2011: Yahoo! had 180-200 PB of data
2012: Facebook ingests 500 TB/day
4
HOW TO ANALYZE THIS DATA?
5
Divide and Conquer
Partition
Combine
6
But Parallel Processing is complicated
How do we assign tasks to workers?
What if we have more tasks than slots?
What happens when tasks fail?
How do you handle distributed synchronization?
7
The Solution!
Google
File
System
Map
Reduce
BigTable
8
GFS to HDFS
It started when google researchers wrote a
paper on a distributed file system to resolve
storage and analysis issues of Big Data
The researchers proposed a file system named
Google File System which in turn, gave birth to
Hadoop Distributed File System (HDFS)
The paper on MapReduce resulted in
MapReduce programming structure
The paper on BigTable produced Hadoop
Hbase, Data warehouse schema over HDFS
9
HADOOP DISTRIBUTED FILE SYSTEM
10
Key Features
Accesible
Hadoop runs on large clusters of commodity machines or on
cloud computing services such as Amazon's Elastic Compute
Cloud (EC2).
Robust
As Hadoop is intended to run on commodity hardware, It is
architected with the assumption of frequent hardware
malfunctions. It can gracefully handle most such failures.
Scalable
Hadoop scales linearly to handle larger data by adding more
nodes to the cluster.
Simple
Hadoop allows users to quickly write efficient parallel code.
11
HDFS Scaling Out
Performs a task
in 45 minutes
Performs a
task in ~ 45/4
minutes
12
Basic Hadoop Stack
Hadoop Distributed File System
MapReduce
Hbase
Higher Level Languages
13
Hadoop Platforms
Platforms: Unix and on Windows.
Linux: the only supported production platform.
Other variants of Unix, like Mac OS X: run Hadoop for
development.
Windows + Cygwin: development platform (openssh)
Java 6
Java 1.6.x (aka 6.0.x aka 6) is recommended for
running Hadoop.
14
Hadoop Modes
• Standalone (or local) mode
– There are no daemons running and everything runs in
a single JVM. Standalone mode is suitable for running
MapReduce programs during development, since it is
easy to test and debug them.
• Pseudo-distributed mode
– The Hadoop daemons run on the local machine, thus
simulating a cluster on a small scale.
• Fully distributed mode
– The Hadoop daemons run on a cluster of machines.
15
Master-Slave Architecture
Namenode
Jobtracker
Datanode
Tasktracker
Secondary
Namenode
16
Master-Slave Architecture
HDFS has a master-slave architecture.
The master node or the name node governs the cluster.
It takes care of tasks and resource allocation.
It stores all the metadata related to file breakage, block
storage, block replication and task execution status.
The slave nodes or the data nodes are the one which
stores all the data blocks and perform task executions
Tasktracker is the program which runs on each individual
data node and monitors the task execution over each
node.
Jobtracker runs on name node and monitors the
complete job execution.
17
HDFS File Distribution
File metadata
FILE-A -> 1,2,3 (split into 3 blocks)
FILE-B -> 4,5 (split into 2 blocks)
1
3
1
3
Replication factor = 3
Hdfs-site.xml
“ dfs.replication”
4 3
4 4
22
2 5
5
5
Block
1
18
HDFS File Distribution
Name node stores metadata related to:
File split
Block allocation
Task allocation
Each file is split into data blocks. Default size is
64 Mb
Each data block is replicated on different data
node. The replication factor in configurable.
Default value is 3
19
Block Placement
Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
Clients read from nearest replica
20
Rack awareness
DN 1
DN 2
DN 3
DN 4
DN 5
DN 6
DN 7
DN 8
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 2 Rack 3
NameNode
File X=
Blk:A in
DN:1,5,6
Blk:B in
DN: 7, 10,
11
Rack 1 =
DN:1,2,3,4
Rack 2 =
DN:5,6,7,8
Rack 3 =
DN:9,10,11,
12
Switch Switch Switch
Data
block A
Data
block B
FILE X
21
Rack awareness
HDFS is aware of the placement of each data
node and on the racks
To prevent data loss due to a complete rack
failure, Hadoop intelligently replicates each data
block onto other racks also
This helps HDSF to recover the data even if
complete rack of data node shuts down.
This information is stored in the name node.
22
File Write in Hadoop
DN 1
DN 2
DN 3
DN 4
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 3
NameNode
File.txt=
Blk:A in
DN:1,5,6
Blk:B in
DN: 7, 10,
11
Blk C in…..
Switch Switch
Switch
Client
File.txt
[A , B, C]
Broken
down
using
Hadoop
client API
DN 5
DN 6
DN 7
DN 8
Rack 2
Switch
First block
in one rack
next blocks
in different
rack
Intelligent
storage of
data
Heartbeat
Request
Response
MetaData
Creation
Block A Write
23
File Write in Hadoop
HDFS client system requests the name node to
write down a file onto HDFS.
It also provide the file size and other metadata
information to the name node.
Meanwhile, each slave node sends a heartbeat
signal to namenode telling it about their status
24
File Write in Hadoop
The namenode tells the client system where to
store the data blocks
Also, it tells the data node to get ready for data
write.
After the data write procedure is complete the
data node sends a success message to both
client and name node.
25
File Read in Hadoop
DN 1
DN 2
DN 3
DN 4
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 3
NameNode
File.txt=
Blk:A in
DN:1,5,6
Blk:B in
DN: 7, 10,
11
Blk C in…..
Switch Switch
Switch
Client
DN 5
DN 6
DN 7
DN 8
Rack 2
Switch
An
ordered
list of
nodes.
Heartbeat
Request
Response
26
Re-replicating missing replicas
27
Re-replication
Missing Heartbeats signify lost Nodes
Name Node consults metadata, finds affected
data
Name Node consults Rack Awareness script
Name Node tells the Data node to re-replicate
28
3 main configuration files
Core-site.xml
Contains configuration information that overrides the
default core Hadoop properties
Mapred-site.xml
Contains configuration information that overrides the
default core Mapreduce properties
Also defines the host and port that the MapReduce job
tracker runs at
Hdfs-site.xml
Mainly, to set the block replication factor
29
Anatomy of a Job Launch
30
Job Status updates
31
Limitations of Hadoop -1
Scalability
Maximum Cluster size – 4,000 nodes for best
performance
Maximum Concurrent tasks- 40,000
Name Node as a single point of failure
Failure kills all running and queued jobs
Jobs need to be re-submitted by the user
Re-Start ability
Restart is very tricky due to complex state
32
Who has the biggest cluster setups
Facebook 400
Microsoft 400
LinkedIn 4100
Yahoo 42,000
33
References
http://hadoop.apache.org/
http://research.google.com/archive/mapreduce.html
http://research.google.com/archive/gfs.html
http://research.google.com/archive/bigtable.html
http://hbase.apache.org/
http://wiki.apache.org/hadoop/FAQ
http://matt-
wand.utsacademics.info/webUTSdiscns/HadoopNotes
.pdf
34
THANK YOU

More Related Content

What's hot

Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
Joe Stein
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
Fabio Fumarola
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
Romain Jacotin
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
Cosmin Lehene
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Mydbops
 
March 2011 HUG: HDFS Federation
March 2011 HUG: HDFS FederationMarch 2011 HUG: HDFS Federation
March 2011 HUG: HDFS Federation
Yahoo Developer Network
 
Postgres connections at scale
Postgres connections at scalePostgres connections at scale
Postgres connections at scale
Mydbops
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
Rajesh Ananda Kumar
 
Setting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutesSetting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutes
Sudheer Kondla
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
Hariharan Ganesan
 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
Joshua McKenzie
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Joe Stein
 
Introduction to DRBD
Introduction to DRBDIntroduction to DRBD
Introduction to DRBD
dawnlua
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
Steven Francia
 
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
Utah Networxs Consultoria e Treinamento
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Maarten Smeets
 
MySQL database replication
MySQL database replicationMySQL database replication
MySQL database replication
PoguttuezhiniVP
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaYahoo Developer Network
 

What's hot (19)

Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
 
March 2011 HUG: HDFS Federation
March 2011 HUG: HDFS FederationMarch 2011 HUG: HDFS Federation
March 2011 HUG: HDFS Federation
 
Postgres connections at scale
Postgres connections at scalePostgres connections at scale
Postgres connections at scale
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Setting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutesSetting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutes
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
 
Introduction to DRBD
Introduction to DRBDIntroduction to DRBD
Introduction to DRBD
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
 
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
MySQL database replication
MySQL database replicationMySQL database replication
MySQL database replication
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
 

Similar to Introduction to HDFS

Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Simplilearn
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
preetik9044
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
Konstantin V. Shvachko
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Mahendran Ponnusamy
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
RamyaMurugesan12
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
senthil0809
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
Santosh Nage
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
ssuser8c3ea7
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
SatyaHadoop
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
Edureka!
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
Yousef Fadila
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
Sunil D Patil
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
HADOOP
HADOOPHADOOP

Similar to Introduction to HDFS (20)

Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Unit 1
Unit 1Unit 1
Unit 1
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
HADOOP
HADOOPHADOOP
HADOOP
 

Recently uploaded

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 

Recently uploaded (20)

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 

Introduction to HDFS

  • 1. 1 Introduction to HDFS By: Siddharth Mathur Instructor: Dr. Shiyong Lu
  • 2. 2 Big Data Wikipedia Definition: In information technology, big data is a loosely- defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools.
  • 3. 3 How Big is Big Data? 2008: Google processed 20 PB a day 2009: Facebook had 2.5 PB user data + 15 TB/day 2009: eBay had 6.5 PB user data + 50 TB/day 2011: Yahoo! had 180-200 PB of data 2012: Facebook ingests 500 TB/day
  • 4. 4 HOW TO ANALYZE THIS DATA?
  • 6. 6 But Parallel Processing is complicated How do we assign tasks to workers? What if we have more tasks than slots? What happens when tasks fail? How do you handle distributed synchronization?
  • 8. 8 GFS to HDFS It started when google researchers wrote a paper on a distributed file system to resolve storage and analysis issues of Big Data The researchers proposed a file system named Google File System which in turn, gave birth to Hadoop Distributed File System (HDFS) The paper on MapReduce resulted in MapReduce programming structure The paper on BigTable produced Hadoop Hbase, Data warehouse schema over HDFS
  • 10. 10 Key Features Accesible Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon's Elastic Compute Cloud (EC2). Robust As Hadoop is intended to run on commodity hardware, It is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. Scalable Hadoop scales linearly to handle larger data by adding more nodes to the cluster. Simple Hadoop allows users to quickly write efficient parallel code.
  • 11. 11 HDFS Scaling Out Performs a task in 45 minutes Performs a task in ~ 45/4 minutes
  • 12. 12 Basic Hadoop Stack Hadoop Distributed File System MapReduce Hbase Higher Level Languages
  • 13. 13 Hadoop Platforms Platforms: Unix and on Windows. Linux: the only supported production platform. Other variants of Unix, like Mac OS X: run Hadoop for development. Windows + Cygwin: development platform (openssh) Java 6 Java 1.6.x (aka 6.0.x aka 6) is recommended for running Hadoop.
  • 14. 14 Hadoop Modes • Standalone (or local) mode – There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them. • Pseudo-distributed mode – The Hadoop daemons run on the local machine, thus simulating a cluster on a small scale. • Fully distributed mode – The Hadoop daemons run on a cluster of machines.
  • 16. 16 Master-Slave Architecture HDFS has a master-slave architecture. The master node or the name node governs the cluster. It takes care of tasks and resource allocation. It stores all the metadata related to file breakage, block storage, block replication and task execution status. The slave nodes or the data nodes are the one which stores all the data blocks and perform task executions Tasktracker is the program which runs on each individual data node and monitors the task execution over each node. Jobtracker runs on name node and monitors the complete job execution.
  • 17. 17 HDFS File Distribution File metadata FILE-A -> 1,2,3 (split into 3 blocks) FILE-B -> 4,5 (split into 2 blocks) 1 3 1 3 Replication factor = 3 Hdfs-site.xml “ dfs.replication” 4 3 4 4 22 2 5 5 5 Block 1
  • 18. 18 HDFS File Distribution Name node stores metadata related to: File split Block allocation Task allocation Each file is split into data blocks. Default size is 64 Mb Each data block is replicated on different data node. The replication factor in configurable. Default value is 3
  • 19. 19 Block Placement Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed Clients read from nearest replica
  • 20. 20 Rack awareness DN 1 DN 2 DN 3 DN 4 DN 5 DN 6 DN 7 DN 8 DN 9 DN 10 DN 11 DN 12 Rack 1 Rack 2 Rack 3 NameNode File X= Blk:A in DN:1,5,6 Blk:B in DN: 7, 10, 11 Rack 1 = DN:1,2,3,4 Rack 2 = DN:5,6,7,8 Rack 3 = DN:9,10,11, 12 Switch Switch Switch Data block A Data block B FILE X
  • 21. 21 Rack awareness HDFS is aware of the placement of each data node and on the racks To prevent data loss due to a complete rack failure, Hadoop intelligently replicates each data block onto other racks also This helps HDSF to recover the data even if complete rack of data node shuts down. This information is stored in the name node.
  • 22. 22 File Write in Hadoop DN 1 DN 2 DN 3 DN 4 DN 9 DN 10 DN 11 DN 12 Rack 1 Rack 3 NameNode File.txt= Blk:A in DN:1,5,6 Blk:B in DN: 7, 10, 11 Blk C in….. Switch Switch Switch Client File.txt [A , B, C] Broken down using Hadoop client API DN 5 DN 6 DN 7 DN 8 Rack 2 Switch First block in one rack next blocks in different rack Intelligent storage of data Heartbeat Request Response MetaData Creation Block A Write
  • 23. 23 File Write in Hadoop HDFS client system requests the name node to write down a file onto HDFS. It also provide the file size and other metadata information to the name node. Meanwhile, each slave node sends a heartbeat signal to namenode telling it about their status
  • 24. 24 File Write in Hadoop The namenode tells the client system where to store the data blocks Also, it tells the data node to get ready for data write. After the data write procedure is complete the data node sends a success message to both client and name node.
  • 25. 25 File Read in Hadoop DN 1 DN 2 DN 3 DN 4 DN 9 DN 10 DN 11 DN 12 Rack 1 Rack 3 NameNode File.txt= Blk:A in DN:1,5,6 Blk:B in DN: 7, 10, 11 Blk C in….. Switch Switch Switch Client DN 5 DN 6 DN 7 DN 8 Rack 2 Switch An ordered list of nodes. Heartbeat Request Response
  • 27. 27 Re-replication Missing Heartbeats signify lost Nodes Name Node consults metadata, finds affected data Name Node consults Rack Awareness script Name Node tells the Data node to re-replicate
  • 28. 28 3 main configuration files Core-site.xml Contains configuration information that overrides the default core Hadoop properties Mapred-site.xml Contains configuration information that overrides the default core Mapreduce properties Also defines the host and port that the MapReduce job tracker runs at Hdfs-site.xml Mainly, to set the block replication factor
  • 29. 29 Anatomy of a Job Launch
  • 31. 31 Limitations of Hadoop -1 Scalability Maximum Cluster size – 4,000 nodes for best performance Maximum Concurrent tasks- 40,000 Name Node as a single point of failure Failure kills all running and queued jobs Jobs need to be re-submitted by the user Re-Start ability Restart is very tricky due to complex state
  • 32. 32 Who has the biggest cluster setups Facebook 400 Microsoft 400 LinkedIn 4100 Yahoo 42,000