Introduction to HDFS

1
Introduction to HDFS
By: Siddharth Mathur
Instructor: Dr. Shiyong Lu

2
Big Data
Wikipedia Definition:
In information technology, big data is a loosely-
deﬁned term used to describe data sets so large
and complex that they become awkward to work
with using on-hand database management tools.

3
How Big is Big Data?
2008: Google processed 20 PB a day
2009: Facebook had 2.5 PB user data + 15
TB/day
2009: eBay had 6.5 PB user data + 50 TB/day
2011: Yahoo! had 180-200 PB of data
2012: Facebook ingests 500 TB/day

5
Divide and Conquer
Partition
Combine

6
But Parallel Processing is complicated
How do we assign tasks to workers?
What if we have more tasks than slots?
What happens when tasks fail?
How do you handle distributed synchronization?

7
The Solution!
Google
File
System
Map
Reduce
BigTable

8
GFS to HDFS
It started when google researchers wrote a
paper on a distributed file system to resolve
storage and analysis issues of Big Data
The researchers proposed a file system named
Google File System which in turn, gave birth to
Hadoop Distributed File System (HDFS)
The paper on MapReduce resulted in
MapReduce programming structure
The paper on BigTable produced Hadoop
Hbase, Data warehouse schema over HDFS

9
HADOOP DISTRIBUTED FILE SYSTEM

10
Key Features
Accesible
Hadoop runs on large clusters of commodity machines or on
cloud computing services such as Amazon's Elastic Compute
Cloud (EC2).
Robust
As Hadoop is intended to run on commodity hardware, It is
architected with the assumption of frequent hardware
malfunctions. It can gracefully handle most such failures.
Scalable
Hadoop scales linearly to handle larger data by adding more
nodes to the cluster.
Simple
Hadoop allows users to quickly write efficient parallel code.

11
HDFS Scaling Out
Performs a task
in 45 minutes
Performs a
task in ~ 45/4
minutes

12
Basic Hadoop Stack
Hadoop Distributed File System
MapReduce
Hbase
Higher Level Languages

13
Hadoop Platforms
Platforms: Unix and on Windows.
Linux: the only supported production platform.
Other variants of Unix, like Mac OS X: run Hadoop for
development.
Windows + Cygwin: development platform (openssh)
Java 6
Java 1.6.x (aka 6.0.x aka 6) is recommended for
running Hadoop.

14
Hadoop Modes
• Standalone (or local) mode
– There are no daemons running and everything runs in
a single JVM. Standalone mode is suitable for running
MapReduce programs during development, since it is
easy to test and debug them.
• Pseudo-distributed mode
– The Hadoop daemons run on the local machine, thus
simulating a cluster on a small scale.
• Fully distributed mode
– The Hadoop daemons run on a cluster of machines.

15
Master-Slave Architecture
Namenode
Jobtracker
Datanode
Tasktracker
Secondary
Namenode

16
Master-Slave Architecture
HDFS has a master-slave architecture.
The master node or the name node governs the cluster.
It takes care of tasks and resource allocation.
It stores all the metadata related to file breakage, block
storage, block replication and task execution status.
The slave nodes or the data nodes are the one which
stores all the data blocks and perform task executions
Tasktracker is the program which runs on each individual
data node and monitors the task execution over each
node.
Jobtracker runs on name node and monitors the
complete job execution.

17
HDFS File Distribution
File metadata
FILE-A -> 1,2,3 (split into 3 blocks)
FILE-B -> 4,5 (split into 2 blocks)
1
3
1
3
Replication factor = 3
Hdfs-site.xml
“ dfs.replication”
4 3
4 4
22
2 5
5
5
Block
1

18
HDFS File Distribution
Name node stores metadata related to:
File split
Block allocation
Task allocation
Each file is split into data blocks. Default size is
64 Mb
Each data block is replicated on different data
node. The replication factor in configurable.
Default value is 3

19
Block Placement
Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
Clients read from nearest replica

20
Rack awareness
DN 1
DN 2
DN 3
DN 4
DN 5
DN 6
DN 7
DN 8
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 2 Rack 3
NameNode
File X=
Blk:A in
DN:1,5,6
Blk:B in
DN: 7, 10,
11
Rack 1 =
DN:1,2,3,4
Rack 2 =
DN:5,6,7,8
Rack 3 =
DN:9,10,11,
12
Switch Switch Switch
Data
block A
Data
block B
FILE X

21
Rack awareness
HDFS is aware of the placement of each data
node and on the racks
To prevent data loss due to a complete rack
failure, Hadoop intelligently replicates each data
block onto other racks also
This helps HDSF to recover the data even if
complete rack of data node shuts down.
This information is stored in the name node.

22
File Write in Hadoop
DN 1
DN 2
DN 3
DN 4
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 3
NameNode
File.txt=
Blk:A in
DN:1,5,6
Blk:B in
DN: 7, 10,
11
Blk C in…..
Switch Switch
Switch
Client
File.txt
[A , B, C]
Broken
down
using
Hadoop
client API
DN 5
DN 6
DN 7
DN 8
Rack 2
Switch
First block
in one rack
next blocks
in different
rack
Intelligent
storage of
data
Heartbeat
Request
Response
MetaData
Creation
Block A Write

23
HDFS client system requests the name node to
write down a file onto HDFS.
It also provide the file size and other metadata
information to the name node.
Meanwhile, each slave node sends a heartbeat
signal to namenode telling it about their status

24
The namenode tells the client system where to
store the data blocks
Also, it tells the data node to get ready for data
write.
After the data write procedure is complete the
data node sends a success message to both
client and name node.

25
File Read in Hadoop
DN 1
DN 2
DN 3
DN 4
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 3
NameNode
File.txt=
Blk:A in
DN:1,5,6
Blk:B in
DN: 7, 10,
11
Blk C in…..
Switch Switch
Switch
Client
DN 5
DN 6
DN 7
DN 8
Rack 2
Switch
An
ordered
list of
nodes.
Heartbeat
Request
Response

26
Re-replicating missing replicas

27
Re-replication
Missing Heartbeats signify lost Nodes
Name Node consults metadata, finds affected
data
Name Node consults Rack Awareness script
Name Node tells the Data node to re-replicate

28
3 main configuration files
Core-site.xml
Contains configuration information that overrides the
default core Hadoop properties
Mapred-site.xml
Contains configuration information that overrides the
default core Mapreduce properties
Also defines the host and port that the MapReduce job
tracker runs at
Hdfs-site.xml
Mainly, to set the block replication factor

31
Limitations of Hadoop -1
Scalability
Maximum Cluster size – 4,000 nodes for best
performance
Maximum Concurrent tasks- 40,000
Name Node as a single point of failure
Failure kills all running and queued jobs
Jobs need to be re-submitted by the user
Re-Start ability
Restart is very tricky due to complex state

32
Who has the biggest cluster setups
Facebook 400
Microsoft 400
LinkedIn 4100
Yahoo 42,000

33
References
http://hadoop.apache.org/
http://research.google.com/archive/mapreduce.html
http://research.google.com/archive/gfs.html
http://research.google.com/archive/bigtable.html
http://hbase.apache.org/
http://wiki.apache.org/hadoop/FAQ
http://matt-
wand.utsacademics.info/webUTSdiscns/HadoopNotes
.pdf

Introduction to HDFS

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Introduction to HDFS

Similar to Introduction to HDFS (20)

Recently uploaded

Recently uploaded (20)

Introduction to HDFS