1
Introduction to HDFS
By: Siddharth Mathur
Instructor: Dr. Shiyong Lu
2
Big Data
Wikipedia Definition:
In information technology, big data is a loosely-
defined term used to describe data sets ...
3
How Big is Big Data?
2008: Google processed 20 PB a day
2009: Facebook had 2.5 PB user data + 15
TB/day
2009: eBay had 6...
4
HOW TO ANALYZE THIS DATA?
5
Divide and Conquer
Partition
Combine
6
But Parallel Processing is complicated
How do we assign tasks to workers?
What if we have more tasks than slots?
What ha...
7
The Solution!
Google
File
System
Map
Reduce
BigTable
8
GFS to HDFS
It started when google researchers wrote a
paper on a distributed file system to resolve
storage and analysi...
9
HADOOP DISTRIBUTED FILE SYSTEM
10
Key Features
Accesible
Hadoop runs on large clusters of commodity machines or on
cloud computing services such as Amazo...
11
HDFS Scaling Out
Performs a task
in 45 minutes
Performs a
task in ~ 45/4
minutes
12
Basic Hadoop Stack
Hadoop Distributed File System
MapReduce
Hbase
Higher Level Languages
13
Hadoop Platforms
Platforms: Unix and on Windows.
Linux: the only supported production platform.
Other variants of Unix,...
14
Hadoop Modes
• Standalone (or local) mode
– There are no daemons running and everything runs in
a single JVM. Standalon...
15
Master-Slave Architecture
Namenode
Jobtracker
Datanode
Tasktracker
Secondary
Namenode
16
Master-Slave Architecture
HDFS has a master-slave architecture.
The master node or the name node governs the cluster.
I...
17
HDFS File Distribution
File metadata
FILE-A -> 1,2,3 (split into 3 blocks)
FILE-B -> 4,5 (split into 2 blocks)
1
3
1
3
...
18
HDFS File Distribution
Name node stores metadata related to:
File split
Block allocation
Task allocation
Each file is s...
19
Block Placement
Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on sa...
20
Rack awareness
DN 1
DN 2
DN 3
DN 4
DN 5
DN 6
DN 7
DN 8
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 2 Rack 3
NameNode
File X=
Blk...
21
Rack awareness
HDFS is aware of the placement of each data
node and on the racks
To prevent data loss due to a complete...
22
File Write in Hadoop
DN 1
DN 2
DN 3
DN 4
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 3
NameNode
File.txt=
Blk:A in
DN:1,5,6
Blk:...
23
File Write in Hadoop
HDFS client system requests the name node to
write down a file onto HDFS.
It also provide the file...
24
File Write in Hadoop
The namenode tells the client system where to
store the data blocks
Also, it tells the data node t...
25
File Read in Hadoop
DN 1
DN 2
DN 3
DN 4
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 3
NameNode
File.txt=
Blk:A in
DN:1,5,6
Blk:B...
26
Re-replicating missing replicas
27
Re-replication
Missing Heartbeats signify lost Nodes
Name Node consults metadata, finds affected
data
Name Node consult...
28
3 main configuration files
Core-site.xml
Contains configuration information that overrides the
default core Hadoop prop...
29
Anatomy of a Job Launch
30
Job Status updates
31
Limitations of Hadoop -1
Scalability
Maximum Cluster size – 4,000 nodes for best
performance
Maximum Concurrent tasks- ...
32
Who has the biggest cluster setups
Facebook 400
Microsoft 400
LinkedIn 4100
Yahoo 42,000
33
References
http://hadoop.apache.org/
http://research.google.com/archive/mapreduce.html
http://research.google.com/archi...
34
THANK YOU
Upcoming SlideShare
Loading in...5
×

Introduction to HDFS

248

Published on

A brief introduction to Hadoop distributed file system. How a file is broken into blocks, written and replicated on HDFS. How missing replicas are taken care of. How a job is launched and its status is checked. Some advantages and disadvantages of HDFS-1.x

Published in: Data & Analytics, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
248
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
25
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction to HDFS

  1. 1. 1 Introduction to HDFS By: Siddharth Mathur Instructor: Dr. Shiyong Lu
  2. 2. 2 Big Data Wikipedia Definition: In information technology, big data is a loosely- defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools.
  3. 3. 3 How Big is Big Data? 2008: Google processed 20 PB a day 2009: Facebook had 2.5 PB user data + 15 TB/day 2009: eBay had 6.5 PB user data + 50 TB/day 2011: Yahoo! had 180-200 PB of data 2012: Facebook ingests 500 TB/day
  4. 4. 4 HOW TO ANALYZE THIS DATA?
  5. 5. 5 Divide and Conquer Partition Combine
  6. 6. 6 But Parallel Processing is complicated How do we assign tasks to workers? What if we have more tasks than slots? What happens when tasks fail? How do you handle distributed synchronization?
  7. 7. 7 The Solution! Google File System Map Reduce BigTable
  8. 8. 8 GFS to HDFS It started when google researchers wrote a paper on a distributed file system to resolve storage and analysis issues of Big Data The researchers proposed a file system named Google File System which in turn, gave birth to Hadoop Distributed File System (HDFS) The paper on MapReduce resulted in MapReduce programming structure The paper on BigTable produced Hadoop Hbase, Data warehouse schema over HDFS
  9. 9. 9 HADOOP DISTRIBUTED FILE SYSTEM
  10. 10. 10 Key Features Accesible Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon's Elastic Compute Cloud (EC2). Robust As Hadoop is intended to run on commodity hardware, It is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. Scalable Hadoop scales linearly to handle larger data by adding more nodes to the cluster. Simple Hadoop allows users to quickly write efficient parallel code.
  11. 11. 11 HDFS Scaling Out Performs a task in 45 minutes Performs a task in ~ 45/4 minutes
  12. 12. 12 Basic Hadoop Stack Hadoop Distributed File System MapReduce Hbase Higher Level Languages
  13. 13. 13 Hadoop Platforms Platforms: Unix and on Windows. Linux: the only supported production platform. Other variants of Unix, like Mac OS X: run Hadoop for development. Windows + Cygwin: development platform (openssh) Java 6 Java 1.6.x (aka 6.0.x aka 6) is recommended for running Hadoop.
  14. 14. 14 Hadoop Modes • Standalone (or local) mode – There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them. • Pseudo-distributed mode – The Hadoop daemons run on the local machine, thus simulating a cluster on a small scale. • Fully distributed mode – The Hadoop daemons run on a cluster of machines.
  15. 15. 15 Master-Slave Architecture Namenode Jobtracker Datanode Tasktracker Secondary Namenode
  16. 16. 16 Master-Slave Architecture HDFS has a master-slave architecture. The master node or the name node governs the cluster. It takes care of tasks and resource allocation. It stores all the metadata related to file breakage, block storage, block replication and task execution status. The slave nodes or the data nodes are the one which stores all the data blocks and perform task executions Tasktracker is the program which runs on each individual data node and monitors the task execution over each node. Jobtracker runs on name node and monitors the complete job execution.
  17. 17. 17 HDFS File Distribution File metadata FILE-A -> 1,2,3 (split into 3 blocks) FILE-B -> 4,5 (split into 2 blocks) 1 3 1 3 Replication factor = 3 Hdfs-site.xml “ dfs.replication” 4 3 4 4 22 2 5 5 5 Block 1
  18. 18. 18 HDFS File Distribution Name node stores metadata related to: File split Block allocation Task allocation Each file is split into data blocks. Default size is 64 Mb Each data block is replicated on different data node. The replication factor in configurable. Default value is 3
  19. 19. 19 Block Placement Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed Clients read from nearest replica
  20. 20. 20 Rack awareness DN 1 DN 2 DN 3 DN 4 DN 5 DN 6 DN 7 DN 8 DN 9 DN 10 DN 11 DN 12 Rack 1 Rack 2 Rack 3 NameNode File X= Blk:A in DN:1,5,6 Blk:B in DN: 7, 10, 11 Rack 1 = DN:1,2,3,4 Rack 2 = DN:5,6,7,8 Rack 3 = DN:9,10,11, 12 Switch Switch Switch Data block A Data block B FILE X
  21. 21. 21 Rack awareness HDFS is aware of the placement of each data node and on the racks To prevent data loss due to a complete rack failure, Hadoop intelligently replicates each data block onto other racks also This helps HDSF to recover the data even if complete rack of data node shuts down. This information is stored in the name node.
  22. 22. 22 File Write in Hadoop DN 1 DN 2 DN 3 DN 4 DN 9 DN 10 DN 11 DN 12 Rack 1 Rack 3 NameNode File.txt= Blk:A in DN:1,5,6 Blk:B in DN: 7, 10, 11 Blk C in….. Switch Switch Switch Client File.txt [A , B, C] Broken down using Hadoop client API DN 5 DN 6 DN 7 DN 8 Rack 2 Switch First block in one rack next blocks in different rack Intelligent storage of data Heartbeat Request Response MetaData Creation Block A Write
  23. 23. 23 File Write in Hadoop HDFS client system requests the name node to write down a file onto HDFS. It also provide the file size and other metadata information to the name node. Meanwhile, each slave node sends a heartbeat signal to namenode telling it about their status
  24. 24. 24 File Write in Hadoop The namenode tells the client system where to store the data blocks Also, it tells the data node to get ready for data write. After the data write procedure is complete the data node sends a success message to both client and name node.
  25. 25. 25 File Read in Hadoop DN 1 DN 2 DN 3 DN 4 DN 9 DN 10 DN 11 DN 12 Rack 1 Rack 3 NameNode File.txt= Blk:A in DN:1,5,6 Blk:B in DN: 7, 10, 11 Blk C in….. Switch Switch Switch Client DN 5 DN 6 DN 7 DN 8 Rack 2 Switch An ordered list of nodes. Heartbeat Request Response
  26. 26. 26 Re-replicating missing replicas
  27. 27. 27 Re-replication Missing Heartbeats signify lost Nodes Name Node consults metadata, finds affected data Name Node consults Rack Awareness script Name Node tells the Data node to re-replicate
  28. 28. 28 3 main configuration files Core-site.xml Contains configuration information that overrides the default core Hadoop properties Mapred-site.xml Contains configuration information that overrides the default core Mapreduce properties Also defines the host and port that the MapReduce job tracker runs at Hdfs-site.xml Mainly, to set the block replication factor
  29. 29. 29 Anatomy of a Job Launch
  30. 30. 30 Job Status updates
  31. 31. 31 Limitations of Hadoop -1 Scalability Maximum Cluster size – 4,000 nodes for best performance Maximum Concurrent tasks- 40,000 Name Node as a single point of failure Failure kills all running and queued jobs Jobs need to be re-submitted by the user Re-Start ability Restart is very tricky due to complex state
  32. 32. 32 Who has the biggest cluster setups Facebook 400 Microsoft 400 LinkedIn 4100 Yahoo 42,000
  33. 33. 33 References http://hadoop.apache.org/ http://research.google.com/archive/mapreduce.html http://research.google.com/archive/gfs.html http://research.google.com/archive/bigtable.html http://hbase.apache.org/ http://wiki.apache.org/hadoop/FAQ http://matt- wand.utsacademics.info/webUTSdiscns/HadoopNotes .pdf
  34. 34. 34 THANK YOU
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×