HDFS
Леонид Налчаджи
leonid.nalchadzhi@gmail.com
Intro
•
•
•

Hadoop ecosystem
History
Setups

2
HDFS
•
•
•
•
•
•
•

Features
Architecture
NameNode
DataNode
Data flow
Tools
Performance
3
Hadoop ecosystem

4
Doug Cutting

5
History
•
•
•

2002: Apache Nutch

•
•

2005: MapReduce

2003: GFS
2004: Nutch Distributed Filesystem
(NDFS)
2006: subproj...
History
•
•
•
•
•

2006:Yahoo! Research cluster: 300 nodes
2006: BigTable paper
2007: Initial HBase prototype
2007: First ...
Яндекс
•

2 MapReduce clusters MapReduce (50
nodes, 100 nodes)

•
•
•

2 HBase clusters
500 nodes cluster is on its way :)...
Facebook
•
•

2 major clusters (1100 and 300 nodes)

•

8B messages/day, 2Pb online data, growing
250 Tb/ months

Messages...
EBay
•
•

532 nodes cluster (8 * 532 cores, 5.3PB).

•

Search optimization and Research

Heavy usage of MapReduce, HBase,...
11
HDFS

12
Features

13
Features
•
•
•
•
•
•

Distributed file system
Replicated, highly fault-tolerant
Commodity hardware
Aimed for Map Reduce
Pur...
Features
•
•
•

Traditional hierarchical file organization
Permissions
No soft links

15
Features
•
•

Streaming data access

•

Strictly one writer

Write-once-read-many access model for
files

16
Hardware Failure
•

Hardware failure is the norm rather than
the exception

•

Detection of faults and quick, automatic
re...
Not good for
•
•
•
•

Multiple data centers
Low latency data access (<10 ms)
Lots of small files
Multiple writers

18
Architecture

19
Architecture
•
•
•
•

One NameNode, many DataNodes
File is a sequence of blocks (64 Mb)
Meta data in memory of namenode
Cl...
Architecture

21
Big blocks
•

Let’s say we want to achieve 1% seek
overhead

•
•

Seek time ~10ms, transfer rate ~100 Mb/s
So block size m...
NameNode

23
NameNode
•
•
•

Manages the FS namespace

•

Makes all decisions regarding replication of
blocks

•

SPOF

Keeps in Memory...
NameNode
•

NS stored in 2 files: namespace image, edit
log

•

Determines the mapping of blocks to
DataNodes

•

Receives ...
NameNode

26
NameNode
•
•
•
•

Store persistent state to several file system
Secondary MasterNode
HDFS High-Availability
HDFS Federation...
28
DataNode

29
DataNode
•
•
•

Usually 1 DataNode 1 machine
Serves read and write client requests
Performs block creation, deletion, and
...
DataNode
•

Each block represented as 2 files: data,
metadata

•

Metadata contains checksums, generation
stamp

•

Handsha...
DataNode
•
•
•
•
•

NameNode never calls DataNodes
Replicate block
Remove local replica
Re-register/shut down
Send an imme...
Data Flow

33
Distance
•
•
•

HDFS is rack aware
Network is presented as tree
d(n1, n2) = d(n1, A) + d (n2, A), A -- closest
common ance...
Distance

35
Read
•

Get list of blocks locations from
NameNode

•
•

Iterate over blocks and read
Verifies checksums

36
Read
•

On fail or when unable to connect DN:

•
•
•

try another replica
remember that
tell NameNode

37
Read

38
Read
final Configuration conf = new Configuration();
final FileSystem fs = FileSystem.get(conf);
InputStream in = null;
try {
...
Write
•
•
•
•

Ask NameNode to create file
NameNode performs some checks
Client ask for list of DataNodes
Forms a pipeline ...
Write
•
•

Send packets to the pipeline
In case of failure

•
•
•
•

close pipeline
remove bad DN, tell NN
retry sending t...
Write

42
Write
final Configuration conf = new Configuration();
final FileSystem fs = FileSystem.get(conf);
FSDataOutputStream out = nul...
Block Placement
•
•

Reliability/Bandwidth trade off
By default: same node, 2 random nodes
from another rack, other random...
Tools

45
Balancer
•

Compare utilization of node with
utilization of cluster

•

Guarantees that the decision does not
reduce eithe...
Block scanner
•

Periodically scans replica, verifies
checksums

•
•

Notifies NameNode if checksum fails
Replicate first, th...
Performance

48
Performance
•
•

Cluster ~3500 nodes

•
•

DFSIO Read: 66 MB /s per node

Total bandwidth is linear to number of
nodes
DFS...
Performance
•
•
•
•
•
•

Open file for read: 126100
Create file: 5600
Rename file: 8300
Delete file: 20700
DataNode heartbeat:...
Sources
•

The Hadoop Distributed File System (K.
Shvacko)

•
•

HDFS Architecture Guide
Hadoop: The Definitive Guide (O’Re...
HDFS
Леонид Налчаджи
leonid.nalchadzhi@gmail.com
Upcoming SlideShare
Loading in...5
×

2013 10 21_db_lecture_06_hdfs

1,074

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,074
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
27
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

2013 10 21_db_lecture_06_hdfs

  1. 1. HDFS Леонид Налчаджи leonid.nalchadzhi@gmail.com
  2. 2. Intro • • • Hadoop ecosystem History Setups 2
  3. 3. HDFS • • • • • • • Features Architecture NameNode DataNode Data flow Tools Performance 3
  4. 4. Hadoop ecosystem 4
  5. 5. Doug Cutting 5
  6. 6. History • • • 2002: Apache Nutch • • 2005: MapReduce 2003: GFS 2004: Nutch Distributed Filesystem (NDFS) 2006: subproject of Lucene => Hadoop 6
  7. 7. History • • • • • 2006:Yahoo! Research cluster: 300 nodes 2006: BigTable paper 2007: Initial HBase prototype 2007: First usable HBase (Hadoop 0.15.0) 2008: 10,000-core Hadoop cluster 7
  8. 8. Яндекс • 2 MapReduce clusters MapReduce (50 nodes, 100 nodes) • • • 2 HBase clusters 500 nodes cluster is on its way :) Fact extraction, data mining, statistics, reporting, analytics 8
  9. 9. Facebook • • 2 major clusters (1100 and 300 nodes) • 8B messages/day, 2Pb online data, growing 250 Tb/ months Messages; machine learning, log storage, reporting, analytics 9
  10. 10. EBay • • 532 nodes cluster (8 * 532 cores, 5.3PB). • Search optimization and Research Heavy usage of MapReduce, HBase, Pig, Hive 10
  11. 11. 11
  12. 12. HDFS 12
  13. 13. Features 13
  14. 14. Features • • • • • • Distributed file system Replicated, highly fault-tolerant Commodity hardware Aimed for Map Reduce Pure java Large files, huge datasets 14
  15. 15. Features • • • Traditional hierarchical file organization Permissions No soft links 15
  16. 16. Features • • Streaming data access • Strictly one writer Write-once-read-many access model for files 16
  17. 17. Hardware Failure • Hardware failure is the norm rather than the exception • Detection of faults and quick, automatic recovery 17
  18. 18. Not good for • • • • Multiple data centers Low latency data access (<10 ms) Lots of small files Multiple writers 18
  19. 19. Architecture 19
  20. 20. Architecture • • • • One NameNode, many DataNodes File is a sequence of blocks (64 Mb) Meta data in memory of namenode Client interacts with datanodes directly 20
  21. 21. Architecture 21
  22. 22. Big blocks • Let’s say we want to achieve 1% seek overhead • • Seek time ~10ms, transfer rate ~100 Mb/s So block size must be ~100Mb 22
  23. 23. NameNode 23
  24. 24. NameNode • • • Manages the FS namespace • Makes all decisions regarding replication of blocks • SPOF Keeps in Memory all files and blocks Executes FS operations like opening, closing, and renaming 24
  25. 25. NameNode • NS stored in 2 files: namespace image, edit log • Determines the mapping of blocks to DataNodes • Receives a Heartbeat and a Blockreport from each of the DataNodes 25
  26. 26. NameNode 26
  27. 27. NameNode • • • • Store persistent state to several file system Secondary MasterNode HDFS High-Availability HDFS Federation 27
  28. 28. 28
  29. 29. DataNode 29
  30. 30. DataNode • • • Usually 1 DataNode 1 machine Serves read and write client requests Performs block creation, deletion, and replication 30
  31. 31. DataNode • Each block represented as 2 files: data, metadata • Metadata contains checksums, generation stamp • Handshake while starting the cluster to verify ID of namespace and SW version 31
  32. 32. DataNode • • • • • NameNode never calls DataNodes Replicate block Remove local replica Re-register/shut down Send an immediate block report 32
  33. 33. Data Flow 33
  34. 34. Distance • • • HDFS is rack aware Network is presented as tree d(n1, n2) = d(n1, A) + d (n2, A), A -- closest common ancestor 34
  35. 35. Distance 35
  36. 36. Read • Get list of blocks locations from NameNode • • Iterate over blocks and read Verifies checksums 36
  37. 37. Read • On fail or when unable to connect DN: • • • try another replica remember that tell NameNode 37
  38. 38. Read 38
  39. 39. Read final Configuration conf = new Configuration(); final FileSystem fs = FileSystem.get(conf); InputStream in = null; try { in = fs.open(new Path("/user/test-hdfs/somefile")); //do whatever with inpustream } finally { IOUtils.closeStream(in); } 39
  40. 40. Write • • • • Ask NameNode to create file NameNode performs some checks Client ask for list of DataNodes Forms a pipeline and ack queue 40
  41. 41. Write • • Send packets to the pipeline In case of failure • • • • close pipeline remove bad DN, tell NN retry sending to good DNs block will be replicated asynchronously 41
  42. 42. Write 42
  43. 43. Write final Configuration conf = new Configuration(); final FileSystem fs = FileSystem.get(conf); FSDataOutputStream out = null; try { out = fs.create(new Path("/user/test-hdfs/newfile")); //write to out } finally { IOUtils.closeStream(in); } 43
  44. 44. Block Placement • • Reliability/Bandwidth trade off By default: same node, 2 random nodes from another rack, other random nodes • No Datanode contains more than one replica of any block • No rack contains more than two replicas of the same block 44
  45. 45. Tools 45
  46. 46. Balancer • Compare utilization of node with utilization of cluster • Guarantees that the decision does not reduce either the number of replicas or the number of racks • Minimizes the inter-rack data copying 46
  47. 47. Block scanner • Periodically scans replica, verifies checksums • • Notifies NameNode if checksum fails Replicate first, then delete 47
  48. 48. Performance 48
  49. 49. Performance • • Cluster ~3500 nodes • • DFSIO Read: 66 MB /s per node Total bandwidth is linear to number of nodes DFSIO Write: 40 MB /s per node 49
  50. 50. Performance • • • • • • Open file for read: 126100 Create file: 5600 Rename file: 8300 Delete file: 20700 DataNode heartbeat: 300000 Block reports (block/s): 639700 50
  51. 51. Sources • The Hadoop Distributed File System (K. Shvacko) • • HDFS Architecture Guide Hadoop: The Definitive Guide (O’Reilly) 51
  52. 52. HDFS Леонид Налчаджи leonid.nalchadzhi@gmail.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×