0
HDFS: What is New in Hadoop 2
Sze Tsz-Wo Nicholas
施子和
December 6, 2013

© Hortonworks Inc. 2013

Page 1
About Me
• 施子和 Sze Tsz-Wo Nicholas, Ph.D.
– Software Engineer at Hortonworks
– PMC Member at Apache Hadoop
– One of the mo...
Agenda
• New HDFS features in Hadoop-2
– New appendable write-pipeline
– Multiple Namenode Federation
– Namenode HA
– File...
We have been hard at work…
• Progress is being made in many areas
– Scalability
– Performance
– Enterprise features
– Ongo...
Building on Rock-solid Foundation
• Original design choices - simple and robust
– Single Namenode metadata server – all st...
New Appendable
Write-Pipeline

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 6
HDFS Write Pipeline
• The write pipeline has been improved dramatically
–
–
–
–

Better durability
Better visibility
Consi...
New Feature in Write Pipeline
• Earlier versions of HDFS
– Files were immutable
– Write-once-read-many model

• New featur...
HDFS hflush and hsync
• Java flush (or C++ fflush)
– forces any buffered output bytes to be written out.

• HDFS hflush
– ...
Read Consistency
• A reader may read data during write
– It can read from any datanode in the pipeline
– and then failover...
In the past …
• When a datanode fails, the pipeline is reconstructed with
data
the remain datanodes
ack

data

Writer

DN1...
Replace Datanode on Failure
• Add new datanodes to the pipeline
data

ack
data

Writer

data

DN1

DN2

DN3

ack

DN4
ack
...
Multiple Namenode
Federation

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 13
Namespace

HDFS Architecture
Persistent Namespace
Metadata & Journal

Hierarchal Namespace
File Name  BlockIDs

Namespace...
Single Namenode Limitations
• Namespace size is limited by the namenode memory size
– 64GB memory can support ~100m files ...
Federation Cluster
• Multiple namenodes and namespace volumes in a cluster
–
–
–
–

The namenodes/namespaces are independe...
Namespace

Multiple Namenode Federation
Foreign
NS n

NS k

NS1

...

Pool 1

Block Storage

NN-n

NN-k

NN-1

...

Pool k...
Namenode HA

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 18
High Availability – No SPOF
• Support standby namenode and failover
– Planned downtime
– Unplanned downtime

• Release 1.1...
Hadoop Full Stack HA

Slave Nodes of Hadoop Cluster

jo
b

jo
b

jo
b

jo
b

jo
b

Apps
Running
Outside

Failover
JT into ...
High Availability – Release 2.0
• Support for Hot Standby
– The standby namenode maintains in-memory data structures

• Su...
Namenode HA in Hadoop 2
ZK

Heartbeat

ZK

ZK

Heartbeat

FailoverController
Active

FailoverController
Standby

Cmds

Mon...
File System Snapshots

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 23
Before Snapshots…
• Deleted files cannot be restored
– Trash is buggy and not well understood
– Trash works only for CLI b...
HDFS Snapshot

Point-in-time image of the file system
Read-only
Copy-on-write

Architecting the Future of Big Data
© Horto...
Use Cases

Protection against user errors
Backup
Experimental/Test setups

Architecting the Future of Big Data
© Hortonwor...
Example: Periodic Snapshots for Backup
• A typical snapshot policy:
Take a snapshot in
– every 15 mins and
– every 1 hr,
–...
Design Goal: Efficiency
• Storage efficiency
– No block data copying
– No metadata copying for unmodified files

• Process...
Design Goal: Features
• Read-only
– Files and directories in a snapshot are immutable
– Nothing can be added to or removed...
HDFS-2802: Snapshot Development
• Available in Hadoop 2 GA release (v2.2.0)
• Community-driven
– Special thanks to who hav...
Namenode Only Operation
• No complicated distributed mechanism
• Snapshot metadata stored in Namenode
• Datanodes have no ...
Fast Snapshot Creation
• Snapshot Creation: O(1)
– It just adds a record to an inode

/
d
1

f1

Architecting the Future o...
Low Memory Overhead
• NameNode memory usage: O(M)
– M is the number of modified files/directories
– Additional memory is u...
File Blocks Sharing
• Blocks in datanodes are not copied
– The snapshot files record the block list and the file size
– No...
Persistent Data Structures
• A well-known data structure for “time travel”
– Support querying previous version of the data...
No Slow Down on Accessing Current Data
• The current data can be accessed directly
– Modifications are recorded in reverse...
Easy Management
• Snapshots can be taken on any directory
– Set the directory to be snapshottable

• Support 65,536 simult...
Admin Ops
• Allow snapshots on a directory
– hdfs dfsadmin –allowSnapshot <path>

• Reset a snapshottable directory
– hdfs...
User Ops
• Create/delete/rename snapshots
– hdfs dfs -createSnapshot <path> [<snapshotName>]
– hdfs dfs –deleteSnapshot <p...
Use snapshot paths in CLI
• All regular commands and APIs can be used against
snapshot path
– /<snapshottableDir>/.snapsho...
Test Snapshot Functionalities
• ~100 unit tests
• ~1.4 million generated system tests
– Covering most combination of (snap...
NFS Support
and Other Features

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 42
NFS Support
• NFS Gateway provides NFS access to HDFS
– File browsing, Data download/upload, Data streaming
– No client-si...
Other Features
• Protobuf, wire compatibility
– Post 2.0 GA stronger wire compatibility

• Rolling upgrades
– With relaxed...
Enterprise Readiness
• Storage fault-tolerance – built into HDFS 
– 100% data reliability

• High Availability 
• Standa...
Work in Progress
• HDFS-2832: Heterogeneous storages
– Datanode abstraction from single storage to collection of storages
...
Future Works
• HDFS-5477: Block manager as a service
– Move block management out from Namenode
– Support different name se...
Q&A
• Myths and misinformation of HDFS
–
–
–
–
–

Not reliable (was never true)
Namenode dies, all state is lost (was neve...
Upcoming SlideShare
Loading in...5
×

Nicholas:hdfs what is new in hadoop 2

658

Published on

BDTC 2013 Beijing China

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
658
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
39
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Nicholas:hdfs what is new in hadoop 2"

  1. 1. HDFS: What is New in Hadoop 2 Sze Tsz-Wo Nicholas 施子和 December 6, 2013 © Hortonworks Inc. 2013 Page 1
  2. 2. About Me • 施子和 Sze Tsz-Wo Nicholas, Ph.D. – Software Engineer at Hortonworks – PMC Member at Apache Hadoop – One of the most active contributors/committers of HDFS • Started in 2007 – Used Hadoop to compute Pi at the two-quadrillionth (2x1015th) bit • It is the current World Record. = 3.141592654… – Received Ph.D. from the University of Maryland, College Park • Discovered a novel square root algorithm over finite field. Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 2
  3. 3. Agenda • New HDFS features in Hadoop-2 – New appendable write-pipeline – Multiple Namenode Federation – Namenode HA – File System Snapshots Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 3
  4. 4. We have been hard at work… • Progress is being made in many areas – Scalability – Performance – Enterprise features – Ongoing operability improvements – Enhancements for other projects in the ecosystem – Expand Hadoop ecosystem to more platforms and use cases • 2192 commits in Hadoop in the last year – Almost a million lines of changes – ~150 contributors – Lot of new contributors - ~80 with < 3 patches • 350K lines of changes in HDFS and common Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 4
  5. 5. Building on Rock-solid Foundation • Original design choices - simple and robust – Single Namenode metadata server – all state in memory – Fault Tolerance: multiple replicas, active monitoring – Storage: Rely on OS’s file system not raw disk • Reliability – Over 7 9’s of data reliability, less than 0.38 failures across 25 clusters • Operability – Small teams can manage large clusters • An operator per 3K node cluster – Fast Time to repair on node or disk failure • Minutes to an hour Vs. RAID array repairs taking many long hours • Scalable - proven by large scale deployments not bits – > 100 PB storage, > 400 million files, > 4500 nodes in a single cluster – ~ 100 K nodes of HDFS in deployment and use Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 5
  6. 6. New Appendable Write-Pipeline Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 6
  7. 7. HDFS Write Pipeline • The write pipeline has been improved dramatically – – – – Better durability Better visibility Consistency guarantees Appendable data Writer data DN1 ack Architecting the Future of Big Data © Hortonworks Inc. 2013 data DN2 ack DN3 ack Page 7
  8. 8. New Feature in Write Pipeline • Earlier versions of HDFS – Files were immutable – Write-once-read-many model • New features in Hadoop 2 – – – – Files can be reopened for append New primitives: hflush and hsync Read consistency Replace datanode on failure Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 8
  9. 9. HDFS hflush and hsync • Java flush (or C++ fflush) – forces any buffered output bytes to be written out. • HDFS hflush – Flush data to all the datanodes in the write pipeline – Guarantees the data is visible for reading – The data may be in datanodes’ memory • HDFS sync – Hfush with local file system sync – May also update the file length in Namenode Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 9
  10. 10. Read Consistency • A reader may read data during write – It can read from any datanode in the pipeline – and then failover to any other datanode to read the same data data Writer ack data DN1 ack data DN2 ack DN3 read read Reader Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 10
  11. 11. In the past … • When a datanode fails, the pipeline is reconstructed with data the remain datanodes ack data Writer DN1 DN2 DN3 ack • When another datanode fails, only one datanode remains! data Writer DN1 DN2 DN3 ack Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 11
  12. 12. Replace Datanode on Failure • Add new datanodes to the pipeline data ack data Writer data DN1 DN2 DN3 ack DN4 ack • User clients may choose the replacement policy – Performance vs data reliability Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 12
  13. 13. Multiple Namenode Federation Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 13
  14. 14. Namespace HDFS Architecture Persistent Namespace Metadata & Journal Hierarchal Namespace File Name  BlockIDs Namespace State Namenode Block Map Block ID  Block Locations Block Storage Heartbeats & Block Reports b2 b1 b3 b1 b3 b5 b3 Datanodes b2 b5 b1 b2 b5 Block ID  Data JBOD JBOD JBOD JBOD Horizontally Scale IO and Storage Architecting the Future of Big Data © Hortonworks Inc. 2011 14 Page 14
  15. 15. Single Namenode Limitations • Namespace size is limited by the namenode memory size – 64GB memory can support ~100m files and blocks – Solution: Federation • Single point of failure (SPOF) – The service is down when the namenode is down – Solution: HA Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 15
  16. 16. Federation Cluster • Multiple namenodes and namespace volumes in a cluster – – – – The namenodes/namespaces are independent Scalability by adding more namenodes/namespaces Isolation – separating applications to their own namespaces Client side mount tables/ViewFS for integrated views • Block Storage as generic storage service – Datanodes store blocks in block pools for all namespaces Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 16
  17. 17. Namespace Multiple Namenode Federation Foreign NS n NS k NS1 ... Pool 1 Block Storage NN-n NN-k NN-1 ... Pool k Pool n Block Pools DN 1 .. DN 2 .. DN m .. Common Storage Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 17
  18. 18. Namenode HA Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 18
  19. 19. High Availability – No SPOF • Support standby namenode and failover – Planned downtime – Unplanned downtime • Release 1.1 – Cold standby • Require reconstructing in-memory data structures during failure-over – Uses NFS as shared storage – Standard HA frameworks as failover controller • Linux HA and VMWare VSphere – Suitable for small clusters up to 500 nodes Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 19
  20. 20. Hadoop Full Stack HA Slave Nodes of Hadoop Cluster jo b jo b jo b jo b jo b Apps Running Outside Failover JT into Safemode NN JT Server Server NN Server HA Cluster for Master Daemons Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 20
  21. 21. High Availability – Release 2.0 • Support for Hot Standby – The standby namenode maintains in-memory data structures • Supports manual and automatic failover • Automatic failover with Failover Controller – Active NN election and failure detection using ZooKeeper – Periodic NN health check – Failover on NN failure • Removed shared storage dependency – Quorum Journal Manager • 3 to 5 Journal Nodes for storing editlog • Edit must be written to quorum number of Journal Nodes • Replay cache for correctness & transparent failovers Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 21
  22. 22. Namenode HA in Hadoop 2 ZK Heartbeat ZK ZK Heartbeat FailoverController Active FailoverController Standby Cmds Monitor Health of NN. OS, HW JN NN Active JN JN Shared NN state through Quorum of JournalNodes NN Standby Monitor Health of NN. OS, HW Block Reports to Active & Standby DN fencing: only obey commands from active DN DN DN DN Namenode HA has no external dependency Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 22
  23. 23. File System Snapshots Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 23
  24. 24. Before Snapshots… • Deleted files cannot be restored – Trash is buggy and not well understood – Trash works only for CLI based deletion • No point-in-time recovery • No periodic snapshots to restore from – No admin/user managed snapshots Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 24
  25. 25. HDFS Snapshot Point-in-time image of the file system Read-only Copy-on-write Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 25
  26. 26. Use Cases Protection against user errors Backup Experimental/Test setups Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 26
  27. 27. Example: Periodic Snapshots for Backup • A typical snapshot policy: Take a snapshot in – every 15 mins and – every 1 hr, – every 1 day, – every 1 week, – every 1 month, Architecting the Future of Big Data © Hortonworks Inc. 2013 keep it for 24 hrs keep 2 days keep 14 days keep 3 months keep 1 year Page 27
  28. 28. Design Goal: Efficiency • Storage efficiency – No block data copying – No metadata copying for unmodified files • Processing efficiency – No additional costs for processing current data • Cheap snapshot creation – Must be fast and lightweight – Must support for a very large number of snapshots Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 28
  29. 29. Design Goal: Features • Read-only – Files and directories in a snapshot are immutable – Nothing can be added to or removed from directories • Hierarchical snapshots – Snapshots of the entire namespace – Snapshots of subtrees • User operation – Users can take snapshots for their data – Admins manage where users can take snapshots Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 29
  30. 30. HDFS-2802: Snapshot Development • Available in Hadoop 2 GA release (v2.2.0) • Community-driven – Special thanks to who have provided for the valuable discussion and feedback on the feature requirements and the open questions • 136 subtask JIRAs – Mainly contributed by Hortonworks • The merge patch has about 28k lines • ~8 months of development Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 30
  31. 31. Namenode Only Operation • No complicated distributed mechanism • Snapshot metadata stored in Namenode • Datanodes have no knowledge of snapshots • Block management layer also don’t know about snapshots Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 31
  32. 32. Fast Snapshot Creation • Snapshot Creation: O(1) – It just adds a record to an inode / d 1 f1 Architecting the Future of Big Data © Hortonworks Inc. 2013 d 2 f2 S1 f3 Page 32
  33. 33. Low Memory Overhead • NameNode memory usage: O(M) – M is the number of modified files/directories – Additional memory is used only when modifications are made relative to a snapshot / d 1 f1 d 2 f4 Architecting the Future of Big Data © Hortonworks Inc. 2013 f2 S1 Modifications: 1. rm f3 2. add f4 f3 Page 33
  34. 34. File Blocks Sharing • Blocks in datanodes are not copied – The snapshot files record the block list and the file size – No data copying / d blk0 Architecting the Future of Big Data © Hortonworks Inc. 2013 S1 f' f’’ S2 f blk1 blk2 blk3 Page 34
  35. 35. Persistent Data Structures • A well-known data structure for “time travel” – Support querying previous version of the data • Access slow down – The additional time required for the data structure • In traditional persistent data structures – There is slow down on accessing current data and snapshot data • In our implementation – No slow down on accessing current data – Slow down happens only on accessing snapshot data Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 35
  36. 36. No Slow Down on Accessing Current Data • The current data can be accessed directly – Modifications are recorded in reverse chronological order Snapshot data = Current data – Modifications / ~ modifications d 1 f1 d 2 f4 f2 S1 Modifications: 1. rm f3 2. add f4 f3 f2 Architecting the Future of Big Data © Hortonworks Inc. 2013 d 2 f3 Page 36
  37. 37. Easy Management • Snapshots can be taken on any directory – Set the directory to be snapshottable • Support 65,536 simultaneous snapshots • No limit on the number of snapshottable directories – Nested snapshottable directories are currently NOT allowed Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 37
  38. 38. Admin Ops • Allow snapshots on a directory – hdfs dfsadmin –allowSnapshot <path> • Reset a snapshottable directory – hdfs dfsadmin –disallowSnapshot <path> • Example Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 38
  39. 39. User Ops • Create/delete/rename snapshots – hdfs dfs -createSnapshot <path> [<snapshotName>] – hdfs dfs –deleteSnapshot <path> <snapshotName> – hdfs dfs –renameSnapshot <path> <oldName> <newName> • Get snapshottable directory listing – hdfs lsSnapshottableDir • Get snapshots difference report – hdfs snapshotDiff <path> <from> <to> Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 39
  40. 40. Use snapshot paths in CLI • All regular commands and APIs can be used against snapshot path – /<snapshottableDir>/.snapshot/<snapshotName>/foo/bar • List all the files in a snapshot – ls /test/.snapshot/s4 • List all the snapshots under that path – ls <path>/.snapshot Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 40
  41. 41. Test Snapshot Functionalities • ~100 unit tests • ~1.4 million generated system tests – Covering most combination of (snapshot + rename) operations • Automated long-running tests for months Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 41
  42. 42. NFS Support and Other Features Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 42
  43. 43. NFS Support • NFS Gateway provides NFS access to HDFS – File browsing, Data download/upload, Data streaming – No client-side library – Better alternative to Hadoop + Fuse based solution • Better consistency guarantees • Supports NFSv3 • Stateless Gateway – Simpler design, easy to handle failures • Future work – High Availability for NFS Gateway – NFSv4 support? Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 43
  44. 44. Other Features • Protobuf, wire compatibility – Post 2.0 GA stronger wire compatibility • Rolling upgrades – With relaxed version checks • Improvements for other projects – Stale node to improve HBase MTTR • Block placement enhancements – Better support for other topologies such as VMs and Cloud • On the wire encryption – Both data and RPC • Expanding ecosystem, platforms and applicability – Native support for Windows Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 44
  45. 45. Enterprise Readiness • Storage fault-tolerance – built into HDFS  – 100% data reliability • High Availability  • Standard Interfaces  – WebHDFS(REST), Fuse, NFS, HttpFs, libwebhdfs and libhdfs • Wire protocol compatibility  – Protocol buffers • Rolling upgrades  • Snapshots  • Disaster Recovery  – Distcp for parallel and incremental copies across cluster – Apache Ambari and HDP for automated management Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 45
  46. 46. Work in Progress • HDFS-2832: Heterogeneous storages – Datanode abstraction from single storage to collection of storages – Support different storage types: Disk and SSD • HDFS-5535: Zero download rolling upgrade – Namenodes and Datanodes can be upgraded independently – No upgrade downtime • HDFS-4685: ACLs – More flexible than user-group-permission Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 46
  47. 47. Future Works • HDFS-5477: Block manager as a service – Move block management out from Namenode – Support different name service, e.g. key-value store • HDFS-3154: Immutable files – Write-once and then read-only • HDFS-4704: Transient files – Tmp files will not be recorded in snapshots Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 47
  48. 48. Q&A • Myths and misinformation of HDFS – – – – – Not reliable (was never true) Namenode dies, all state is lost (was never true) Does not support disaster recovery (distcp in Hadoop0.15) Hard to operate for new comers Performance improvements (always ongoing) • Major improvements in 1.2 and 2.x – Namenode is a single point of failure – Needs shared NFS storage for HA – Does not have point in time recovery Thank You! Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 48
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×