HDFS Truncate:
Evolving Beyond Write-Once Semantics
Konstantin V. Shvachko
Plamen Jeliazkov
June 10, 2015
San Jose, CA
Agenda
Describing file truncate feature implemented in HDFS-3107 and HDFS-7056
 Evolution of Data Mutation Semantics in HDFS
 Use cases for truncate
 Overview of HDFS architecture
• Focus on Data-flow: write, append, leases, block recovery
 The truncate design principles
• Comparison of different implementation approaches
 Truncate with snapshots and upgrades: “copy-on-truncate”
 Truncate testing and benchmark tools: DFSIO, Slive, NNThroughput
 Possibilities for further improvements
2
Data Mutation Semantics
Data Mutation Operations
Traditional file systems support random writes by concurrent clients into a single file
 Sequential writes: write to the end of a file (Tape model)
 Random writes: write to a file at a given offset (Disk model)
 Record append: records are atomically appended to a file (Object storage)
• Many clients write records concurrently. The offset is chosen by the system
 Snapshots: copy-on-write techniques
• Duplicate metadata referencing the same data
• Data is duplicated when modified
 Concurrent writers: multiple clients can write to the same file
4
Pre HDFS History
Google File System implements a variety of data mutation operations
 2003 GFS paper:
• Optimized for sequential IOs
• Support for random writes with concurrent writers
• Record appends
• Snapshots
 2005 Y! Juggernaut project
• Random writes with concurrent writers
5
HDFS Write Semantics History
Simplified (to bare bones initially) to avoid complexity
 2006 HDFS
• Single writers, multiple readers
• Write-once semantics, no updates allowed
• Readers can see data only after the file is closed
 2008 First attempt to implement append (HADOOP-1700). Hadoop 1
• Solved file readability before close problem
 2009 Append revisited (HDFS-265). Hadoop 2
• Introduced lease / block recovery
 2013 Snapshots (HDFS-2802)
 2015 Truncate (HDFS-3107)
6
Operation Truncate
Truncate Operation
In traditional file systems truncate is the file length change operation
 Truncate file routine is supported by all
traditional file systems
• Allows readjusting file’s length to a
new value in both directions
 POSIX truncate() and Java
RandomAccessFile.setLength() allow
• Shrinking a file to a shorter length by
detaching and deleting the file’s tail
• Expanding file to a larger length by
padding zeros at the end of the file
8
B1 B2 B3 B4
Old Length
Original File F
B1 B2 B3 B4
New Length
Shorter File F’
B1 B2 B3 B4
Longer File F’’
B5
New Length
HDFS Truncate
Allows to reduce file length only
 HDFS truncate detaches and deletes the tail of a file,
thus shrinking it to the new length
9
B1 B2 B4
New Length Old Length
File F
B3
Motivation for Truncate
“Having an append without a truncate is a serious deficiency”
 Undo data updates added by mistake, as in failed append
• Combination of truncates and appends models random writes
 Data copy fails with a large file partly copied (DistCp)
• Current solution – restart copying from the beginning – can be optimized
• Truncate files to the last known common length or a block boundary and
restart copying the remaining part
 Transaction support: HDFS file used as a reliable journal store (Hawk)
• Begin-transaction remembers the offset in the journal file
• Abort-transaction rolls back the journal by truncating to that offset
 Improved support for Fuse and NFSv3 connectors to HDFS
10
HDFS
State of the Art
HDFS Architecture
Reliable distributed file system for storing very large data sets
 HDFS metadata is decoupled from data
• Namespace is a hierarchy of files and directories represented by INodes
• INodes record attributes: permissions, quotas, timestamps, replication
 NameNode keeps its entire state in RAM
• Memory state: the namespace tree and the mapping of blocks to DataNodes
• Persistent state: recent checkpoint of the namespace and journal log
 File data is divided into blocks (default 128MB)
• Each block is independently replicated on multiple DataNodes (default 3)
• Block replicas stored on DataNodes as local files on local drives
12
HDFS Cluster
Standard HDFS Architecture
 Single active NameNode
fails over to Standby using
ZooKeeper and QJM
 Thousands of DataNodes
store data blocks
 Tens of thousands of HDFS
clients connected to active
NameNode
13
HDFS Operations Workflow
Namespace operations are decoupled from data transfers
 Active NameNode workflow
1. Receive request from a client,
2. Apply the update to its memory state,
3. Record the update as a journal transaction in persistent storage,
4. Return result to the client
 HDFS Client (read or write to a file)
• Send request to the NameNode, receive replica locations
• Read or write data from or to DataNodes
 DataNode
• Data transfer to / from clients and between DataNodes
• Report replica states to NameNode(s): new, deleted, corrupt
• Report its own state to NameNode(s): heartbeats, block reports
14
HDFS Write
1. To write a block of a file,
the client requests a list
of candidate DataNodes
from the NameNode
2. Organizes a write
pipeline, transfers data
3. DataNodes report
received replicas to the
NameNode
15
Write Leases
Single writer semantics is enforced through file leases
 HDFS client that opens a file for writing is granted a lease
• Only one client can hold a lease on a single file
• Client periodically renews the lease by sending heartbeats to the NameNode
 Lease duration is bound by a soft limit and a hard limit
• Until the soft limit expires, the writer is certain of exclusive access to the file
• After soft limit (10 min): any client can reclaim the lease
• After hard limit (1 hour): NameNode mandatory closes the file, revokes the lease
 Writer's lease does not prevent other clients from reading the file
16
Lease and Block Recovery
If client fails to close file the last block replicas may be inconsistent with each other
 Client failure is determined by lease expiration
• Lease expiration triggers block recovery to synchronize last block replicas
 NameNode assigns new generation stamp to the block under recovery
• Designates primary DataNode for the recovery
• Sends request to recover replicas to the primary DataNode
 The primary coordinates replicas length adjustment with other DataNodes
• Then confirms the resulting block length to NameNode
• NameNode updates the block and closes the file upon such confirmation
 Failure to recover triggers another recovery with higher generation stamp
17
Truncate Implementation
Early Implementation Approaches
Involved client as a coordinator of the truncate workflow
 Block boundary only truncate + append
• Truncate to the largest block boundary, append the remaining delta
 Truncate with concat - an HDFS operation that allows combining blocks of several files into one
• Identify the delta between the largest block boundary and the new length
• Copy the delta into a temporary file
• Concatenate the original file truncated to the largest block boundary with the delta file
19
B1 B3 B4
New Length Old Length
File F
B2
deltalargest block boundary
The Design: In-place Truncate
Block replicas are truncated in-place when no snapshots or upgrades
 Truncate uses lease recovery workflow
1. NameNode receives truncate call from a client, ensures the file is closed
• Whole blocks are invalidated instantaneously. Return if nothing more to truncate
• If not on the block boundary, then NameNode sets file length to newLength, opens the file for
write, assigns a lease for it, persists truncate operation in editsLog, and returns to the client
2. NameNode designates a DataNode as primary for the last block recovery, and sends a
DatanodeCommand to it, which contains the new length of the replica
3. Primary DataNode synchronizes the new length between the replicas, and confirms the
result to the NameNode
4. NameNode completes truncate by releasing the lease and closing the file
20
Truncate Workflow
Atomic update of the namespace followed by the background truncate recovery
21
Active
NameNodetruncate
1
2 recovery 4 confirm
3 synchronize
P
DataNodes
Snapshots
A file snapshot duplicates file metadata, but does not copy data blocks
 File on the NameNode is represented by
• INode contains file attributes: file length, permissions, times, quotas, replication
• List of blocks belonging to the file, including replica locations for each block
 File snapshot is a state of the file at a particular point in time
 Snapshots are maintained on the NameNode by duplicating metadata
• Pre-truncate file snapshot is represented by an immutable copy of INode only
• List of blocks could only grow with appends – no need to duplicate the list
22
Snapshot support for truncate
Copy-on-truncate recovery is needed to create a snapshot copy of the last block
 Snapshots are maintained on the NameNode by duplicating metadata
• File snapshot is represented by an immutable copy of INode and
• Possibly a list of blocks, also immutable
o The block list is duplicated only when a truncate is performed
 DataNode duplicates the block being truncated
• The old replica belongs to the snapshot. The new replica is referenced by the file
 Copy-on-truncate recovery is also handled by lease recovery workflow
• Replica recovery command carries the new block
• DataNodes keep the old replica unchanged, and copy its part to the new replica
• Primary DataNode reports both replicas to the NameNode when recovery is done
23
<blockID, genStamp, blockLen>
Synopsys
Truncate file src to the specified newLength
 Returns:
• true if the file was truncated to the desired newLength and is immediately available to be
reused for write operations such as append, or
• false if a background process of adjusting the length of the last block has been started, and
clients should wait for it to complete before they can proceed with further file updates
 Throws IOException in the following error cases:
• If src is a directory
• If src does not exist or current user does not have write permissions for the file
• If src is opened for write
• If newLength is greater than the current file length
public boolean truncate(Path src, long newLength)
throws IOException;
24
Semantics
Truncate cuts the tail of the specified file to the specified new length
 After truncate operation returns to the client new readers can read data up to
the newLength
• Reads beyond the new files length results in EOF
• Old readers can still see old bytes until replica recovery is complete
 After truncate operation returns the file may or may not be immediately
available for write
• If the newLength is on the block boundary, then the file can be appended to
immediately
• Otherwise clients should wait for the truncate recovery to complete
25
Testing and Benchmarking
Truncate testing tools
 The DFSIO benchmark
• Measures average throughput for read, write and append operations
• Now also includes truncate
 Slive is map reduce application
• Randomly generates HDFS operations
• Highly adjustable
• Includes truncate
27
Truncate Benchmarks
Using DFSIO, measure performance of the following operations
 Create large files
 Append data to a file
 Truncate files to a block boundary
 Truncate file in the middle of a block without waiting for recovery
 Truncate file in the middle of a block and waiting for recovery
 Truncate file with a snapshot to cause copy-on-write truncate, with recovery
28
NNThroughput Benchmark
Pure metadata operations without RPC overhead
29
0
5,000
10,000
15,000
20,000
25,000
30,000
Create Mkdirs Delete Truncate 10 to 5
blocks
Performance (ops/sec)
1 million operations with 1000 threads
Further Improvements
 Currently files are updated in HDFS as tape devices
• Finally with the rewind
 Allow to truncate an unclosed file by the same client
• Optimization for transaction journal use case
 Implement random writes
• It is known it can be done
 Concurrent writers: many clients writing records to the same file
• Atomic append: each record is guaranteed to be written, but at an unknown offset
30
Thanks to the Community
Many people contributed to the truncate project
 Konstantin Shvachko
 Plamen Jeliazkov
 Tsz Wo Nicholas Sze
 Jing Zhao
 Colin Patrick McCabe
 Yi Liu
 Milan Desai
 Byron Wong
 Konstantin Boudnik
 Lei Chang
 Milind Bhandarkar
 Dasha Boudnik
 Guo Ruijing
 Roman Shaposhnik
31
Thank You.
Questions?
Come visit WANdisco at Booth # P3
HDFS Truncate: Evolving Beyond Write-Once Semantics
Konstantin V. Shvachko Plamen Jeliazkov

HDFS Trunncate: Evolving Beyond Write-Once Semantics

  • 1.
    HDFS Truncate: Evolving BeyondWrite-Once Semantics Konstantin V. Shvachko Plamen Jeliazkov June 10, 2015 San Jose, CA
  • 2.
    Agenda Describing file truncatefeature implemented in HDFS-3107 and HDFS-7056  Evolution of Data Mutation Semantics in HDFS  Use cases for truncate  Overview of HDFS architecture • Focus on Data-flow: write, append, leases, block recovery  The truncate design principles • Comparison of different implementation approaches  Truncate with snapshots and upgrades: “copy-on-truncate”  Truncate testing and benchmark tools: DFSIO, Slive, NNThroughput  Possibilities for further improvements 2
  • 3.
  • 4.
    Data Mutation Operations Traditionalfile systems support random writes by concurrent clients into a single file  Sequential writes: write to the end of a file (Tape model)  Random writes: write to a file at a given offset (Disk model)  Record append: records are atomically appended to a file (Object storage) • Many clients write records concurrently. The offset is chosen by the system  Snapshots: copy-on-write techniques • Duplicate metadata referencing the same data • Data is duplicated when modified  Concurrent writers: multiple clients can write to the same file 4
  • 5.
    Pre HDFS History GoogleFile System implements a variety of data mutation operations  2003 GFS paper: • Optimized for sequential IOs • Support for random writes with concurrent writers • Record appends • Snapshots  2005 Y! Juggernaut project • Random writes with concurrent writers 5
  • 6.
    HDFS Write SemanticsHistory Simplified (to bare bones initially) to avoid complexity  2006 HDFS • Single writers, multiple readers • Write-once semantics, no updates allowed • Readers can see data only after the file is closed  2008 First attempt to implement append (HADOOP-1700). Hadoop 1 • Solved file readability before close problem  2009 Append revisited (HDFS-265). Hadoop 2 • Introduced lease / block recovery  2013 Snapshots (HDFS-2802)  2015 Truncate (HDFS-3107) 6
  • 7.
  • 8.
    Truncate Operation In traditionalfile systems truncate is the file length change operation  Truncate file routine is supported by all traditional file systems • Allows readjusting file’s length to a new value in both directions  POSIX truncate() and Java RandomAccessFile.setLength() allow • Shrinking a file to a shorter length by detaching and deleting the file’s tail • Expanding file to a larger length by padding zeros at the end of the file 8 B1 B2 B3 B4 Old Length Original File F B1 B2 B3 B4 New Length Shorter File F’ B1 B2 B3 B4 Longer File F’’ B5 New Length
  • 9.
    HDFS Truncate Allows toreduce file length only  HDFS truncate detaches and deletes the tail of a file, thus shrinking it to the new length 9 B1 B2 B4 New Length Old Length File F B3
  • 10.
    Motivation for Truncate “Havingan append without a truncate is a serious deficiency”  Undo data updates added by mistake, as in failed append • Combination of truncates and appends models random writes  Data copy fails with a large file partly copied (DistCp) • Current solution – restart copying from the beginning – can be optimized • Truncate files to the last known common length or a block boundary and restart copying the remaining part  Transaction support: HDFS file used as a reliable journal store (Hawk) • Begin-transaction remembers the offset in the journal file • Abort-transaction rolls back the journal by truncating to that offset  Improved support for Fuse and NFSv3 connectors to HDFS 10
  • 11.
  • 12.
    HDFS Architecture Reliable distributedfile system for storing very large data sets  HDFS metadata is decoupled from data • Namespace is a hierarchy of files and directories represented by INodes • INodes record attributes: permissions, quotas, timestamps, replication  NameNode keeps its entire state in RAM • Memory state: the namespace tree and the mapping of blocks to DataNodes • Persistent state: recent checkpoint of the namespace and journal log  File data is divided into blocks (default 128MB) • Each block is independently replicated on multiple DataNodes (default 3) • Block replicas stored on DataNodes as local files on local drives 12
  • 13.
    HDFS Cluster Standard HDFSArchitecture  Single active NameNode fails over to Standby using ZooKeeper and QJM  Thousands of DataNodes store data blocks  Tens of thousands of HDFS clients connected to active NameNode 13
  • 14.
    HDFS Operations Workflow Namespaceoperations are decoupled from data transfers  Active NameNode workflow 1. Receive request from a client, 2. Apply the update to its memory state, 3. Record the update as a journal transaction in persistent storage, 4. Return result to the client  HDFS Client (read or write to a file) • Send request to the NameNode, receive replica locations • Read or write data from or to DataNodes  DataNode • Data transfer to / from clients and between DataNodes • Report replica states to NameNode(s): new, deleted, corrupt • Report its own state to NameNode(s): heartbeats, block reports 14
  • 15.
    HDFS Write 1. Towrite a block of a file, the client requests a list of candidate DataNodes from the NameNode 2. Organizes a write pipeline, transfers data 3. DataNodes report received replicas to the NameNode 15
  • 16.
    Write Leases Single writersemantics is enforced through file leases  HDFS client that opens a file for writing is granted a lease • Only one client can hold a lease on a single file • Client periodically renews the lease by sending heartbeats to the NameNode  Lease duration is bound by a soft limit and a hard limit • Until the soft limit expires, the writer is certain of exclusive access to the file • After soft limit (10 min): any client can reclaim the lease • After hard limit (1 hour): NameNode mandatory closes the file, revokes the lease  Writer's lease does not prevent other clients from reading the file 16
  • 17.
    Lease and BlockRecovery If client fails to close file the last block replicas may be inconsistent with each other  Client failure is determined by lease expiration • Lease expiration triggers block recovery to synchronize last block replicas  NameNode assigns new generation stamp to the block under recovery • Designates primary DataNode for the recovery • Sends request to recover replicas to the primary DataNode  The primary coordinates replicas length adjustment with other DataNodes • Then confirms the resulting block length to NameNode • NameNode updates the block and closes the file upon such confirmation  Failure to recover triggers another recovery with higher generation stamp 17
  • 18.
  • 19.
    Early Implementation Approaches Involvedclient as a coordinator of the truncate workflow  Block boundary only truncate + append • Truncate to the largest block boundary, append the remaining delta  Truncate with concat - an HDFS operation that allows combining blocks of several files into one • Identify the delta between the largest block boundary and the new length • Copy the delta into a temporary file • Concatenate the original file truncated to the largest block boundary with the delta file 19 B1 B3 B4 New Length Old Length File F B2 deltalargest block boundary
  • 20.
    The Design: In-placeTruncate Block replicas are truncated in-place when no snapshots or upgrades  Truncate uses lease recovery workflow 1. NameNode receives truncate call from a client, ensures the file is closed • Whole blocks are invalidated instantaneously. Return if nothing more to truncate • If not on the block boundary, then NameNode sets file length to newLength, opens the file for write, assigns a lease for it, persists truncate operation in editsLog, and returns to the client 2. NameNode designates a DataNode as primary for the last block recovery, and sends a DatanodeCommand to it, which contains the new length of the replica 3. Primary DataNode synchronizes the new length between the replicas, and confirms the result to the NameNode 4. NameNode completes truncate by releasing the lease and closing the file 20
  • 21.
    Truncate Workflow Atomic updateof the namespace followed by the background truncate recovery 21 Active NameNodetruncate 1 2 recovery 4 confirm 3 synchronize P DataNodes
  • 22.
    Snapshots A file snapshotduplicates file metadata, but does not copy data blocks  File on the NameNode is represented by • INode contains file attributes: file length, permissions, times, quotas, replication • List of blocks belonging to the file, including replica locations for each block  File snapshot is a state of the file at a particular point in time  Snapshots are maintained on the NameNode by duplicating metadata • Pre-truncate file snapshot is represented by an immutable copy of INode only • List of blocks could only grow with appends – no need to duplicate the list 22
  • 23.
    Snapshot support fortruncate Copy-on-truncate recovery is needed to create a snapshot copy of the last block  Snapshots are maintained on the NameNode by duplicating metadata • File snapshot is represented by an immutable copy of INode and • Possibly a list of blocks, also immutable o The block list is duplicated only when a truncate is performed  DataNode duplicates the block being truncated • The old replica belongs to the snapshot. The new replica is referenced by the file  Copy-on-truncate recovery is also handled by lease recovery workflow • Replica recovery command carries the new block • DataNodes keep the old replica unchanged, and copy its part to the new replica • Primary DataNode reports both replicas to the NameNode when recovery is done 23 <blockID, genStamp, blockLen>
  • 24.
    Synopsys Truncate file srcto the specified newLength  Returns: • true if the file was truncated to the desired newLength and is immediately available to be reused for write operations such as append, or • false if a background process of adjusting the length of the last block has been started, and clients should wait for it to complete before they can proceed with further file updates  Throws IOException in the following error cases: • If src is a directory • If src does not exist or current user does not have write permissions for the file • If src is opened for write • If newLength is greater than the current file length public boolean truncate(Path src, long newLength) throws IOException; 24
  • 25.
    Semantics Truncate cuts thetail of the specified file to the specified new length  After truncate operation returns to the client new readers can read data up to the newLength • Reads beyond the new files length results in EOF • Old readers can still see old bytes until replica recovery is complete  After truncate operation returns the file may or may not be immediately available for write • If the newLength is on the block boundary, then the file can be appended to immediately • Otherwise clients should wait for the truncate recovery to complete 25
  • 26.
  • 27.
    Truncate testing tools The DFSIO benchmark • Measures average throughput for read, write and append operations • Now also includes truncate  Slive is map reduce application • Randomly generates HDFS operations • Highly adjustable • Includes truncate 27
  • 28.
    Truncate Benchmarks Using DFSIO,measure performance of the following operations  Create large files  Append data to a file  Truncate files to a block boundary  Truncate file in the middle of a block without waiting for recovery  Truncate file in the middle of a block and waiting for recovery  Truncate file with a snapshot to cause copy-on-write truncate, with recovery 28
  • 29.
    NNThroughput Benchmark Pure metadataoperations without RPC overhead 29 0 5,000 10,000 15,000 20,000 25,000 30,000 Create Mkdirs Delete Truncate 10 to 5 blocks Performance (ops/sec) 1 million operations with 1000 threads
  • 30.
    Further Improvements  Currentlyfiles are updated in HDFS as tape devices • Finally with the rewind  Allow to truncate an unclosed file by the same client • Optimization for transaction journal use case  Implement random writes • It is known it can be done  Concurrent writers: many clients writing records to the same file • Atomic append: each record is guaranteed to be written, but at an unknown offset 30
  • 31.
    Thanks to theCommunity Many people contributed to the truncate project  Konstantin Shvachko  Plamen Jeliazkov  Tsz Wo Nicholas Sze  Jing Zhao  Colin Patrick McCabe  Yi Liu  Milan Desai  Byron Wong  Konstantin Boudnik  Lei Chang  Milind Bhandarkar  Dasha Boudnik  Guo Ruijing  Roman Shaposhnik 31
  • 32.
    Thank You. Questions? Come visitWANdisco at Booth # P3 HDFS Truncate: Evolving Beyond Write-Once Semantics Konstantin V. Shvachko Plamen Jeliazkov