1
HDFS Tiered Storage
Virajith Jalaparthi, Chris Douglas
Ewan Higgs, Thomas Demoor
‱ Tiered Storage [issues.apache.org]
– HDFS-9806
– HDFS-12090
Microsoft – Western Digital – Apache Community
2
Virajith Jalaparti
Chris Douglas


Ewan Higgs
Kasper Janssens
Thomas Demoor


‱ Hadoop Compatible FS [1]: s3a://, wasb://, adl://, 

‱ Direct IO between Hadoop apps and Object Store
‱ Disaggregated compute & storage
‱ HDFS NameNode functions taken up by Object Store
Hadoop already plays nicely with Object Stores
3
REMOTE
STORE
APP
HADOOP CLUSTER
READWRITE
[1]: https://s.apache.org/Hadoop3FSspec
‱ Pain points:
– Not really a FileSystem: rename, append, directories, ...
‱ Even with correct semantics, performance unlike HDFS
‱ HDFS features unavailable (e.g., hedged reads, snapshots, etc.)
– No locality
‱ Higher latency than attached storage
‱ Higher variance in both latency and throughput
– No HDFS integration
‱ Policies for users, permissions, quota, security, 

‱ Storage Plugins (e.g. Ranger, Sentry)
‱ External Storage Tier for HDFS
– HDFS Storage Policy: DISK, SSD, RAM, ARCHIVE, PROVIDED
‱ Share namespace, not only data!
– Keep 1-to-1 mapping: HDFS file  external object
‱ No change to existing HDFS workflows
– Hadoop Apps interact with HDFS as before (fully transparent)
– Data Tiering happens async in background
– Native support for all HDFS features / admin tools
‱ Data Tiering controlled by admin
– On directory / file level
– Through Storage Policy (e.g. <HDD, HDD, HDD>  <PROVIDED>)
‱ HDFS NameNode scalability not a bottleneck
– HDFS manages the working set/compatibility
– Object store manages larger data lake, ingest, etc.
Goal: let HDFS play nicely with Object Stores
4
HDFS
REMOTE
STORE
APP
HADOOP CLUSTER
WRITE
BACK
LOAD
ON-DEMAND
READWRITE
HDFS
‱ Use HDFS to manage remote storage
“Mount” remote storage in HDFS
5
‱ Use HDFS to manage remote storage
– HDFS blocks correspond to fixed range of bytes in remote
– AliasMap (DWS17: youtu.be/kpNDZNp-Nlw)
– HDFS coordinates reads/writes to remote store
– Mount remote store as a PROVIDED tier in HDFS
– Set StoragePolicy to move data into HDFS

 

/
a b
HDFS
Namespace

 

/
d e f
Remote
Namespace
Mount remote
namespace
c
d e f
Mount point
REMOTE
STORE
APP
HADOOP CLUSTER
WRITE
THROUGH
LOAD
ON-DEMAND
READWRITE
Alias Map
HDFS Block->Remote location
PROVIDED storage on the READ path
6
/foo/bar
/foo/baz
/foo/bazt
/foo/bazz
IaaS(De)Hydration Delegation
PROVIDED storage on the READ path
7
/foo/bar
/foo/baz
/foo/bazt
/foo/bazz
/foo
bar baz bazt bazz
setrep=2
IaaS(De)Hydration Delegation
PROVIDED storage on the READ path
8
/foo/bar
/foo/baz
/foo/bazt
/foo/bazz
/foo
bar baz bazt bazz
[Router-Based]
Federation
/cloud
?
IaaS(De)Hydration Delegation
Apache Hadoop 3.1.0
‱ Generate FSImage from a FileSystem
 Start a NameNode serving remote data
 Serve from (a subset of) DataNodes in the cluster
‱ Backported and deployed in production at Microsoft
‱ Static: namespace changes are not reflected in HDFS NameNode
9
‱ Prototype code [2] with the PROVIDED abstraction
 Read-through caching of blocks (demand paging)
 Scheduled, metered prefetch for recurring pipelines with SLOs
 Write-through to remote (participant in the HDFS write pipeline)
 Wire FSImage to a running NameNode
‱ Per-application NameNodes; with isolation
‱ Bidirectional synchronization out of scope
[2]: https://github.com/Microsoft-CISL/hadoop/tree/tieredStore-sig16
Running Apache Hadoop in the cloud
10
‱ HDInsight/Elastic MapReduce (EMR)/etc.
‱ Disaggregation introduces not only
latency, but also variance
‱ “Lift and shift” workloads
 Rely on HDFS plugins
 May need to use attached storage to meet
SLOs
 Would otherwise require spending more for
capacity to the remote store
𝑠𝑡𝑑𝑑𝑒𝑣
𝑚𝑒𝑎𝑛
‱ HDFS can be used as a cache for Object Storage
‱ Similar to $my_favorite_caching_FS (CFS)?
– These are all caching systems that dispatch between storage systems horizontally
– We want to tier the storage systems vertically
‱ Support HDFS, not just Hadoop ecosystem around FileSystem
Notes on Caching
11
Compute
$CFS
☁
Compute
$CFS
Compute
$CFS
HDFS HDFS HDFS
Compute
☁
Compute Compute
HDFS HDFS HDFS

 



/
bucket1 carlhadoop
Object Store




/
reports
fileA fileB dir
sales
HDFS cluster
NameNode
DataNode 1
P
External Storage for HDFS
Hadoop
Client
DataNode N
P
fileA fileB dir
12
‱ “DropBox for Hadoop"
– Hadoop cluster has complete namespace but only “data in working set” is stored locally
– Dynamically page in missing data from object store on read
– Asynchronously write back data to object store
‱ Storage Policies + Replication count offer rich placement options
– E.g.: hot data: <SSD, PROVIDED> / cold data: <PROVIDED>
‱ Dedicated object storage system more efficient ($$$)
– Similar goal as ARCHIVE storage policy
– Object storage features (erasure coding, multi-geo replication, 
)
‱ Data sharing with non-Hadoop apps
– File-object mapping means objects can be accessed in remote store with REST API / SDKs
Use case: External Storage for HDFS
13
Community feedback at last year’s Summit
14
WD Activescale Object Storage
‱ Western Digital moving up the stack (Data Center Systems)
‱ Scale-out object storage system for Private & Public Cloud
‱ Key features:
 Compatible with Amazon S3 API
 Strong consistency (not eventual!)
 Erasure coding for efficient storage
‱ Scale:
 Petabytes per rack
 Billions of objects per rack
 Linear scalability in # of racks
‱ More info at http://www.hgst.com/products/systems
15
‱ AS AN Administrator
‱ I CAN configure HDFS with an object storage backend
hdfs storagepolicies -setStoragePolicy -policy PROVIDED -path /var/log
hdfs syncservice -create -backupOnly -name activescale /var/logs s3a://hadoop-logs/
‱ SO THAT when a user copies files to HDFS they are asynchronously copied to
to the synchronization endpoint
Demo time
16
Another example
‱ AS AN Administrator
‱ I CAN set the Storage Policy to be PROVIDED_ONLY
hdfs storagepolicies –setStoragePolicy -policy PROVIDED_ONLY -path /var/log
‱ SO THAT data is no longer in the Datanode but is transparently read
through from the synchronization endpoint on access.
17
‱ Preserve file-object mapping
– AliasMap (last year’s talk – HDFS-9806): synchronize namespaces
– Datanodes collaborate to move blocks which together form an object in destination system
‱ Minimize impact on frontend traffic / efficient data transfer
– Obvious: Read all blocks into a single Datanode to reconstruct a file before transferring
– Efficient: Transfer directly copies block per block outside of cluster using
‱ S3: multipart upload
‱ WASB: append blobs
‱ HDFS: tmpdir + concat
‱ Flexible deployment: could run in NameNode OR as External service
– In Namenode is easy to deploy but adds resource pressure
– External service is more difficult to deploy for some sites but reduces resource pressure
– Ongoing community discussion; start with external, include internal option as required
Requirements
18
‱ MountManager manages all the local mount points
– Mount point can be configured to sync with external store
‱ Periodically create a diff by comparing snapshots of the mountpoint
– NEW SyncService (in/out NameNode)
‱ Generate a ”phased plan” for ordering the operations in the diff
– Multiple ordered phases
‱ RENAMES_TO_TEMP, DELETES, RENAMES_TO_FINAL, CREATE_DIRS, CREATE_FILES
‱ e.g. dir creation before file creation
– Parallel operations within a phase
‱ Leverage multiple datanodes and connections to external store
‱ e.g. Upload multiple new files in parallel
‱ Execute plan and track work
– Namespace (metadata) operations originate from SyncService
– Data operations originate from DataNodes
– Tracking: admin can query mountpoint for progress
Deep Dive: Synchronization
19
‱ Snapshot diff:
– Reflect point-in-time 100% accurate state of HDFS in external store
– Snapshot ensures data remains referenceable: retains blocks of data
– Does not track create + delete in between consecutive snapshots (cfr. file B in Fig.)
‱ EditLog post-processing:
– To parallelize
‱ Read batch from log and track lineage between overlapping operations
– HDFS operations might have altered reality: no point-in time
‱ Data not part of log: would require postponing block garbage collection
Tracking changes: Snapshot diff vs. EditLog
20
ss-6
B B A A
ss-5time
SnapshotDiffReport
M d .
+ d ./a
+ f ./f1.bin
Example Diff – Simple Case
21
Commands
#given /basic-test
mkdir -p /basic-test/a/b/c
touch /basic-test/a/b/c/d/f1.bin
touch /basic-test/f1.bin
Simple Case - New dirs; new files
PhasedPlan
+ d ./a/b/c/d/
+ f ./a/b/c/d/f1.bin
+ f ./f1.bin
SnapshotDiffReport
M d .
R f ./a.bin -> ./b.bin
R f ./b.bin -> ./a.bin
Example Diff – Harder Case
22
Commands
#given /swap-test/a.bin
#given /swap-test/b.bin
mv /swap-test/a.bin /swap-test/tmp
mv /swap-test/b.bin /swap-test/a.bin
mv /swap-test/tmp /swap-test/b.bin
Harder Case - Cycle
PhasedPlan
R f ./a.bin -> ./tmp/b.bin
R f ./b.bin -> ./tmp/a.bin
R f ./tmp/b.bin -> b.bin
R f ./tmp/a.bin -> a.bin
‱ Tiered Storage HDFS-12090 [issues.apache.org]
– Design documentation
– List of subtasks, lots of linked tickets – take one!
– Discussion of scope, implementation, and feedback
‱ Bert Verslyppe, Hendrik Depauw, ĂĂ±igo Goiri, Rakesh Radhakrishnan, Uma
Gangumalla, Daryn Sharp, Steve Loughran, Sanjay Radia, Anu Engineer,
Jitendra Pandey, Andrew Wang, Zhe Zhang, Allen Wittenauer, and many
others 

Thanks to the community for feedback & help!
23
Multipart Extra Slides
24
‱ Applications write to HDFS
– First to DISK, then SyncService asynchronously copies to synchronization endpoint
– When files have been copied, the extraneous disk replicas can be removed
Deep Dive: MultiPart Upload
25
SyncService
Datanode
Datanode
Datanode
External
Store
File
Block1
Client
Write File
Multipart InitMultipart Complete
Multipart PutPart
Block2
Block3
‱ Common concept in Object Storage
– Supported by S3, WASB
‱ Usage in Hadoop
– S3A uses it – see Steve Loughran’s talk
– New to HDFS – HDFS-13186
‱ Three phases
– UploadHandle initMultipart(Path filePath)
– PartHandle putPart(Path filePath, InputStream inputStream,
int partNumber, UploadHandle uploadId, long lengthInBytes)
– void complete(Path filePath, List<Pair<Integer, PartHandle>> handles,
UploadHandle multipartUploadId)
‱ Benefits:
– Object/File Isolation – you only see the results when it’s done
– Can be written in parallel across multiple nodes
MultipartUploader
26
MultipartUploader in SyncService
27
Sync Extra Slides
28
Create Directory
29
Delete Directory
30
Rename Directory
31
Create File
32
Delete File
33
Rename File
34
Modify File
35

HDFS tiered storage

  • 1.
    1 HDFS Tiered Storage VirajithJalaparthi, Chris Douglas Ewan Higgs, Thomas Demoor
  • 2.
    ‱ Tiered Storage[issues.apache.org] – HDFS-9806 – HDFS-12090 Microsoft – Western Digital – Apache Community 2 Virajith Jalaparti Chris Douglas 
 Ewan Higgs Kasper Janssens Thomas Demoor 

  • 3.
    ‱ Hadoop CompatibleFS [1]: s3a://, wasb://, adl://, 
 ‱ Direct IO between Hadoop apps and Object Store ‱ Disaggregated compute & storage ‱ HDFS NameNode functions taken up by Object Store Hadoop already plays nicely with Object Stores 3 REMOTE STORE APP HADOOP CLUSTER READWRITE [1]: https://s.apache.org/Hadoop3FSspec ‱ Pain points: – Not really a FileSystem: rename, append, directories, ... ‱ Even with correct semantics, performance unlike HDFS ‱ HDFS features unavailable (e.g., hedged reads, snapshots, etc.) – No locality ‱ Higher latency than attached storage ‱ Higher variance in both latency and throughput – No HDFS integration ‱ Policies for users, permissions, quota, security, 
 ‱ Storage Plugins (e.g. Ranger, Sentry)
  • 4.
    ‱ External StorageTier for HDFS – HDFS Storage Policy: DISK, SSD, RAM, ARCHIVE, PROVIDED ‱ Share namespace, not only data! – Keep 1-to-1 mapping: HDFS file  external object ‱ No change to existing HDFS workflows – Hadoop Apps interact with HDFS as before (fully transparent) – Data Tiering happens async in background – Native support for all HDFS features / admin tools ‱ Data Tiering controlled by admin – On directory / file level – Through Storage Policy (e.g. <HDD, HDD, HDD>  <PROVIDED>) ‱ HDFS NameNode scalability not a bottleneck – HDFS manages the working set/compatibility – Object store manages larger data lake, ingest, etc. Goal: let HDFS play nicely with Object Stores 4 HDFS REMOTE STORE APP HADOOP CLUSTER WRITE BACK LOAD ON-DEMAND READWRITE
  • 5.
    HDFS ‱ Use HDFSto manage remote storage “Mount” remote storage in HDFS 5 ‱ Use HDFS to manage remote storage – HDFS blocks correspond to fixed range of bytes in remote – AliasMap (DWS17: youtu.be/kpNDZNp-Nlw) – HDFS coordinates reads/writes to remote store – Mount remote store as a PROVIDED tier in HDFS – Set StoragePolicy to move data into HDFS 
 
 / a b HDFS Namespace 
 
 / d e f Remote Namespace Mount remote namespace c d e f Mount point REMOTE STORE APP HADOOP CLUSTER WRITE THROUGH LOAD ON-DEMAND READWRITE Alias Map HDFS Block->Remote location
  • 6.
    PROVIDED storage onthe READ path 6 /foo/bar /foo/baz /foo/bazt /foo/bazz IaaS(De)Hydration Delegation
  • 7.
    PROVIDED storage onthe READ path 7 /foo/bar /foo/baz /foo/bazt /foo/bazz /foo bar baz bazt bazz setrep=2 IaaS(De)Hydration Delegation
  • 8.
    PROVIDED storage onthe READ path 8 /foo/bar /foo/baz /foo/bazt /foo/bazz /foo bar baz bazt bazz [Router-Based] Federation /cloud ? IaaS(De)Hydration Delegation
  • 9.
    Apache Hadoop 3.1.0 ‱Generate FSImage from a FileSystem  Start a NameNode serving remote data  Serve from (a subset of) DataNodes in the cluster ‱ Backported and deployed in production at Microsoft ‱ Static: namespace changes are not reflected in HDFS NameNode 9 ‱ Prototype code [2] with the PROVIDED abstraction  Read-through caching of blocks (demand paging)  Scheduled, metered prefetch for recurring pipelines with SLOs  Write-through to remote (participant in the HDFS write pipeline)  Wire FSImage to a running NameNode ‱ Per-application NameNodes; with isolation ‱ Bidirectional synchronization out of scope [2]: https://github.com/Microsoft-CISL/hadoop/tree/tieredStore-sig16
  • 10.
    Running Apache Hadoopin the cloud 10 ‱ HDInsight/Elastic MapReduce (EMR)/etc. ‱ Disaggregation introduces not only latency, but also variance ‱ “Lift and shift” workloads  Rely on HDFS plugins  May need to use attached storage to meet SLOs  Would otherwise require spending more for capacity to the remote store 𝑠𝑡𝑑𝑑𝑒𝑣 𝑚𝑒𝑎𝑛
  • 11.
    ‱ HDFS canbe used as a cache for Object Storage ‱ Similar to $my_favorite_caching_FS (CFS)? – These are all caching systems that dispatch between storage systems horizontally – We want to tier the storage systems vertically ‱ Support HDFS, not just Hadoop ecosystem around FileSystem Notes on Caching 11 Compute $CFS ☁ Compute $CFS Compute $CFS HDFS HDFS HDFS Compute ☁ Compute Compute HDFS HDFS HDFS
  • 12.
    
 
 
 / bucket1 carlhadoop ObjectStore 
 
 / reports fileA fileB dir sales HDFS cluster NameNode DataNode 1 P External Storage for HDFS Hadoop Client DataNode N P fileA fileB dir 12
  • 13.
    ‱ “DropBox forHadoop" – Hadoop cluster has complete namespace but only “data in working set” is stored locally – Dynamically page in missing data from object store on read – Asynchronously write back data to object store ‱ Storage Policies + Replication count offer rich placement options – E.g.: hot data: <SSD, PROVIDED> / cold data: <PROVIDED> ‱ Dedicated object storage system more efficient ($$$) – Similar goal as ARCHIVE storage policy – Object storage features (erasure coding, multi-geo replication, 
) ‱ Data sharing with non-Hadoop apps – File-object mapping means objects can be accessed in remote store with REST API / SDKs Use case: External Storage for HDFS 13
  • 14.
    Community feedback atlast year’s Summit 14
  • 15.
    WD Activescale ObjectStorage ‱ Western Digital moving up the stack (Data Center Systems) ‱ Scale-out object storage system for Private & Public Cloud ‱ Key features:  Compatible with Amazon S3 API  Strong consistency (not eventual!)  Erasure coding for efficient storage ‱ Scale:  Petabytes per rack  Billions of objects per rack  Linear scalability in # of racks ‱ More info at http://www.hgst.com/products/systems 15
  • 16.
    ‱ AS ANAdministrator ‱ I CAN configure HDFS with an object storage backend hdfs storagepolicies -setStoragePolicy -policy PROVIDED -path /var/log hdfs syncservice -create -backupOnly -name activescale /var/logs s3a://hadoop-logs/ ‱ SO THAT when a user copies files to HDFS they are asynchronously copied to to the synchronization endpoint Demo time 16
  • 17.
    Another example ‱ ASAN Administrator ‱ I CAN set the Storage Policy to be PROVIDED_ONLY hdfs storagepolicies –setStoragePolicy -policy PROVIDED_ONLY -path /var/log ‱ SO THAT data is no longer in the Datanode but is transparently read through from the synchronization endpoint on access. 17
  • 18.
    ‱ Preserve file-objectmapping – AliasMap (last year’s talk – HDFS-9806): synchronize namespaces – Datanodes collaborate to move blocks which together form an object in destination system ‱ Minimize impact on frontend traffic / efficient data transfer – Obvious: Read all blocks into a single Datanode to reconstruct a file before transferring – Efficient: Transfer directly copies block per block outside of cluster using ‱ S3: multipart upload ‱ WASB: append blobs ‱ HDFS: tmpdir + concat ‱ Flexible deployment: could run in NameNode OR as External service – In Namenode is easy to deploy but adds resource pressure – External service is more difficult to deploy for some sites but reduces resource pressure – Ongoing community discussion; start with external, include internal option as required Requirements 18
  • 19.
    ‱ MountManager managesall the local mount points – Mount point can be configured to sync with external store ‱ Periodically create a diff by comparing snapshots of the mountpoint – NEW SyncService (in/out NameNode) ‱ Generate a ”phased plan” for ordering the operations in the diff – Multiple ordered phases ‱ RENAMES_TO_TEMP, DELETES, RENAMES_TO_FINAL, CREATE_DIRS, CREATE_FILES ‱ e.g. dir creation before file creation – Parallel operations within a phase ‱ Leverage multiple datanodes and connections to external store ‱ e.g. Upload multiple new files in parallel ‱ Execute plan and track work – Namespace (metadata) operations originate from SyncService – Data operations originate from DataNodes – Tracking: admin can query mountpoint for progress Deep Dive: Synchronization 19
  • 20.
    ‱ Snapshot diff: –Reflect point-in-time 100% accurate state of HDFS in external store – Snapshot ensures data remains referenceable: retains blocks of data – Does not track create + delete in between consecutive snapshots (cfr. file B in Fig.) ‱ EditLog post-processing: – To parallelize ‱ Read batch from log and track lineage between overlapping operations – HDFS operations might have altered reality: no point-in time ‱ Data not part of log: would require postponing block garbage collection Tracking changes: Snapshot diff vs. EditLog 20 ss-6 B B A A ss-5time
  • 21.
    SnapshotDiffReport M d . +d ./a + f ./f1.bin Example Diff – Simple Case 21 Commands #given /basic-test mkdir -p /basic-test/a/b/c touch /basic-test/a/b/c/d/f1.bin touch /basic-test/f1.bin Simple Case - New dirs; new files PhasedPlan + d ./a/b/c/d/ + f ./a/b/c/d/f1.bin + f ./f1.bin
  • 22.
    SnapshotDiffReport M d . Rf ./a.bin -> ./b.bin R f ./b.bin -> ./a.bin Example Diff – Harder Case 22 Commands #given /swap-test/a.bin #given /swap-test/b.bin mv /swap-test/a.bin /swap-test/tmp mv /swap-test/b.bin /swap-test/a.bin mv /swap-test/tmp /swap-test/b.bin Harder Case - Cycle PhasedPlan R f ./a.bin -> ./tmp/b.bin R f ./b.bin -> ./tmp/a.bin R f ./tmp/b.bin -> b.bin R f ./tmp/a.bin -> a.bin
  • 23.
    ‱ Tiered StorageHDFS-12090 [issues.apache.org] – Design documentation – List of subtasks, lots of linked tickets – take one! – Discussion of scope, implementation, and feedback ‱ Bert Verslyppe, Hendrik Depauw, ĂĂ±igo Goiri, Rakesh Radhakrishnan, Uma Gangumalla, Daryn Sharp, Steve Loughran, Sanjay Radia, Anu Engineer, Jitendra Pandey, Andrew Wang, Zhe Zhang, Allen Wittenauer, and many others 
 Thanks to the community for feedback & help! 23
  • 24.
  • 25.
    ‱ Applications writeto HDFS – First to DISK, then SyncService asynchronously copies to synchronization endpoint – When files have been copied, the extraneous disk replicas can be removed Deep Dive: MultiPart Upload 25 SyncService Datanode Datanode Datanode External Store File Block1 Client Write File Multipart InitMultipart Complete Multipart PutPart Block2 Block3
  • 26.
    ‱ Common conceptin Object Storage – Supported by S3, WASB ‱ Usage in Hadoop – S3A uses it – see Steve Loughran’s talk – New to HDFS – HDFS-13186 ‱ Three phases – UploadHandle initMultipart(Path filePath) – PartHandle putPart(Path filePath, InputStream inputStream, int partNumber, UploadHandle uploadId, long lengthInBytes) – void complete(Path filePath, List<Pair<Integer, PartHandle>> handles, UploadHandle multipartUploadId) ‱ Benefits: – Object/File Isolation – you only see the results when it’s done – Can be written in parallel across multiple nodes MultipartUploader 26
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.