HDFS tiered storage

1
HDFS Tiered Storage
Virajith Jalaparthi, Chris Douglas
Ewan Higgs, Thomas Demoor

• Tiered Storage [issues.apache.org]
– HDFS-9806
– HDFS-12090
Microsoft – Western Digital – Apache Community
2
Virajith Jalaparti
Chris Douglas
…
Ewan Higgs
Kasper Janssens
Thomas Demoor
…

• Hadoop Compatible FS [1]: s3a://, wasb://, adl://, …
• Direct IO between Hadoop apps and Object Store
• Disaggregated compute & storage
• HDFS NameNode functions taken up by Object Store
Hadoop already plays nicely with Object Stores
3
REMOTE
STORE
APP
HADOOP CLUSTER
READWRITE
[1]: https://s.apache.org/Hadoop3FSspec
• Pain points:
– Not really a FileSystem: rename, append, directories, ...
• Even with correct semantics, performance unlike HDFS
• HDFS features unavailable (e.g., hedged reads, snapshots, etc.)
– No locality
• Higher latency than attached storage
• Higher variance in both latency and throughput
– No HDFS integration
• Policies for users, permissions, quota, security, …
• Storage Plugins (e.g. Ranger, Sentry)

• External Storage Tier for HDFS
– HDFS Storage Policy: DISK, SSD, RAM, ARCHIVE, PROVIDED
• Share namespace, not only data!
– Keep 1-to-1 mapping: HDFS file  external object
• No change to existing HDFS workflows
– Hadoop Apps interact with HDFS as before (fully transparent)
– Data Tiering happens async in background
– Native support for all HDFS features / admin tools
• Data Tiering controlled by admin
– On directory / file level
– Through Storage Policy (e.g. <HDD, HDD, HDD>  <PROVIDED>)
• HDFS NameNode scalability not a bottleneck
– HDFS manages the working set/compatibility
– Object store manages larger data lake, ingest, etc.
Goal: let HDFS play nicely with Object Stores
4
HDFS
REMOTE
STORE
APP
HADOOP CLUSTER
WRITE
BACK
LOAD
ON-DEMAND
READWRITE

HDFS
• Use HDFS to manage remote storage
“Mount” remote storage in HDFS
5
• Use HDFS to manage remote storage
– HDFS blocks correspond to fixed range of bytes in remote
– AliasMap (DWS17: youtu.be/kpNDZNp-Nlw)
– HDFS coordinates reads/writes to remote store
– Mount remote store as a PROVIDED tier in HDFS
– Set StoragePolicy to move data into HDFS
… …
/
a b
HDFS
Namespace
… …
/
d e f
Remote
Namespace
Mount remote
namespace
c
d e f
Mount point
REMOTE
STORE
APP
HADOOP CLUSTER
WRITE
THROUGH
LOAD
ON-DEMAND
READWRITE
Alias Map
HDFS Block->Remote location

PROVIDED storage on the READ path
6
/foo/bar
/foo/baz
/foo/bazt
/foo/bazz
IaaS(De)Hydration Delegation

7
/foo/bar
/foo/baz
/foo/bazt
/foo/bazz
/foo
bar baz bazt bazz
setrep=2

8
/foo/bar
/foo/baz
/foo/bazt
/foo/bazz
/foo
bar baz bazt bazz
[Router-Based]
Federation
/cloud
?

Apache Hadoop 3.1.0
• Generate FSImage from a FileSystem
 Start a NameNode serving remote data
 Serve from (a subset of) DataNodes in the cluster
• Backported and deployed in production at Microsoft
• Static: namespace changes are not reflected in HDFS NameNode
9
• Prototype code [2] with the PROVIDED abstraction
 Read-through caching of blocks (demand paging)
 Scheduled, metered prefetch for recurring pipelines with SLOs
 Write-through to remote (participant in the HDFS write pipeline)
 Wire FSImage to a running NameNode
• Per-application NameNodes; with isolation
• Bidirectional synchronization out of scope
[2]: https://github.com/Microsoft-CISL/hadoop/tree/tieredStore-sig16

Running Apache Hadoop in the cloud
10
• HDInsight/Elastic MapReduce (EMR)/etc.
• Disaggregation introduces not only
latency, but also variance
• “Lift and shift” workloads
 Rely on HDFS plugins
 May need to use attached storage to meet
SLOs
 Would otherwise require spending more for
capacity to the remote store
𝑠𝑡𝑑𝑑𝑒𝑣
𝑚𝑒𝑎𝑛

• HDFS can be used as a cache for Object Storage
• Similar to $my_favorite_caching_FS (CFS)?
– These are all caching systems that dispatch between storage systems horizontally
– We want to tier the storage systems vertically
• Support HDFS, not just Hadoop ecosystem around FileSystem
Notes on Caching
11
Compute
$CFS
☁️
Compute
$CFS
Compute
$CFS
HDFS HDFS HDFS
Compute
☁️
Compute Compute
HDFS HDFS HDFS

… …
…
/
bucket1 carlhadoop
Object Store
…
…
/
reports
fileA fileB dir
sales
HDFS cluster
NameNode
DataNode 1
P
External Storage for HDFS
Hadoop
Client
DataNode N
P
fileA fileB dir
12

• “DropBox for Hadoop"
– Hadoop cluster has complete namespace but only “data in working set” is stored locally
– Dynamically page in missing data from object store on read
– Asynchronously write back data to object store
• Storage Policies + Replication count offer rich placement options
– E.g.: hot data: <SSD, PROVIDED> / cold data: <PROVIDED>
• Dedicated object storage system more efficient ($$$)
– Similar goal as ARCHIVE storage policy
– Object storage features (erasure coding, multi-geo replication, …)
• Data sharing with non-Hadoop apps
– File-object mapping means objects can be accessed in remote store with REST API / SDKs
Use case: External Storage for HDFS
13

Community feedback at last year’s Summit
14

WD Activescale Object Storage
• Western Digital moving up the stack (Data Center Systems)
• Scale-out object storage system for Private & Public Cloud
• Key features:
 Compatible with Amazon S3 API
 Strong consistency (not eventual!)
 Erasure coding for efficient storage
• Scale:
 Petabytes per rack
 Billions of objects per rack
 Linear scalability in # of racks
• More info at http://www.hgst.com/products/systems
15

• AS AN Administrator
• I CAN configure HDFS with an object storage backend
hdfs storagepolicies -setStoragePolicy -policy PROVIDED -path /var/log
hdfs syncservice -create -backupOnly -name activescale /var/logs s3a://hadoop-logs/
• SO THAT when a user copies files to HDFS they are asynchronously copied to
to the synchronization endpoint
Demo time
16

Another example
• AS AN Administrator
• I CAN set the Storage Policy to be PROVIDED_ONLY
hdfs storagepolicies –setStoragePolicy -policy PROVIDED_ONLY -path /var/log
• SO THAT data is no longer in the Datanode but is transparently read
through from the synchronization endpoint on access.
17

• Preserve file-object mapping
– AliasMap (last year’s talk – HDFS-9806): synchronize namespaces
– Datanodes collaborate to move blocks which together form an object in destination system
• Minimize impact on frontend traffic / efficient data transfer
– Obvious: Read all blocks into a single Datanode to reconstruct a file before transferring
– Efficient: Transfer directly copies block per block outside of cluster using
• S3: multipart upload
• WASB: append blobs
• HDFS: tmpdir + concat
• Flexible deployment: could run in NameNode OR as External service
– In Namenode is easy to deploy but adds resource pressure
– External service is more difficult to deploy for some sites but reduces resource pressure
– Ongoing community discussion; start with external, include internal option as required
Requirements
18

• MountManager manages all the local mount points
– Mount point can be configured to sync with external store
• Periodically create a diff by comparing snapshots of the mountpoint
– NEW SyncService (in/out NameNode)
• Generate a ”phased plan” for ordering the operations in the diff
– Multiple ordered phases
• RENAMES_TO_TEMP, DELETES, RENAMES_TO_FINAL, CREATE_DIRS, CREATE_FILES
• e.g. dir creation before file creation
– Parallel operations within a phase
• Leverage multiple datanodes and connections to external store
• e.g. Upload multiple new files in parallel
• Execute plan and track work
– Namespace (metadata) operations originate from SyncService
– Data operations originate from DataNodes
– Tracking: admin can query mountpoint for progress
Deep Dive: Synchronization
19

• Snapshot diff:
– Reflect point-in-time 100% accurate state of HDFS in external store
– Snapshot ensures data remains referenceable: retains blocks of data
– Does not track create + delete in between consecutive snapshots (cfr. file B in Fig.)
• EditLog post-processing:
– To parallelize
• Read batch from log and track lineage between overlapping operations
– HDFS operations might have altered reality: no point-in time
• Data not part of log: would require postponing block garbage collection
Tracking changes: Snapshot diff vs. EditLog
20
ss-6
B B A A
ss-5time

SnapshotDiffReport
M d .
+ d ./a
+ f ./f1.bin
Example Diff – Simple Case
21
Commands
#given /basic-test
mkdir -p /basic-test/a/b/c
touch /basic-test/a/b/c/d/f1.bin
touch /basic-test/f1.bin
Simple Case - New dirs; new files
PhasedPlan
+ d ./a/b/c/d/
+ f ./a/b/c/d/f1.bin
+ f ./f1.bin

SnapshotDiffReport
M d .
R f ./a.bin -> ./b.bin
R f ./b.bin -> ./a.bin
Example Diff – Harder Case
22
Commands
#given /swap-test/a.bin
#given /swap-test/b.bin
mv /swap-test/a.bin /swap-test/tmp
mv /swap-test/b.bin /swap-test/a.bin
mv /swap-test/tmp /swap-test/b.bin
Harder Case - Cycle
PhasedPlan
R f ./a.bin -> ./tmp/b.bin
R f ./b.bin -> ./tmp/a.bin
R f ./tmp/b.bin -> b.bin
R f ./tmp/a.bin -> a.bin

• Tiered Storage HDFS-12090 [issues.apache.org]
– Design documentation
– List of subtasks, lots of linked tickets – take one!
– Discussion of scope, implementation, and feedback
• Bert Verslyppe, Hendrik Depauw, Íñigo Goiri, Rakesh Radhakrishnan, Uma
Gangumalla, Daryn Sharp, Steve Loughran, Sanjay Radia, Anu Engineer,
Jitendra Pandey, Andrew Wang, Zhe Zhang, Allen Wittenauer, and many
others …
Thanks to the community for feedback & help!
23

• Applications write to HDFS
– First to DISK, then SyncService asynchronously copies to synchronization endpoint
– When files have been copied, the extraneous disk replicas can be removed
Deep Dive: MultiPart Upload
25
SyncService
Datanode
Datanode
Datanode
External
Store
File
Block1
Client
Write File
Multipart InitMultipart Complete
Multipart PutPart
Block2
Block3

• Common concept in Object Storage
– Supported by S3, WASB
• Usage in Hadoop
– S3A uses it – see Steve Loughran’s talk
– New to HDFS – HDFS-13186
• Three phases
– UploadHandle initMultipart(Path filePath)
– PartHandle putPart(Path filePath, InputStream inputStream,
int partNumber, UploadHandle uploadId, long lengthInBytes)
– void complete(Path filePath, List<Pair<Integer, PartHandle>> handles,
UploadHandle multipartUploadId)
• Benefits:
– Object/File Isolation – you only see the results when it’s done
– Can be written in parallel across multiple nodes
MultipartUploader
26

MultipartUploader in SyncService
27

HDFS tiered storage

More Related Content

What's hot

Similar to HDFS tiered storage

More from DataWorks Summit

Recently uploaded

HDFS tiered storage