More Related Content Similar to HDFS tiered storage: mounting object stores in HDFS (20) More from DataWorks Summit (20) HDFS tiered storage: mounting object stores in HDFS1. 1
HDFS Tiered Storage:
Mounting Object Stores
in HDFS
Ewan Higgs
Thomas Demoor
4/28/2018©2018 Western Digital Corporation or its affiliates. All rights reserved.
2. • Product Owner Object Storage Access
– S3-compatible features (versioning, lifecycle management, async. replication)
– Hadoop integration
• Apache Hadoop contributions:
– S3a filesystem improvements (Hadoop 2.6–2.8+)
– Object store optimizations: rename-free committer (HADOOP-13786)
– HDFS Tiering: Provided Storage (HDFS-9806)
• Ex: Queueing Theory PhD @ Ghent Uni, Belgium
• Tweets @thodemoor
Thomas Demoor
4/28/2018 2©2018 Western Digital Corporation or its affiliates. All rights reserved.
3. • Software Architect
– Hadoop integration
• Apache Hadoop contributions:
– HDFS Tiering: Provided Storage
• Ex:
– Embedded systems
– Managing market data in finance
– HPC Systems administrator
Ewan Higgs
4/28/2018 3©2018 Western Digital Corporation or its affiliates. All rights reserved.
4. HDFS + Object Store
• External Tier for Hadoop: Object Store or HDFS or …
• Your trusted HDFS workflow
Hadoop Apps interact with HDFS
Data Tiering Async in background
Admin controls data placement through usual CLI
• HDFS Storage Policy
DISK, SSD, RAM, ARCHIVE, PROVIDED
• Read Path (HDFS-9806) merged in December 🎉
Already being used in production at Microsoft.
4/28/2018 4©2018 Western Digital Corporation or its affiliates. All rights reserved.
5. Demo
• AS AN Administrator,
• I CAN configure HDFS with an S3a backend with an appropriate
Storage Policy
hdfs storagepolicies -setStoragePolicy -policy PROVIDED -path /var/log
hdfs syncservice -create -backupOnly -name activescale /var/logs
s3a://hadoop-logs/
• SO THAT when a user copies files to HDFS they are asynchronously
copied to to the synchronization endpoint
4/28/2018 5©2018 Western Digital Corporation or its affiliates. All rights reserved.
6. WIP – Mock Demo
• AS AN Administrator
• I CAN set the Storage Policy to be PROVIDED_ONLY
hdfs –setStoragePolicy -policy PROVIDED_ONLY -path /var/log
• SO THAT data is no longer in the Datanode but is transparently read
through from the synchronization endpoint on access.
4/28/2018 6©2018 Western Digital Corporation or its affiliates. All rights reserved.
7. • S3a = Hadoop Compatible FileSystem
– (cfr. wasb://, adl://, gcs://, ceph://, …)
– Direct IO between Hadoop apps and object store: s3a://bucket/object
– Scalable & Resilient: NameNode functions taken up by object store
– Functional since Hadoop 2.7, improvements in 2.8 & ongoing (2x write)
• Our contributions:
– Non-AWS endpoints
– YARN/Hadoop2
– Streaming upload
– …
• S3a in action:
– Hortonworks Data Cloud: https://hortonworks.com/products/cloud/aws/
– Databricks Cloud: https://databricks.com/product/databricks
– Backup entire HDFS clusters
– Archive cold data from HDFS clusters
S3a: quick recap
4/28/2018 7©2018 Western Digital Corporation or its affiliates. All rights reserved.
APP
HADOOP CLUSTER
READWRITE
8. Hortonworks evolution of Hadoop+Cloud
4/28/2018 8©2018 Western Digital Corporation or its affiliates. All rights reserved.
Evolution towards Cloud Storage as the Primary Data Lake
Application
HDFS
Backup Restore
Input Output
Application
HDFSInput
Output
Copy
Application
HDFS
Input
Output
tmp
9. Hortonworks evolution of Hadoop+Cloud: Next
4/28/2018 9©2018 Western Digital Corporation or its affiliates. All rights reserved.
Application
HDFS
Backup Restore
Input Output
Application
HDFSInput
Output
Copy
Application
HDFS
Input
Output
tmp
Application
HDFS
Write
Through
Load
On-Demand
Input Output
Evolution towards Cloud Storage as the Primary Data Lake
10. WD Active Archive Object Storage
• Western Digital moving up the stack
• Scale-out object storage system for Private & Public Cloud
• Key features:
Compatible with Amazon S3 API
Strong consistency (not eventual!)
Erasure coding for efficient storage
• Scale:
Petabytes per rack
Billions of objects per rack
Linear scalability in # of racks
• More info at http://www.hgst.com/products/systems
4/28/2018 10©2018 Western Digital Corporation or its affiliates. All rights reserved.
11. • Hadoop-compatible does not mean identical to HDFS
– S3a is not a FileSystem
• no notion of directories
• no support for flush/ append (breaks Apache HBase: KV-store)
– No Data Locality:
• less performant for hot/real-time data
– Hadoop admin tools require HDFS:
• permissions/quota/security/…
• Goal: integrate better with HDFS
– Data-locality for hot data + object storage for cold data
– Offer familiar HDFS abstractions, admin tools, …
S3a doesn’t fit all use cases
4/28/2018 11©2018 Western Digital Corporation or its affiliates. All rights reserved.
APP
HADOOP CLUSTER
READWRITE
12. HDFS
• Use HDFS to manage remote storage
Solution: “Mount” remote storage in HDFS
4/28/2018 12©2018 Western Digital Corporation or its affiliates. All rights reserved.
• Use HDFS to manage remote storage
– HDFS coordinates reads/writes to remote store
– Mount remote store as a PROVIDED tier in HDFS
• Details later in the talk
– Set StoragePolicy to move data between the tiers
… …
/
a b
HDFS
Namespace
… …
/
d e f
Remote
Namespace
Mount remote
namespace
c
d e f
Mount point
REMOTE
STORE
APP
HADOOP CLUSTER
WRITE
THROUGH
LOAD
ON-DEMAND
READWRITE
Alias Map
HDFS Block->Remote location
13. • HDFS + S3A/WASB/…
• HDFS for streaming append workloads and data locality
• Object Storage for dense resilient storage of arbitrarily large scale
• Can also do HDFS on HDFS
Best of Both Worlds
4/28/2018 13©2018 Western Digital Corporation or its affiliates. All rights reserved.
14. • This design means HDFS could be used as a cache for Object Storage
• Similar to Alluxio, Ignite, Rubix?
– These are all caching systems that dispatch between storage systems horizontally
– We want to tier the storage systems vertically
– We want HDFS to remain source of truth for Hadoop
Caching
4/28/2018 14©2018 Western Digital Corporation or its affiliates. All rights reserved.
Compute
Alluxio
☁️
Compute
Alluxio
Compute
Alluxio
HDFS HDFS HDFS
Compute
☁️
Compute Compute
HDFS HDFS HDFS
15. • Preserve file-object mapping
– AliasMap (last year’s talk): used to synchronize namespaces
– Multiple Datanodes collaborate to move blocks which together form a file/object in
destination system
• Minimize impact on frontend traffic / Efficient data transfer
– Obvious: Read all blocks into a single Datanode to reconstruct a file before transferring
– Efficient: Transfer directly copies block per block outside of cluster using
• S3: multipart upload
• WASB: append blobs
• HDFS: tmpdir + concat
• Deployment model
– In Namenode is easy to deploy but adds resource pressure
– External service is more difficult to deploy for some sites but reduces resource pressure
• See AliasMap from HDFS-9806 and SPS service from HDFS-10825
• Would like to re-use SPS protocols
Requirements
4/28/2018 15©2018 Western Digital Corporation or its affiliates. All rights reserved.
17. • Applications write to HDFS
– First to DISK, then SyncService asynchronously copies to synchronization endpoint
– When files have been copied, the extraneous disk replicas can be removed
Tech Details
4/28/2018 17©2018 Western Digital Corporation or its affiliates. All rights reserved.
Namenode
Datanode
Datanode
Datanode
External
Store
File
Block
Client
Block
Block
Write File
Multipart InitMultipart Complete
Multipart PutPart
Extraneous Blocks are
removed.
18. • MountManager
– Manages all the local mount points
• SyncServiceSatisfier
– Service in Namenode that periodically creates a diff by comparing Snapshots
• PhasedPlanFactory
– Generates the plan for ordering the operations in the diff
• E.g. Create, Modify, Rename, Delete of Files and Directories
• SyncTask Tracker
– Runs the work and tracks it
• Namespace operations (MetadataSyncTask) run from Namenode
• Data operations (BlockSyncTask) run in Datanode (❤️-beat mechanism)
Satisfy Synchronization: How it works
4/28/2018 18©2018 Western Digital Corporation or its affiliates. All rights reserved.
19. Tech Details – Namenode Components
4/28/2018 19©2018 Western Digital Corporation or its affiliates. All rights reserved.
MountManager SyncServiceSatisfier PhasedPlanFactory
SyncTask TrackerSnapshotDiffReport
20. SnapshotDiffReport
M d .
+ d ./a
+ f ./f1.bin
Example Diff – Simple Case
4/28/2018 20©2018 Western Digital Corporation or its affiliates. All rights reserved.
Commands
#given /basic-test
mkdir -p /basic-test/a/b/c
touch /basic-test/a/b/c/d/f1.bin
touch /basic-test/f1.bin
Simple Case - New dirs; new files
21. SnapshotDiffReport
M d .
R f ./a.bin -> ./b.bin
R f ./b.bin -> ./a.bin
Example Diff – Harder Case
4/28/2018 21©2018 Western Digital Corporation or its affiliates. All rights reserved.
Commands
#given /swap-test/a.bin
#given /swap-test/b.bin
mv /swap-test/a.bin /swap-test/tmp
mv /swap-test/b.bin /swap-test/a.bin
mv /swap-test/tmp /swap-test/b.bin
Harder Case - Cycle
22. • DistCpSync Algorithm:
1. Rewrite Target Destinations based on other Renames
2. Renames and Deletes to temp directory
3. Renames to final location (mkdirs if directory doesn’t exist yet)
4. Drop temp directory (and the deletes with them)
5. Creates and Modifies
• PhasedPlan Algorithm:
1. Rewrite Target Destinations based on other Renames
2. Renames and temp directory
3. Delete Files and Directories
4. Renames to final location (mkdirs if directory doesn’t exist yet)
5. Drop temp directory
6. Create Directories on end point
7. Create Files on end point
Use solution in DistCpSync for PhasedPlan
4/28/2018 22©2018 Western Digital Corporation or its affiliates. All rights reserved.
23. SnapshotDiffReport
M d .
R f ./a.bin -> ./b.bin
R f ./b.bin -> ./a.bin
Example Diff – Harder Case
4/28/2018 23©2018 Western Digital Corporation or its affiliates. All rights reserved.
Commands
#given /swap-test/a.bin
#given /swap-test/b.bin
mv /swap-test/a.bin /swap-test/tmp
mv /swap-test/b.bin /swap-test/a.bin
mv /swap-test/tmp /swap-test/b.bin
Harder Case - Cycle
PhasedPlan
M d .
R f ./a.bin -> ./tmp/b.bin
R f ./b.bin -> ./tmp/a.bin
R f ./tmp/b.bin -> b.bin
R f ./tmp/a.bin -> a.bin
24. Tech Details – Namenode Components
4/28/2018 24©2018 Western Digital Corporation or its affiliates. All rights reserved.
MountManager SyncServiceSatisfier PhasedPlanFactory
SyncTask TrackerSnapshotDiffReport
25. • Common concept in Object Storage
– Supported by S3, WASB
• Usage in Hadoop
– S3A uses it – see Steve Loughran’s talk
– New to HDFS – HDFS-13186
• Three phases
– UploadHandle initMultipart(Path filePath)
– PartHandle putPart(Path filePath, InputStream inputStream,
int partNumber, UploadHandle uploadId, long lengthInBytes)
– void complete(Path filePath, List<Pair<Integer, PartHandle>> handles,
UploadHandle multipartUploadId)
• Benefits:
– Object/File Isolation – you only see the results when it’s done
– Can be written in parallel across multiple nodes
MultipartUploader
4/28/2018 25©2018 Western Digital Corporation or its affiliates. All rights reserved.
27. • All this work requires a synchronization point
• Namenode is the obvious place
• Namenode is too busy in large systems
• Optionally push SyncService to external service
– Precedents: AliasMap, StoragePolicySatisfier
Namenode work reduction
4/28/2018 27©2018 Western Digital Corporation or its affiliates. All rights reserved.
28. • Tiered Storage HDFS-12090 [issues.apache.org]
– Design documentation
– List of subtasks, lots of linked tickets – take one!
– Discussion of scope, implementation, and feedback
• Joint work Microsoft – Western Digital
– {thomas.demoor, ewan.higgs}@wdc.om
– {cdoug,vijala}@microsoft.com
Resources
4/28/2018 28©2018 Western Digital Corporation or its affiliates. All rights reserved.
29. Virajith Jalaparthi, Chris Douglas, Ewan Higgs, Kasper Janssens, Bert
Verslyppe, Hendrik Depauw, Rakesh R, Daryn Sharp, Steve Loughran, Thomas
Demoor, Allen Wittenauer, and more!
Thanks!
4/28/2018 29©2018 Western Digital Corporation or its affiliates. All rights reserved.
Editor's Notes StoragePolicy - cf SSD, DISK, … hdfs syncservice –create –backupOnly –name activescale /var/logs s3a://hadoop-logs/
hdfs copyFromLocal :subdir logs per day (14 days)
apps interact directly with hdfs
use aws cli to show objects pop up in backend?
hdfs show storage policy
Show progress of backup, wait to finish
hdfs cat file which is in provided only (seamless, fast)
S3A Recap
Difference between HDFS and S3A
Use cases where S3A falls down
Best of both worlds
HDFS + S3A/WASB/…
Other mixed systems
HDFS + HDFS
Alluxio
Virtual central overlay across all your storages aka src of truth -We see hdfs as the hadoop overlay/cache of your data lake / object store
They are the intersection of what all storages can do / mappings that can be made (limited s3 rest api, ...) - We get the best of both worlds
Caveat: we are not expects on each of these projects. https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs PhasedPlanFactory - Could refactor DistCpSync to reuse code but it’s undergoing a lot of work for 3.1
While it might seem like you can simply run the commands in a particular order to get the correct results, it’s not that simple! We can continue to make arbitrary rings.
Detecting Cycles is not nice! Q: Does anyone see a problem here?
A: Renaming files for an object store is not pleasant; doing it twice is worse. We can continue to make arbitrary rings.
Detecting Cycles is not nice! Major design consideration: Code in Hadoop-hdfs project needs to make multipartupload to Object Store (e.g. S3a, WASB). So the Multipart upload ID or Part information cannot resolve to actual data.
To solve this, we wrap a byte[] as UploadHandle and PartHandle that contain the actual information required to pass to the underlying API.
GCS supports Composite Objects, which are similar to how HDFS will work: it’s a concatenation of existing objects/files.
Tough to see!! Same as create file because it must be idempotent.