HDFS Tiered Storage: Mounting Object Stores in HDFS

HDFS Tiered Storage:
mounting object stores in HDFS
Thomas Demoor – Product Owner
Ewan Higgs – System Architect
Engineering, Data Center Systems,
Western Digital

• Engineering Owner for object storage datapath
– Amazon S3-compatible API
– Customer-facing datapath features
– Hadoop integration
• Apache Hadoop contributions:
– S3a filesystem improvements (Hadoop 2.6–2.8+)
– Object store optimizations: rename-free committer (HADOOP-13786)
– HDFS Tiering: Provided Storage (HDFS-9806)
• Ex: Queueing Theory PhD @ Ghent Uni, Belgium
• Tweets @thodemoor
Thomas Demoor

Ewan Higgs
• Software Architect at Western Digital
– Focused on Hadoop Integration
• HDFS Contributions
– Protocol level changes to the Block Token Identifier (HDFS-11026, HDFS-6708, HDFS-9807)
– Provided Storage (HDFS-9806)
• [F]OSS work
– Contributed to: HDFS, Hue, hanythingondemand, …
– My own work: Spark Terasort, spark-config-gen, csv-game
– Co-organized: FOSDEM HPC, Big Data, and Data Science Devroom (201{6,7})

Resources
• Tiered Storage HDFS-9806 [issues.apache.org]
– Design documentation
– List of subtasks, lots of linked tickets – take one!
– Discussion of scope, implementation, and feedback
• Joint work Microsoft – Western Digital
– {thomas.demoor, ewan.higgs}@wdc.om
– {cdoug,vijala}@microsoft.com

Data in Hadoop
• All data in one place
• Tools written against abstractions
– Compatible FileSystems (HDFS/Azure/S3/etc.)
• Multi-tenancy & management APIs
– Authorization
– Authentication
– Quotas
– Encryption
• Storage Tiering & Policies
– Tiers: Hot, Warm, Cold, …
– Media: RAM, SSD, HDD, Archive

Managing Multiple Clusters: Today
• Why multiple Hadoop clusters?
– NameNode as a scaling bottleneck
– Separate production from staging / testing
– Security & Compliance
• Copy by using the compute cluster
– Copy data (distcp) between clusters
– (+) Clients process local copies, no visible partial copies
– (-) Uses compute resources, requires capacity planning
• Resolve inside the application
– Directly access data in multiple clusters
– (+) Consistency managed at client
– (-) Auth to all data sources, consistency is hard, no opportunities for transparent caching
D A
hdfs://a/ hdfs://b/
A
r/w
hdfs://a/ hdfs://b/
r/w

Managing Multiple Clusters: Our Proposal
• Use the platform to mount external store as provided storage tier
– Synchronize storage by mounting remote namespace
– (+) Transparent to users, caching/prefetching, unified namespace
– (-) Conflicts may not be automatically mergeable
• Mount hadoop-compatible filesystem as provided store
– hdfs://
– file://
– object stores (s3a://, wasb://)
– …
• Goal: use HDFS to coordinate external storage
– No explicit data copying
– Present uniform namespace
– Multi-protocol access
– No capability or performance gap: storage types (RAM/SSD/DISK/PROVIDED), rebalancing, security, quotas, etc.
A
hdfs://a/
hdfs://b/
r/w
mount

Hadoop Cloud Storage Utilization Evolution
Evolution towards Cloud Storage as the Primary Data Lake
Application
HDFS
Backup Restore
Input Output
Application
HDFSInput
Output
Copy
Application
HDFS
Input
Output
tmp

Hadoop Cloud Storage Utilization Evolution: Next
Application
HDFS
Backup Restore
Input Output
Application
HDFSInput
Output
Copy
Application
HDFS
Input
Output
tmp
Application
HDFS
Write
Through
Load
On-Demand
Input Output
Evolution towards Cloud Storage as the Primary Data Lake

Use-Case: Object Store as Archive Tier
Application
HDFS
Write
Through
Load
On-Demand
Input Output

• Western Digital moving up the stack
– Belgian Object Storage startup Amplidata (started 2008)
– Acquired by Western Digital in 2015
• Scale-out object storage system for Private & Public Cloud
• Key features:
– Compatible with Amazon S3 API
– Strong consistency (not eventual!)
– Erasure coding for efficient storage
– Linear scalability in # racks, objects, throughput
• Scale:
– Big system -- X100:
• 588 disks /rack = 5.8 PB raw = 4-4.5 PB usable
• 5B objects
– Small system -- P100:
• 72 disks / module = 720 TB raw = 508 TB usable
• 600M objects
• More info HERE
WD Active Archive Object Storage

• Target: Hadoop users with archival storage needs
• Archival storage  large footprint (1PB+ Hadoop clusters)
• Large footprint  already heavily invested in HDFS ecosystem
– Existing workflows, scripts, tools
– Migrating to cloud would require application changes
• Provided storage offers the best of both worlds:
– Familiar HDFS abstractions
– Data-locality for hot data
– Scale-out object storage for cold data
Use-Case: Object Storage as Archive Tier
Application
HDFS
Write
Through
Load
On-Demand
Input Output

Use-Case: Ephemeral Hadoop Clusters
Application
HDFS
Write
Through
Load
On-Demand
Input Output
Application
HDFS
Write
Through
Load
On-Demand
Input Output
Application
HDFS
Write
Through
Load
On-Demand
Input Output

• Low utilization on clusters running latency-sensitive applications
– Seen at Microsoft (Zhang et. al. [OSDI’16]) and Google (Lo et. al. [ACM TOCS’16])
– Remedy: Co-locate analytics cluster on same hardware, as secondary tenant
• To handle scale (10,000s machines), run multiple HDFS Namenodes in federation
– Related: Router-based HDFS federation (HDFS-10467)
• Require Quality of Service for latency-sensitive application
– Preempt machines running analytics
– E.g., Update the search index  kill 1000 nodes of analytics cluster
– Possibility of rapid changes in load on Namenodes => entails re-balancing between Namenodes
• During rebalancing, use tiering to “mount” source sub-tree in the destination
Namenode
– Metadata operation, much faster than moving data (Alt.: run a distcp job, as proposed in HDFS-10467)
– Can lazily copy data to destination NN
– Data available even before the copying is complete
Use-Case: Harvesting spare cycles in datacenters

Challenges
• Synchronize metadata without copying data
– Dynamically page in “blocks” on demand
– Define policies to prefetch and evict local replicas
• Mirror changes in remote namespace
– Handle out-of-band churn in remote storage
– Avoid dropping valid, cached data (e.g., rename)
• Handle writes consistently
– Writes committed to the backing store must “make sense”
• Dynamic mounting
– Efficient/clean mount-unmount behavior
– 1 Object Store mapping to multiple Namenodes

Big Picture: Read from External Store
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store
(Data)
mount
Client
read(/d/e)
read(/c/d/e)
(file data)
(file data)
DN1 DN2
HDFS
cluster
NN
… …
d
e f g

Review: Retrieving the contents of a file in HDFS
• Block locations stored in NN
– Resolved in getBlockLocation() to a
single DN and the relevant storage type
• Replicas stored in the DN
… …
d
e f g
NN
BlockManager
DN1 DN2
DFSClient
getBlockLocation
(“/d/f/z1”, 0, L)
return LocatedBlocks
{{DN2, 𝑏𝑖}}
FSImage
RAM_DISK SSD DISK

New! Provided Storage Type
• Peer to RAM, SSD, DISK in HDFS (HDFS-2832)
• Data in external store mapped to HDFS blocks
– Each block associated with an Alias = (REF, nonce)
• Used to map blocks to external data
• Nonce used to detect changes on backing store
• E.g.: REF = (file URI, offset, length); nonce = GUID
• E.g.: REF= (s3a://bucket/file, 0, 1024); nonce = <ETag>
– Mapping stored in a BlockMap
• KV store accessible by NN and all DNs
• KV can be external service or in the NN
• ProvidedVolume on Datanodes reads/writes
data from/to external store
DN1
External store
DN2
BlockManager
/𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗
𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3}
/𝑎𝑑𝑙/𝑏𝑎𝑟 → 𝑏 𝑘, … , 𝑏𝑙
𝑏 𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷}
FSNamesystem
NN
BlockMap
𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘
…
RAM_DISK SSD DISK PROVIDED

Example: Using an immutable cloud store
• Create FSImage and BlockMap
– Block StoragePolicy can be set as required
– E.g. {rep=2, PROVIDED, DISK }
FSImage
BlockMap
/𝑑/𝑒 → {𝑏1, 𝑏2, … }
/d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … }
…
𝑏𝑖 → {rep = 1, PROVIDED}
…
𝑏𝑖 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 0, 𝐿 , inodeId1}
𝑏𝑖+1 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 𝐿, 2𝐿 , inodeId1}
…
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store

External namespace
• Start NN with the FSImage
– Replication > 1 start copying to local media
• All blocks reachable from NN when a DN with PROVIDED storage heartbeats in
– In contrast to READ_ONLY_SHARED (HDFS-5318)
… …
d
e f g
NN
BlockManager
DN1 DN2
… …
… …
/
a b c
e f g
d
FSImage
BlockMap

• Block locations stored as a
composite DN
– Contains all DNs with the storage
configured
– Resolved in getBlockLocation() to a
single DN
• DN uses Alias to read from
external store
– Data can be cached locally as it is read
(read-through cache)
… …
d
e f g
NN
BlockManager
DN1 DN2
DFSClient
getBlockLocation
(“/d/f/z1”, 0, L)
return LocatedBlocks
{{DN2, 𝑏𝑖,
“ext:///c/d/f/z1”}}
External store
lookup(𝑏𝑖)
(“ext:///c/d/f/z1/”, 0, L, GUID1)
FSImage
BlockMap

Benefits of the PROVIDED design
• Use existing HDFS features to enforce quotas, limits on storage tiers
– Simpler implementation, no mismatch between HDFS invariants and framework
• Supports different types of back-end storages
– org.apache.hadoop.FileSystem, blob stores, etc.
• Credentials hidden from client
– Only NN and DNs require credentials of external store
– HDFS can be used to enforce access controls for remote store
• Enables several policies to improve performance
– Set replication in FSImage to pre-fetch
– Read-through cache
– Actively pre-fetch while cluster is running
• Set StoragePolicy for the file to prefetch

Handling out-of-band changes
• Nonce for correctness (e.g. ETag in S3 + GET If-Match: ETag)
• Asynchronously poll external store to reap deleted / discover new objects
– Integrate detected changes to the NN
– Update BlockMap on file creation/deletion
• Consensus, shared log, etc.
– Tighter NS integration complements provided store abstraction

Assumptions
• Churn is rare and relatively predictable
– Analytic workloads, ETL into external/cloud storage, compute in cluster
• Clusters are either consumers/producers for a subtree/region
– FileSystem has too little information to resolve conflicts
Ingest
ETL
Raw Data Bucket
Analytic Results
Bucket
Analytics

Implementation roadmap
• Read-only image (with periodic, naive refresh)
– ViewFS-based: NN configured to refresh from root
– Mount within an existing NN
– Refresh view of remote cluster and sync
• Write-through
– Cloud backup: no namespace in external store, replication only
– Return to writer only when data are committed to external store
• Write-back
– Lazily replicate to external store
• Dynamic Mounting
– Existing NN where an administrator wants to add tiered storage

Dataworks Summit Hadoop San Jose, June 13-15
• Today’s talk was about the implemented read path
• This year’s SJ talk (co-present with Microsoft) will be about the next steps:
– Write path
– Dynamic mounting

Resources + Q&A
• Tiered Storage HDFS-9806 [issues.apache.org]
– Design documentation
– List of subtasks, lots of linked tickets – take one!
– Discussion of scope, implementation, and feedback
• Joint work Microsoft – Western Digital
– {thomas.demoor, ewan.higgs}@wdc.om
– {cdoug,vijala}@microsoft.com

Bonus slide: Write & Dynamic mounting
• Write == multipart upload
• Mounting 1 Object Store into 1 Namenode read-only is relatively easy.
– Namenode can dictate the Block IDs.
• Mounting 1 Object Store into multiple Namenodes r/w is hard, sharing Key Value
store.
– Multiple Namenodes may disagree over what the next Block ID is to write.
• Dynamically unmounting 1 Object Store is hard
– Option 1: Blocks simply become inaccessible as though storage was removed. Not ideal!
– Option 2: Provided Blocks have a special header (like in Erasure Coding; See HDFS-10867).

Bonus slide: Authentication
• Cloud credentials use tokens
• HDFS credentials use user/kerberos
• In the case of HDFS-first deployments, PROVIDED is useful because the
authentication remains with HDFS.
• In the case of Cloud-first deployments, PROVIDED is useful because it provides a
caching layer or reads and writes.
• With PROVIDED, Applications deployed on HDFS-first systems can be redeployed on
Cloud-first systems with no changes.

HDFS Tiered Storage: Mounting Object Stores in HDFS

More Related Content

What's hot

Similar to HDFS Tiered Storage: Mounting Object Stores in HDFS

More from DataWorks Summit/Hadoop Summit

Recently uploaded

HDFS Tiered Storage: Mounting Object Stores in HDFS

Editor's Notes