HDFS Tiered Storage:
mounting object stores in HDFS
Thomas Demoor – Product Owner
Ewan Higgs – System Architect
Engineering, Data Center Systems,
Western Digital
• Engineering Owner for object storage datapath
– Amazon S3-compatible API
– Customer-facing datapath features
– Hadoop integration
• Apache Hadoop contributions:
– S3a filesystem improvements (Hadoop 2.6–2.8+)
– Object store optimizations: rename-free committer (HADOOP-13786)
– HDFS Tiering: Provided Storage (HDFS-9806)
• Ex: Queueing Theory PhD @ Ghent Uni, Belgium
• Tweets @thodemoor
Thomas Demoor
Ewan Higgs
• Software Architect at Western Digital
– Focused on Hadoop Integration
• HDFS Contributions
– Protocol level changes to the Block Token Identifier (HDFS-11026, HDFS-6708, HDFS-9807)
– Provided Storage (HDFS-9806)
• [F]OSS work
– Contributed to: HDFS, Hue, hanythingondemand, …
– My own work: Spark Terasort, spark-config-gen, csv-game
– Co-organized: FOSDEM HPC, Big Data, and Data Science Devroom (201{6,7})
Resources
• Tiered Storage HDFS-9806 [issues.apache.org]
– Design documentation
– List of subtasks, lots of linked tickets – take one!
– Discussion of scope, implementation, and feedback
• Joint work Microsoft – Western Digital
– {thomas.demoor, ewan.higgs}@wdc.om
– {cdoug,vijala}@microsoft.com
Data in Hadoop
• All data in one place
• Tools written against abstractions
– Compatible FileSystems (HDFS/Azure/S3/etc.)
• Multi-tenancy & management APIs
– Authorization
– Authentication
– Quotas
– Encryption
• Storage Tiering & Policies
– Tiers: Hot, Warm, Cold, …
– Media: RAM, SSD, HDD, Archive
Managing Multiple Clusters: Today
• Why multiple Hadoop clusters?
– NameNode as a scaling bottleneck
– Separate production from staging / testing
– Security & Compliance
• Copy by using the compute cluster
– Copy data (distcp) between clusters
– (+) Clients process local copies, no visible partial copies
– (-) Uses compute resources, requires capacity planning
• Resolve inside the application
– Directly access data in multiple clusters
– (+) Consistency managed at client
– (-) Auth to all data sources, consistency is hard, no opportunities for transparent caching
D A
hdfs://a/ hdfs://b/
A
r/w
hdfs://a/ hdfs://b/
r/w
Managing Multiple Clusters: Our Proposal
• Use the platform to mount external store as provided storage tier
– Synchronize storage by mounting remote namespace
– (+) Transparent to users, caching/prefetching, unified namespace
– (-) Conflicts may not be automatically mergeable
• Mount hadoop-compatible filesystem as provided store
– hdfs://
– file://
– object stores (s3a://, wasb://)
– …
• Goal: use HDFS to coordinate external storage
– No explicit data copying
– Present uniform namespace
– Multi-protocol access
– No capability or performance gap: storage types (RAM/SSD/DISK/PROVIDED), rebalancing, security, quotas, etc.
A
hdfs://a/
hdfs://b/
r/w
mount
Hadoop Cloud Storage Utilization Evolution
Evolution towards Cloud Storage as the Primary Data Lake
Application
HDFS
Backup Restore
Input Output
Application
HDFSInput
Output
Copy
Application
HDFS
Input
Output
tmp
Hadoop Cloud Storage Utilization Evolution: Next
Application
HDFS
Backup Restore
Input Output
Application
HDFSInput
Output
Copy
Application
HDFS
Input
Output
tmp
Application
HDFS
Write
Through
Load
On-Demand
Input Output
Evolution towards Cloud Storage as the Primary Data Lake
Use-Case: Object Store as Archive Tier
Application
HDFS
Write
Through
Load
On-Demand
Input Output
• Western Digital moving up the stack
– Belgian Object Storage startup Amplidata (started 2008)
– Acquired by Western Digital in 2015
• Scale-out object storage system for Private & Public Cloud
• Key features:
– Compatible with Amazon S3 API
– Strong consistency (not eventual!)
– Erasure coding for efficient storage
– Linear scalability in # racks, objects, throughput
• Scale:
– Big system -- X100:
• 588 disks /rack = 5.8 PB raw = 4-4.5 PB usable
• 5B objects
– Small system -- P100:
• 72 disks / module = 720 TB raw = 508 TB usable
• 600M objects
• More info HERE
WD Active Archive Object Storage
• Target: Hadoop users with archival storage needs
• Archival storage  large footprint (1PB+ Hadoop clusters)
• Large footprint  already heavily invested in HDFS ecosystem
– Existing workflows, scripts, tools
– Migrating to cloud would require application changes
• Provided storage offers the best of both worlds:
– Familiar HDFS abstractions
– Data-locality for hot data
– Scale-out object storage for cold data
Use-Case: Object Storage as Archive Tier
Application
HDFS
Write
Through
Load
On-Demand
Input Output
Use-Case: Ephemeral Hadoop Clusters
Application
HDFS
Write
Through
Load
On-Demand
Input Output
Application
HDFS
Write
Through
Load
On-Demand
Input Output
Application
HDFS
Write
Through
Load
On-Demand
Input Output
• Low utilization on clusters running latency-sensitive applications
– Seen at Microsoft (Zhang et. al. [OSDI’16]) and Google (Lo et. al. [ACM TOCS’16])
– Remedy: Co-locate analytics cluster on same hardware, as secondary tenant
• To handle scale (10,000s machines), run multiple HDFS Namenodes in federation
– Related: Router-based HDFS federation (HDFS-10467)
• Require Quality of Service for latency-sensitive application
– Preempt machines running analytics
– E.g., Update the search index  kill 1000 nodes of analytics cluster
– Possibility of rapid changes in load on Namenodes => entails re-balancing between Namenodes
• During rebalancing, use tiering to “mount” source sub-tree in the destination
Namenode
– Metadata operation, much faster than moving data (Alt.: run a distcp job, as proposed in HDFS-10467)
– Can lazily copy data to destination NN
– Data available even before the copying is complete
Use-Case: Harvesting spare cycles in datacenters
Challenges
• Synchronize metadata without copying data
– Dynamically page in “blocks” on demand
– Define policies to prefetch and evict local replicas
• Mirror changes in remote namespace
– Handle out-of-band churn in remote storage
– Avoid dropping valid, cached data (e.g., rename)
• Handle writes consistently
– Writes committed to the backing store must “make sense”
• Dynamic mounting
– Efficient/clean mount-unmount behavior
– 1 Object Store mapping to multiple Namenodes
Big Picture: Read from External Store
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store
(Data)
mount
Client
read(/d/e)
read(/c/d/e)
(file data)
(file data)
DN1 DN2
HDFS
cluster
NN
… …
d
e f g
Review: Retrieving the contents of a file in HDFS
• Block locations stored in NN
– Resolved in getBlockLocation() to a
single DN and the relevant storage type
• Replicas stored in the DN
… …
d
e f g
NN
BlockManager
DN1 DN2
DFSClient
getBlockLocation
(“/d/f/z1”, 0, L)
return LocatedBlocks
{{DN2, 𝑏𝑖}}
FSImage
RAM_DISK SSD DISK
New! Provided Storage Type
• Peer to RAM, SSD, DISK in HDFS (HDFS-2832)
• Data in external store mapped to HDFS blocks
– Each block associated with an Alias = (REF, nonce)
• Used to map blocks to external data
• Nonce used to detect changes on backing store
• E.g.: REF = (file URI, offset, length); nonce = GUID
• E.g.: REF= (s3a://bucket/file, 0, 1024); nonce = <ETag>
– Mapping stored in a BlockMap
• KV store accessible by NN and all DNs
• KV can be external service or in the NN
• ProvidedVolume on Datanodes reads/writes
data from/to external store
DN1
External store
DN2
BlockManager
/𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗
𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3}
/𝑎𝑑𝑙/𝑏𝑎𝑟 → 𝑏 𝑘, … , 𝑏𝑙
𝑏 𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷}
FSNamesystem
NN
BlockMap
𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘
…
RAM_DISK SSD DISK PROVIDED
Example: Using an immutable cloud store
• Create FSImage and BlockMap
– Block StoragePolicy can be set as required
– E.g. {rep=2, PROVIDED, DISK }
FSImage
BlockMap
/𝑑/𝑒 → {𝑏1, 𝑏2, … }
/d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … }
…
𝑏𝑖 → {rep = 1, PROVIDED}
…
𝑏𝑖 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 0, 𝐿 , inodeId1}
𝑏𝑖+1 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 𝐿, 2𝐿 , inodeId1}
…
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store
External namespace
Example: Using an immutable cloud store
• Start NN with the FSImage
– Replication > 1 start copying to local media
• All blocks reachable from NN when a DN with PROVIDED storage heartbeats in
– In contrast to READ_ONLY_SHARED (HDFS-5318)
… …
d
e f g
NN
BlockManager
DN1 DN2
… …
… …
/
a b c
e f g
d
FSImage
BlockMap
Example: Using an immutable cloud store
• Block locations stored as a
composite DN
– Contains all DNs with the storage
configured
– Resolved in getBlockLocation() to a
single DN
• DN uses Alias to read from
external store
– Data can be cached locally as it is read
(read-through cache)
… …
d
e f g
NN
BlockManager
DN1 DN2
DFSClient
getBlockLocation
(“/d/f/z1”, 0, L)
return LocatedBlocks
{{DN2, 𝑏𝑖,
“ext:///c/d/f/z1”}}
External store
lookup(𝑏𝑖)
(“ext:///c/d/f/z1/”, 0, L, GUID1)
FSImage
BlockMap
Benefits of the PROVIDED design
• Use existing HDFS features to enforce quotas, limits on storage tiers
– Simpler implementation, no mismatch between HDFS invariants and framework
• Supports different types of back-end storages
– org.apache.hadoop.FileSystem, blob stores, etc.
• Credentials hidden from client
– Only NN and DNs require credentials of external store
– HDFS can be used to enforce access controls for remote store
• Enables several policies to improve performance
– Set replication in FSImage to pre-fetch
– Read-through cache
– Actively pre-fetch while cluster is running
• Set StoragePolicy for the file to prefetch
Handling out-of-band changes
• Nonce for correctness (e.g. ETag in S3 + GET If-Match: ETag)
• Asynchronously poll external store to reap deleted / discover new objects
– Integrate detected changes to the NN
– Update BlockMap on file creation/deletion
• Consensus, shared log, etc.
– Tighter NS integration complements provided store abstraction
Assumptions
• Churn is rare and relatively predictable
– Analytic workloads, ETL into external/cloud storage, compute in cluster
• Clusters are either consumers/producers for a subtree/region
– FileSystem has too little information to resolve conflicts
Ingest
ETL
Raw Data Bucket
Analytic Results
Bucket
Analytics
Implementation roadmap
• Read-only image (with periodic, naive refresh)
– ViewFS-based: NN configured to refresh from root
– Mount within an existing NN
– Refresh view of remote cluster and sync
• Write-through
– Cloud backup: no namespace in external store, replication only
– Return to writer only when data are committed to external store
• Write-back
– Lazily replicate to external store
• Dynamic Mounting
– Existing NN where an administrator wants to add tiered storage
Dataworks Summit Hadoop San Jose, June 13-15
• Today’s talk was about the implemented read path
• This year’s SJ talk (co-present with Microsoft) will be about the next steps:
– Write path
– Dynamic mounting
Resources + Q&A
• Tiered Storage HDFS-9806 [issues.apache.org]
– Design documentation
– List of subtasks, lots of linked tickets – take one!
– Discussion of scope, implementation, and feedback
• Joint work Microsoft – Western Digital
– {thomas.demoor, ewan.higgs}@wdc.om
– {cdoug,vijala}@microsoft.com
Bonus slide: Write & Dynamic mounting
• Write == multipart upload
• Mounting 1 Object Store into 1 Namenode read-only is relatively easy.
– Namenode can dictate the Block IDs.
• Mounting 1 Object Store into multiple Namenodes r/w is hard, sharing Key Value
store.
– Multiple Namenodes may disagree over what the next Block ID is to write.
• Dynamically unmounting 1 Object Store is hard
– Option 1: Blocks simply become inaccessible as though storage was removed. Not ideal!
– Option 2: Provided Blocks have a special header (like in Erasure Coding; See HDFS-10867).
Bonus slide: Authentication
• Cloud credentials use tokens
• HDFS credentials use user/kerberos
• In the case of HDFS-first deployments, PROVIDED is useful because the
authentication remains with HDFS.
• In the case of Cloud-first deployments, PROVIDED is useful because it provides a
caching layer or reads and writes.
• With PROVIDED, Applications deployed on HDFS-first systems can be redeployed on
Cloud-first systems with no changes.

HDFS Tiered Storage: Mounting Object Stores in HDFS

  • 1.
    HDFS Tiered Storage: mountingobject stores in HDFS Thomas Demoor – Product Owner Ewan Higgs – System Architect Engineering, Data Center Systems, Western Digital
  • 2.
    • Engineering Ownerfor object storage datapath – Amazon S3-compatible API – Customer-facing datapath features – Hadoop integration • Apache Hadoop contributions: – S3a filesystem improvements (Hadoop 2.6–2.8+) – Object store optimizations: rename-free committer (HADOOP-13786) – HDFS Tiering: Provided Storage (HDFS-9806) • Ex: Queueing Theory PhD @ Ghent Uni, Belgium • Tweets @thodemoor Thomas Demoor
  • 3.
    Ewan Higgs • SoftwareArchitect at Western Digital – Focused on Hadoop Integration • HDFS Contributions – Protocol level changes to the Block Token Identifier (HDFS-11026, HDFS-6708, HDFS-9807) – Provided Storage (HDFS-9806) • [F]OSS work – Contributed to: HDFS, Hue, hanythingondemand, … – My own work: Spark Terasort, spark-config-gen, csv-game – Co-organized: FOSDEM HPC, Big Data, and Data Science Devroom (201{6,7})
  • 4.
    Resources • Tiered StorageHDFS-9806 [issues.apache.org] – Design documentation – List of subtasks, lots of linked tickets – take one! – Discussion of scope, implementation, and feedback • Joint work Microsoft – Western Digital – {thomas.demoor, ewan.higgs}@wdc.om – {cdoug,vijala}@microsoft.com
  • 5.
    Data in Hadoop •All data in one place • Tools written against abstractions – Compatible FileSystems (HDFS/Azure/S3/etc.) • Multi-tenancy & management APIs – Authorization – Authentication – Quotas – Encryption • Storage Tiering & Policies – Tiers: Hot, Warm, Cold, … – Media: RAM, SSD, HDD, Archive
  • 6.
    Managing Multiple Clusters:Today • Why multiple Hadoop clusters? – NameNode as a scaling bottleneck – Separate production from staging / testing – Security & Compliance • Copy by using the compute cluster – Copy data (distcp) between clusters – (+) Clients process local copies, no visible partial copies – (-) Uses compute resources, requires capacity planning • Resolve inside the application – Directly access data in multiple clusters – (+) Consistency managed at client – (-) Auth to all data sources, consistency is hard, no opportunities for transparent caching D A hdfs://a/ hdfs://b/ A r/w hdfs://a/ hdfs://b/ r/w
  • 7.
    Managing Multiple Clusters:Our Proposal • Use the platform to mount external store as provided storage tier – Synchronize storage by mounting remote namespace – (+) Transparent to users, caching/prefetching, unified namespace – (-) Conflicts may not be automatically mergeable • Mount hadoop-compatible filesystem as provided store – hdfs:// – file:// – object stores (s3a://, wasb://) – … • Goal: use HDFS to coordinate external storage – No explicit data copying – Present uniform namespace – Multi-protocol access – No capability or performance gap: storage types (RAM/SSD/DISK/PROVIDED), rebalancing, security, quotas, etc. A hdfs://a/ hdfs://b/ r/w mount
  • 8.
    Hadoop Cloud StorageUtilization Evolution Evolution towards Cloud Storage as the Primary Data Lake Application HDFS Backup Restore Input Output Application HDFSInput Output Copy Application HDFS Input Output tmp
  • 9.
    Hadoop Cloud StorageUtilization Evolution: Next Application HDFS Backup Restore Input Output Application HDFSInput Output Copy Application HDFS Input Output tmp Application HDFS Write Through Load On-Demand Input Output Evolution towards Cloud Storage as the Primary Data Lake
  • 10.
    Use-Case: Object Storeas Archive Tier Application HDFS Write Through Load On-Demand Input Output
  • 11.
    • Western Digitalmoving up the stack – Belgian Object Storage startup Amplidata (started 2008) – Acquired by Western Digital in 2015 • Scale-out object storage system for Private & Public Cloud • Key features: – Compatible with Amazon S3 API – Strong consistency (not eventual!) – Erasure coding for efficient storage – Linear scalability in # racks, objects, throughput • Scale: – Big system -- X100: • 588 disks /rack = 5.8 PB raw = 4-4.5 PB usable • 5B objects – Small system -- P100: • 72 disks / module = 720 TB raw = 508 TB usable • 600M objects • More info HERE WD Active Archive Object Storage
  • 12.
    • Target: Hadoopusers with archival storage needs • Archival storage  large footprint (1PB+ Hadoop clusters) • Large footprint  already heavily invested in HDFS ecosystem – Existing workflows, scripts, tools – Migrating to cloud would require application changes • Provided storage offers the best of both worlds: – Familiar HDFS abstractions – Data-locality for hot data – Scale-out object storage for cold data Use-Case: Object Storage as Archive Tier Application HDFS Write Through Load On-Demand Input Output
  • 13.
    Use-Case: Ephemeral HadoopClusters Application HDFS Write Through Load On-Demand Input Output Application HDFS Write Through Load On-Demand Input Output Application HDFS Write Through Load On-Demand Input Output
  • 14.
    • Low utilizationon clusters running latency-sensitive applications – Seen at Microsoft (Zhang et. al. [OSDI’16]) and Google (Lo et. al. [ACM TOCS’16]) – Remedy: Co-locate analytics cluster on same hardware, as secondary tenant • To handle scale (10,000s machines), run multiple HDFS Namenodes in federation – Related: Router-based HDFS federation (HDFS-10467) • Require Quality of Service for latency-sensitive application – Preempt machines running analytics – E.g., Update the search index  kill 1000 nodes of analytics cluster – Possibility of rapid changes in load on Namenodes => entails re-balancing between Namenodes • During rebalancing, use tiering to “mount” source sub-tree in the destination Namenode – Metadata operation, much faster than moving data (Alt.: run a distcp job, as proposed in HDFS-10467) – Can lazily copy data to destination NN – Data available even before the copying is complete Use-Case: Harvesting spare cycles in datacenters
  • 15.
    Challenges • Synchronize metadatawithout copying data – Dynamically page in “blocks” on demand – Define policies to prefetch and evict local replicas • Mirror changes in remote namespace – Handle out-of-band churn in remote storage – Avoid dropping valid, cached data (e.g., rename) • Handle writes consistently – Writes committed to the backing store must “make sense” • Dynamic mounting – Efficient/clean mount-unmount behavior – 1 Object Store mapping to multiple Namenodes
  • 16.
    Big Picture: Readfrom External Store External namespace ext://nn … … … … / a b c e f g d External store (Data) mount Client read(/d/e) read(/c/d/e) (file data) (file data) DN1 DN2 HDFS cluster NN … … d e f g
  • 17.
    Review: Retrieving thecontents of a file in HDFS • Block locations stored in NN – Resolved in getBlockLocation() to a single DN and the relevant storage type • Replicas stored in the DN … … d e f g NN BlockManager DN1 DN2 DFSClient getBlockLocation (“/d/f/z1”, 0, L) return LocatedBlocks {{DN2, 𝑏𝑖}} FSImage RAM_DISK SSD DISK
  • 18.
    New! Provided StorageType • Peer to RAM, SSD, DISK in HDFS (HDFS-2832) • Data in external store mapped to HDFS blocks – Each block associated with an Alias = (REF, nonce) • Used to map blocks to external data • Nonce used to detect changes on backing store • E.g.: REF = (file URI, offset, length); nonce = GUID • E.g.: REF= (s3a://bucket/file, 0, 1024); nonce = <ETag> – Mapping stored in a BlockMap • KV store accessible by NN and all DNs • KV can be external service or in the NN • ProvidedVolume on Datanodes reads/writes data from/to external store DN1 External store DN2 BlockManager /𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗 𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3} /𝑎𝑑𝑙/𝑏𝑎𝑟 → 𝑏 𝑘, … , 𝑏𝑙 𝑏 𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷} FSNamesystem NN BlockMap 𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘 … RAM_DISK SSD DISK PROVIDED
  • 19.
    Example: Using animmutable cloud store • Create FSImage and BlockMap – Block StoragePolicy can be set as required – E.g. {rep=2, PROVIDED, DISK } FSImage BlockMap /𝑑/𝑒 → {𝑏1, 𝑏2, … } /d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … } … 𝑏𝑖 → {rep = 1, PROVIDED} … 𝑏𝑖 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 0, 𝐿 , inodeId1} 𝑏𝑖+1 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 𝐿, 2𝐿 , inodeId1} … External namespace ext://nn … … … … / a b c e f g d External store
  • 20.
    External namespace Example: Usingan immutable cloud store • Start NN with the FSImage – Replication > 1 start copying to local media • All blocks reachable from NN when a DN with PROVIDED storage heartbeats in – In contrast to READ_ONLY_SHARED (HDFS-5318) … … d e f g NN BlockManager DN1 DN2 … … … … / a b c e f g d FSImage BlockMap
  • 21.
    Example: Using animmutable cloud store • Block locations stored as a composite DN – Contains all DNs with the storage configured – Resolved in getBlockLocation() to a single DN • DN uses Alias to read from external store – Data can be cached locally as it is read (read-through cache) … … d e f g NN BlockManager DN1 DN2 DFSClient getBlockLocation (“/d/f/z1”, 0, L) return LocatedBlocks {{DN2, 𝑏𝑖, “ext:///c/d/f/z1”}} External store lookup(𝑏𝑖) (“ext:///c/d/f/z1/”, 0, L, GUID1) FSImage BlockMap
  • 22.
    Benefits of thePROVIDED design • Use existing HDFS features to enforce quotas, limits on storage tiers – Simpler implementation, no mismatch between HDFS invariants and framework • Supports different types of back-end storages – org.apache.hadoop.FileSystem, blob stores, etc. • Credentials hidden from client – Only NN and DNs require credentials of external store – HDFS can be used to enforce access controls for remote store • Enables several policies to improve performance – Set replication in FSImage to pre-fetch – Read-through cache – Actively pre-fetch while cluster is running • Set StoragePolicy for the file to prefetch
  • 23.
    Handling out-of-band changes •Nonce for correctness (e.g. ETag in S3 + GET If-Match: ETag) • Asynchronously poll external store to reap deleted / discover new objects – Integrate detected changes to the NN – Update BlockMap on file creation/deletion • Consensus, shared log, etc. – Tighter NS integration complements provided store abstraction
  • 24.
    Assumptions • Churn israre and relatively predictable – Analytic workloads, ETL into external/cloud storage, compute in cluster • Clusters are either consumers/producers for a subtree/region – FileSystem has too little information to resolve conflicts Ingest ETL Raw Data Bucket Analytic Results Bucket Analytics
  • 25.
    Implementation roadmap • Read-onlyimage (with periodic, naive refresh) – ViewFS-based: NN configured to refresh from root – Mount within an existing NN – Refresh view of remote cluster and sync • Write-through – Cloud backup: no namespace in external store, replication only – Return to writer only when data are committed to external store • Write-back – Lazily replicate to external store • Dynamic Mounting – Existing NN where an administrator wants to add tiered storage
  • 26.
    Dataworks Summit HadoopSan Jose, June 13-15 • Today’s talk was about the implemented read path • This year’s SJ talk (co-present with Microsoft) will be about the next steps: – Write path – Dynamic mounting
  • 27.
    Resources + Q&A •Tiered Storage HDFS-9806 [issues.apache.org] – Design documentation – List of subtasks, lots of linked tickets – take one! – Discussion of scope, implementation, and feedback • Joint work Microsoft – Western Digital – {thomas.demoor, ewan.higgs}@wdc.om – {cdoug,vijala}@microsoft.com
  • 28.
    Bonus slide: Write& Dynamic mounting • Write == multipart upload • Mounting 1 Object Store into 1 Namenode read-only is relatively easy. – Namenode can dictate the Block IDs. • Mounting 1 Object Store into multiple Namenodes r/w is hard, sharing Key Value store. – Multiple Namenodes may disagree over what the next Block ID is to write. • Dynamically unmounting 1 Object Store is hard – Option 1: Blocks simply become inaccessible as though storage was removed. Not ideal! – Option 2: Provided Blocks have a special header (like in Erasure Coding; See HDFS-10867).
  • 29.
    Bonus slide: Authentication •Cloud credentials use tokens • HDFS credentials use user/kerberos • In the case of HDFS-first deployments, PROVIDED is useful because the authentication remains with HDFS. • In the case of Cloud-first deployments, PROVIDED is useful because it provides a caching layer or reads and writes. • With PROVIDED, Applications deployed on HDFS-first systems can be redeployed on Cloud-first systems with no changes.

Editor's Notes

  • #5 Please join us. We have a design document posted to JIRA, an active discussion of the implementation choices, and HDFS-9806 We have a few minutes for questions, but please find us after the talk. There are far more details than we can possibly cover in a single presentation and we’re still setting the design, so we’re very open to collaboration. Thanks, and... let’s take a couple questions.
  • #6 Hadoop gained traction by putting all of an org’s data in one place, in common formats, to be processed by common tools. Different applications get a consistent view of their data from HDFS. Data is protected and managed by a set of user and operator invariants that assign quotas, authenticate users, encrypt data, and distribute it across heterogeneous media. If you have only one source of data to process using that abstraction, then you get to enjoy nice things and the rest of us will sullenly resent you.
  • #7 In most cases, these multiple clusters and different tiers of storage are managed today using two main techniques, The first one is to use the framework for example, people run distcp jobs to copy data over from one storage cluster to another. While this allows for clients to process local copies of data, and leaves no visible intermediate state, it needs compute resources and manual capacity planning. The second one is to use the application to handle multiple clusters, the application can be made aware of the fact that data is in multiple clusters and it can read the data from each one separately while reasoning about the data’s consistency. However, now each application must implement techniques to these reads, authenticate to different sources, and this leaves us with no opportunities for transparent caching or prefetching to improve performance.
  • #8 Our proposal is to use the platform to manage multiple storage clusters. So, we propose to use the storage layer to manage the multiple external storages. This allows us to use different storages for multiple applications and users in a transparent manner, we can use local storage to cache data from remote storage and have a single uniform namespace across multiple storage systems, which can be in the same building or on the other side of the world, in the cloud. In this talk, we are going to describe how we can enable HDFS to do this – how we can mount external storage systems in HDFS. This allows us to exploit all the capabilities and features that HDFS supports such as quotas, and security in accessing the different storage systems.
  • #9 We have seen Hortonworks explaining how they see the evolution of the data lake and we agree with it. So this slide is heavily inspired by a slide in a deck that Chris Nauroth has presented in the past. A lot of organizations start off with a HDFS system and take snapshots which get backed up onto cloud storage. This is time consuming and allows for a window where data could potentially be lost. [] So we move to pull data in from the cloud storage directly and write to HDFS. This makes sense as you may need to reuse the results of any intermediate job in an analytic pipeline. When the pipeline is done, the data is staged out to the cloud storage. [] Finally we see people reading and writing directly to the cloud storage. It is now the single source of truth and HDFS is used as a caching layer between jobs in a pipeline. Is anyone working like this now? (Raise hands?) Of course, this isn’t the end of the evolution. It’s not the goal of a cloud native system…
  • #10 Application shouldn’t manage multiple storage systems : persistent & temp Write through (write back later) Load on-demand. Pre load / Pre-tier: mount provided store, change tier Auto tiering to HDD + Provided / Callback when finished
  • #11 cloud hadoop clusters can be permanent, with cloud as cold tier or ephemeral (kubernetes, …)
  • #14 hadoop clusters can be permanent, with cloud as cold tier
  • #15 We are waiting on Virajith to finish this slide. Regarding low utilisation: check out Christina Delimitrou’s work on Quasar: http://www.csl.cornell.edu/~delimitrou/papers/2014.asplos.quasar.pdf http://www.csl.cornell.edu/~delimitrou/slides/2014.asplos.quasar.slides.pdf
  • #16 There are a few challenges. These can be broadly be grouped into the read path and the write path. In the read path, we're mostly focused on caching and synchronizing changes to the object storage. In the write path, we're concerned with writing new blocks and dynamically mounting object stores. We consider this phase 2.
  • #18 To understand how this would work in practice, let’s look at a simple example where we want to access an external cloud storage through HDFS. Let’s ignore writes for now. -> Now suppose, this is the part of the namespace we want to -> mount in HDFS. -> if the mount is successful, we should be able to access data in the cloud through HDFS. That is -> if a client comes and requests for a particular file, say /d/e, from HDFS, then HDFS should be -> able to read the file from the external store, -> get back the data from the external store and -> stream the data back to the client. Now, I will hand it over to Chris to explain how we make all of this to happen using the PROVIDED abstraction I just introduced.
  • #19 For anyone who isn’t familiar with HDFS, let’s review how it currently retrieves data. [] the NN will report all the local replicas, and NN will select a single PROV replica, say closest to the client. This avoids reporting every DN as a valid target, which is accurate, but not helpful for applications. [] The NN will look up the BlockInfo (and DatanodeSroageInfo) information in the Block Map and return this information to the client as part of the block location. This includes stuff like Datanode, Storage Type, etc. [] When the client requests the block from the DN, it looks up the data from the relevant storage types.
  • #20  XXX CUT XXX We introduce a new provided storage type which will be a peer to existing storage types. So, in HDFS today, The NN is partitioned into a namespace (FSNamesystem) that maps files to block IDs, and the block lifecycle management in the BlockManager. Each file is a sequence of block IDs in the namespace. Each block ID is mapped to a list of replicas resident on a storage attached to a datanode. A storage is a device (DISK, SSD, memory region) attached to a Datanode. Because HDFS understands blocks, even for files in the provided storage, we have a similar mapping. However, we also need to have some mapping of these blocks and how data is laid out in the provided store. For this, replicas of a block in “provided” storage are mapped to an alias. An alias is simply a tuple: a reference resolvable in the namespace of the external store, and a nonce to verify that the reference still locates the data matching that block. If my external store is another FileSystem, then my reference may be a URI, offset, length. and the nonce includes an inode/fileID, modification time, etc. Finally, we have provided volumes in Datanodes which are used to read and write data from the external store. The provided volume is essentially implements a client that is capable to talking to the external store.
  • #21 Let’s drill down into an example. Assume we want to mount this external namespace into HDFS. Rather, this subtree. [] We can generate a mirror of the metadata as an FSImage (checkpoint of NN state). For every file, we also partition it into blocks, and store the reference in the blockmap with the corresponding nonce. [] Note that the image contains only the block IDs and storage policy, while the blockmap stores the block alias. So if file /c/d/e were 1GB, the image could record 4 logical blocks. For each block, the blockmap would record the reference (URI,offset,len) and a nonce (inodeId, LMT) sufficient to detect inconsistency.
  • #22 A quick note on block reports, if those are unfamiliar. By the way: if any of this is unfamiliar, please speak up The NN persists metadata about blocks, but their location in the cluster is reported by DNs. Each DN reports the volumes (HDD, SSD) attached to it, and a list of block IDs stored in each. At startup, the NN comes out of safe mode (starts accepting writes) when some fraction of its namespace is available. [] When a DN reports its provided storage, it does not send a full block report for the provided storage (which is, recall, a peer of its local media). It only reports that any block stored therein is reachable through it. As long as the NN has at least one DN reporting that provided storage, it considers all the blocks in the block map as reachable. The NN scans the block map to discover DN blocks in that provided storage. This is in contrast to some existing work supporting read-only replicas, where every DN sends a block report of the shared data, as when multiple DNs mount a filer.
  • #23 Inside the NN, we relax the invariant that a storage- a HDD/SDD- belongs to only one DN. So when a client requests the block locations for a file (here z1) [] the NN will report all the local replicas, and NN will select a single PROV replica, say closest to the client. This avoids reporting every DN as a valid target, which is accurate, but not helpful for applications. [] The NN will look up the Block Alias information in the Block Map and return this information to the client as part of the block location. [] When the client requests the PROV block from the DN, it passes the block alias given by the NN. This will also be in the BlockTokenIdentifier to prevent rogue users from editing the request information. [] request the block data from the external store [] and return the data to the client, having verified the nonce [] because the block is read through the DN, we can also cache the data as a local block.
  • #24 There are a few points worth calling out, here. * First, this is a relatively small change to HDFS. The only client-visible change adds a new storage type. As a user, this is simpler than coordinating with copying jobs. In our cloud example, all the cluster’s data is immediately available once it’s in the namespace, even if the replication policy hasn’t prefetched data into local media. * Second, particularly for read-only mounts, this is a narrow API to implement. For cloud backup scenarios- where the NN is the only writer to the namespace- then we only need the block to object ID map and NN metadata to mount a prefix/snapshot of the cluster. * Third, because the client reads through the DN, it can cache a copy of the block on read. Pointedly, the NN can direct the client to any DN that should cache a copy on read, opening some interesting combinations of placement policies and read-through caching. The DN isn’t necessarily the closest to the client, but it may follow another objective function or replication policy. * Finally, in our example the cloud credentials are hidden from the client. S3/WAS both authenticate clients to containers using a single key. Because HDFS owns and protects the external store’s credentials, the client only accesses data permitted by HDFS. Generally, we can use features of HDFS that aren’t directly supported by the backing store if we can define the mapping.
  • #25 It’s imperative that we never return the wrong data. If a file were overwritten in the backing store, we will never return part of the first file, and part of the second. The nonce is what we use to protect ourselves from that. But there needs to be some way to ingest new data into HDFS. If our external store has a namespace compatible with FS, then we can always scan it, but... while refresh is limited to scans, the view to the client can be inconsistent. A client may see some unpromoted output, some promoted output, and a sentinel file declaring it completely promoted. Better cooperation with external stores can tighten the namespace integration, to expose meaningful states. For example, if the external store could expose meaningful snapshots, then HDFS could move from one to the next, maintaining a read-only version while it updates. If the path remains valid while the NN updates itself, we have clearer guarantees. For anyone familiar with CORFU and Tango (MSR, Dahlia Malkhi, Mahesh Balakrishnan, Ted Wobber), or with WANdisco’s integration of their Paxos engine with the NN, we can make the metadata sync tight and meaningful. We still need the logic at the block layer we’re adding as provided storage. After correctness, we also need to be mindful of efficiency. Output is often promoted by renaming it, and if the NN were to interpret that as a deletion and creation, our HDFS cluster would discard blocks just to recopy them, right at the moment they are consumed. One of our goals is to conservatively identify these cases based on actual workloads.
  • #26 Since I mentioned strong consensus engines, this isn’t a “real” shared namespace. Even the read-only case is eventually consistent; in the base case we’re scanning the entire subtree in the external store. That’s obviously not workable, but most bigdata workloads don’t hit pathological cases. The typical year/month/day/hour layouts common to analytics clusters are mostly additive, and this is sufficient for that case. * When writes conflict, there is only so much the FS can do to merge conflicts. Set aside really complex cases like compactions; even simple cases may not merge cleanly. If a user creates a directory that is also present in the external store, can that be merged? Maybe not; successful creation might be gating access; many frameworks in the Hadoop ecosystem follow conventions that rely on atomicity of operations in HDFS. * The permissions, timestamps, or storage policy may not match, and there isn’t a “correct” answer for the merged result (absent application semantics). * So we assume that, generally (or by construction), clusters will be either producers or consumers for some part of the shared namespace. Fundamentally: no magic, here. We haven’t made any breakthroughs in consensus, but provided storage is a tractable solution that happens to cover some common cases/deployments in its early incarnations, and from a R&D perspective, some very interesting problems in the policy space. Please find us after the talk, we love to talk about this.
  • #27 The implementation will be staged. The read-only case is relatively straightforward; we implemented a proof-of-concept spread over a few weeks. A link is posted to JIRA. We will start with a NN managing an external store, merged using federation (ViewFS). This lets us defer the mounting logic, which would otherwise interfere with NN operation. We will then explore strategies for creating and maintaining mounts in the primary NN, alongside other data. For those familiar with the NN locking and the formidable challenge of relaxing it, note that most of the invariants we’d enforce don’t apply inside the mount. Quotas aren’t enforced, renames outside can be disallowed, etc. So it may be possible to embed this in the NN. Refresh will start as naive scans, then improve. Identifying subtrees that change and/or are accessed more frequently could improve the general case, but polling is fundamentally limited. Given some experience, we can recognize the common abstractions when tiering over stores that expose version information, snapshots, etc. and write some tighter integrations. Writes are complex, so we will move from working system to working system. We’re wiring the PROV type into the DN state machines, so the write-through case should be tractable, particularly when the external store behind the provided abstraction is an object store. Ultimately, we’d like to use local storage to batch- or even prioritize- writes to the external store. Because HDFS sits between the client and the external store: if we have limited bandwidth, want to apply cost or priority models, etc. these can be embedded in HDFS.
  • #29 Please join us. We have a design document posted to JIRA, an active discussion of the implementation choices, and we’ll be starting a branch to host these changes. The existing work on READ_ONLY_SHARED replicas has a superlative design doc, if you want to contribute but need some orientation in the internal details. We have a few minutes for questions, but please find us after the talk. There are far more details than we can possibly cover in a single presentation and we’re still setting the design, so we’re very open to collaboration. Thanks, and... let’s take a couple questions.