Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HDFS Tiered
Storage
Chris Douglas, Virajith Jalaparti
Microsoft CISL
>id
Microsoft Cloud and Information Services Lab (CISL)
Applied research group in large-scale systems and machine learning...
Data in Hadoop
All data in one place
Tools written against abstractions
Compatible FileSystems (Azure/S3/etc.)
Multi-tenan...
In most cases, we have multiple clusters…
Multiple storage clusters
Production/research partitioning
Compliance and regula...
Managing multiple clusters: Today
Using the framework
Copy data (distcp) between clusters
(+) Clients process local copies...
Managing multiple clusters: Our proposal
Tiering: Using the platform
Synchronize storage with remote namespace
(+) Transpa...
Challenges
Synchronize metadata without copying data
Dynamically page in “blocks” on demand
Define policies to prefetch an...
Proposal: Provided Storage Type
Peer to RAM, SSD, DISK in HDFS (HDFS-2832)
Data in external store mapped to HDFS blocks
Ea...
Example: Using an immutable cloud store
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store
(Data)
mount
Cl...
Example: Using an immutable cloud store
FSImage
BlockMap
/𝑑/𝑒 → {𝑏1, 𝑏2, … }
/d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … }
…
𝑏𝑖 → {rep = 1, PRO...
Example: Using an immutable cloud store
FSImage
BlockMap
Start NN with the FSImage
Replication > 1 start copying to local ...
Example: Using an immutable cloud store
FSImage
BlockMap
Block locations stored as a
composite DN
Contains all DNs with th...
Benefits of the PROVIDED design
Use existing HDFS features to enforce quotas, limits on storage tiers
Simpler implementati...
Handling out-of-band changes
Nonce for correctness
Asynchronously poll external store
Integrate detected changes to the NN...
Assumptions
Churn is rare and relatively predictable
Analytic workloads, ETL into external/cloud storage, compute in clust...
Implementation roadmap
Read-only image (with periodic, naive refresh)
ViewFS-based: NN configured to refresh from root
Mou...
Resources
Tiered Storage HDFS-9806 [issues.apache.org]
Design documentation
List of subtasks – take one!
Discussion of sco...
Alternative approaches: Client-driven tiering
Existing solutions: ViewFS/HADOOP-12077
Challenges
Maintain synchronized cli...
Upcoming SlideShare
Loading in …5
×

HDFS Tiered Storage

1,920 views

Published on

HDFS Tiered Storage

Published in: Technology
  • Be the first to comment

HDFS Tiered Storage

  1. 1. HDFS Tiered Storage Chris Douglas, Virajith Jalaparti Microsoft CISL
  2. 2. >id Microsoft Cloud and Information Services Lab (CISL) Applied research group in large-scale systems and machine learning Contributions to Apache Hadoop YARN Preemption, reservations/planning, federation, distributed sched. Apache REEF: control-plane for big data systems Chris Douglas (cdoug@microsoft.com) Contributor to Apache Hadoop since 2007, member of its PMC Virajith Jalaparti (vijala@microsoft.com)
  3. 3. Data in Hadoop All data in one place Tools written against abstractions Compatible FileSystems (Azure/S3/etc.) Multi-tenant Management APIs Quotas, auth, encryption, media Works well if all data is in one cluster
  4. 4. In most cases, we have multiple clusters… Multiple storage clusters Production/research partitioning Compliance and regulatory restrictions Datasets can be shared Geographically distributed clusters Disaster recovery Cloud backup/Hybrid clouds Heterogeneous storage tiers in a cluster Compute + Storage Compute + Storage wasb://… hdfs://b/ hdfs://a/
  5. 5. Managing multiple clusters: Today Using the framework Copy data (distcp) between clusters (+) Clients process local copies, no visible partial copies (-) Uses compute resources, requires capacity planning Using the application Directly access data in multiple clusters (+) Consistency managed at client (-) Auth to all data sources, consistency is hard, no opportunities for transparent caching D A hdfs://a/ hdfs://b/ A r/w hdfs://a/ hdfs://b/ r/w
  6. 6. Managing multiple clusters: Our proposal Tiering: Using the platform Synchronize storage with remote namespace (+) Transparent to users, caching/prefetching, unified namespace (-) Conflicts may be unresolvable Use HDFS to coordinate external storage No capability or performance gap Support for heterogeneous media (RAM/SSD/DISK), rebalancing, security, quotas, etc. A hdfs://a/ hdfs://b/ r/w mount
  7. 7. Challenges Synchronize metadata without copying data Dynamically page in “blocks” on demand Define policies to prefetch and evict local replicas Mirror changes in remote namespace Handle out-of-band churn in remote storage Avoid dropping valid, cached data (e.g., rename) Handle writes consistently Writes committed to the backing store must “make sense”
  8. 8. Proposal: Provided Storage Type Peer to RAM, SSD, DISK in HDFS (HDFS-2832) Data in external store mapped to HDFS blocks Each block associated with an Alias = (REF, nonce) Used to map blocks to external data Nonce used to detect changes on backing store E.g.: REF = (file URI, offset, length); nonce = GUID Mapping stored in a BlockMap KV store accessible by NN and all DNs ProvidedVolume on Datanodes reads/writes data from/to external store DN1 External store DN2 BlockManager /𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗 𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3} /𝑎𝑑𝑙/𝑏𝑎𝑟 → 𝑏 𝑘, … , 𝑏𝑙 𝑏_𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷} FSNamesystem NN BlockMap 𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘 … RAM_DISK SSD DISK PROVIDED
  9. 9. Example: Using an immutable cloud store External namespace ext://nn … … … … / a b c e f g d External store (Data) mount Client read(/d/e) DN1 DN2 HDFS cluster NN read(/c/d/e) (file data) (file data)
  10. 10. Example: Using an immutable cloud store FSImage BlockMap /𝑑/𝑒 → {𝑏1, 𝑏2, … } /d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … } … 𝑏𝑖 → {rep = 1, PROVIDED} … 𝑏𝑖 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 0, 𝐿 , inodeId1} 𝑏𝑖+1 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 𝐿, 2𝐿 , inodeId1} … Create FSImage and BlockMap Block StoragePolicy can be set as required E.g. {rep=2, PROVIDED, DISK } External namespace ext://nn … … … … / a b c e f g d External store
  11. 11. Example: Using an immutable cloud store FSImage BlockMap Start NN with the FSImage Replication > 1 start copying to local media All blocks reachable from NN when a DN with PROVIDED storage heartbeats in In contrast to READ_ONLY_SHARED (HDFS-5318) … … d e f g NN BlockManager DN1 DN2 … … … … / a b c e f g d External namespace
  12. 12. Example: Using an immutable cloud store FSImage BlockMap Block locations stored as a composite DN Contains all DNs with the storage configured Resolved in getBlockLocation() to a single DN DN looks up block in BlockMap, uses Alias to read from external store Data can be cached locally as it is read (read-through cache) … … d e f g NN BlockManager DN1 DN2 DFSClient getBlockLocation (“/d/f/z1”, 0, L) return LocatedBlocks {{DN2, 𝑏𝑖, PROVIDED}} lookup(𝑏𝑖) (“/c/d/f/z1/”, 0, L, GUID1) External store
  13. 13. Benefits of the PROVIDED design Use existing HDFS features to enforce quotas, limits on storage tiers Simpler implementation, no mismatch between HDFS invariants and framework Supports different types of back-end storages org.apache.hadoop.FileSystem, blob stores, etc. Enables several policies to improve performance Set replication in FSImage to pre-fetch Read-through cache Actively pre-fetch while cluster is running Set StoragePolicy for the file to prefetch Credentials hidden from client Only NN and DNs require credentials of external store HDFS can be used to enforce access controls for remote store
  14. 14. Handling out-of-band changes Nonce for correctness Asynchronously poll external store Integrate detected changes to the NN Update BlockMap on file creation/deletion Consensus, shared log, etc. Tighter NS integration complements provided store abstraction Operations like rename can cause unnecessary evictions Heuristics based on common rename scenarios (e.g., output promotion) to assign block ids
  15. 15. Assumptions Churn is rare and relatively predictable Analytic workloads, ETL into external/cloud storage, compute in cluster Clusters are either consumers/producers for a subtree/region FileSystem has too little information to resolve conflicts Clients can recognize/ignore inconsistent states External stores can tighten these semantics Independent of PROVIDED storage
  16. 16. Implementation roadmap Read-only image (with periodic, naive refresh) ViewFS-based: NN configured to refresh from root Mount within an existing NN Refresh view of remote cluster and sync Write-through Cloud backup: no namespace in external store, replication only Return to writer only when data are committed to external store Write-back Lazily replicate to external store
  17. 17. Resources Tiered Storage HDFS-9806 [issues.apache.org] Design documentation List of subtasks – take one! Discussion of scope, implementation, and feedback Read-only replicas HDFS-5318 [issues.apache.org] Related READ_ONLY_SHARED work; excellent design doc {cdoug,vijala}@microsoft.com
  18. 18. Alternative approaches: Client-driven tiering Existing solutions: ViewFS/HADOOP-12077 Challenges Maintain synchronized client views Enforcing storage quotas, rate limiting reads etc. fall upon the client Clients need sufficient privileges to read/write data Client is responsible for maintaining the system in a consistent state Need to recover partially completed operations from other clients

×