Virajith Jalaparti and Ashvin Agrawal of Microsoft present regarding their work to support mounting remote stores in HDFS. They show how HDFS can be used as a caching proxy to access remote stores such as ADLS and S3, enabling clients to be unaware of the location of their data, and increasing efficiency in the process.
This is taken from the Apache Hadoop Contributors Meetup on January 30, hosted by LinkedIn in Mountain View.
2. Motivation: Multiple clusters are inevitable
Data rarely lives in one
cluster
Compliance & Regulatory restrictions
Production / Research partitions
Heterogenous access needs
Backup / Archival
Storage – Compute
disaggregation is prevalent
Public Cloud deployments
Serverless Computing
Ephemeral Hadoop clusters ($$)
2
3. Available Solutions
Hadoop Compatible FS: s3a://, wasb://, adl://, …
Direct IO between Hadoop apps and remote stores
HDFS NameNode functions taken up by Object Store
HDFS managed external stores [HDFS-9806]
3
4. Enabled using the PROVIDED StorageType
Peer to RAM, SSD, DISK in HDFS (HDFS-2832)
Data in remote store mapped to HDFS blocks on
PROVIDED storage
◦ Each block associated with BlockAlias = (REF, nonce)
◦ Nonce used to detect changes on external store
◦ REF = (file URI, offset, length); nonce = GUID
◦ REF= (s3a://bucket/file, 0, 1024); nonce = <ETag>
◦ Mapping stored in a AliasMap
◦ Can use a KV store which is external to or in the NN
PROVIDEDVolume on Datanodes reads/writes
data from/to remote store
4
DN1
Remote store
DN2
BlockManager
/𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗
𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3}
/𝑟𝑒𝑚𝑜𝑡𝑒/𝑏𝑎𝑟
→ 𝑏 𝑘, … , 𝑏𝑙
𝑏 𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷}
FSNamesystem
NN
AliasMap
𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘
…
RAM SSD DISK PROVIDED
5. Benefits of the PROVIDED design
Use existing HDFS features to enforce quotas, limits on storage tiers
No change to existing HDFS workflows
◦ Hadoop Apps interact with HDFS as before (fully transparent)
◦ Data Tiering happens async in background
Supports different types of back-end storages
◦ org.apache.hadoop.FileSystem, blob stores, etc.
5
6. Benefits of the PROVIDED design
Enables several policies to improve performance
◦ Set replication in FSImage to prefetch
◦ Read-through cache
Credentials hidden from client
◦ Only NN and DNs require credentials of external store
◦ HDFS can be used to enforce access controls for remote store
Pain point: Scale tied to HDFS NameNode
6
7. Provided Storage in Hadoop 3.1
Set
Set Provided
storage config
in hdfs-site.xml
Distribute
Distribute
config xml to all
NNs and DNs
Create
Create FSImage
Start
Start the
cluster
7
8. Status
3.1
• Unified
Namespace
• Alias Map
• On-Demand
Remote Reads
Today
• Dynamic Mounts
• Multiple Storage &
Config Isolation
• Security
• HA support
• Writes and Backup
Future
• Efficient Refresh
• Cache and
Prefetch Policies
• Performance
optimizations
8
10. NameNode changes for mounting remote stores
• User provides mount info to the NN
• NN dynamically creates required INodes
and Blocks
• NN distributes mount info to DNs in
response to heartbeats
• NN persists mount info in xAttrs and edit
log for HA
10
11. DataNode changes for mounting remote stores
• Receives Provided Storage UUID and
metadata from NN
• Creates and activates Provided Volume
• Volume is part of reports to NN
• Replicas loaded from AliasMap “on-
demand”
11
12. Security
Remote store credentials need to be distributed to all Nodes
Data is accessed with these credentials, no passthrough
Authorization
◦ ACL mirrored from the remote store
◦ Pluggable ACL mapping policy
Authentication
◦ Federated Kerberos
◦ Oath based tokens; e.g. ADLS client credentials
12
13. Netco: Workload-aware Cache policies [SOCC ‘17]
Learn characteristics of recurring jobs
Prefetch datasets when possible
Cache data for jobs that benefit from
it the most
13
In today’s age of Big data, an organization’s data typically lives across multiple clusters. This is actually the norm and not an exception. This can be because of regulatory or compliance reasons or because of separate clusters running production and test or research workloads. Or that different divisions own and operate clusters.
With the cloud, this becomes more prevalent due to the storage-compute disaggregation that exists. Most cloud providers separate out the compute layer from their primary storage offerings due to various benefits such an architecture allows. This, along with the advent of serverless computing, requires applications to access data that is not co-located or managed in the same cluster.
For this, we introduce a new storage type called Provided which will be a peer to existing storage types. The Provided storage type is used to refer to data in the remote store.
-> So, Datanodes can now support 4 kinds of storage types.
-> Data in the remote store is mapped to HDFS blocks on provided storage.
So, in HDFS today, The NN is partitioned into a namespace (FSNamesystem) that maps files to a sequence of block IDs, and the BlockManager, which is responsible for the block lifecycle management and maintaining the locations of the blocks of any file. In this example, file /a/foo is mapped to blocks with ids bi to bj. Each block ID is mapped to a list of replicas resident on a storage attached to a datanode. For example, here we have block b_i mapped to storages s1, s2 and s3.
-> As HDFS understands blocks, for files in the provided storage, we use a similar mapping. So, a file /remote/bar is mapped to blocks bk to b_l, and each of these blocks is mapped to a provided storage.
-> However, this is not sufficient to locate the data in the remote store. We need to have some mapping between these blocks and how data is laid out in the remote store. For this, every block in “provided” storage are mapped to an alias. An alias is simply a tuple: a reference is something resolvable in the namespace of the remote store, and the nonce to verify that the reference still locates the data matching that block.
For example, if the remote store is another FileSystem, then my reference may be a URI, offset, length. and the nonce can be a GUID like an inode or fileID. If the remote store is a blob store like s3, the REF can be a blob name, offset and length, and the nonce can be a ETAG id.
-> We also maintain an AliasMap which contains the mapping between the block ids and their aliases. This can be in the NN or be an external KV store.
-> Finally, we have provided volumes in Datanodes which are used to read and write data from the external store. The provided volume is essentially implements a client that is capable to talking to the external store.
Summary: the AliasMap helps us map HDFS metadata to metadata on the remote store, and the PROVIDED storage type helps HDFS understand that the data is actually remote.
There are a few points worth calling out, here.
* First, this is a relatively small change to HDFS. The only client-visible change adds a new storage type. As a user, this is simpler than coordinating with copying jobs. In our cloud example, all the cluster’s data is immediately available once it’s in the namespace, even if the replication policy hasn’t prefetched data into local media.
* Second, particularly for read-only mounts, this is a narrow API to implement. For cloud backup scenarios- where the NN is the only writer to the namespace- then we only need the block to object ID map and NN metadata to mount a prefix/snapshot of the cluster.
* Third, because the client reads through the DN, it can cache a copy of the block on read. Pointedly, the NN can direct the client to any DN that should cache a copy on read, opening some interesting combinations of placement policies and read-through caching. The DN isn’t necessarily the closest to the client, but it may follow another objective function or replication policy.
* Finally, in our example the cloud credentials are hidden from the client. S3/WAS both authenticate clients to containers using a single key. Because HDFS owns and protects the external store’s credentials, the client only accesses data permitted by HDFS. Generally, we can use features of HDFS that aren’t directly supported by the backing store if we can define the mapping.
There are a few points worth calling out, here.
* First, this is a relatively small change to HDFS. The only client-visible change adds a new storage type. As a user, this is simpler than coordinating with copying jobs. In our cloud example, all the cluster’s data is immediately available once it’s in the namespace, even if the replication policy hasn’t prefetched data into local media.
* Second, particularly for read-only mounts, this is a narrow API to implement. For cloud backup scenarios- where the NN is the only writer to the namespace- then we only need the block to object ID map and NN metadata to mount a prefix/snapshot of the cluster.
* Third, because the client reads through the DN, it can cache a copy of the block on read. Pointedly, the NN can direct the client to any DN that should cache a copy on read, opening some interesting combinations of placement policies and read-through caching. The DN isn’t necessarily the closest to the client, but it may follow another objective function or replication policy.
* Finally, in our example the cloud credentials are hidden from the client. S3/WAS both authenticate clients to containers using a single key. Because HDFS owns and protects the external store’s credentials, the client only accesses data permitted by HDFS. Generally, we can use features of HDFS that aren’t directly supported by the backing store if we can define the mapping.