Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS

Mounting Remote
Stores in HDFS
VIRAJITH JALAPARTI
ASHVIN AGRAWAL
1

Motivation: Multiple clusters are inevitable
Data rarely lives in one
cluster
Compliance & Regulatory restrictions
Production / Research partitions
Heterogenous access needs
Backup / Archival
Storage – Compute
disaggregation is prevalent
Public Cloud deployments
Serverless Computing
Ephemeral Hadoop clusters ($$)
2

Available Solutions
Hadoop Compatible FS: s3a://, wasb://, adl://, …
Direct IO between Hadoop apps and remote stores
HDFS NameNode functions taken up by Object Store
HDFS managed external stores [HDFS-9806]
3

Enabled using the PROVIDED StorageType
Peer to RAM, SSD, DISK in HDFS (HDFS-2832)
Data in remote store mapped to HDFS blocks on
PROVIDED storage
◦ Each block associated with BlockAlias = (REF, nonce)
◦ Nonce used to detect changes on external store
◦ REF = (file URI, offset, length); nonce = GUID
◦ REF= (s3a://bucket/file, 0, 1024); nonce = <ETag>
◦ Mapping stored in a AliasMap
◦ Can use a KV store which is external to or in the NN
PROVIDEDVolume on Datanodes reads/writes
data from/to remote store
4
DN1
Remote store
DN2
BlockManager
/𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗
𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3}
/𝑟𝑒𝑚𝑜𝑡𝑒/𝑏𝑎𝑟
→ 𝑏 𝑘, … , 𝑏𝑙
𝑏 𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷}
FSNamesystem
NN
AliasMap
𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘
…
RAM SSD DISK PROVIDED

Benefits of the PROVIDED design
Use existing HDFS features to enforce quotas, limits on storage tiers
No change to existing HDFS workflows
◦ Hadoop Apps interact with HDFS as before (fully transparent)
◦ Data Tiering happens async in background
Supports different types of back-end storages
◦ org.apache.hadoop.FileSystem, blob stores, etc.
5

Benefits of the PROVIDED design
Enables several policies to improve performance
◦ Set replication in FSImage to prefetch
◦ Read-through cache
Credentials hidden from client
◦ Only NN and DNs require credentials of external store
◦ HDFS can be used to enforce access controls for remote store
Pain point: Scale tied to HDFS NameNode
6

Provided Storage in Hadoop 3.1
Set
Set Provided
storage config
in hdfs-site.xml
Distribute
Distribute
config xml to all
NNs and DNs
Create
Create FSImage
Start
Start the
cluster
7

Status
3.1
• Unified
Namespace
• Alias Map
• On-Demand
Remote Reads
Today
• Dynamic Mounts
• Multiple Storage &
Config Isolation
• Security
• HA support
• Writes and Backup
Future
• Efficient Refresh
• Cache and
Prefetch Policies
• Performance
optimizations
8

Mount API
API
◦ addMount(String remote, String mount, Map<String, String> remoteConfig)
◦ removeMount(String mountPath)
◦ List<MountPoint> listMounts()
CLI: hdfs dfsadmin
◦ -addMount <remotePath> <mountPath> [<remoteConfig>]
◦ -removeMount <mountPath>
WebHDFS: PUT http://<HOST>:<PORT>/webhdfs/v1/<MOUNT_PATH>?
◦ op=ADDMOUNT&remotepath=<remote URI>[&remoteconfig=<remote config>]
◦ op=REMOVEMOUNT
◦ op=LISTMOUNTS
9

NameNode changes for mounting remote stores
• User provides mount info to the NN
• NN dynamically creates required INodes
and Blocks
• NN distributes mount info to DNs in
response to heartbeats
• NN persists mount info in xAttrs and edit
log for HA
10

DataNode changes for mounting remote stores
• Receives Provided Storage UUID and
metadata from NN
• Creates and activates Provided Volume
• Volume is part of reports to NN
• Replicas loaded from AliasMap “on-
demand”
11

Security
Remote store credentials need to be distributed to all Nodes
Data is accessed with these credentials, no passthrough
Authorization
◦ ACL mirrored from the remote store
◦ Pluggable ACL mapping policy
Authentication
◦ Federated Kerberos
◦ Oath based tokens; e.g. ADLS client credentials
12

Netco: Workload-aware Cache policies [SOCC ‘17]
Learn characteristics of recurring jobs
Prefetch datasets when possible
Cache data for jobs that benefit from
it the most
13

References
Tiered Storage HDFS-9806
Handling write from HDFS to Provided Storage HDFS-12090
Hadoop Summit 2016
DataWorks Summit 2018
virajith.jalaparti@microsoft.com
ashvin.agrawal@microsoft.com
14

Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS

Similar to Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS (20)

Recently uploaded

Recently uploaded (20)

Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS

Editor's Notes