7. Cluster Topology
node1
NAS/Object Storage
NFSS3
SAN Block Storage
iSCSI/FC
node2 nodeN…
Hadoop/Spark Cluster
On Virtual Machine
• OS
• Hadoop binary, HDFS
• SAN volume
• HTTP only
• Data lake, file
• Spark, Hive
• Same mount point
on nodes
• HBase
S3A NFSXFS/Ext4
9. Hadoop S3a Library
Hadoop DFS protocol
• The communication protocol between NN, DN and client
• Default implementation: HDFS
• Other implementations: S3a, Azure, local FS, etc.
Hadoop S3A library
• Hadoop DFS protocol implementation for S3 compatible
storage like Amazon S3, Pure Storage FlashBlade.
• Enable the Hadoop ecosystem (Spark, Hive, MR, etc.) to
store and process data in S3 object storage.
• Several years in production, heavily used on cloud.
11. THE ALL-FLASH DATA HUB FOR MODERN ANALYTICS
10+ PBs
/ RACK
DENSITY
FILE & OBJECT
CONVERGED
75 GB/S BW
7.5 M+ NFS OPS
BIG + FAST
JUST ADD
A BLADE!
SIMPLE ELASTIC SCALE
12. Hadoop Ecosystem and S3
How it works?
• Hadoop ecosystem (Spark, Hive, MR, etc.) uses HDFS client internally.
• Spark executor -> HDFS client -> storage
• Hive on Tez container -> HDFS client -> storage
• HDFS client speaks Hadoop DFS protocol.
• Client automatically choose proper implementation to use base on schema.
• /user/joe/data -> HDFS
• file:///user/joe/data -> local FS (including NFS)
• s3a://user/joe/data -> S3A
• Exception: HBase (details covered later)
13. Spark on S3
Spark submit
• Changes: use s3a:// as input/output.
• Temporary data I/O: YARN -> HDFS
• User data I/O: Spark executors -> S3A -> S3
YARN RM
spark-
submit
YARN
Container
HDFS
S3
temp data
val flatJSON =
sc.textFile("s3a://deephub/tweets/")
14. Hadoop on S3 Challenges
Consistency model
• Amazon S3: eventual consistency, use S3Guard
• S3 compatible storage (e.g. FlashBlade S3) supports strong consistency
Slow “rename”
• “rename” is critical in HDFS to support atomic commits, like Linux “mv”
• S3 does not support “rename” natively.
• S3A simulates “rename” as LIST – COPY - DELETE
16. Hadoop on S3 Updates
Make S3A cloud native
• Hundreds of JIRAs
• Robustness, scale and
performance
• S3Guard
• Zero-rename S3A committers.
Use Hadoop 3.1~
17. S3A Committers
Originally S3A use FileOutputCommitter, which relays on “rename”
Zero-rename, cloud-native S3A Committers
• Staging committer: directory & partitioned
• Magic committer
18. S3A Committers
Staging Committer
• Does not require strong consistency.
• Proven to work at Netflix.
• Requires large local FS space.
Magic committer
• Require strong consistency. S3Guard on
AWS, FlashBlade S3, etc.
• Faster. Use less local local FS space.
• Less stable/tested than staging
committer.
Common Key Points
• Fast, no “rename”
• Both leverage S3 transactional multi-parts upload
23. HBase & HDFS
What HBase want from HDFS?
• HFile & WAL durability
• Scale & performance
• Mostly latency, but also
throughput
What HBase does NOT want from HDFS?
• Noisy neighbors
• Co-locating compute: YARN, Spark
• Co-locating data: Hive data
warehouse
• Complexity: operation & upgrade
24. HBase on NFS
How it works?
• Use NFS as HBase root and staging directory.
• Same mount NFS point on all RegionServer nodes
• Point HBase to store data in that mount point
• Leverage HDFS local FS implementation (file:///mnt/hbase)
No change on application.
• Clients only see HBase tables.
HMasterclient
Region
Server
NFS
Table API
25. HFile Durability
• HDFS uses 3x replication to protect HFile.
• HFile replication is not necessary in enterprise NFS
• Erasure coding or RAID like data protection within/across storage
array
• Amazon EFS stores data within and across multiple AZs
• FlashBlade supports N+2 data durability and high availability
26. HBase WAL Durability
• HDFS uses “hflush/hsync” API to ensure WAL is safely flushed to multiple
data nodes before acknowledging clients.
• Not necessary in enterprise NFS
• FlashBlade acknowledges writes after data is persisted in NVRAM on 3
blades.
27. NFS Performance for HBase
• Depend on NFS implementation.
• NFS is generally good for random access.
• Also check throughput.
• Flash storage is ideal for HBase.
• All-flash scale-out NFS such as Pure Storage FlashBlade
28. HBase PE on FlashBlade NFS
random writes
1RS, 1M rows/client,
10 clients
7 blades
random reads
1RS, 100K rows/client,
20 clients
7 blades
Memstore flush
storm?
Block cache affects
result
Latency seen by
storage, stable
29. HBase PE on Amazon EFS
random writes
1RS, 1M rows/client,
10 clients
1024MB/s provisioned
throughput EFS
random reads
1RS, 100K rows/client,
20 clients
1024MB/s provisioned
throughput EFS
Region too busy,
memstore flush is slow
30. Key Takeaways
• Storage options for cloud-era Hadoop/Spark
• Hive & Spark on S3 with cloud-native S3A Committers
• HBase on enterprise NFS
• Available in cloud and on premise (Pure Storage FlashBlade)
• Additional benefits: always-on compression, encryption, etc.
• Proven to work
• Simple, reliable, performant
• No more HDFS operation
• Virtualize your Hadoop/Spark cluster