Hive/Spark/HBase on S3 & NFS
– No More HDFS Operation
( / Yifeng Jiang)
@uprush
March 14th, 2019
Yifeng Jiang
Yifeng Jiang / / ( )
• APJ Solution Architect, Data Science @ Pure Storage
• Big data, machine learning, cloud, PaaS, web systems
Prior to Pure
• Hadooper since 2009
• HBook author
• Software engineer, PaaS, Cloud
Agenda
• Separate compute and storage
• Hive & Spark on S3 with S3A Committers
• HBase on NFS
• Demo, benchmark and case study
Hadoop Common Pain Points
• Hardware complexity: racks of server, cables, power
• Availability SLA’s / HDFS operation
• Unbalanced CPU and storage demands
• Software complexity: 20+ components
• Performance Tuning
Separate Compute & Storage
• Hardware complexity: racks of server, cables, power
• Availability SLA’s / HDFS operation
• Unbalanced CPU and storage demands
• Software complexity: 20+ components
• Performance Tuning
Virtual machine
S3/NFS
Scale independently
Modernizing Data Analytics
DATA INGEST
Search
NFS
ETL
S3
QueryStore
Deep Learning
Storage Backend
Cluster Topology
node1
NAS/Object Storage
NFSS3
SAN Block Storage
iSCSI/FC
node2 nodeN…
Hadoop/Spark Cluster
On Virtual Machine
• OS
• Hadoop binary, HDFS
• SAN volume
• HTTP only
• Data lake, file
• Spark, Hive
• Same mount point
on nodes
• HBase
S3A NFSXFS/Ext4
Hive/Spark on S3
Hadoop S3a Library
Hadoop DFS protocol
• The communication protocol between NN, DN and client
• Default implementation: HDFS
• Other implementations: S3a, Azure, local FS, etc.
Hadoop S3A library
• Hadoop DFS protocol implementation for S3 compatible
storage like Amazon S3, Pure Storage FlashBlade.
• Enable the Hadoop ecosystem (Spark, Hive, MR, etc.) to
store and process data in S3 object storage.
• Several years in production, heavily used on cloud.
Hadoop S3A Demo
THE ALL-FLASH DATA HUB FOR MODERN ANALYTICS
10+ PBs
/ RACK
DENSITY
FILE & OBJECT
CONVERGED
75 GB/S BW
7.5 M+ NFS OPS
BIG + FAST
JUST ADD
A BLADE!
SIMPLE ELASTIC SCALE
Hadoop Ecosystem and S3
How it works?
• Hadoop ecosystem (Spark, Hive, MR, etc.) uses HDFS client internally.
• Spark executor -> HDFS client -> storage
• Hive on Tez container -> HDFS client -> storage
• HDFS client speaks Hadoop DFS protocol.
• Client automatically choose proper implementation to use base on schema.
• /user/joe/data -> HDFS
• file:///user/joe/data -> local FS (including NFS)
• s3a://user/joe/data -> S3A
• Exception: HBase (details covered later)
Spark on S3
Spark submit
• Changes: use s3a:// as input/output.
• Temporary data I/O: YARN -> HDFS
• User data I/O: Spark executors -> S3A -> S3
YARN RM
spark-
submit
YARN
Container
HDFS
S3
temp data
val flatJSON =
sc.textFile("s3a://deephub/tweets/")
Hadoop on S3 Challenges
Consistency model
• Amazon S3: eventual consistency, use S3Guard
• S3 compatible storage (e.g. FlashBlade S3) supports strong consistency
Slow “rename”
• “rename” is critical in HDFS to support atomic commits, like Linux “mv”
• S3 does not support “rename” natively.
• S3A simulates “rename” as LIST – COPY - DELETE
Slow S3 “Rename”
Image source: https://stackoverflow.com/questions/42822483/extremely-slow-s3-write-times-from-emr-spark/42835927
Hadoop on S3 Updates
Make S3A cloud native
• Hundreds of JIRAs
• Robustness, scale and
performance
• S3Guard
• Zero-rename S3A committers.
Use Hadoop 3.1~
S3A Committers
Originally S3A use FileOutputCommitter, which relays on “rename”
Zero-rename, cloud-native S3A Committers
• Staging committer: directory & partitioned
• Magic committer
S3A Committers
Staging Committer
• Does not require strong consistency.
• Proven to work at Netflix.
• Requires large local FS space.
Magic committer
• Require strong consistency. S3Guard on
AWS, FlashBlade S3, etc.
• Faster. Use less local local FS space.
• Less stable/tested than staging
committer.
Common Key Points
• Fast, no “rename”
• Both leverage S3 transactional multi-parts upload
S3A Committer Demo
S3A Committer Benchmark
• 1TB MR random text writes
• FlashBlade S3
• 15 blades
• Plenty compute nodes
0
200
400
600
800
1000
1200
file directory magic
time(s)
Output Committer
1TB Genwords Bench
Teragen Benchmark
Magic committer
faster
File committer
slow
Staging committer
fast
S3 ReadLocal FS ReadLess local FS
Read
HBase on NFS
The peace of volcano
HBase & HDFS
What HBase want from HDFS?
• HFile & WAL durability
• Scale & performance
• Mostly latency, but also
throughput
What HBase does NOT want from HDFS?
• Noisy neighbors
• Co-locating compute: YARN, Spark
• Co-locating data: Hive data
warehouse
• Complexity: operation & upgrade
HBase on NFS
How it works?
• Use NFS as HBase root and staging directory.
• Same mount NFS point on all RegionServer nodes
• Point HBase to store data in that mount point
• Leverage HDFS local FS implementation (file:///mnt/hbase)
No change on application.
• Clients only see HBase tables.
HMasterclient
Region
Server
NFS
Table API
HFile Durability
• HDFS uses 3x replication to protect HFile.
• HFile replication is not necessary in enterprise NFS
• Erasure coding or RAID like data protection within/across storage
array
• Amazon EFS stores data within and across multiple AZs
• FlashBlade supports N+2 data durability and high availability
HBase WAL Durability
• HDFS uses “hflush/hsync” API to ensure WAL is safely flushed to multiple
data nodes before acknowledging clients.
• Not necessary in enterprise NFS
• FlashBlade acknowledges writes after data is persisted in NVRAM on 3
blades.
NFS Performance for HBase
• Depend on NFS implementation.
• NFS is generally good for random access.
• Also check throughput.
• Flash storage is ideal for HBase.
• All-flash scale-out NFS such as Pure Storage FlashBlade
HBase PE on FlashBlade NFS
random writes
1RS, 1M rows/client,
10 clients
7 blades
random reads
1RS, 100K rows/client,
20 clients
7 blades
Memstore flush
storm?
Block cache affects
result
Latency seen by
storage, stable
HBase PE on Amazon EFS
random writes
1RS, 1M rows/client,
10 clients
1024MB/s provisioned
throughput EFS
random reads
1RS, 100K rows/client,
20 clients
1024MB/s provisioned
throughput EFS
Region too busy,
memstore flush is slow
Key Takeaways
• Storage options for cloud-era Hadoop/Spark
• Hive & Spark on S3 with cloud-native S3A Committers
• HBase on enterprise NFS
• Available in cloud and on premise (Pure Storage FlashBlade)
• Additional benefits: always-on compression, encryption, etc.
• Proven to work
• Simple, reliable, performant
• No more HDFS operation
• Virtualize your Hadoop/Spark cluster

Hive spark-s3acommitter-hbase-nfs

  • 1.
    Hive/Spark/HBase on S3& NFS – No More HDFS Operation ( / Yifeng Jiang) @uprush March 14th, 2019
  • 2.
    Yifeng Jiang Yifeng Jiang/ / ( ) • APJ Solution Architect, Data Science @ Pure Storage • Big data, machine learning, cloud, PaaS, web systems Prior to Pure • Hadooper since 2009 • HBook author • Software engineer, PaaS, Cloud
  • 3.
    Agenda • Separate computeand storage • Hive & Spark on S3 with S3A Committers • HBase on NFS • Demo, benchmark and case study
  • 4.
    Hadoop Common PainPoints • Hardware complexity: racks of server, cables, power • Availability SLA’s / HDFS operation • Unbalanced CPU and storage demands • Software complexity: 20+ components • Performance Tuning
  • 5.
    Separate Compute &Storage • Hardware complexity: racks of server, cables, power • Availability SLA’s / HDFS operation • Unbalanced CPU and storage demands • Software complexity: 20+ components • Performance Tuning Virtual machine S3/NFS Scale independently
  • 6.
    Modernizing Data Analytics DATAINGEST Search NFS ETL S3 QueryStore Deep Learning Storage Backend
  • 7.
    Cluster Topology node1 NAS/Object Storage NFSS3 SANBlock Storage iSCSI/FC node2 nodeN… Hadoop/Spark Cluster On Virtual Machine • OS • Hadoop binary, HDFS • SAN volume • HTTP only • Data lake, file • Spark, Hive • Same mount point on nodes • HBase S3A NFSXFS/Ext4
  • 8.
  • 9.
    Hadoop S3a Library HadoopDFS protocol • The communication protocol between NN, DN and client • Default implementation: HDFS • Other implementations: S3a, Azure, local FS, etc. Hadoop S3A library • Hadoop DFS protocol implementation for S3 compatible storage like Amazon S3, Pure Storage FlashBlade. • Enable the Hadoop ecosystem (Spark, Hive, MR, etc.) to store and process data in S3 object storage. • Several years in production, heavily used on cloud.
  • 10.
  • 11.
    THE ALL-FLASH DATAHUB FOR MODERN ANALYTICS 10+ PBs / RACK DENSITY FILE & OBJECT CONVERGED 75 GB/S BW 7.5 M+ NFS OPS BIG + FAST JUST ADD A BLADE! SIMPLE ELASTIC SCALE
  • 12.
    Hadoop Ecosystem andS3 How it works? • Hadoop ecosystem (Spark, Hive, MR, etc.) uses HDFS client internally. • Spark executor -> HDFS client -> storage • Hive on Tez container -> HDFS client -> storage • HDFS client speaks Hadoop DFS protocol. • Client automatically choose proper implementation to use base on schema. • /user/joe/data -> HDFS • file:///user/joe/data -> local FS (including NFS) • s3a://user/joe/data -> S3A • Exception: HBase (details covered later)
  • 13.
    Spark on S3 Sparksubmit • Changes: use s3a:// as input/output. • Temporary data I/O: YARN -> HDFS • User data I/O: Spark executors -> S3A -> S3 YARN RM spark- submit YARN Container HDFS S3 temp data val flatJSON = sc.textFile("s3a://deephub/tweets/")
  • 14.
    Hadoop on S3Challenges Consistency model • Amazon S3: eventual consistency, use S3Guard • S3 compatible storage (e.g. FlashBlade S3) supports strong consistency Slow “rename” • “rename” is critical in HDFS to support atomic commits, like Linux “mv” • S3 does not support “rename” natively. • S3A simulates “rename” as LIST – COPY - DELETE
  • 15.
    Slow S3 “Rename” Imagesource: https://stackoverflow.com/questions/42822483/extremely-slow-s3-write-times-from-emr-spark/42835927
  • 16.
    Hadoop on S3Updates Make S3A cloud native • Hundreds of JIRAs • Robustness, scale and performance • S3Guard • Zero-rename S3A committers. Use Hadoop 3.1~
  • 17.
    S3A Committers Originally S3Ause FileOutputCommitter, which relays on “rename” Zero-rename, cloud-native S3A Committers • Staging committer: directory & partitioned • Magic committer
  • 18.
    S3A Committers Staging Committer •Does not require strong consistency. • Proven to work at Netflix. • Requires large local FS space. Magic committer • Require strong consistency. S3Guard on AWS, FlashBlade S3, etc. • Faster. Use less local local FS space. • Less stable/tested than staging committer. Common Key Points • Fast, no “rename” • Both leverage S3 transactional multi-parts upload
  • 19.
  • 20.
    S3A Committer Benchmark •1TB MR random text writes • FlashBlade S3 • 15 blades • Plenty compute nodes 0 200 400 600 800 1000 1200 file directory magic time(s) Output Committer 1TB Genwords Bench
  • 21.
    Teragen Benchmark Magic committer faster Filecommitter slow Staging committer fast S3 ReadLocal FS ReadLess local FS Read
  • 22.
    HBase on NFS Thepeace of volcano
  • 23.
    HBase & HDFS WhatHBase want from HDFS? • HFile & WAL durability • Scale & performance • Mostly latency, but also throughput What HBase does NOT want from HDFS? • Noisy neighbors • Co-locating compute: YARN, Spark • Co-locating data: Hive data warehouse • Complexity: operation & upgrade
  • 24.
    HBase on NFS Howit works? • Use NFS as HBase root and staging directory. • Same mount NFS point on all RegionServer nodes • Point HBase to store data in that mount point • Leverage HDFS local FS implementation (file:///mnt/hbase) No change on application. • Clients only see HBase tables. HMasterclient Region Server NFS Table API
  • 25.
    HFile Durability • HDFSuses 3x replication to protect HFile. • HFile replication is not necessary in enterprise NFS • Erasure coding or RAID like data protection within/across storage array • Amazon EFS stores data within and across multiple AZs • FlashBlade supports N+2 data durability and high availability
  • 26.
    HBase WAL Durability •HDFS uses “hflush/hsync” API to ensure WAL is safely flushed to multiple data nodes before acknowledging clients. • Not necessary in enterprise NFS • FlashBlade acknowledges writes after data is persisted in NVRAM on 3 blades.
  • 27.
    NFS Performance forHBase • Depend on NFS implementation. • NFS is generally good for random access. • Also check throughput. • Flash storage is ideal for HBase. • All-flash scale-out NFS such as Pure Storage FlashBlade
  • 28.
    HBase PE onFlashBlade NFS random writes 1RS, 1M rows/client, 10 clients 7 blades random reads 1RS, 100K rows/client, 20 clients 7 blades Memstore flush storm? Block cache affects result Latency seen by storage, stable
  • 29.
    HBase PE onAmazon EFS random writes 1RS, 1M rows/client, 10 clients 1024MB/s provisioned throughput EFS random reads 1RS, 100K rows/client, 20 clients 1024MB/s provisioned throughput EFS Region too busy, memstore flush is slow
  • 30.
    Key Takeaways • Storageoptions for cloud-era Hadoop/Spark • Hive & Spark on S3 with cloud-native S3A Committers • HBase on enterprise NFS • Available in cloud and on premise (Pure Storage FlashBlade) • Additional benefits: always-on compression, encryption, etc. • Proven to work • Simple, reliable, performant • No more HDFS operation • Virtualize your Hadoop/Spark cluster