Elastify Cloud-Native Spark Application with Persistent Memory

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Saisai Shao (Tencent), Peiyu Zhuang (MemVerge)
Elastify Cloud-Native Spark
Application with Persistent Memory
#UnifiedAnalytics #SparkAISummit

About Me
Saisai Shao
Expert Software Engineer in Tencent Cloud
Apache Spark Committer and Apache Livy (incubator) PPMC
Peiyu Zhuang
Software Engineer in MemVerge
3#UnifiedAnalytics #SparkAISummit

Tencent Cloud
12000+
3+
Trillions
Largest big data cluster
100PB+
Records per day
500TB
3.5+
Trillions
20+ Billions
Time of computations
500PB+
Ads per day
Data Tech Model Scenario

Tencent Cloud Big Data and AI
AI Services
Smart Search
Face/Human
Identification
GrandEye
Intelligent
Recommendation
Smart
Conference
Live
Broadcasting
AI Conversation
(XiaoWei)
Intelligent
Customer Service
Big Data
Services
TI-ML
Natural Language
Processing
Image Recognition Voice RecognitionAI Platform
Service
AI
Foundations
Elastic MapReduce
Elasticsearch Service
SaaS BI
Stream Compute Service
Snova Data Warehouse Sparkling Data Warehouse Suite
RayData Data Visualization
Solution

About MemVerge
Up to 768TB
Total memory per cluster
Up to 72GB/s
Read bandwidth per node
< 1μs
Access Latency
MemVerge is a startup company based in
San Jose. We started in 2017.
We are delivering world’s first Memory-
Converged Infrastructure (MCI) system,
called Distributed Memory Objects
(DMO).

Back To the Date of MapReduce
How did we design data application?
• Network bandwidth vs. disk
throughput
• Move code rather than moving data
• Fast small memory vs. slow large
disk
• Optimize sequence R/W
1Gbps

The Trends of HW in DC
* https://www.cisco.com/c/dam/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-734328.pdf
* https://www.backblaze.com/blog/hdd-vs-ssd-in-data-centers/
Enterprise Bytes Shipments: HDD and SSD
Datacenter Bandwidth Migration

Modern DC Architecture
Changes happen to modern DC?
• Separate data and computation, high-
speed network between compute
nodes and storage boxes
• Tiered storage for hot and cold data
25~100Gbps
Compute nodes
storage boxes
Accelerators

Reimaging the DC Memory and
Storage Hierarchy
HDD/TAPE
SSD
DRAM
MemoryStorage
HOT
WARM
COLD
Improving memory capacity
Improving SSD performance
Efficient and scalable storage

Embrace the New Architecture
IntelⓇ OptaneTM DC Persistent Memory
• Low latency and high throughput,
like DRAM
– Latency: 200 ~ 400ns
– Bandwidth:
• Read: Up to 8GB/s
• Write: Up to 3GB/s
• High density and non-volatility,
like NAND
– Up to 6TB per server
• Memory-speed storage system

How to Use DCPMM
RDMA/DPDK
DCPMM per node DCPMM centric arch

MemVerge Elastic Spark Solution
13
RDD
Caching and
Storage
Shuffle Data
Ethernet Switch
Data Source

A PMEM Centric Data Platform
14
14
MemVerge DMO
Cluster Shared Persistent Memory
MemVerge Spark Adaptors
…
DRAM
Node 1
PMEM
DRAM
Node 2
PMEM
DRAM
Node 3
PMEM
DRAM
Node 4
PMEM
DRAM
Node N
PMEM
#UnifiedAnalytics #SparkAISummit

Spark Integration
15
RDD
Caching and Storage
Shuffle Data
Data Source
Spark with additional
RDD persist APIs
A new generic shuffle
manager
Hadoop compatible
storage APIs
MemVerge
DMO

DCPMM Equipped Shuffle Service

Shuffle & Block Manager
• Block manager persists data to the
memory or disk in local nodes.
• Losing an executor means recomputing
of the whole shuffle task.
• The storage and network
implementation is coupled with the
shuffle implementation.
Shuffle Manager
Block Manager
Memory
Store
Disk
Store
Compute Node
Persist & Retrieve Data
Spark Executor
Local
Disk
Shuffle
Output

The Problems of Current Shuffle
• Poor elasticity
– The failure of node leads to shuffle data lost
• Heavy overhead to NodeManager
– Coexisting with NM brings heavy overhead to NM for heavy workloads
• Unsuitable to cloud environment
– Data/computation separation architecture brings no advantages to local shuffle
• The community is also working on these problems:
– SPARK-25299 Use remote storage for persisting shuffle data
– SPARK-26268 Decouple shuffle data from Spark deployment

MemVerge Splash Shuffle Manager
• A flexible shuffle manager
– supports user-defined storage
backend and network transport for
shuffle data
• Open source
– https://github.com/MemVerge/splash
• Spark JIRA: SPARK-25299
19

Splash Shuffle Manager
• Create a new shuffle manager that
implements shuffle manager interface
• Extract the storage and network
implementations to the storage plugin
interface
• Apply different plugins for different
storage & network
• Separate storage and compute
• Tolerate node failure
• Support dynamic allocation
Worker 2Worker 1
Storage System
(NFS, local FS,
HDFS, S3, DMO …)
Read
shuffle
Write
shuffle
Storage Plugin
Splash
Executor 1
Storage Plugin
Splash
Executor 2

Persist Shuffle Data to PMEM
Persistent
Memory
DMO
System
{} Shuffle
Manager
{} Storage
Plugin
Splash Shuffle
Manager
DMO Plugin
• Distributed Memory Object (DMO) is a
distributed file system built on PMEM.
• The storage plugin allows us to persist data
into the DMO system, a separated storage
cluster.
• The use of PMEM and fast network
technologies (RDMA or DPDK) in the
storage cluster speeds up the shuffle.

Benchmark Settings
Common
• 4 compute nodes
• 10GbE network
• Driver memory 4g
• Executor memory 6g
• Total cores 160
• Executor cores 4
Baseline
• 4 HDD
• 7200 RPM
DMO
• 2 storage nodes
• 400GB PMEM/node

TeraSort Results
9.2
6 6.5
22
11 10
0
5
10
15
20
25
30
35
Baseline DMO with UDP DMO with DPDK
TeraSort 400GB, 216G Shuffle Write Reduce Stage (min)
Map Stage (min)

TPC-DS Results
0
200
400
600
800
1000
1200
1400
1600
1800
78 4 64 24a 24b 80 23a 23b 25 17 29 11 93 74 50 16 40
Duration (s)
Query ID
TPC-DS 1.2TB
Baseline
DMO

TPC-DS Results, cont.
0
20
40
60
80
0
200
400
600
800
1000
1200
400GB 800GB 1200GB
TPC-DS Query 80
-20
0
20
40
60
80
0
500
1000
1500
400GB 800GB 1200GB
TPC-DS Query 4
-10
0
10
20
30
40
50
0
500
1000
1500
2000
400GB 800GB 1200GB
TPC-DS Query 23
0
10
20
30
40
50
0
500
1000
1500
2000
2500
400GB 800GB 1200GB
TPC-DS Query 24
Baseline
DMO
Percent

Benchmark in Cloud
• 4 Sparkling host:
– 32 core
– 128GB DRAM
– 50GB cloud system disk
– 4*11TB SATA data disk
– 10G network
• 3 DMO host:
– 32 core
– 256GB DRAM (200GB PMEM)
– 50GB cloud system disk
– 10G network
• Spark Conf
– Driver memory 4G
– Executor memory 30G
– Executor memory overhead 2G
– Executor Cores 4
– Executor Instances 12
• TPC-DS
– data size 1TB
– 10 queries that has the biggest
shuffle data

TPC-DS Results in Cloud
-10.0%
-5.0%
0.0%
5.0%
10.0%
15.0%
20.0%
200.0
300.0
400.0
500.0
600.0
700.0
800.0
900.0
1000.0
1100.0
q78 q64 q23a q24a q23b q80 q24b q4 q25 q17
Baseline
DMO
Percent

Future Work
• Verify the solution in production environment
• Data path performance tuning for Splash shuffle manager
• Enable map side merge in Splash shuffle manager
• Support the Java NIO style IO interface in Splash shuffle manager
• Landing the Splash shuffle service in cloud environment

Summary
• PMEM will bring fundamental changes to ALL data centers and
enable a data-driven future
• MemVerge and Tencent Cloud deliver better scalability and
performance at a lower cost not just for Spark
– AI, Big Data, Banking, Animation Studios, Gaming Industry, IoT, etc.
– Machine learning, analytics, and online systems
• Thank you Intel for supporting our work!

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Elastify Cloud-Native Spark Application with Persistent Memory

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Elastify Cloud-Native Spark Application with Persistent Memory

Similar to Elastify Cloud-Native Spark Application with Persistent Memory (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Elastify Cloud-Native Spark Application with Persistent Memory