WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Saisai Shao (Tencent), Peiyu Zhuang (MemVerge)
Elastify Cloud-Native Spark
Application with Persistent Memory
#UnifiedAnalytics #SparkAISummit
About Me
Saisai Shao
Expert Software Engineer in Tencent Cloud
Apache Spark Committer and Apache Livy (incubator) PPMC
Peiyu Zhuang
Software Engineer in MemVerge
3#UnifiedAnalytics #SparkAISummit
Tencent Cloud
12000+
3+
Trillions
Largest big data cluster
100PB+
Records per day
500TB
3.5+
Trillions
20+ Billions
Time of computations
500PB+
Ads per day
Data Tech Model Scenario
Tencent Cloud Big Data and AI
5#UnifiedAnalytics #SparkAISummit
AI Services
Smart Search
Face/Human
Identification
GrandEye
Intelligent
Recommendation
Smart
Conference
Live
Broadcasting
AI Conversation
(XiaoWei)
Intelligent
Customer Service
Big Data
Services
TI-ML
Natural Language
Processing
Image Recognition Voice RecognitionAI Platform
Service
AI
Foundations
Elastic MapReduce
Elasticsearch Service
SaaS BI
Stream Compute Service
Snova Data Warehouse Sparkling Data Warehouse Suite
RayData Data Visualization
Solution
About MemVerge
6#UnifiedAnalytics #SparkAISummit
Up to 768TB
Total memory per cluster
Up to 72GB/s
Read bandwidth per node
< 1μs
Access Latency
MemVerge is a startup company based in
San Jose. We started in 2017.
We are delivering world’s first Memory-
Converged Infrastructure (MCI) system,
called Distributed Memory Objects
(DMO).
Back To the Date of MapReduce
How did we design data application?
• Network bandwidth vs. disk
throughput
• Move code rather than moving data
• Fast small memory vs. slow large
disk
• Optimize sequence R/W
7#UnifiedAnalytics #SparkAISummit
1Gbps
The Trends of HW in DC
8#UnifiedAnalytics #SparkAISummit
* https://www.cisco.com/c/dam/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-734328.pdf
* https://www.backblaze.com/blog/hdd-vs-ssd-in-data-centers/
Enterprise Bytes Shipments: HDD and SSD
Datacenter Bandwidth Migration
Modern DC Architecture
Changes happen to modern DC?
• Separate data and computation, high-
speed network between compute
nodes and storage boxes
• Tiered storage for hot and cold data
9#UnifiedAnalytics #SparkAISummit
25~100Gbps
Compute nodes
storage boxes
Accelerators
Reimaging the DC Memory and
Storage Hierarchy
10#UnifiedAnalytics #SparkAISummit
HDD/TAPE
SSD
DRAM
MemoryStorage
HOT
WARM
COLD
Improving memory capacity
Improving SSD performance
Efficient and scalable storage
Embrace the New Architecture
11#UnifiedAnalytics #SparkAISummit
IntelⓇ OptaneTM DC Persistent Memory
• Low latency and high throughput,
like DRAM
– Latency: 200 ~ 400ns
– Bandwidth:
• Read: Up to 8GB/s
• Write: Up to 3GB/s
• High density and non-volatility,
like NAND
– Up to 6TB per server
• Memory-speed storage system
How to Use DCPMM
12#UnifiedAnalytics #SparkAISummit
RDMA/DPDK
DCPMM per node DCPMM centric arch
MemVerge Elastic Spark Solution
13#UnifiedAnalytics #SparkAISummit
13
RDD
Caching and
Storage
Shuffle Data
Ethernet Switch
Data Source
A PMEM Centric Data Platform
14
14
MemVerge DMO
Cluster Shared Persistent Memory
MemVerge Spark Adaptors
…
DRAM
Node 1
PMEM
DRAM
Node 2
PMEM
DRAM
Node 3
PMEM
DRAM
Node 4
PMEM
DRAM
Node N
PMEM
#UnifiedAnalytics #SparkAISummit
Spark Integration
15#UnifiedAnalytics #SparkAISummit
15
RDD
Caching and Storage
Shuffle Data
Data Source
Spark with additional
RDD persist APIs
A new generic shuffle
manager
Hadoop compatible
storage APIs
MemVerge
DMO
DCPMM Equipped Shuffle Service
16#UnifiedAnalytics #SparkAISummit
Shuffle & Block Manager
17#UnifiedAnalytics #SparkAISummit
• Block manager persists data to the
memory or disk in local nodes.
• Losing an executor means recomputing
of the whole shuffle task.
• The storage and network
implementation is coupled with the
shuffle implementation.
Shuffle Manager
Block Manager
Memory
Store
Disk
Store
Compute Node
Persist & Retrieve Data
Spark Executor
Local
Disk
Shuffle
Output
The Problems of Current Shuffle
• Poor elasticity
– The failure of node leads to shuffle data lost
• Heavy overhead to NodeManager
– Coexisting with NM brings heavy overhead to NM for heavy workloads
• Unsuitable to cloud environment
– Data/computation separation architecture brings no advantages to local shuffle
• The community is also working on these problems:
– SPARK-25299 Use remote storage for persisting shuffle data
– SPARK-26268 Decouple shuffle data from Spark deployment
18#UnifiedAnalytics #SparkAISummit
MemVerge Splash Shuffle Manager
19#UnifiedAnalytics #SparkAISummit
• A flexible shuffle manager
– supports user-defined storage
backend and network transport for
shuffle data
• Open source
– https://github.com/MemVerge/splash
• Spark JIRA: SPARK-25299
19
Splash Shuffle Manager
20#UnifiedAnalytics #SparkAISummit
• Create a new shuffle manager that
implements shuffle manager interface
• Extract the storage and network
implementations to the storage plugin
interface
• Apply different plugins for different
storage & network
• Separate storage and compute
• Tolerate node failure
• Support dynamic allocation
Worker 2Worker 1
Storage System
(NFS, local FS,
HDFS, S3, DMO …)
Read
shuffle
Write
shuffle
Storage Plugin
Splash
Executor 1
Storage Plugin
Splash
Executor 2
Persist Shuffle Data to PMEM
21#UnifiedAnalytics #SparkAISummit
Persistent
Memory
DMO
System
{} Shuffle
Manager
{} Storage
Plugin
Splash Shuffle
Manager
DMO Plugin
• Distributed Memory Object (DMO) is a
distributed file system built on PMEM.
• The storage plugin allows us to persist data
into the DMO system, a separated storage
cluster.
• The use of PMEM and fast network
technologies (RDMA or DPDK) in the
storage cluster speeds up the shuffle.
Benchmark Settings
Common
• 4 compute nodes
• 10GbE network
• Driver memory 4g
• Executor memory 6g
• Total cores 160
• Executor cores 4
22#UnifiedAnalytics #SparkAISummit
Baseline
• 4 HDD
• 7200 RPM
DMO
• 2 storage nodes
• 400GB PMEM/node
TeraSort Results
23#UnifiedAnalytics #SparkAISummit
9.2
6 6.5
22
11 10
0
5
10
15
20
25
30
35
Baseline DMO with UDP DMO with DPDK
TeraSort 400GB, 216G Shuffle Write Reduce Stage (min)
Map Stage (min)
TPC-DS Results
24#UnifiedAnalytics #SparkAISummit
0
200
400
600
800
1000
1200
1400
1600
1800
78 4 64 24a 24b 80 23a 23b 25 17 29 11 93 74 50 16 40
Duration (s)
Query ID
TPC-DS 1.2TB
Baseline
DMO
TPC-DS Results, cont.
25#UnifiedAnalytics #SparkAISummit
0
20
40
60
80
0
200
400
600
800
1000
1200
400GB 800GB 1200GB
TPC-DS Query 80
-20
0
20
40
60
80
0
500
1000
1500
400GB 800GB 1200GB
TPC-DS Query 4
-10
0
10
20
30
40
50
0
500
1000
1500
2000
400GB 800GB 1200GB
TPC-DS Query 23
0
10
20
30
40
50
0
500
1000
1500
2000
2500
400GB 800GB 1200GB
TPC-DS Query 24
Baseline
DMO
Percent
Benchmark in Cloud
• 4 Sparkling host:
– 32 core
– 128GB DRAM
– 50GB cloud system disk
– 4*11TB SATA data disk
– 10G network
• 3 DMO host:
– 32 core
– 256GB DRAM (200GB PMEM)
– 50GB cloud system disk
– 10G network
26#UnifiedAnalytics #SparkAISummit
• Spark Conf
– Driver memory 4G
– Executor memory 30G
– Executor memory overhead 2G
– Executor Cores 4
– Executor Instances 12
• TPC-DS
– data size 1TB
– 10 queries that has the biggest
shuffle data
TPC-DS Results in Cloud
27#UnifiedAnalytics #SparkAISummit
-10.0%
-5.0%
0.0%
5.0%
10.0%
15.0%
20.0%
200.0
300.0
400.0
500.0
600.0
700.0
800.0
900.0
1000.0
1100.0
q78 q64 q23a q24a q23b q80 q24b q4 q25 q17
Baseline
DMO
Percent
Future Work
• Verify the solution in production environment
• Data path performance tuning for Splash shuffle manager
• Enable map side merge in Splash shuffle manager
• Support the Java NIO style IO interface in Splash shuffle manager
• Landing the Splash shuffle service in cloud environment
28#UnifiedAnalytics #SparkAISummit
Summary
• PMEM will bring fundamental changes to ALL data centers and
enable a data-driven future
• MemVerge and Tencent Cloud deliver better scalability and
performance at a lower cost not just for Spark
– AI, Big Data, Banking, Animation Studios, Gaming Industry, IoT, etc.
– Machine learning, analytics, and online systems
• Thank you Intel for supporting our work!
29#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Elastify Cloud-Native Spark Application with Persistent Memory

  • 1.
    WIFI SSID:SparkAISummit |Password: UnifiedAnalytics
  • 2.
    Saisai Shao (Tencent),Peiyu Zhuang (MemVerge) Elastify Cloud-Native Spark Application with Persistent Memory #UnifiedAnalytics #SparkAISummit
  • 3.
    About Me Saisai Shao ExpertSoftware Engineer in Tencent Cloud Apache Spark Committer and Apache Livy (incubator) PPMC Peiyu Zhuang Software Engineer in MemVerge 3#UnifiedAnalytics #SparkAISummit
  • 4.
    Tencent Cloud 12000+ 3+ Trillions Largest bigdata cluster 100PB+ Records per day 500TB 3.5+ Trillions 20+ Billions Time of computations 500PB+ Ads per day Data Tech Model Scenario
  • 5.
    Tencent Cloud BigData and AI 5#UnifiedAnalytics #SparkAISummit AI Services Smart Search Face/Human Identification GrandEye Intelligent Recommendation Smart Conference Live Broadcasting AI Conversation (XiaoWei) Intelligent Customer Service Big Data Services TI-ML Natural Language Processing Image Recognition Voice RecognitionAI Platform Service AI Foundations Elastic MapReduce Elasticsearch Service SaaS BI Stream Compute Service Snova Data Warehouse Sparkling Data Warehouse Suite RayData Data Visualization Solution
  • 6.
    About MemVerge 6#UnifiedAnalytics #SparkAISummit Upto 768TB Total memory per cluster Up to 72GB/s Read bandwidth per node < 1μs Access Latency MemVerge is a startup company based in San Jose. We started in 2017. We are delivering world’s first Memory- Converged Infrastructure (MCI) system, called Distributed Memory Objects (DMO).
  • 7.
    Back To theDate of MapReduce How did we design data application? • Network bandwidth vs. disk throughput • Move code rather than moving data • Fast small memory vs. slow large disk • Optimize sequence R/W 7#UnifiedAnalytics #SparkAISummit 1Gbps
  • 8.
    The Trends ofHW in DC 8#UnifiedAnalytics #SparkAISummit * https://www.cisco.com/c/dam/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-734328.pdf * https://www.backblaze.com/blog/hdd-vs-ssd-in-data-centers/ Enterprise Bytes Shipments: HDD and SSD Datacenter Bandwidth Migration
  • 9.
    Modern DC Architecture Changeshappen to modern DC? • Separate data and computation, high- speed network between compute nodes and storage boxes • Tiered storage for hot and cold data 9#UnifiedAnalytics #SparkAISummit 25~100Gbps Compute nodes storage boxes Accelerators
  • 10.
    Reimaging the DCMemory and Storage Hierarchy 10#UnifiedAnalytics #SparkAISummit HDD/TAPE SSD DRAM MemoryStorage HOT WARM COLD Improving memory capacity Improving SSD performance Efficient and scalable storage
  • 11.
    Embrace the NewArchitecture 11#UnifiedAnalytics #SparkAISummit IntelⓇ OptaneTM DC Persistent Memory • Low latency and high throughput, like DRAM – Latency: 200 ~ 400ns – Bandwidth: • Read: Up to 8GB/s • Write: Up to 3GB/s • High density and non-volatility, like NAND – Up to 6TB per server • Memory-speed storage system
  • 12.
    How to UseDCPMM 12#UnifiedAnalytics #SparkAISummit RDMA/DPDK DCPMM per node DCPMM centric arch
  • 13.
    MemVerge Elastic SparkSolution 13#UnifiedAnalytics #SparkAISummit 13 RDD Caching and Storage Shuffle Data Ethernet Switch Data Source
  • 14.
    A PMEM CentricData Platform 14 14 MemVerge DMO Cluster Shared Persistent Memory MemVerge Spark Adaptors … DRAM Node 1 PMEM DRAM Node 2 PMEM DRAM Node 3 PMEM DRAM Node 4 PMEM DRAM Node N PMEM #UnifiedAnalytics #SparkAISummit
  • 15.
    Spark Integration 15#UnifiedAnalytics #SparkAISummit 15 RDD Cachingand Storage Shuffle Data Data Source Spark with additional RDD persist APIs A new generic shuffle manager Hadoop compatible storage APIs MemVerge DMO
  • 16.
    DCPMM Equipped ShuffleService 16#UnifiedAnalytics #SparkAISummit
  • 17.
    Shuffle & BlockManager 17#UnifiedAnalytics #SparkAISummit • Block manager persists data to the memory or disk in local nodes. • Losing an executor means recomputing of the whole shuffle task. • The storage and network implementation is coupled with the shuffle implementation. Shuffle Manager Block Manager Memory Store Disk Store Compute Node Persist & Retrieve Data Spark Executor Local Disk Shuffle Output
  • 18.
    The Problems ofCurrent Shuffle • Poor elasticity – The failure of node leads to shuffle data lost • Heavy overhead to NodeManager – Coexisting with NM brings heavy overhead to NM for heavy workloads • Unsuitable to cloud environment – Data/computation separation architecture brings no advantages to local shuffle • The community is also working on these problems: – SPARK-25299 Use remote storage for persisting shuffle data – SPARK-26268 Decouple shuffle data from Spark deployment 18#UnifiedAnalytics #SparkAISummit
  • 19.
    MemVerge Splash ShuffleManager 19#UnifiedAnalytics #SparkAISummit • A flexible shuffle manager – supports user-defined storage backend and network transport for shuffle data • Open source – https://github.com/MemVerge/splash • Spark JIRA: SPARK-25299 19
  • 20.
    Splash Shuffle Manager 20#UnifiedAnalytics#SparkAISummit • Create a new shuffle manager that implements shuffle manager interface • Extract the storage and network implementations to the storage plugin interface • Apply different plugins for different storage & network • Separate storage and compute • Tolerate node failure • Support dynamic allocation Worker 2Worker 1 Storage System (NFS, local FS, HDFS, S3, DMO …) Read shuffle Write shuffle Storage Plugin Splash Executor 1 Storage Plugin Splash Executor 2
  • 21.
    Persist Shuffle Datato PMEM 21#UnifiedAnalytics #SparkAISummit Persistent Memory DMO System {} Shuffle Manager {} Storage Plugin Splash Shuffle Manager DMO Plugin • Distributed Memory Object (DMO) is a distributed file system built on PMEM. • The storage plugin allows us to persist data into the DMO system, a separated storage cluster. • The use of PMEM and fast network technologies (RDMA or DPDK) in the storage cluster speeds up the shuffle.
  • 22.
    Benchmark Settings Common • 4compute nodes • 10GbE network • Driver memory 4g • Executor memory 6g • Total cores 160 • Executor cores 4 22#UnifiedAnalytics #SparkAISummit Baseline • 4 HDD • 7200 RPM DMO • 2 storage nodes • 400GB PMEM/node
  • 23.
    TeraSort Results 23#UnifiedAnalytics #SparkAISummit 9.2 66.5 22 11 10 0 5 10 15 20 25 30 35 Baseline DMO with UDP DMO with DPDK TeraSort 400GB, 216G Shuffle Write Reduce Stage (min) Map Stage (min)
  • 24.
    TPC-DS Results 24#UnifiedAnalytics #SparkAISummit 0 200 400 600 800 1000 1200 1400 1600 1800 784 64 24a 24b 80 23a 23b 25 17 29 11 93 74 50 16 40 Duration (s) Query ID TPC-DS 1.2TB Baseline DMO
  • 25.
    TPC-DS Results, cont. 25#UnifiedAnalytics#SparkAISummit 0 20 40 60 80 0 200 400 600 800 1000 1200 400GB 800GB 1200GB TPC-DS Query 80 -20 0 20 40 60 80 0 500 1000 1500 400GB 800GB 1200GB TPC-DS Query 4 -10 0 10 20 30 40 50 0 500 1000 1500 2000 400GB 800GB 1200GB TPC-DS Query 23 0 10 20 30 40 50 0 500 1000 1500 2000 2500 400GB 800GB 1200GB TPC-DS Query 24 Baseline DMO Percent
  • 26.
    Benchmark in Cloud •4 Sparkling host: – 32 core – 128GB DRAM – 50GB cloud system disk – 4*11TB SATA data disk – 10G network • 3 DMO host: – 32 core – 256GB DRAM (200GB PMEM) – 50GB cloud system disk – 10G network 26#UnifiedAnalytics #SparkAISummit • Spark Conf – Driver memory 4G – Executor memory 30G – Executor memory overhead 2G – Executor Cores 4 – Executor Instances 12 • TPC-DS – data size 1TB – 10 queries that has the biggest shuffle data
  • 27.
    TPC-DS Results inCloud 27#UnifiedAnalytics #SparkAISummit -10.0% -5.0% 0.0% 5.0% 10.0% 15.0% 20.0% 200.0 300.0 400.0 500.0 600.0 700.0 800.0 900.0 1000.0 1100.0 q78 q64 q23a q24a q23b q80 q24b q4 q25 q17 Baseline DMO Percent
  • 28.
    Future Work • Verifythe solution in production environment • Data path performance tuning for Splash shuffle manager • Enable map side merge in Splash shuffle manager • Support the Java NIO style IO interface in Splash shuffle manager • Landing the Splash shuffle service in cloud environment 28#UnifiedAnalytics #SparkAISummit
  • 29.
    Summary • PMEM willbring fundamental changes to ALL data centers and enable a data-driven future • MemVerge and Tencent Cloud deliver better scalability and performance at a lower cost not just for Spark – AI, Big Data, Banking, Animation Studios, Gaming Industry, IoT, etc. – Machine learning, analytics, and online systems • Thank you Intel for supporting our work! 29#UnifiedAnalytics #SparkAISummit
  • 30.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT