Vectorized Deep Learning Acceleration from Preprocessing to Inference and Training on Apache Spark in SK Telecom

Vectorized Deep Learning
Acceleration from Preprocessing to
Inference and Training on Apache
Spark in SK Telecom
Hongchan Roh
Team Leader, SK Telecom
Jason Dai
Senior Principal Engineer, Intel Corporation

Agenda
Hongchan Roh
▪ Demo for Network Quality Analysis and
Vectorized aggregation acceleration
▪ Network Quality Analysis and Prediction in SK
Telecom Wrap-up (SAIS 2019 summary)
▪ Vectorized preprocessing acceleration via
Aggregation Push Down from Apache Spark
Jason Dai
▪ Unified Big Data + AI Software Architecture in SKT
using Analytics Zoo
▪ Future Work (Project Zouwu)

Demo for Network Quality Analysis and Vectorized
aggregation acceleration

https://youtu.be/Wip_b7hUb7w
Analyze and Visualize Network Quality Indicators
e.g. CQI, RSRP, RSRQ, SINR, and more
Data Size: 0.1 billion records
Network Quality Analysis
demo
Analyze and Visualize
Floating Population and COVID19 confirmed Route
Data Size: 0.2 billion records (population), 10,000 records (COVID19)
COVID 19 route analysis
with population demo
https://youtu.be/nuRBxXS2Fms

Network Quality Analysis and Prediction in SK
Telecom Wrap-up (SAIS 2019 summary)

7
SK Telecom : The largest telecommunications provider in South Korea
• 27 million subscribers
• 300,000 cells
Target Data: Radio cell tower logs
• time-series data with timestamp tags generated every 10 seconds
• geospatial data with geographical coordinates (latitude and
longitude) translated by cell tower’s location

8
Data Ingestion Requirements
• Ingestion 1.4 million records / sec, (500 byte of records, 200~500 columns)
• 120 billion records / day, 60 TB / day
• Data retention period: 7 ~ 30 days
Query Requirements
• web dashboard and ad-hoc queries: response within 3 sec for a specific
time and region (cell) predicates for multi sessions
• daily batch queries: response within hours for long query pipelines
having heavy operations such as joins and aggregations

New In-memory Datastore for Spark (FlashBase)
9
• Tried to design a new data store for Spark (FlashBase) to support much more partitions
L i n u x O S
C P U / D R A M
S t o r a g e ( H D D )
…
H D F S
S p a r k
S S D C a c h e
Legacy
Architecture
SQL Queries
(Web, Jupyter)
Spark-SQL
Data Loading
Data
Loader
Data Source APIs
JDBCFile, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
New
Architecture
• for query engine, for DRAM key-value store and for SSD key-value store
• SSDs as main storage devices for small-sized parallel I/O with short latencies
Best open source candidates to assemble
New in-memory (dram/SSD) datastore for Spark (2016)

Ingestion Performance and other features
10
Node spec: E5 2680 v4 CPU, 256GB DRAM, 20TB SSDs
Features Details
Ingestion performance 500,000 records/sec/node
In-memory datastore DRAM only, DRAM to SSD Tiering
Massively Parallel Processing 100 redis processes per a single node
Extreme partitioning Up-to 2 billion partitions for a single node
Filter acceleration Using fine-grained partitions and push downed filters
Column-store Column-store by default (row-store option)
Column value transformation Defined by java grammar in schema tbl properties
Compression Gzip level compression ratio w/ LZ4 level speed
Vector processing Filter and aggr. acceleration (SIMD, AVX)
ETC Recovery, replication, scale-out

Extreme Partitioning feature
11
A partition combination for network quality analysis
• 300K (cell towers) X 100K (time slots) = 30B (total partitions)
Time
partitions
Cell tower partitionswvcv3 wvcyw wvfj6 … wyfb1
wyf9w
201804221100
201804221105
201804221110
…
201904221105
201904221110
Node spec: E5 2680 v4 CPU, 256GB DRAM, 20TB SSDs
FlashBase
• Up-to 2 billion partitions in a single node
• Needs 15 nodes to store 30 billion partitions
Oracle
• Up-to 1 million partitions in a single cluster
ü 2000 times reduction in computation and I/O
for specific time and region queries!

Network Quality Analysis Example
12
Spark with FlashBase
Less than 1 sec
FlashBase Cluster: 16 nodes (E5 2680 v4 CPU, 256GB DRAM, 20TB SSDs)
select * from ue_rf_sum
where event_time between '201910070000' and '201910080000' and cell_tower_id = 'snjJlAF5W' and rsrp < -85;
Half an hour
Spark with HDFS
Partition filtering
1/10080
with time
Partition filtering
1/(10080 * 30000)
with time and cell tower
1user_equipment_radio_frequency_summary table
HDFS Cluster: 20 nodes (E5 2650 v4 CPU, 256GB DRAM, 24TB HDDs)
Network quality analysis query for one day and a single cell tower
• 0.142 trillion (142 billion) records in ue_rf_sum1 table (7 day data, 42TB)
• 14,829 satisfying records

Introduction of Network Quality Prediction
13
• Predict Network Quality Indicators (CQI, RSRP, RSRQ, SINR, …)
for anomaly detection and real-time management
• Goal : Unify Geospatial visualization & Network Prediction On Spark
* CQI : Channel Quality Indicator
* RSRP : Reference Signal Received Power
* RSRQ : Reference Signal Received Quality
* SINR :Signal to Interference Noise Ratio
*

Memory augmented model
14
memory1 memory2 memory7 current
▪ ▪ ▪ ▪
Attention
layer
memory3
▪ ▪ ▪ ▪
▪ ▪ ▪ ▪
Encoder1 Encoder2▪ ▪ ▪ ▪Encoder1 Encoder1 Encoder1
1-week data
Concat FCNN !𝑦!"#
Final
prediction
1 32
4 5
Current
Recent 50 min data with 5 min period
Memory
Previous 7 days historical data each of
which has same time band with current and target.
Target
Network quality after 5 min
• Encoder : 1-NN (Autoregressive term)
Encoder1 : ℎ! = 𝑐 + 𝑤" 𝑦! #$%&'(#" + … + 𝑤"" 𝑦!#$ %&'(#""
Encoder2 : ℎ!
)
= 𝑐)
+ 𝑤"
)
𝑦!#" + … + 𝑤′"* 𝑦!#"*
1
2
3
4
5

Memory augmented model - Test result
15
Actual Forecast Error: MAE Score: Error*100
Mem-
model
Improved predictions for sudden change!
Seq2Seq

Training & Inference Architecture
16
Spark-SQL
Data Loading
Data
Loader
Data Source APIs
File, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
Build In-memory Pipeline between FlashBase and Intel Analytics ZOO
Data Layer And Inferencing & Training Layer are integrated into the same Spark Cluster
Also share the same Spark session.
Source Code : https://github.com/mnms/ARMemNet-BigDL
Intel Analytics Zoo : Used to unify TF model into Spark Pipeline seamlessly.
Intel BigDL : inference & training engine
The processing of Inferencing & training can be distributed in Spark Cluster.
Preprocess RDD of Tensor Model Code of TF
DL Training & Inferencing
Data Model
Spark
Cluster
1
1
3
2
3
2
SIMD Acceleration

Vectorized preprocessing acceleration via
Aggregation Push Down from Apache Spark

Apache Spark pushes down filters and projection
▪ Apache Spark push down only filters and projection to data source
▪ Data source should transfer entire data to Spark via Socket
▪ Aggregation operations are entirely performed in Spark
▪ Spark data source filter push down
▪ And, Or, Not, Like, Limit
▪ EQ, GT, GTE, LT, LTE, IN, IsNULL, IsNotNULL, EqualNullSafe

What if Aggregation can be done in Data source?
▪ Network IO between data source and Spark can be largely reduced.
▪ Data source just transfers aggregated results.
▪ Vectorized Aggregation in data source
▪ Data source accelerates aggregation operations(MIN, MAX, SUM, COUNT) using Intel SIMD API.

Aggregation pushdown – Architecture
▪ FlashBase DataSource extends Catalyst Optimization rule
▪ Push down aggregation without customizing Spark Source.
▪ Define a custom optimization rule, and add it to the set of custom optimization rules in Spark.
sqlContext.experimental.extraOptimizations ++= Seq( PropagateAggregationRule )
▪ Custom rule is executed after finishing optimization rules in Catalyst.
- PropagateAggregationRule is executed after optimization phase.
- PropagateAggregationRule pushes down aggregations to DataSource

Aggregation pushdown – Before push down
SELECT SUBSTR(EVENT_TIME, 0, 4) AS YEAR, COUNT(1), AVG(HEIGHT)
FROM TBL_PEOPLE
WHERE JOB = 'ENGINEER’
GROUP BY YEAR
Aggregate
Project (EVENT_TIME, HEIGHT)
Filter (JOB = ‘ENGINEER’)
LogicalRelation
(Attrs : EVENT_TIME, JOB, HEIGHT)
Relation “Relation” plan builds RDD of data in a data source.
The LogicalRelation defines Attributes in a data source.
The data of RDD are filtered the by “Filter” plan.
The selected columns are pruned by “Project” plan.
The pruned & filtered data are finally aggregated by ”Aggregate” plan.

Aggregation pushdown – Push down GB & AGG
Aggregate
LogicalRelation
RelationForAggregation
ALIAS
COUNT Function
Literal(1)
ALIAS
SUBSTRING
Attribute
(EVENT_TIME)
ALIAS
AVG Function
Attribute
(HEIGHT)
aggregateExpressions : Expression Trees of GROUP BY & AGGREGATE
- Create RelationForAggregation and replace it with original Relation
- Push down the expression trees of GROUP BY and AGGREGATE Functions to
RelationForAggregation.

Aggregation pushdown – Build Aggregated Project
Aggregate
LogicalRelation
ALIAS
COUNT Function
Literal(1)
ALIAS
SUBSTRING
Attribute
(EVENT_TIME)
ALIAS
AVG Function
Attribute
(HEIGHT)
aggregateExpressions : Expression Trees of GROUP BY & AGGREGATE
Transform trees to Attributes and Wrap them with Project Plan
Project
ALIAS
Attribute
(SUBSTR(EVENT
_TIME)
ALIAS
Attribute
(COUNT(1))
ALIAS
Attribute
(AVG(HEIGHT))
<Tree of GROUP BY> <Trees of Aggregation Functions>
The expression
trees of
GROUP BY &
AGG

Aggregation pushdown – After push down
Project
(ALIAS(SUBSTR(EVENT_TIME)),
ALIAS(COUNT(1)),
ALIAS(AVG(HEIGHT)))
LogicalRelation
(Attrs : SUBSTR(EVENT_TIME), COUNT(1),
AVG(HEIGHT))
The expression trees of
GROUP BY & AGG
Filters to be applied
before aggregation
SELECT SUBSTR(EVENT_TIME, 0, 4) AS YEAR,
COUNT(1), AVG(HEIGHT)
FROM TBL_PEOPLE
WHERE JOB = 'ENGINEER’
GROUP BY YEAR
…
== Optimized Logical Plan ==
Project [substring(EVENT_TIME#24, 0, 4)#59 AS YEAR#34, count(1)#60L AS count(1)#57L,
avg(HEIGHT#28)#61 AS avg(HEIGHT)#58]
+- Relation[substring(EVENT_TIME#24, 0, 4)#59,count(1)#60L,avg(HEIGHT#28)#61]
R2RelationForAggregation(ArrayBuffer(substring(EVENT_TIME#24, 0,
4)#59),ArrayBuffer(substring(EVENT_TIME#24, 0, 4)),ArrayBuffer(count(1)#60L,
avg(HEIGHT#28)#61),ArrayBuffer(count(1), avg(HEIGHT#28)),WrappedArray(IsNotNull(JOB),
EqualTo(JOB,ENGINEER)))
…

Vectorized Aggregation in Data Source
▪ When query is executed, RelationForAggregation builds Aggregation-RDD.
▪ Aggregation-RDD sends AGGREGATE command to FlashBase with args of “GROUP BY & AGG Functions
& Filter”.
▪ Each nodes of FlashBase executes Vectorized aggregation using “Intel’s AVX-512”.
The expression trees of
GROUP BY & AGG
Filters to be applied
before aggregation
Aggregation RDD
Data Source of FlashBase
Vectorized aggregation via Intel’s AVX-512
AGGREGATE
command

Acceleration Results – Training Data set and H/W
▪ Samples 2,164 cells from total 300K radio towers
▪ Includes last 42 days, every 10 seconds
▪ With 8 KPI columns, total 725,726,100 rows
▪ 3 servers with Intel Gold 6240
*
* CQI, RSRP, RSRQ, DL_PRB_USAGE_RATE, SINR, UE_TX_POWER, PHR, UE_CONN_TOT_CNT

Acceleration Results - normalization
▪ MinMax based Scaling requires min, max value from training data set
▪ Standard deviation based Scaling requires mean, standard deviation
Operation
Aggregation
Pushdown Off
Aggregation
Pushdown On
Peformance
Gain
Query
Min 16.7 s 2.0 s 8.35 x
dataframe.agg(
min("CQI").as("CQI"), min("RSRP").as("RSRP"), min("RSRQ").as("RSRQ"),
…
).first
Max 16.2 s 2.0 s 8.1 x
dataframe.agg(
max("CQI").as("CQI"), max("RSRP").as("RSRP"),
max("RSRQ").as("RSRQ"), …
).first
Mean 16.0 s 5.0 s 3.2 x
dataframe.agg(
avg("CQI").as("CQI"), avg("RSRP").as("RSRP"), avg("RSRQ").as("RSRQ"),
…
).first

Acceleration Results – 5min window average
▪ Aggregation operations for 5 min window aggregation
Operation
Aggregation
Pushdown Off
Aggregation
Pushdown On
Peformance
Gain
Query
Training 22.67 12 1.89 x
dataframe.groupBy("EVENT_TIME", "UNIQ_ID") .agg(
avg("CQI").as("CQI"), avg("RSRP").as("RSRP"),
avg("RSRQ").as("RSRQ"),
…
)
Inference 0.53 s 0.27 s 1.96 x
dataframe.where(filter).groupBy("UNIQ_ID", "EVENT_TIME").agg(
avg("CQI"), avg("RSRP"), avg("RSRQ"),
…
)

Data Loading
Data
Loader
Data source API
File, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
Acceleration Factor Breakdown (min/max)
30
Spark Job
Scan Cmd.
Spark Job
Data source API
forked.
DRAM
Store
customized.
Flash
Store
tiering
Data
Loader
1.Spark pushdowns aggregation to FlashBase
FlashBase sends aggregated results to Spark
→ Reduce Shuffle writing size
and computation of Spark to 1/5
2. FlashBase accelerates aggregation with
vector-processing via Intel’s AVX-512
(Intel Math Kernel Library)
→ 1.5 times faster aggregation.
→ 8 times faster
(1)Pushdown aggregation
(3)Aggregated results
(2)Accelerate
aggregation via
AVX-512
Preprocess

Agenda
Jason Dai
▪ Unified Big Data + AI Software Architecture in SKT
using Analytics Zoo
▪ Future Work (Project Zouwu)

Unified Big Data + AI Software
Architecture in SKT using Analytics Zoo

AI on Big Data
Accelerating Data Analytics + AI Solutions At Scale
Distributed, High-Performance
Deep Learning Framework
for Apache Spark*
https://github.com/intel-analytics/bigdl
Unified Analytics + AI Platform
for TensorFlow*, PyTorch*, Keras*, BigDL, Ray* and Apache Spark*
https://github.com/intel-analytics/analytics-zoo

Analytics Zoo
Recommendation
Distributed TensorFlow & PyTorch on Spark
Spark Dataframes & ML Pipelines for DL
RayOnSpark
InferenceModel
Models & Algorithms
Integrated Analytics
& AI Pipelines
Time Series Computer Vision NLP
https://github.com/intel-analytics/analytics-zoo
Automated ML
Workflow
AutoML for Time Series Automatic Cluster Serving
Compute
Environment
K8s Cluster Spark Cluster
Python Libraries
(Numpy/Pandas/sklearn/…)
DL Frameworks
(TF/PyTorch/OpenVINO/…)
Distributed Analytics
(Spark/Flink/Ray/…)
Laptop Hadoop Cluster
Powered by oneAPI
Unified Data Analytics and AI Platform

Distributed TensorFlow on Spark in Analytics Zoo
#pyspark code
train_rdd = spark.hadoopFile(…).map(…)
dataset = TFDataset.from_rdd(train_rdd,…)
#tensorflow code
import tensorflow as tf
slim = tf.contrib.slim
images, labels = dataset.tensors
with slim.arg_scope(lenet.lenet_arg_scope()):
logits, end_points = lenet.lenet(images, …)
loss = tf.reduce_mean(
tf.losses.sparse_softmax_cross_entropy(
logits=logits, labels=labels))
#distributed training on Spark
optimizer = TFOptimizer.from_loss(loss, Adam(…))
optimizer.optimize(end_trigger=MaxEpoch(5))
Write TensorFlow inline with Spark code
Analytics Zoo API in blue

Spark Dataframe & ML Pipeline for DL
#Spark dataframe code
parquetfile = spark.read.parquet(…)
train_df = parquetfile.withColumn(…)
#Keras API
model = Sequential()
.add(Convolution2D(32, 3, 3))
.add(MaxPooling2D(pool_size=(2, 2)))
.add(Flatten()).add(Dense(10)))
#Spark ML pipeline code
estimater = NNEstimater(model,
CrossEntropyCriterion())
.setMaxEpoch(5)
.setFeaturesCol("image")
nnModel = estimater.fit(train_df)
Analytics Zoo API in blue

*Other names and brands may be claimed as the property of others
Unified Big Data + AI Pipeline in SKT using Analytics Zoo
Data Loader
DRAM
Store
tiering forked.
Flash
Store
customized.
Data Source APIs
Spark-SQL
Preproce
ss
SQL Queries
(Web, Jupyter)

Performance Improvement by Analytics Zoo
Python Distributed
Preprocessing
(DASK) & Inference
on GPU
Intel
Analytics Zoo
1 Server
Xeon 6240
Intel
Analytics Zoo
3 Servers
Xeon 6240
Python
Preprocessing
(Pandas) &
Inference on GPU
74.26 10.24 3.24 1.61
3X 6X
Test Data: 80K Cell Tower, 8 days, 5mins period, 8 Quality Indicator
TCO optimized AI performance with [ 1 ] Analytics Zoo [ 2 ] Intel Optimized Tensorflow [ 3 ] Distributed AI Processing
[ 1 ] Pre-processing & Inference Latency
Seconds
0
200
400
600
800
1000
1200
1400
1600
1800
BS 4,096 BS 8,192 BS 16,384 BS 32,768 BS 65,536
Intel Analytics Zoo - 1 Server ( Xeon 6240)
GPU Vendor
Intel Analytics Zoo - 3 Servers
Distributed Training - Scalability case (Xeon 6240)
[ 2 ] Time-To-Training Performance
Performance test validation @ SK Telecom Testbed
All performance testing and validation results were provided by SK Telecom Testbed. Intel does not control or audit third-party data.
You should consult other sources to evaluate accuracy. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Project Zouwu: Analytics Zoo Time Series for Telco
Project Zouwu (link)
▪ Use case: reference time series use cases for
Telco (such as network traffic forecasting,
etc.)
▪ Models: built-in models for time series
analysis (such as LSTM and MTNet)
▪ “AutoTS”: AutoML support for building E2E
time series analysis pipelines
(including automatic feature generation,
model selection and hyperparameter tuning)
Project
Zouwu
Built-in Models
ML Workflow AutoML Workflow
Integrated Analytics & AI Pipelines
use-case
model autots
https://github.com/intel-analytics/analytics-
zoo/tree/master/pyzoo/zoo/zouwu

Project Zouwu: Analytics Zoo Time Series for Telco
▪ Built for common Telco use cases
• Time series analysis
• Network KPI forecast
• Anomaly detection
• AIOps
▪ Optimized and Scalable solutions on Xeon
• Integrated Intel optimized libraries on Xeon (TF, PyTorch, OpenVINO, MKL-DNN,
etc.)
• Scaling out TensorFlow/PyTorch/OpenVINO models across clusters
• AutoML for building end-to-end AI pipelines automatically

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Vectorized Deep Learning Acceleration from Preprocessing to Inference and Training on Apache Spark in SK Telecom

Vectorized Deep Learning Acceleration from Preprocessing to Inference and Training on Apache Spark in SK Telecom

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Vectorized Deep Learning Acceleration from Preprocessing to Inference and Training on Apache Spark in SK Telecom

Similar to Vectorized Deep Learning Acceleration from Preprocessing to Inference and Training on Apache Spark in SK Telecom (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Vectorized Deep Learning Acceleration from Preprocessing to Inference and Training on Apache Spark in SK Telecom