SlideShare a Scribd company logo
1 of 43
Download to read offline
Vectorized Deep Learning
Acceleration from Preprocessing to
Inference and Training on Apache
Spark in SK Telecom
Hongchan Roh
Team Leader, SK Telecom
Jason Dai
Senior Principal Engineer, Intel Corporation
Agenda
Hongchan Roh
▪ Demo for Network Quality Analysis and
Vectorized aggregation acceleration
▪ Network Quality Analysis and Prediction in SK
Telecom Wrap-up (SAIS 2019 summary)
▪ Vectorized preprocessing acceleration via
Aggregation Push Down from Apache Spark
Jason Dai
▪ Unified Big Data + AI Software Architecture in SKT
using Analytics Zoo
▪ Future Work (Project Zouwu)
Demo for Network Quality Analysis and Vectorized
aggregation acceleration
https://youtu.be/Wip_b7hUb7w
Analyze and Visualize Network Quality Indicators
e.g. CQI, RSRP, RSRQ, SINR, and more
Data Size: 0.1 billion records
Network Quality Analysis
demo
Analyze and Visualize
Floating Population and COVID19 confirmed Route
Data Size: 0.2 billion records (population), 10,000 records (COVID19)
COVID 19 route analysis
with population demo
https://youtu.be/nuRBxXS2Fms
Network Quality Analysis and Prediction in SK
Telecom Wrap-up (SAIS 2019 summary)
Network Quality Analysis
7
SK Telecom : The largest telecommunications provider in South Korea
• 27 million subscribers
• 300,000 cells
Target Data: Radio cell tower logs
• time-series data with timestamp tags generated every 10 seconds
• geospatial data with geographical coordinates (latitude and
longitude) translated by cell tower’s location
Network Quality Analysis
8
Data Ingestion Requirements
• Ingestion 1.4 million records / sec, (500 byte of records, 200~500 columns)
• 120 billion records / day, 60 TB / day
• Data retention period: 7 ~ 30 days
Query Requirements
• web dashboard and ad-hoc queries: response within 3 sec for a specific
time and region (cell) predicates for multi sessions
• daily batch queries: response within hours for long query pipelines
having heavy operations such as joins and aggregations
New In-memory Datastore for Spark (FlashBase)
9
• Tried to design a new data store for Spark (FlashBase) to support much more partitions
L i n u x O S
C P U / D R A M
S t o r a g e ( H D D )
…
H D F S
S p a r k
S S D C a c h e
Legacy
Architecture
SQL Queries
(Web, Jupyter)
Spark-SQL
Data Loading
Data
Loader
Data Source APIs
JDBCFile, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
New
Architecture
• for query engine, for DRAM key-value store and for SSD key-value store
• SSDs as main storage devices for small-sized parallel I/O with short latencies
Best open source candidates to assemble
New in-memory (dram/SSD) datastore for Spark (2016)
Ingestion Performance and other features
10
Node spec: E5 2680 v4 CPU, 256GB DRAM, 20TB SSDs
Features Details
Ingestion performance 500,000 records/sec/node
In-memory datastore DRAM only, DRAM to SSD Tiering
Massively Parallel Processing 100 redis processes per a single node
Extreme partitioning Up-to 2 billion partitions for a single node
Filter acceleration Using fine-grained partitions and push downed filters
Column-store Column-store by default (row-store option)
Column value transformation Defined by java grammar in schema tbl properties
Compression Gzip level compression ratio w/ LZ4 level speed
Vector processing Filter and aggr. acceleration (SIMD, AVX)
ETC Recovery, replication, scale-out
Extreme Partitioning feature
11
A partition combination for network quality analysis
• 300K (cell towers) X 100K (time slots) = 30B (total partitions)
Time
partitions
Cell tower partitionswvcv3 wvcyw wvfj6 … wyfb1
wyf9w
201804221100
201804221105
201804221110
…
201904221105
201904221110
Node spec: E5 2680 v4 CPU, 256GB DRAM, 20TB SSDs
FlashBase
• Up-to 2 billion partitions in a single node
• Needs 15 nodes to store 30 billion partitions
Oracle
• Up-to 1 million partitions in a single cluster
ü 2000 times reduction in computation and I/O
for specific time and region queries!
Network Quality Analysis Example
12
Spark with FlashBase
Less than 1 sec
FlashBase Cluster: 16 nodes (E5 2680 v4 CPU, 256GB DRAM, 20TB SSDs)
select * from ue_rf_sum
where event_time between '201910070000' and '201910080000' and cell_tower_id = 'snjJlAF5W' and rsrp < -85;
Half an hour
Spark with HDFS
Partition filtering
1/10080
with time
Partition filtering
1/(10080 * 30000)
with time and cell tower
1user_equipment_radio_frequency_summary table
HDFS Cluster: 20 nodes (E5 2650 v4 CPU, 256GB DRAM, 24TB HDDs)
Network quality analysis query for one day and a single cell tower
• 0.142 trillion (142 billion) records in ue_rf_sum1 table (7 day data, 42TB)
• 14,829 satisfying records
Introduction of Network Quality Prediction
13
• Predict Network Quality Indicators (CQI, RSRP, RSRQ, SINR, …)
for anomaly detection and real-time management
• Goal : Unify Geospatial visualization & Network Prediction On Spark
* CQI : Channel Quality Indicator
* RSRP : Reference Signal Received Power
* RSRQ : Reference Signal Received Quality
* SINR :Signal to Interference Noise Ratio
*
Memory augmented model
14
memory1 memory2 memory7 current
▪ ▪ ▪ ▪
Attention
layer
memory3
▪ ▪ ▪ ▪
▪ ▪ ▪ ▪
Encoder1 Encoder2▪ ▪ ▪ ▪Encoder1 Encoder1 Encoder1
1-week data
Concat FCNN !𝑦!"#
Final
prediction
1 32
4 5
Current
Recent 50 min data with 5 min period
Memory
Previous 7 days historical data each of
which has same time band with current and target.
Target
Network quality after 5 min
• Encoder : 1-NN (Autoregressive term)
Encoder1 : ℎ! = 𝑐 + 𝑤" 𝑦! #$%&'(#" + … + 𝑤"" 𝑦!#$ %&'(#""
Encoder2 : ℎ!
)
= 𝑐)
+ 𝑤"
)
𝑦!#" + … + 𝑤′"* 𝑦!#"*
1
2
3
4
5
Memory augmented model - Test result
15
Actual Forecast Error: MAE Score: Error*100
Mem-
model
Improved predictions for sudden change!
Seq2Seq
Training & Inference Architecture
16
Spark-SQL
Data Loading
Data
Loader
Data Source APIs
File, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
Build In-memory Pipeline between FlashBase and Intel Analytics ZOO
Data Layer And Inferencing & Training Layer are integrated into the same Spark Cluster
Also share the same Spark session.
Source Code : https://github.com/mnms/ARMemNet-BigDL
Intel Analytics Zoo : Used to unify TF model into Spark Pipeline seamlessly.
Intel BigDL : inference & training engine
The processing of Inferencing & training can be distributed in Spark Cluster.
Preprocess RDD of Tensor Model Code of TF
DL Training & Inferencing
Data Model
Spark
Cluster
1
1
3
2
3
2
SIMD Acceleration
Vectorized preprocessing acceleration via
Aggregation Push Down from Apache Spark
What is push down?
Apache Spark pushes down filters and projection
▪ Apache Spark push down only filters and projection to data source
▪ Data source should transfer entire data to Spark via Socket
▪ Aggregation operations are entirely performed in Spark
▪ Spark data source filter push down
▪ And, Or, Not, Like, Limit
▪ EQ, GT, GTE, LT, LTE, IN, IsNULL, IsNotNULL, EqualNullSafe
What if Aggregation can be done in Data source?
▪ Network IO between data source and Spark can be largely reduced.
▪ Data source just transfers aggregated results.
▪ Vectorized Aggregation in data source
▪ Data source accelerates aggregation operations(MIN, MAX, SUM, COUNT) using Intel SIMD API.
Aggregation pushdown – Architecture
▪ FlashBase DataSource extends Catalyst Optimization rule
▪ Push down aggregation without customizing Spark Source.
▪ Define a custom optimization rule, and add it to the set of custom optimization rules in Spark.
sqlContext.experimental.extraOptimizations ++= Seq( PropagateAggregationRule )
▪ Custom rule is executed after finishing optimization rules in Catalyst.
- PropagateAggregationRule is executed after optimization phase.
- PropagateAggregationRule pushes down aggregations to DataSource
Aggregation pushdown – Before push down
SELECT SUBSTR(EVENT_TIME, 0, 4) AS YEAR, COUNT(1), AVG(HEIGHT)
FROM TBL_PEOPLE
WHERE JOB = 'ENGINEER’
GROUP BY YEAR
Aggregate
Project (EVENT_TIME, HEIGHT)
Filter (JOB = ‘ENGINEER’)
LogicalRelation
(Attrs : EVENT_TIME, JOB, HEIGHT)
Relation “Relation” plan builds RDD of data in a data source.
The LogicalRelation defines Attributes in a data source.
The data of RDD are filtered the by “Filter” plan.
The selected columns are pruned by “Project” plan.
The pruned & filtered data are finally aggregated by ”Aggregate” plan.
Aggregation pushdown – Push down GB & AGG
Aggregate
Project (EVENT_TIME, HEIGHT)
Filter (JOB = ‘ENGINEER’)
LogicalRelation
(Attrs : EVENT_TIME, JOB, HEIGHT)
RelationForAggregation
ALIAS
COUNT Function
Literal(1)
ALIAS
SUBSTRING
Attribute
(EVENT_TIME)
ALIAS
AVG Function
Attribute
(HEIGHT)
aggregateExpressions : Expression Trees of GROUP BY & AGGREGATE
- Create RelationForAggregation and replace it with original Relation
- Push down the expression trees of GROUP BY and AGGREGATE Functions to
RelationForAggregation.
Aggregation pushdown – Build Aggregated Project
Aggregate
Project (EVENT_TIME, HEIGHT)
Filter (JOB = ‘ENGINEER’)
LogicalRelation
(Attrs : EVENT_TIME, JOB, HEIGHT)
ALIAS
COUNT Function
Literal(1)
ALIAS
SUBSTRING
Attribute
(EVENT_TIME)
ALIAS
AVG Function
Attribute
(HEIGHT)
aggregateExpressions : Expression Trees of GROUP BY & AGGREGATE
Transform trees to Attributes and Wrap them with Project Plan
Project
ALIAS
Attribute
(SUBSTR(EVENT
_TIME)
ALIAS
Attribute
(COUNT(1))
ALIAS
Attribute
(AVG(HEIGHT))
<Tree of GROUP BY> <Trees of Aggregation Functions>
RelationForAggregation
The expression
trees of
GROUP BY &
AGG
Aggregation pushdown – After push down
Project
(ALIAS(SUBSTR(EVENT_TIME)),
ALIAS(COUNT(1)),
ALIAS(AVG(HEIGHT)))
LogicalRelation
(Attrs : SUBSTR(EVENT_TIME), COUNT(1),
AVG(HEIGHT))
RelationForAggregation
The expression trees of
GROUP BY & AGG
Filters to be applied
before aggregation
SELECT SUBSTR(EVENT_TIME, 0, 4) AS YEAR,
COUNT(1), AVG(HEIGHT)
FROM TBL_PEOPLE
WHERE JOB = 'ENGINEER’
GROUP BY YEAR
…
== Optimized Logical Plan ==
Project [substring(EVENT_TIME#24, 0, 4)#59 AS YEAR#34, count(1)#60L AS count(1)#57L,
avg(HEIGHT#28)#61 AS avg(HEIGHT)#58]
+- Relation[substring(EVENT_TIME#24, 0, 4)#59,count(1)#60L,avg(HEIGHT#28)#61]
R2RelationForAggregation(ArrayBuffer(substring(EVENT_TIME#24, 0,
4)#59),ArrayBuffer(substring(EVENT_TIME#24, 0, 4)),ArrayBuffer(count(1)#60L,
avg(HEIGHT#28)#61),ArrayBuffer(count(1), avg(HEIGHT#28)),WrappedArray(IsNotNull(JOB),
EqualTo(JOB,ENGINEER)))
…
Vectorized Aggregation in Data Source
▪ When query is executed, RelationForAggregation builds Aggregation-RDD.
▪ Aggregation-RDD sends AGGREGATE command to FlashBase with args of “GROUP BY & AGG Functions
& Filter”.
▪ Each nodes of FlashBase executes Vectorized aggregation using “Intel’s AVX-512”.
RelationForAggregation
The expression trees of
GROUP BY & AGG
Filters to be applied
before aggregation
Aggregation RDD
Data Source of FlashBase
Vectorized aggregation via Intel’s AVX-512
AGGREGATE
command
Acceleration Results – Training Data set and H/W
▪ Samples 2,164 cells from total 300K radio towers
▪ Includes last 42 days, every 10 seconds
▪ With 8 KPI columns, total 725,726,100 rows
▪ 3 servers with Intel Gold 6240
*
* CQI, RSRP, RSRQ, DL_PRB_USAGE_RATE, SINR, UE_TX_POWER, PHR, UE_CONN_TOT_CNT
Acceleration Results - normalization
▪ MinMax based Scaling requires min, max value from training data set
▪ Standard deviation based Scaling requires mean, standard deviation
Operation
Aggregation
Pushdown Off
Aggregation
Pushdown On
Peformance
Gain
Query
Min 16.7 s 2.0 s 8.35 x
dataframe.agg(
min("CQI").as("CQI"), min("RSRP").as("RSRP"), min("RSRQ").as("RSRQ"),
…
).first
Max 16.2 s 2.0 s 8.1 x
dataframe.agg(
max("CQI").as("CQI"), max("RSRP").as("RSRP"),
max("RSRQ").as("RSRQ"), …
).first
Mean 16.0 s 5.0 s 3.2 x
dataframe.agg(
avg("CQI").as("CQI"), avg("RSRP").as("RSRP"), avg("RSRQ").as("RSRQ"),
…
).first
Acceleration Results – 5min window average
▪ Aggregation operations for 5 min window aggregation
Operation
Aggregation
Pushdown Off
Aggregation
Pushdown On
Peformance
Gain
Query
Training 22.67 12 1.89 x
dataframe.groupBy("EVENT_TIME", "UNIQ_ID") .agg(
avg("CQI").as("CQI"), avg("RSRP").as("RSRP"),
avg("RSRQ").as("RSRQ"),
…
)
Inference 0.53 s 0.27 s 1.96 x
dataframe.where(filter).groupBy("UNIQ_ID", "EVENT_TIME").agg(
avg("CQI"), avg("RSRP"), avg("RSRQ"),
…
)
Data Loading
Data
Loader
Data source API
File, HTTP, Kafka
forked.
DRAM
Store
customized.
Flash
Store
tiering
Acceleration Factor Breakdown (min/max)
30
Spark Job
Scan Cmd.
Spark Job
Data source API
forked.
DRAM
Store
customized.
Flash
Store
tiering
Data
Loader
1.Spark pushdowns aggregation to FlashBase
FlashBase sends aggregated results to Spark
→ Reduce Shuffle writing size
and computation of Spark to 1/5
2. FlashBase accelerates aggregation with
vector-processing via Intel’s AVX-512
(Intel Math Kernel Library)
→ 1.5 times faster aggregation.
→ 8 times faster
(1)Pushdown aggregation
(3)Aggregated results
(2)Accelerate
aggregation via
AVX-512
Preprocess
Agenda
Jason Dai
▪ Unified Big Data + AI Software Architecture in SKT
using Analytics Zoo
▪ Future Work (Project Zouwu)
Unified Big Data + AI Software
Architecture in SKT using Analytics Zoo
AI on Big Data
Accelerating Data Analytics + AI Solutions At Scale
Distributed, High-Performance
Deep Learning Framework
for Apache Spark*
https://github.com/intel-analytics/bigdl
Unified Analytics + AI Platform
for TensorFlow*, PyTorch*, Keras*, BigDL, Ray* and Apache Spark*
https://github.com/intel-analytics/analytics-zoo
Analytics Zoo
Recommendation
Distributed TensorFlow & PyTorch on Spark
Spark Dataframes & ML Pipelines for DL
RayOnSpark
InferenceModel
Models & Algorithms
Integrated Analytics
& AI Pipelines
Time Series Computer Vision NLP
https://github.com/intel-analytics/analytics-zoo
Automated ML
Workflow
AutoML for Time Series Automatic Cluster Serving
Compute
Environment
K8s Cluster Spark Cluster
Python Libraries
(Numpy/Pandas/sklearn/…)
DL Frameworks
(TF/PyTorch/OpenVINO/…)
Distributed Analytics
(Spark/Flink/Ray/…)
Laptop Hadoop Cluster
Powered by oneAPI
Unified Data Analytics and AI Platform
Distributed TensorFlow on Spark in Analytics Zoo
#pyspark code
train_rdd = spark.hadoopFile(…).map(…)
dataset = TFDataset.from_rdd(train_rdd,…)
#tensorflow code
import tensorflow as tf
slim = tf.contrib.slim
images, labels = dataset.tensors
with slim.arg_scope(lenet.lenet_arg_scope()):
logits, end_points = lenet.lenet(images, …)
loss = tf.reduce_mean(
tf.losses.sparse_softmax_cross_entropy( 
logits=logits, labels=labels))
#distributed training on Spark
optimizer = TFOptimizer.from_loss(loss, Adam(…))
optimizer.optimize(end_trigger=MaxEpoch(5))
Write TensorFlow inline with Spark code
Analytics Zoo API in blue
Spark Dataframe & ML Pipeline for DL
#Spark dataframe code
parquetfile = spark.read.parquet(…)
train_df = parquetfile.withColumn(…)
#Keras API
model = Sequential()
.add(Convolution2D(32, 3, 3)) 
.add(MaxPooling2D(pool_size=(2, 2))) 
.add(Flatten()).add(Dense(10)))
#Spark ML pipeline code
estimater = NNEstimater(model, 
CrossEntropyCriterion())
.setMaxEpoch(5) 
.setFeaturesCol("image")
nnModel = estimater.fit(train_df)
Analytics Zoo API in blue
*Other names and brands may be claimed as the property of others
Unified Big Data + AI Pipeline in SKT using Analytics Zoo
Data Loader
DRAM
Store
tiering forked.
Flash
Store
customized.
Data Source APIs
Spark-SQL
Preproce
ss
SQL Queries
(Web, Jupyter)
Performance Improvement by Analytics Zoo
Python Distributed
Preprocessing
(DASK) & Inference
on GPU
Intel
Analytics Zoo
1 Server
Xeon 6240
Intel
Analytics Zoo
3 Servers
Xeon 6240
Python
Preprocessing
(Pandas) &
Inference on GPU
74.26 10.24 3.24 1.61
3X 6X
Test Data: 80K Cell Tower, 8 days, 5mins period, 8 Quality Indicator
TCO optimized AI performance with [ 1 ] Analytics Zoo [ 2 ] Intel Optimized Tensorflow [ 3 ] Distributed AI Processing
[ 1 ] Pre-processing & Inference Latency
Seconds
0
200
400
600
800
1000
1200
1400
1600
1800
BS 4,096 BS 8,192 BS 16,384 BS 32,768 BS 65,536
Intel Analytics Zoo - 1 Server ( Xeon 6240)
GPU Vendor
Intel Analytics Zoo - 3 Servers
Distributed Training - Scalability case (Xeon 6240)
[ 2 ] Time-To-Training Performance
Performance test validation @ SK Telecom Testbed
All performance testing and validation results were provided by SK Telecom Testbed. Intel does not control or audit third-party data.
You should consult other sources to evaluate accuracy. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Future Work (Project Zouwu)
Project Zouwu: Analytics Zoo Time Series for Telco
Project Zouwu (link)
▪ Use case: reference time series use cases for
Telco (such as network traffic forecasting,
etc.)
▪ Models: built-in models for time series
analysis (such as LSTM and MTNet)
▪ “AutoTS”: AutoML support for building E2E
time series analysis pipelines
(including automatic feature generation,
model selection and hyperparameter tuning)
Project
Zouwu
Built-in Models
ML Workflow AutoML Workflow
Integrated Analytics & AI Pipelines
use-case
model autots
https://github.com/intel-analytics/analytics-
zoo/tree/master/pyzoo/zoo/zouwu
Project Zouwu: Analytics Zoo Time Series for Telco
▪ Built for common Telco use cases
• Time series analysis
• Network KPI forecast
• Anomaly detection
• AIOps
▪ Optimized and Scalable solutions on Xeon
• Integrated Intel optimized libraries on Xeon (TF, PyTorch, OpenVINO, MKL-DNN,
etc.)
• Scaling out TensorFlow/PyTorch/OpenVINO models across clusters
• AutoML for building end-to-end AI pipelines automatically
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Training on Apache Spark in SK Telecom

More Related Content

What's hot

Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueDatabricks
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)Spark Summit
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkDatabricks
 
Harnessing Spark Catalyst for Custom Data Payloads
Harnessing Spark Catalyst for Custom Data PayloadsHarnessing Spark Catalyst for Custom Data Payloads
Harnessing Spark Catalyst for Custom Data PayloadsSimeon Fitch
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleDatabricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
 
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...Databricks
 

What's hot (20)

Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Harnessing Spark Catalyst for Custom Data Payloads
Harnessing Spark Catalyst for Custom Data PayloadsHarnessing Spark Catalyst for Custom Data Payloads
Harnessing Spark Catalyst for Custom Data Payloads
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
 

Similar to Vectorized Deep Learning Acceleration from Preprocessing to Inference and Training on Apache Spark in SK Telecom

Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Databricks
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsYousun Jeong
 
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkLenovo Data Center
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Strata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesStrata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesJun Liu
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at ScaleElasticsearch
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red_Hat_Storage
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 

Similar to Vectorized Deep Learning Acceleration from Preprocessing to Inference and Training on Apache Spark in SK Telecom (20)

Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Strata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesStrata + Hadoop 2015 Slides
Strata + Hadoop 2015 Slides
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at Scale
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 

Recently uploaded (20)

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 

Vectorized Deep Learning Acceleration from Preprocessing to Inference and Training on Apache Spark in SK Telecom

  • 1.
  • 2. Vectorized Deep Learning Acceleration from Preprocessing to Inference and Training on Apache Spark in SK Telecom Hongchan Roh Team Leader, SK Telecom Jason Dai Senior Principal Engineer, Intel Corporation
  • 3. Agenda Hongchan Roh ▪ Demo for Network Quality Analysis and Vectorized aggregation acceleration ▪ Network Quality Analysis and Prediction in SK Telecom Wrap-up (SAIS 2019 summary) ▪ Vectorized preprocessing acceleration via Aggregation Push Down from Apache Spark Jason Dai ▪ Unified Big Data + AI Software Architecture in SKT using Analytics Zoo ▪ Future Work (Project Zouwu)
  • 4. Demo for Network Quality Analysis and Vectorized aggregation acceleration
  • 5. https://youtu.be/Wip_b7hUb7w Analyze and Visualize Network Quality Indicators e.g. CQI, RSRP, RSRQ, SINR, and more Data Size: 0.1 billion records Network Quality Analysis demo Analyze and Visualize Floating Population and COVID19 confirmed Route Data Size: 0.2 billion records (population), 10,000 records (COVID19) COVID 19 route analysis with population demo https://youtu.be/nuRBxXS2Fms
  • 6. Network Quality Analysis and Prediction in SK Telecom Wrap-up (SAIS 2019 summary)
  • 7. Network Quality Analysis 7 SK Telecom : The largest telecommunications provider in South Korea • 27 million subscribers • 300,000 cells Target Data: Radio cell tower logs • time-series data with timestamp tags generated every 10 seconds • geospatial data with geographical coordinates (latitude and longitude) translated by cell tower’s location
  • 8. Network Quality Analysis 8 Data Ingestion Requirements • Ingestion 1.4 million records / sec, (500 byte of records, 200~500 columns) • 120 billion records / day, 60 TB / day • Data retention period: 7 ~ 30 days Query Requirements • web dashboard and ad-hoc queries: response within 3 sec for a specific time and region (cell) predicates for multi sessions • daily batch queries: response within hours for long query pipelines having heavy operations such as joins and aggregations
  • 9. New In-memory Datastore for Spark (FlashBase) 9 • Tried to design a new data store for Spark (FlashBase) to support much more partitions L i n u x O S C P U / D R A M S t o r a g e ( H D D ) … H D F S S p a r k S S D C a c h e Legacy Architecture SQL Queries (Web, Jupyter) Spark-SQL Data Loading Data Loader Data Source APIs JDBCFile, HTTP, Kafka forked. DRAM Store customized. Flash Store tiering New Architecture • for query engine, for DRAM key-value store and for SSD key-value store • SSDs as main storage devices for small-sized parallel I/O with short latencies Best open source candidates to assemble New in-memory (dram/SSD) datastore for Spark (2016)
  • 10. Ingestion Performance and other features 10 Node spec: E5 2680 v4 CPU, 256GB DRAM, 20TB SSDs Features Details Ingestion performance 500,000 records/sec/node In-memory datastore DRAM only, DRAM to SSD Tiering Massively Parallel Processing 100 redis processes per a single node Extreme partitioning Up-to 2 billion partitions for a single node Filter acceleration Using fine-grained partitions and push downed filters Column-store Column-store by default (row-store option) Column value transformation Defined by java grammar in schema tbl properties Compression Gzip level compression ratio w/ LZ4 level speed Vector processing Filter and aggr. acceleration (SIMD, AVX) ETC Recovery, replication, scale-out
  • 11. Extreme Partitioning feature 11 A partition combination for network quality analysis • 300K (cell towers) X 100K (time slots) = 30B (total partitions) Time partitions Cell tower partitionswvcv3 wvcyw wvfj6 … wyfb1 wyf9w 201804221100 201804221105 201804221110 … 201904221105 201904221110 Node spec: E5 2680 v4 CPU, 256GB DRAM, 20TB SSDs FlashBase • Up-to 2 billion partitions in a single node • Needs 15 nodes to store 30 billion partitions Oracle • Up-to 1 million partitions in a single cluster ü 2000 times reduction in computation and I/O for specific time and region queries!
  • 12. Network Quality Analysis Example 12 Spark with FlashBase Less than 1 sec FlashBase Cluster: 16 nodes (E5 2680 v4 CPU, 256GB DRAM, 20TB SSDs) select * from ue_rf_sum where event_time between '201910070000' and '201910080000' and cell_tower_id = 'snjJlAF5W' and rsrp < -85; Half an hour Spark with HDFS Partition filtering 1/10080 with time Partition filtering 1/(10080 * 30000) with time and cell tower 1user_equipment_radio_frequency_summary table HDFS Cluster: 20 nodes (E5 2650 v4 CPU, 256GB DRAM, 24TB HDDs) Network quality analysis query for one day and a single cell tower • 0.142 trillion (142 billion) records in ue_rf_sum1 table (7 day data, 42TB) • 14,829 satisfying records
  • 13. Introduction of Network Quality Prediction 13 • Predict Network Quality Indicators (CQI, RSRP, RSRQ, SINR, …) for anomaly detection and real-time management • Goal : Unify Geospatial visualization & Network Prediction On Spark * CQI : Channel Quality Indicator * RSRP : Reference Signal Received Power * RSRQ : Reference Signal Received Quality * SINR :Signal to Interference Noise Ratio *
  • 14. Memory augmented model 14 memory1 memory2 memory7 current ▪ ▪ ▪ ▪ Attention layer memory3 ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ Encoder1 Encoder2▪ ▪ ▪ ▪Encoder1 Encoder1 Encoder1 1-week data Concat FCNN !𝑦!"# Final prediction 1 32 4 5 Current Recent 50 min data with 5 min period Memory Previous 7 days historical data each of which has same time band with current and target. Target Network quality after 5 min • Encoder : 1-NN (Autoregressive term) Encoder1 : ℎ! = 𝑐 + 𝑤" 𝑦! #$%&'(#" + … + 𝑤"" 𝑦!#$ %&'(#"" Encoder2 : ℎ! ) = 𝑐) + 𝑤" ) 𝑦!#" + … + 𝑤′"* 𝑦!#"* 1 2 3 4 5
  • 15. Memory augmented model - Test result 15 Actual Forecast Error: MAE Score: Error*100 Mem- model Improved predictions for sudden change! Seq2Seq
  • 16. Training & Inference Architecture 16 Spark-SQL Data Loading Data Loader Data Source APIs File, HTTP, Kafka forked. DRAM Store customized. Flash Store tiering Build In-memory Pipeline between FlashBase and Intel Analytics ZOO Data Layer And Inferencing & Training Layer are integrated into the same Spark Cluster Also share the same Spark session. Source Code : https://github.com/mnms/ARMemNet-BigDL Intel Analytics Zoo : Used to unify TF model into Spark Pipeline seamlessly. Intel BigDL : inference & training engine The processing of Inferencing & training can be distributed in Spark Cluster. Preprocess RDD of Tensor Model Code of TF DL Training & Inferencing Data Model Spark Cluster 1 1 3 2 3 2 SIMD Acceleration
  • 17. Vectorized preprocessing acceleration via Aggregation Push Down from Apache Spark
  • 18. What is push down?
  • 19. Apache Spark pushes down filters and projection ▪ Apache Spark push down only filters and projection to data source ▪ Data source should transfer entire data to Spark via Socket ▪ Aggregation operations are entirely performed in Spark ▪ Spark data source filter push down ▪ And, Or, Not, Like, Limit ▪ EQ, GT, GTE, LT, LTE, IN, IsNULL, IsNotNULL, EqualNullSafe
  • 20. What if Aggregation can be done in Data source? ▪ Network IO between data source and Spark can be largely reduced. ▪ Data source just transfers aggregated results. ▪ Vectorized Aggregation in data source ▪ Data source accelerates aggregation operations(MIN, MAX, SUM, COUNT) using Intel SIMD API.
  • 21. Aggregation pushdown – Architecture ▪ FlashBase DataSource extends Catalyst Optimization rule ▪ Push down aggregation without customizing Spark Source. ▪ Define a custom optimization rule, and add it to the set of custom optimization rules in Spark. sqlContext.experimental.extraOptimizations ++= Seq( PropagateAggregationRule ) ▪ Custom rule is executed after finishing optimization rules in Catalyst. - PropagateAggregationRule is executed after optimization phase. - PropagateAggregationRule pushes down aggregations to DataSource
  • 22. Aggregation pushdown – Before push down SELECT SUBSTR(EVENT_TIME, 0, 4) AS YEAR, COUNT(1), AVG(HEIGHT) FROM TBL_PEOPLE WHERE JOB = 'ENGINEER’ GROUP BY YEAR Aggregate Project (EVENT_TIME, HEIGHT) Filter (JOB = ‘ENGINEER’) LogicalRelation (Attrs : EVENT_TIME, JOB, HEIGHT) Relation “Relation” plan builds RDD of data in a data source. The LogicalRelation defines Attributes in a data source. The data of RDD are filtered the by “Filter” plan. The selected columns are pruned by “Project” plan. The pruned & filtered data are finally aggregated by ”Aggregate” plan.
  • 23. Aggregation pushdown – Push down GB & AGG Aggregate Project (EVENT_TIME, HEIGHT) Filter (JOB = ‘ENGINEER’) LogicalRelation (Attrs : EVENT_TIME, JOB, HEIGHT) RelationForAggregation ALIAS COUNT Function Literal(1) ALIAS SUBSTRING Attribute (EVENT_TIME) ALIAS AVG Function Attribute (HEIGHT) aggregateExpressions : Expression Trees of GROUP BY & AGGREGATE - Create RelationForAggregation and replace it with original Relation - Push down the expression trees of GROUP BY and AGGREGATE Functions to RelationForAggregation.
  • 24. Aggregation pushdown – Build Aggregated Project Aggregate Project (EVENT_TIME, HEIGHT) Filter (JOB = ‘ENGINEER’) LogicalRelation (Attrs : EVENT_TIME, JOB, HEIGHT) ALIAS COUNT Function Literal(1) ALIAS SUBSTRING Attribute (EVENT_TIME) ALIAS AVG Function Attribute (HEIGHT) aggregateExpressions : Expression Trees of GROUP BY & AGGREGATE Transform trees to Attributes and Wrap them with Project Plan Project ALIAS Attribute (SUBSTR(EVENT _TIME) ALIAS Attribute (COUNT(1)) ALIAS Attribute (AVG(HEIGHT)) <Tree of GROUP BY> <Trees of Aggregation Functions> RelationForAggregation The expression trees of GROUP BY & AGG
  • 25. Aggregation pushdown – After push down Project (ALIAS(SUBSTR(EVENT_TIME)), ALIAS(COUNT(1)), ALIAS(AVG(HEIGHT))) LogicalRelation (Attrs : SUBSTR(EVENT_TIME), COUNT(1), AVG(HEIGHT)) RelationForAggregation The expression trees of GROUP BY & AGG Filters to be applied before aggregation SELECT SUBSTR(EVENT_TIME, 0, 4) AS YEAR, COUNT(1), AVG(HEIGHT) FROM TBL_PEOPLE WHERE JOB = 'ENGINEER’ GROUP BY YEAR … == Optimized Logical Plan == Project [substring(EVENT_TIME#24, 0, 4)#59 AS YEAR#34, count(1)#60L AS count(1)#57L, avg(HEIGHT#28)#61 AS avg(HEIGHT)#58] +- Relation[substring(EVENT_TIME#24, 0, 4)#59,count(1)#60L,avg(HEIGHT#28)#61] R2RelationForAggregation(ArrayBuffer(substring(EVENT_TIME#24, 0, 4)#59),ArrayBuffer(substring(EVENT_TIME#24, 0, 4)),ArrayBuffer(count(1)#60L, avg(HEIGHT#28)#61),ArrayBuffer(count(1), avg(HEIGHT#28)),WrappedArray(IsNotNull(JOB), EqualTo(JOB,ENGINEER))) …
  • 26. Vectorized Aggregation in Data Source ▪ When query is executed, RelationForAggregation builds Aggregation-RDD. ▪ Aggregation-RDD sends AGGREGATE command to FlashBase with args of “GROUP BY & AGG Functions & Filter”. ▪ Each nodes of FlashBase executes Vectorized aggregation using “Intel’s AVX-512”. RelationForAggregation The expression trees of GROUP BY & AGG Filters to be applied before aggregation Aggregation RDD Data Source of FlashBase Vectorized aggregation via Intel’s AVX-512 AGGREGATE command
  • 27. Acceleration Results – Training Data set and H/W ▪ Samples 2,164 cells from total 300K radio towers ▪ Includes last 42 days, every 10 seconds ▪ With 8 KPI columns, total 725,726,100 rows ▪ 3 servers with Intel Gold 6240 * * CQI, RSRP, RSRQ, DL_PRB_USAGE_RATE, SINR, UE_TX_POWER, PHR, UE_CONN_TOT_CNT
  • 28. Acceleration Results - normalization ▪ MinMax based Scaling requires min, max value from training data set ▪ Standard deviation based Scaling requires mean, standard deviation Operation Aggregation Pushdown Off Aggregation Pushdown On Peformance Gain Query Min 16.7 s 2.0 s 8.35 x dataframe.agg( min("CQI").as("CQI"), min("RSRP").as("RSRP"), min("RSRQ").as("RSRQ"), … ).first Max 16.2 s 2.0 s 8.1 x dataframe.agg( max("CQI").as("CQI"), max("RSRP").as("RSRP"), max("RSRQ").as("RSRQ"), … ).first Mean 16.0 s 5.0 s 3.2 x dataframe.agg( avg("CQI").as("CQI"), avg("RSRP").as("RSRP"), avg("RSRQ").as("RSRQ"), … ).first
  • 29. Acceleration Results – 5min window average ▪ Aggregation operations for 5 min window aggregation Operation Aggregation Pushdown Off Aggregation Pushdown On Peformance Gain Query Training 22.67 12 1.89 x dataframe.groupBy("EVENT_TIME", "UNIQ_ID") .agg( avg("CQI").as("CQI"), avg("RSRP").as("RSRP"), avg("RSRQ").as("RSRQ"), … ) Inference 0.53 s 0.27 s 1.96 x dataframe.where(filter).groupBy("UNIQ_ID", "EVENT_TIME").agg( avg("CQI"), avg("RSRP"), avg("RSRQ"), … )
  • 30. Data Loading Data Loader Data source API File, HTTP, Kafka forked. DRAM Store customized. Flash Store tiering Acceleration Factor Breakdown (min/max) 30 Spark Job Scan Cmd. Spark Job Data source API forked. DRAM Store customized. Flash Store tiering Data Loader 1.Spark pushdowns aggregation to FlashBase FlashBase sends aggregated results to Spark → Reduce Shuffle writing size and computation of Spark to 1/5 2. FlashBase accelerates aggregation with vector-processing via Intel’s AVX-512 (Intel Math Kernel Library) → 1.5 times faster aggregation. → 8 times faster (1)Pushdown aggregation (3)Aggregated results (2)Accelerate aggregation via AVX-512 Preprocess
  • 31. Agenda Jason Dai ▪ Unified Big Data + AI Software Architecture in SKT using Analytics Zoo ▪ Future Work (Project Zouwu)
  • 32. Unified Big Data + AI Software Architecture in SKT using Analytics Zoo
  • 33. AI on Big Data Accelerating Data Analytics + AI Solutions At Scale Distributed, High-Performance Deep Learning Framework for Apache Spark* https://github.com/intel-analytics/bigdl Unified Analytics + AI Platform for TensorFlow*, PyTorch*, Keras*, BigDL, Ray* and Apache Spark* https://github.com/intel-analytics/analytics-zoo
  • 34. Analytics Zoo Recommendation Distributed TensorFlow & PyTorch on Spark Spark Dataframes & ML Pipelines for DL RayOnSpark InferenceModel Models & Algorithms Integrated Analytics & AI Pipelines Time Series Computer Vision NLP https://github.com/intel-analytics/analytics-zoo Automated ML Workflow AutoML for Time Series Automatic Cluster Serving Compute Environment K8s Cluster Spark Cluster Python Libraries (Numpy/Pandas/sklearn/…) DL Frameworks (TF/PyTorch/OpenVINO/…) Distributed Analytics (Spark/Flink/Ray/…) Laptop Hadoop Cluster Powered by oneAPI Unified Data Analytics and AI Platform
  • 35. Distributed TensorFlow on Spark in Analytics Zoo #pyspark code train_rdd = spark.hadoopFile(…).map(…) dataset = TFDataset.from_rdd(train_rdd,…) #tensorflow code import tensorflow as tf slim = tf.contrib.slim images, labels = dataset.tensors with slim.arg_scope(lenet.lenet_arg_scope()): logits, end_points = lenet.lenet(images, …) loss = tf.reduce_mean( tf.losses.sparse_softmax_cross_entropy( logits=logits, labels=labels)) #distributed training on Spark optimizer = TFOptimizer.from_loss(loss, Adam(…)) optimizer.optimize(end_trigger=MaxEpoch(5)) Write TensorFlow inline with Spark code Analytics Zoo API in blue
  • 36. Spark Dataframe & ML Pipeline for DL #Spark dataframe code parquetfile = spark.read.parquet(…) train_df = parquetfile.withColumn(…) #Keras API model = Sequential() .add(Convolution2D(32, 3, 3)) .add(MaxPooling2D(pool_size=(2, 2))) .add(Flatten()).add(Dense(10))) #Spark ML pipeline code estimater = NNEstimater(model, CrossEntropyCriterion()) .setMaxEpoch(5) .setFeaturesCol("image") nnModel = estimater.fit(train_df) Analytics Zoo API in blue
  • 37. *Other names and brands may be claimed as the property of others Unified Big Data + AI Pipeline in SKT using Analytics Zoo Data Loader DRAM Store tiering forked. Flash Store customized. Data Source APIs Spark-SQL Preproce ss SQL Queries (Web, Jupyter)
  • 38. Performance Improvement by Analytics Zoo Python Distributed Preprocessing (DASK) & Inference on GPU Intel Analytics Zoo 1 Server Xeon 6240 Intel Analytics Zoo 3 Servers Xeon 6240 Python Preprocessing (Pandas) & Inference on GPU 74.26 10.24 3.24 1.61 3X 6X Test Data: 80K Cell Tower, 8 days, 5mins period, 8 Quality Indicator TCO optimized AI performance with [ 1 ] Analytics Zoo [ 2 ] Intel Optimized Tensorflow [ 3 ] Distributed AI Processing [ 1 ] Pre-processing & Inference Latency Seconds 0 200 400 600 800 1000 1200 1400 1600 1800 BS 4,096 BS 8,192 BS 16,384 BS 32,768 BS 65,536 Intel Analytics Zoo - 1 Server ( Xeon 6240) GPU Vendor Intel Analytics Zoo - 3 Servers Distributed Training - Scalability case (Xeon 6240) [ 2 ] Time-To-Training Performance Performance test validation @ SK Telecom Testbed All performance testing and validation results were provided by SK Telecom Testbed. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 40. Project Zouwu: Analytics Zoo Time Series for Telco Project Zouwu (link) ▪ Use case: reference time series use cases for Telco (such as network traffic forecasting, etc.) ▪ Models: built-in models for time series analysis (such as LSTM and MTNet) ▪ “AutoTS”: AutoML support for building E2E time series analysis pipelines (including automatic feature generation, model selection and hyperparameter tuning) Project Zouwu Built-in Models ML Workflow AutoML Workflow Integrated Analytics & AI Pipelines use-case model autots https://github.com/intel-analytics/analytics- zoo/tree/master/pyzoo/zoo/zouwu
  • 41. Project Zouwu: Analytics Zoo Time Series for Telco ▪ Built for common Telco use cases • Time series analysis • Network KPI forecast • Anomaly detection • AIOps ▪ Optimized and Scalable solutions on Xeon • Integrated Intel optimized libraries on Xeon (TF, PyTorch, OpenVINO, MKL-DNN, etc.) • Scaling out TensorFlow/PyTorch/OpenVINO models across clusters • AutoML for building end-to-end AI pipelines automatically
  • 42. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.