SlideShare a Scribd company logo
#EUai9
Marcin Kulka and Michał Kaczmarczyk
9LivesData
Oct/26/2017
No More Cumbersomeness:
Automatic Predictive
Modeling
on Apache Spark
Who we are?
• Marcin Kulka – Senior Software
Engineer
• Michał Kaczmarczyk (Ph.D.) –
Software Architect, Team Leader and
Project Manager
2
Who we are?
• Advanced software R&D company (Warsaw,
Poland)
• 75+ scientists and software engineers
• Specializing in scalable storage,
distributed and big data systems
• Cooperating with partners all around the world
3
4
• Masato Asahara (Ph.D.) -
Researcher, NEC Data Science
Research Laboratory
• Ryohei Fujimaki (Ph.D.) -
Research Fellow, NEC Data
Science Research Laboratory
5
Agenda
• Typical use case for predictive modeling problem
• Our technology - Automatic Predictive Modeling
• Design challenges
• Evaluation results
• Our observations
6
Motivation
7
Predictive analysis in industry and business
8
Driver risk
assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product price
optimization
Sales
optimization
Energy/water operation
mgmt
... but Predictive Modeling
• Takes a long time
• Requires high skills
9
Typical predictive modeling use case
1010
Training Data
Validation Data
Test Data
Highly accurate
prediction results
Typical predictive modeling use case
1111
Predictive
models
Training Data
Validation Data
Test Data
Highly accurate
prediction results
Predictive model design
12
Algorithm selection
Accuracy v s Transparency
Black box White box
Predictive model design
13
Hyperparameters tuning
Best balance
Algorithm selection
Accuracy v s Transparency
Black box White box
Predictive model design
14
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Predictive model design
15
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
A lot of effort, many models…
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Predictive model design
16
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
A lot of effort, many models…
Many
iterations,
weeks...
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Predictive model design
17
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
A lot of effort, many models…
Many
iterations,
weeks...
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Sophisticated knowledge...
Automatic predictive modeling
18
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Automatic predictive modeling
19
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
Highly accurate
results in a short
time!
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Our technology
20
Exploring massive modeling possibilities
21
Data
preprocessing
strategies
Exploring massive modeling possibilities
22
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Exploring massive modeling possibilities
23
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Feature
selection!
Exploring massive modeling possibilities
24
Algorithms
Yes
No Yes
Hyperparameters
tuning
Data
preprocessing
strategies
Feature
selection!
Exploring massive modeling possibilities
25
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Yes
No Yes
Feature
selection!
1000s of
models!
Hyperparameters
tuning
Exploring massive modeling possibilities
26
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Yes
No Yes
Feature
selection!
1000s of
models!
Hyperparameters
tuning
Automating and accelerating with Spark
27
Complete in hours!
Yes
No Yes
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Feature
selection!
Hyperparameters
tuning
28
Training
data
Validation
criteria
Validation
data
Modeling flow = training + validation
Modeling flow = training + validation
29
Training
data
Validation
data
Training
models
Validating
models
Models
Test
data
Best model
Validation
criteria
Modeling and prediction flow
30
Training
data
Validation
data
Training
models
Validating
models
Models
Test
data
Prediction
Best model
Validation
criteria
Best
prediction
Design challenges
and solutions
31
3232
Challenges to achieve high execution performance
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
3232
θ1
θ2
θ3
3333
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
Challenges to achieve high execution performance
Using native ML engines in Spark
Why?
34
Comparison of Spark and native ML engines
35
(+ Spark ML)
Native
ML engines
Comparison of Spark and native ML engines
36
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Comparison of Spark and native ML engines
37
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Accuracy
Comparison of Spark and native ML engines
38
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Performance Medium Extremely high
Distributed
nature,
synchronization
overhead
Accuracy
If data fits
a single server
Comparison of Spark and native ML engines
39
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Performance Medium Extremely high
Distributed
nature,
synchronization
overhead
Accuracy
If data fits
a single server
Comparison of Spark and native ML engines
• We would like to combine Spark and ML engines
40
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Performance Medium Extremely high
Combining Spark and ML engines for training
41
Training
data
(parquet)
HDFS
Models
42
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Combining Spark and ML engines for training
43
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Combining Spark and ML engines for training
44
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
’Single ML engine’
on a single executor
Combining Spark and ML engines for training
45
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Input
requirements:
size & format
’Single ML engine’
on a single executor
Combining Spark and ML engines for training
46
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Combining Spark and ML engines for training
47
Machine Learning
(map operation)
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Matrix
Matrix
Matrix
Combining Spark and ML engines for training
48
Machine Learning
(map operation)
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Matrix
Matrix
Matrix
Combining Spark and ML engines for training
RDD of huge, efficiently
stored objects optimized
for ML computations!!!
Converting to
RDD[Matrix]
49
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
HDFS
1000s of
models
Yes
No Yes
Yes
No Yes
Matrix
Matrix
Matrix
RDD of huge, efficiently
stored objects optimized
for ML computations!!!
Combining Spark and ML engines for training
Combining Spark and ML engines for validation
50
Validation
data
(parquet)
HDFS
51
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Combining Spark and ML engines for validation
52
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Matrix
Matrix
Matrix
Combining Spark and ML engines for validation
Converting to
RDD[Matrix]
Matrix
Matrix
Matrix
53
Prediction
(map operation)
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Computing
validation results
for many models
Combining Spark and ML engines for validation
Converting to
RDD[Matrix]
Matrix
Matrix
Matrix
54
Validation
(MapReduce)
Prediction
(map operation)
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Computing
validation scores
Combining Spark and ML engines for validation
Converting to
RDD[Matrix]
Matrix
Matrix
Matrix
55
Validation
(MapReduce)
Prediction
(map operation)
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
HDFS
Best
model
Combining Spark and ML engines for validation
56
Predict
(map operation)
Convert to
RDD[Matrix]
Data preprocessing
(MapReduce)
Test data
(parquet)
HDFS
HDFS
Prediction
results
(parquet)
Matrix
Matrix
Matrix
Computations
only for selected
models
Combining Spark and ML engines for prediction
Design challenges
5757
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
Many models to schedule
58
Matrix X3
Matrix X2
Matrix X1
Many models to schedule
59
Algorithms
Hyperparameters
Data
preprocessing
strategies
Parameters:
θ1, θ2, θ3 ...
Matrix X3
Matrix X2
Matrix X1
Many models to schedule
60
Algorithms
Hyperparameters
Data
preprocessing
strategies
Machine Learning
Yes
No Yes
Parameters:
θ1, θ2, θ3 ...
Matrix X3
Matrix X2
Matrix X1
Naive scheduling
61
Load &
Convert
Parameter θ1
Parameter θ1
Parameter θ1
Matrix X1
Matrix X2
Matrix X3
• Waste of memory
• Frequent data
loading from
other servers
• Frequent data to
matrix conversion
61
Load &
Convert
Parameter θ1
Parameter θ1
Parameter θ1
Matrix X1
Matrix X2
Matrix X3
62
Parameter-aware scheduling
62
• Efficient memory
usage
• Infrequent data
loading from
other servers
• Infrequent data to
matrix conversion
62
Parameter θ1
Parameter θ2
Parameter θ3
Matrix X1
Design challenges
6363
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
Machine learning – most work intensive & time consuming part
64
Machine Learning
(map operation)
Convert
to matrix
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
HDFS
Yes
No Yes
We must ensure good
balance of paralleled
work
1000s of
models
Matrix
Matrix
Matrix
Naive balancing of models to compute
65
5 min 5 min
Complicated model
Naive balancing of models to compute
66
5 min 5 min
1 min 1 min Wait 8 min…Yes
No Yes
Yes
No Yes
Decision tree model
Complicated model
Predictive balancing
• Balancing
complex and
simple
models
(based on
previous
estimation)
• Complex
models first
5 min 1 min
5 min 1 min
Yes
No Yes
Yes
No Yes
♪~
♪~
67
Evaluation
68
Evaluation – targeting Top-10%
• Prediction problem
– Comparing Top-10% precision of targeting potential
positive samples
• Comparing with manual predictive modeling
– Done with scikit-learn v0.18.1
– Selected algorithms (Logistic Regression, SVM, Random
Forests)
– Selected preprocessing strategies
– All parameters of algorithms set with default values
• except Random Forest (n_estimators = 200)
69
Evaluation – data sets
• KDDCUP 2014 competition data
– 557K records for training and validate data
– 62K records for test data
– Features: 500
• KDDCUP 2015 competition data
– 108K records for training and validate data
– 12K records for test data
– Features: 500
• IJCAI 2015 competition data
– 87K records for training, validate and test data
– Features: 500
70
Evaluation – cluster specificaton
• Size: 3U
• Server modules: 34
• CPU: 272 cores (Intel Xeon D 2.1GHz)
– 128 cores used in the evaluation
• RAM: 2TB
• Storage: 34TB SSD
• Internal network: 10GbE
• Spark v1.6.0, Hadoop v2.7.3
71
Scalable Modular Server
(DX2000)
Evaluation results and conclusions
72
Data Our
technology
Logistic
regression
SVM Random
Forests
KDDCUP 2014 15.6% 13.5% 12.0% 14.8%
KDDCUP 2015 97.1% 95.5% 93.1% 97.2%
IJCAI 2015 8.2% 8.3% 8.1% 8.2%
Top-10% precision results
Evaluation results and conclusions
• Competitive results with good accuracy
73
Data Our
technology
Logistic
regression
SVM Random
Forests
KDDCUP 2014 15.6% 13.5% 12.0% 14.8%
KDDCUP 2015 97.1% 95.5% 93.1% 97.2%
IJCAI 2015 8.2% 8.3% 8.1% 8.2%
Top-10% precision results
Evaluation results and conclusions
• Short execution time
• Full automation of the whole process
• Handling data of any size
74
Data Our technology
KDDCUP 2014 172 minutes
KDDCUP 2015 45 minutes
IJCAI 2015 36 minutes
Execution time
Our observations
75
Our observations
• Using RDD of huge but compact objects
optimized for ML computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
76
Our observations
• Using RDD of huge but compact objects
optimized for ML computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
77
Converting to
RDD[Matrix]
78
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
HDFS
1000s of
models
Yes
No Yes
Yes
No Yes
Matrix
Matrix
Matrix
RDD[DenseMatrix]
• Spark used for parallelization
• All the necessary data for a single execution kept
without memory overhead
• Performance critical operations executed:
– On objects with Linear Algebra operations optimized
– By fast native ML algorithms
79
RDD[DenseMatrix]
Our observations
• Using RDD of huge but compact objects
optimized for fast computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
80
Limiting execution overhead in tests
• Submitting Spark application takes time
81
TestSpark submit Spark submit Test Spark submit Test
Limiting execution overhead in tests
• We submit only once
82
TestSpark submit Test Test
♪~
Our observations
• Using RDD of huge but compact objects
optimized for fast computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
83
Stable execution on YARN
• Default configuration sometimes failing with not
enough memory
• Spark Web UI:
• Serving much memory to Spark but application
still failing
• Known problem in Spark
84
Stable execution on YARN
• JVM system memory spikes over YARN
limitation suddenly (*)
85
(*) Shivnath and Mayuresh. “Understanding Memory Management In Spark For Fun And Profit”, Spark Summit 2016.
YARN limitation
(6GB)
Time
Memory(GB)
Spike of JVM system
memory usage
Stable execution on YARN
• Tip: spark.yarn.executor.memoryOverhead to be
carefully configured
• Recommended overhead: 6-10%
• 15% overhead required in our case
• Must be thoroughly investigated
86
(http://spark.apache.org/docs/2.1.1/running-on-yarn.html)
Summary
87
Summary
• Predictive modeling problem
– Requires sophisticated knowledge
– Takes a long time
• Our technology: Automatic Predictive Modeling
– Combines Spark with native ML engines
– Fully automates the whole process
– Provides highly accurate results
– Takes at most hours
– Handles data of any size
88
Future work
• Extending to other models
(e.g. deep learning)
• Speeding up by GPU
• Reducing YARN memory
overhead
89
Thank you!
90

More Related Content

What's hot

Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Spark Summit
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Productionizing Machine Learning Pipelines with Databricks and Azure ML
Productionizing Machine Learning Pipelines with Databricks and Azure MLProductionizing Machine Learning Pipelines with Databricks and Azure ML
Productionizing Machine Learning Pipelines with Databricks and Azure ML
Databricks
 
Ray: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed PythonRay: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed Python
Databricks
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
Databricks
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Databricks
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Databricks
 

What's hot (20)

Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Productionizing Machine Learning Pipelines with Databricks and Azure ML
Productionizing Machine Learning Pipelines with Databricks and Azure MLProductionizing Machine Learning Pipelines with Databricks and Azure ML
Productionizing Machine Learning Pipelines with Databricks and Azure ML
 
Ray: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed PythonRay: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed Python
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 

Similar to No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk

Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Mostafa Majidpour
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
Justin Basilico
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Modeling at Scale: SigOpt at TWIMLcon 2019
Modeling at Scale: SigOpt at TWIMLcon 2019Modeling at Scale: SigOpt at TWIMLcon 2019
Modeling at Scale: SigOpt at TWIMLcon 2019
SigOpt
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLAB
CodeOps Technologies LLP
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
BigML, Inc
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
Databricks
 
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION IncRTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
Object Automation
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
MapR Technologies
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
Ted Dunning
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
Miroslaw Staron
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
Value of Data Science
Value of Data ScienceValue of Data Science
Value of Data Science
Akin Osman Kazakci
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
Greg Makowski
 

Similar to No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk (20)

Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Modeling at Scale: SigOpt at TWIMLcon 2019
Modeling at Scale: SigOpt at TWIMLcon 2019Modeling at Scale: SigOpt at TWIMLcon 2019
Modeling at Scale: SigOpt at TWIMLcon 2019
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLAB
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION IncRTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Value of Data Science
Value of Data ScienceValue of Data Science
Value of Data Science
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr Szul
 

Recently uploaded

4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 

Recently uploaded (20)

4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk

  • 1. #EUai9 Marcin Kulka and Michał Kaczmarczyk 9LivesData Oct/26/2017 No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark
  • 2. Who we are? • Marcin Kulka – Senior Software Engineer • Michał Kaczmarczyk (Ph.D.) – Software Architect, Team Leader and Project Manager 2
  • 3. Who we are? • Advanced software R&D company (Warsaw, Poland) • 75+ scientists and software engineers • Specializing in scalable storage, distributed and big data systems • Cooperating with partners all around the world 3
  • 4. 4
  • 5. • Masato Asahara (Ph.D.) - Researcher, NEC Data Science Research Laboratory • Ryohei Fujimaki (Ph.D.) - Research Fellow, NEC Data Science Research Laboratory 5
  • 6. Agenda • Typical use case for predictive modeling problem • Our technology - Automatic Predictive Modeling • Design challenges • Evaluation results • Our observations 6
  • 8. Predictive analysis in industry and business 8 Driver risk assessment Inventory Optimization Churn Retention Predictive Maintenance Product price optimization Sales optimization Energy/water operation mgmt
  • 9. ... but Predictive Modeling • Takes a long time • Requires high skills 9
  • 10. Typical predictive modeling use case 1010 Training Data Validation Data Test Data Highly accurate prediction results
  • 11. Typical predictive modeling use case 1111 Predictive models Training Data Validation Data Test Data Highly accurate prediction results
  • 12. Predictive model design 12 Algorithm selection Accuracy v s Transparency Black box White box
  • 13. Predictive model design 13 Hyperparameters tuning Best balance Algorithm selection Accuracy v s Transparency Black box White box
  • 14. Predictive model design 14 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features Sales = f (Price, Location) Sales = f (Price, Weather) or
  • 15. Predictive model design 15 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features A lot of effort, many models… Sales = f (Price, Location) Sales = f (Price, Weather) or
  • 16. Predictive model design 16 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features A lot of effort, many models… Many iterations, weeks... Sales = f (Price, Location) Sales = f (Price, Weather) or
  • 17. Predictive model design 17 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features A lot of effort, many models… Many iterations, weeks... Sales = f (Price, Location) Sales = f (Price, Weather) or Sophisticated knowledge...
  • 18. Automatic predictive modeling 18 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features Sales = f (Price, Location) Sales = f (Price, Weather) or
  • 19. Automatic predictive modeling 19 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features Highly accurate results in a short time! Sales = f (Price, Location) Sales = f (Price, Weather) or
  • 21. Exploring massive modeling possibilities 21 Data preprocessing strategies
  • 22. Exploring massive modeling possibilities 22 Algorithms Yes No Yes Data preprocessing strategies
  • 23. Exploring massive modeling possibilities 23 Algorithms Yes No Yes Data preprocessing strategies Feature selection!
  • 24. Exploring massive modeling possibilities 24 Algorithms Yes No Yes Hyperparameters tuning Data preprocessing strategies Feature selection!
  • 25. Exploring massive modeling possibilities 25 Algorithms Yes No Yes Data preprocessing strategies Yes No Yes Feature selection! 1000s of models! Hyperparameters tuning
  • 26. Exploring massive modeling possibilities 26 Algorithms Yes No Yes Data preprocessing strategies Yes No Yes Feature selection! 1000s of models! Hyperparameters tuning
  • 27. Automating and accelerating with Spark 27 Complete in hours! Yes No Yes Algorithms Yes No Yes Data preprocessing strategies Feature selection! Hyperparameters tuning
  • 29. Modeling flow = training + validation 29 Training data Validation data Training models Validating models Models Test data Best model Validation criteria
  • 30. Modeling and prediction flow 30 Training data Validation data Training models Validating models Models Test data Prediction Best model Validation criteria Best prediction
  • 32. 3232 Challenges to achieve high execution performance • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing 3232 θ1 θ2 θ3
  • 33. 3333 θ1 θ2 θ3 • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing Challenges to achieve high execution performance
  • 34. Using native ML engines in Spark Why? 34
  • 35. Comparison of Spark and native ML engines 35 (+ Spark ML) Native ML engines
  • 36. Comparison of Spark and native ML engines 36 (+ Spark ML) Native ML engines Scalability Yes No (or very limited)
  • 37. Comparison of Spark and native ML engines 37 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Accuracy
  • 38. Comparison of Spark and native ML engines 38 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Performance Medium Extremely high Distributed nature, synchronization overhead Accuracy If data fits a single server
  • 39. Comparison of Spark and native ML engines 39 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Performance Medium Extremely high Distributed nature, synchronization overhead Accuracy If data fits a single server
  • 40. Comparison of Spark and native ML engines • We would like to combine Spark and ML engines 40 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Performance Medium Extremely high
  • 41. Combining Spark and ML engines for training 41 Training data (parquet) HDFS Models
  • 43. 43 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Combining Spark and ML engines for training
  • 44. 44 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes ’Single ML engine’ on a single executor Combining Spark and ML engines for training
  • 45. 45 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Input requirements: size & format ’Single ML engine’ on a single executor Combining Spark and ML engines for training
  • 46. 46 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Combining Spark and ML engines for training
  • 47. 47 Machine Learning (map operation) Converting to RDD[Matrix] Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Matrix Matrix Matrix Combining Spark and ML engines for training
  • 48. 48 Machine Learning (map operation) Converting to RDD[Matrix] Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Matrix Matrix Matrix Combining Spark and ML engines for training RDD of huge, efficiently stored objects optimized for ML computations!!!
  • 49. Converting to RDD[Matrix] 49 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS HDFS 1000s of models Yes No Yes Yes No Yes Matrix Matrix Matrix RDD of huge, efficiently stored objects optimized for ML computations!!! Combining Spark and ML engines for training
  • 50. Combining Spark and ML engines for validation 50 Validation data (parquet) HDFS
  • 53. Converting to RDD[Matrix] Matrix Matrix Matrix 53 Prediction (map operation) Data preprocessing (MapReduce) Validation data (parquet) HDFS Computing validation results for many models Combining Spark and ML engines for validation
  • 54. Converting to RDD[Matrix] Matrix Matrix Matrix 54 Validation (MapReduce) Prediction (map operation) Data preprocessing (MapReduce) Validation data (parquet) HDFS Computing validation scores Combining Spark and ML engines for validation
  • 55. Converting to RDD[Matrix] Matrix Matrix Matrix 55 Validation (MapReduce) Prediction (map operation) Data preprocessing (MapReduce) Validation data (parquet) HDFS HDFS Best model Combining Spark and ML engines for validation
  • 56. 56 Predict (map operation) Convert to RDD[Matrix] Data preprocessing (MapReduce) Test data (parquet) HDFS HDFS Prediction results (parquet) Matrix Matrix Matrix Computations only for selected models Combining Spark and ML engines for prediction
  • 57. Design challenges 5757 θ1 θ2 θ3 • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing
  • 58. Many models to schedule 58 Matrix X3 Matrix X2 Matrix X1
  • 59. Many models to schedule 59 Algorithms Hyperparameters Data preprocessing strategies Parameters: θ1, θ2, θ3 ... Matrix X3 Matrix X2 Matrix X1
  • 60. Many models to schedule 60 Algorithms Hyperparameters Data preprocessing strategies Machine Learning Yes No Yes Parameters: θ1, θ2, θ3 ... Matrix X3 Matrix X2 Matrix X1
  • 61. Naive scheduling 61 Load & Convert Parameter θ1 Parameter θ1 Parameter θ1 Matrix X1 Matrix X2 Matrix X3 • Waste of memory • Frequent data loading from other servers • Frequent data to matrix conversion 61 Load & Convert Parameter θ1 Parameter θ1 Parameter θ1 Matrix X1 Matrix X2 Matrix X3
  • 62. 62 Parameter-aware scheduling 62 • Efficient memory usage • Infrequent data loading from other servers • Infrequent data to matrix conversion 62 Parameter θ1 Parameter θ2 Parameter θ3 Matrix X1
  • 63. Design challenges 6363 θ1 θ2 θ3 • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing
  • 64. Machine learning – most work intensive & time consuming part 64 Machine Learning (map operation) Convert to matrix Data preprocessing (MapReduce) Training data (parquet) HDFS HDFS Yes No Yes We must ensure good balance of paralleled work 1000s of models Matrix Matrix Matrix
  • 65. Naive balancing of models to compute 65 5 min 5 min Complicated model
  • 66. Naive balancing of models to compute 66 5 min 5 min 1 min 1 min Wait 8 min…Yes No Yes Yes No Yes Decision tree model Complicated model
  • 67. Predictive balancing • Balancing complex and simple models (based on previous estimation) • Complex models first 5 min 1 min 5 min 1 min Yes No Yes Yes No Yes ♪~ ♪~ 67
  • 69. Evaluation – targeting Top-10% • Prediction problem – Comparing Top-10% precision of targeting potential positive samples • Comparing with manual predictive modeling – Done with scikit-learn v0.18.1 – Selected algorithms (Logistic Regression, SVM, Random Forests) – Selected preprocessing strategies – All parameters of algorithms set with default values • except Random Forest (n_estimators = 200) 69
  • 70. Evaluation – data sets • KDDCUP 2014 competition data – 557K records for training and validate data – 62K records for test data – Features: 500 • KDDCUP 2015 competition data – 108K records for training and validate data – 12K records for test data – Features: 500 • IJCAI 2015 competition data – 87K records for training, validate and test data – Features: 500 70
  • 71. Evaluation – cluster specificaton • Size: 3U • Server modules: 34 • CPU: 272 cores (Intel Xeon D 2.1GHz) – 128 cores used in the evaluation • RAM: 2TB • Storage: 34TB SSD • Internal network: 10GbE • Spark v1.6.0, Hadoop v2.7.3 71 Scalable Modular Server (DX2000)
  • 72. Evaluation results and conclusions 72 Data Our technology Logistic regression SVM Random Forests KDDCUP 2014 15.6% 13.5% 12.0% 14.8% KDDCUP 2015 97.1% 95.5% 93.1% 97.2% IJCAI 2015 8.2% 8.3% 8.1% 8.2% Top-10% precision results
  • 73. Evaluation results and conclusions • Competitive results with good accuracy 73 Data Our technology Logistic regression SVM Random Forests KDDCUP 2014 15.6% 13.5% 12.0% 14.8% KDDCUP 2015 97.1% 95.5% 93.1% 97.2% IJCAI 2015 8.2% 8.3% 8.1% 8.2% Top-10% precision results
  • 74. Evaluation results and conclusions • Short execution time • Full automation of the whole process • Handling data of any size 74 Data Our technology KDDCUP 2014 172 minutes KDDCUP 2015 45 minutes IJCAI 2015 36 minutes Execution time
  • 76. Our observations • Using RDD of huge but compact objects optimized for ML computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 76
  • 77. Our observations • Using RDD of huge but compact objects optimized for ML computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 77
  • 78. Converting to RDD[Matrix] 78 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS HDFS 1000s of models Yes No Yes Yes No Yes Matrix Matrix Matrix RDD[DenseMatrix]
  • 79. • Spark used for parallelization • All the necessary data for a single execution kept without memory overhead • Performance critical operations executed: – On objects with Linear Algebra operations optimized – By fast native ML algorithms 79 RDD[DenseMatrix]
  • 80. Our observations • Using RDD of huge but compact objects optimized for fast computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 80
  • 81. Limiting execution overhead in tests • Submitting Spark application takes time 81 TestSpark submit Spark submit Test Spark submit Test
  • 82. Limiting execution overhead in tests • We submit only once 82 TestSpark submit Test Test ♪~
  • 83. Our observations • Using RDD of huge but compact objects optimized for fast computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 83
  • 84. Stable execution on YARN • Default configuration sometimes failing with not enough memory • Spark Web UI: • Serving much memory to Spark but application still failing • Known problem in Spark 84
  • 85. Stable execution on YARN • JVM system memory spikes over YARN limitation suddenly (*) 85 (*) Shivnath and Mayuresh. “Understanding Memory Management In Spark For Fun And Profit”, Spark Summit 2016. YARN limitation (6GB) Time Memory(GB) Spike of JVM system memory usage
  • 86. Stable execution on YARN • Tip: spark.yarn.executor.memoryOverhead to be carefully configured • Recommended overhead: 6-10% • 15% overhead required in our case • Must be thoroughly investigated 86 (http://spark.apache.org/docs/2.1.1/running-on-yarn.html)
  • 88. Summary • Predictive modeling problem – Requires sophisticated knowledge – Takes a long time • Our technology: Automatic Predictive Modeling – Combines Spark with native ML engines – Fully automates the whole process – Provides highly accurate results – Takes at most hours – Handles data of any size 88
  • 89. Future work • Extending to other models (e.g. deep learning) • Speeding up by GPU • Reducing YARN memory overhead 89