Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Automating Data Science on Spark: 

A Bayesian Approach
Vu Pham, Huan Dao, Christopher Nguyen
San Francisco, June 8th 2016
@arimoinc
Feature
Engineering Model
Selection
Hyperparameter
Tuning
Data Science
is
so MANUAL!
@arimoinc
Natural
Interfaces
Machine
Learning
Collaboration
Agenda
@arimoinc
1. For Hyper-parameter Tuning
2. For “Automating” Data Science on Spark
3. Experiments
@arimoinc
Bayesian Optimization
for Hyper-parameter Tuning
@arimoinc
Feature
Engineering Model
Selection
Hyper-parameter
Tuning
We can
automate
this?
What if…
Hyper-parameters
@arimoinc
K-means K
Neural Network # of layers, dropout, momentum…
Random Forest Feature set, # of trees,...
Manual Search
@arimoinc
Grid Search
@arimoinc
Random Search
@arimoinc
Hyper-parameters tuning
@arimoinc
U: S R
c U(c)
Hyper-parameter Space Performance onValidation Set
How to intelligently se...
@arimoinc
Bayesian Inference
Courtesy: http://www2.stat.duke.edu/~mw/fineart.html
Bayesian Optimization Explained
@arimoinc
1. Incorporate a prior over the space of possible objective functions (GP)
2. Co...
Bayesian Optimization Explained
@arimoinc
@arimoinc
expensive
multi-modal
noisy
blackbox
Bayesian Optimization is to
globally optimize functions that are:
@arimoinc
Bayesian Optimization for
“Automating” Data Science
on Spark
Automatic Data Science Workflow
@arimoinc
Feature
Engineering
Model
Selection
Hyperparameter
Tuning
Machines doing part of Data Science
@arimoinc
Dataset
Feature preprocessing
Data Cleaning
Generate new features
Feature se...
A generic Machine Learning pipeline
@arimoinc
Training set
Val/Test set
Feature preprocessing
Truncated SVD
n: # of dims
f...
Spark WorkersSpark Workers
Pipeline on DDF and BayesOpt
@arimoinc
Spark Driver
DDF
Arimo Pipeline Manager
…
Client
Bayesia...
Pipeline on DDF and BayesOpt
@arimoinc
train_ddf = session.get_ddf(…)
valid_ddf = session.get_ddf(…)
optimizer = Spearmint...
@arimoinc
Experimental Results
Experiment 1: SF Crimes
@arimoinc
Dates Category Descript DayOfWeek PdDistrict Resolution Address X Y
2015-05-13 23:53:00 ...
Experiment 1: SF Crimes dataset
@arimoinc
Hyper-parameter Type Range
Number of hidden layers INT 1, 2, 3
Number of hidden ...
Experiment 1: SF Crimes dataset
@arimoinc
Hyper-parameter Type Range Spearmint
Number of hidden layers INT 1, 2, 3 2
Numbe...
Experiment 1: SF Crimes dataset
@arimoinc
2.149
2.152
2.155
2.157
2.16
1 7 17 18 39
2.1600
2.1555 2.1552 2.1551
2.1519
2.1...
Experiment #1: With SigOpt
@arimoinc
Hyper-parameter Type Range Spearmint
Number of hidden layers INT 1, 2, 3 2 3
Number o...
SF Crimes - Time to results
@arimoinc
Experiment #2: Airlines data
@arimoinc
Year 1987-2008 DepDelay departure delay, in minutes
Month 1-12 Origin origin IATA a...
Experiment #2
@arimoinc
Training set
Val/Test set
Feature preprocessing
~900 features
Truncated SVD
n: # of dims
f-score w...
Experiment #2: Hyper-parameters
@arimoinc
Hyperparameter Type Range BayesOpt
Number of SVD dimensions INT [5, 100] 98
Top ...
Summary
@arimoinc
1. Bayesian Optimization for Hyper-parameter Tuning
2. Bayesian Optimization for 

“Automating” Data Sci...
Getting Started
@arimoinc
• Blogpost: http://goo.gl/PFyBKI
• Open-source: spearmint, hyperopt, SMAC, AutoML
• Commercial: ...
http://goo.gl/PFyBKI
https://www.arimo.com
@arimoinc @pentagoniac @phvu
CHECK IT OUT!
Upcoming SlideShare
Loading in …5
×

Automatic Features Generation And Model Training On Spark: A Bayesian Approach

2,597 views

Published on

Spark Summit 2016 talk by Christopher Nguyen (Arimo, Inc.),
Vu Pham (Arimo, Inc.) and Huan Dao (Arimo, Inc.)

Published in: Data & Analytics
  • Be the first to comment

Automatic Features Generation And Model Training On Spark: A Bayesian Approach

  1. 1. Automating Data Science on Spark: 
 A Bayesian Approach Vu Pham, Huan Dao, Christopher Nguyen San Francisco, June 8th 2016
  2. 2. @arimoinc Feature Engineering Model Selection Hyperparameter Tuning Data Science is so MANUAL!
  3. 3. @arimoinc Natural Interfaces Machine Learning Collaboration
  4. 4. Agenda @arimoinc 1. For Hyper-parameter Tuning 2. For “Automating” Data Science on Spark 3. Experiments
  5. 5. @arimoinc Bayesian Optimization for Hyper-parameter Tuning
  6. 6. @arimoinc Feature Engineering Model Selection Hyper-parameter Tuning We can automate this? What if…
  7. 7. Hyper-parameters @arimoinc K-means K Neural Network # of layers, dropout, momentum… Random Forest Feature set, # of trees, max depth… SVM Regularization term (C) Gradient Descent Learning rate, number of iterations…
  8. 8. Manual Search @arimoinc
  9. 9. Grid Search @arimoinc
  10. 10. Random Search @arimoinc
  11. 11. Hyper-parameters tuning @arimoinc U: S R c U(c) Hyper-parameter Space Performance onValidation Set How to intelligently select the next configuration? (Given the observations in the past) Maximize a utility functions over hyper-parameter space:
  12. 12. @arimoinc Bayesian Inference Courtesy: http://www2.stat.duke.edu/~mw/fineart.html
  13. 13. Bayesian Optimization Explained @arimoinc 1. Incorporate a prior over the space of possible objective functions (GP) 2. Combine the prior with the likelihood to obtain a posterior over function values given observations 3. Select next configuration to evaluate based on the posterior • According to an acquisition function
  14. 14. Bayesian Optimization Explained @arimoinc
  15. 15. @arimoinc expensive multi-modal noisy blackbox Bayesian Optimization is to globally optimize functions that are:
  16. 16. @arimoinc Bayesian Optimization for “Automating” Data Science on Spark
  17. 17. Automatic Data Science Workflow @arimoinc Feature Engineering Model Selection Hyperparameter Tuning
  18. 18. Machines doing part of Data Science @arimoinc Dataset Feature preprocessing Data Cleaning Generate new features Feature selection Dimensionality reduction Predictive modeling Scoring
  19. 19. A generic Machine Learning pipeline @arimoinc Training set Val/Test set Feature preprocessing Truncated SVD n: # of dims f-score with target variable Keep alpha percents top features K-means k: # of clusters RandomForest #1 number of trees maximum depth percentage of features RandomForest #k … Scoring
  20. 20. Spark WorkersSpark Workers Pipeline on DDF and BayesOpt @arimoinc Spark Driver DDF Arimo Pipeline Manager … Client Bayesian Optimizer Client uses Bayesian Optimizer to select the hyper-parameters of the pipeline so that it maximizes the performance on a validation set Spark Workers
  21. 21. Pipeline on DDF and BayesOpt @arimoinc train_ddf = session.get_ddf(…) valid_ddf = session.get_ddf(…) optimizer = SpearmintOptimizer(chooser_name=‘GPEIperSecChooser', max_finished_jobs=max_iters, grid_size=5000, ..) best_params, trace = auto_model( optimizer, train_ddf, 'arrdelay', classification=True, excluded_columns=['actualelapsedtime','arrtime', 'year'], validation_ddf=val_ddf)
  22. 22. @arimoinc Experimental Results
  23. 23. Experiment 1: SF Crimes @arimoinc Dates Category Descript DayOfWeek PdDistrict Resolution Address X Y 2015-05-13 23:53:00 WARRANTS WARRANT ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.4258 37.7745 2015-05-13 23:53:00 OTHER OFFENSES TRAFFICVIOLATION ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.4258 37.7745 2015-05-13 23:33:00 OTHER OFFENSES TRAFFICVIOLATION ARREST Wednesday NORTHERN ARREST, BOOKED VANNESS AV / GREENWICH ST -122.4243 37.8004
  24. 24. Experiment 1: SF Crimes dataset @arimoinc Hyper-parameter Type Range Number of hidden layers INT 1, 2, 3 Number of hidden units INT 64, 128, 256 Dropout at the input layer FLOAT [0, 0.5] Dropout at the hidden layers FLOAT [0,0.75] Learning rate FLOAT [0.01, 0.1] L2 Weight decay FLOAT [0, 0.01] Logloss on the validation set Running time (hours) ~ 40 iterations
  25. 25. Experiment 1: SF Crimes dataset @arimoinc Hyper-parameter Type Range Spearmint Number of hidden layers INT 1, 2, 3 2 Number of hidden units INT 64, 128, 256 256 Dropout at the input layer FLOAT [0, 0.5] 0.423678 Dropout at the hidden layers FLOAT [0,0.75] 0.091693 Learning rate FLOAT [0.01, 0.1] 0.025994 L2 Weight decay FLOAT [0, 0.01] 0.00238 Logloss on the validation set 2.1502 Running time (hours) ~ 40 iterations 15.8
  26. 26. Experiment 1: SF Crimes dataset @arimoinc 2.149 2.152 2.155 2.157 2.16 1 7 17 18 39 2.1600 2.1555 2.1552 2.1551 2.1519 2.1502 Completed jobs 1 7 16 17 18 39 Elapsed time 1021 8003 12742 1623 1983 31561 Number of layers 1 3 3 2 3 2 Hidden units 64 256 256 256 256 256 Learning rate 0.01 0.1 0.01 0.01 0.021 0.026 Input dropout 0 0.5 0.5 0.463 0.5 0.424 Hidden dropout 0 0 0 0.024 0.089 0.092 Weight decay 0 0 0.002 0.003 0 0.002
  27. 27. Experiment #1: With SigOpt @arimoinc Hyper-parameter Type Range Spearmint Number of hidden layers INT 1, 2, 3 2 3 Number of hidden units INT 64, 128, 256 256 256 Dropout at the input layer FLOAT [0, 0.5] 0.423678 0.3141 Dropout at the hidden layers FLOAT [0,0.75] 0.091693 0.0944 Learning rate FLOAT [0.01, 0.1] 0.025994 0.0979 L2 Weight decay FLOAT [0, 0.01] 0.00238 0.0039 Logloss on the validation set 2.1502 2.14892 Running time (hours) ~ 40 iterations 15.8 20.1
  28. 28. SF Crimes - Time to results @arimoinc
  29. 29. Experiment #2: Airlines data @arimoinc Year 1987-2008 DepDelay departure delay, in minutes Month 1-12 Origin origin IATA airport code DayofMonth 1-31 Dest destination IATA airport code DayOfWeek 1 (Monday) - 7 (Sunday) Distance in miles DepTime actual departure time TaxiIn taxi in time, in minutes CRSDepTime scheduled departure time TaxiOut taxi out time in minutes ArrTime actual arrival time Cancelled was the flight cancelled? CRSArrTime scheduled arrival time CancellationCode reason for cancellation UniqueCarrier unique carrier code Diverted 1 = yes, 0 = no FlightNum flight number CarrierDelay in minutes TailNum plane tail number WeatherDelay in minutes ActualElapsedTime in minutes NASDelay in minutes CRSElapsedTime in minutes SecurityDelay in minutes AirTime in minutes LateAircraftDelay in minutes ArrDelay arrival delay, in minutes Delayed Is the flight delayed
  30. 30. Experiment #2 @arimoinc Training set Val/Test set Feature preprocessing ~900 features Truncated SVD n: # of dims f-score with target variable Keep alpha percents top features K-means k: # of clusters RandomForest #1 number of trees maximum depth percentage of features RandomForest #k … Scoring
  31. 31. Experiment #2: Hyper-parameters @arimoinc Hyperparameter Type Range BayesOpt Number of SVD dimensions INT [5, 100] 98 Top feature percentage FLOAT [0.1, 1] 0.8258 k (# of clusters) INT [1, 6] 2 Number of trees (RF) INT [50, 500] 327 Max. depth (RF) INT [1, 20] 12 Min. instances per node (RF) INT [1, 1000] 414 F1-score on validation set 0.8736
  32. 32. Summary @arimoinc 1. Bayesian Optimization for Hyper-parameter Tuning 2. Bayesian Optimization for 
 “Automating” Data Science on Spark 3. Experiments
  33. 33. Getting Started @arimoinc • Blogpost: http://goo.gl/PFyBKI • Open-source: spearmint, hyperopt, SMAC, AutoML • Commercial: Whetlab, SigOpt, …
  34. 34. http://goo.gl/PFyBKI https://www.arimo.com @arimoinc @pentagoniac @phvu CHECK IT OUT!

×