ALEXIS PERRIER
@alexip
Linkedin.com/in/alexisperrier
Data Scientist
Slides:
▸ AWS Machine Learning : predictive analytics
▸ Simple, efficient but somewhat limited
▸ Auto-ML on AWS marketplace
AWS MACHINE LEARNING
PLAN
DATA SCIENCE TAKE TIME AND RESOURCES
10 FALLACIES OF DATA SCIENCE
Shane Brennan https://medium.com/towards-data-science/the-ten-fallacies-of-data-science-9b2af78a1862
1. Exists
2. Accessible
3. Consistent
4. Relevant
5. Understandable
6. Processable
7. Reproducibility
8. Compliance and security
9. Results are understood
10.Expected outcomes
Data scientist ROI
Access Massage Train

Models
ProductionPresentation
The plan
“AWS WANTS TO PUT MACHINE
LEARNING IN REACH OF ANY
DEVELOPER“
April 2015 - Techcrunch
MACHINE LEARNING AS A SERVICE
AWS DATA ECOSYSTEM
AWS MACHINE LEARNING
WHAT IT DOES
▸ Supervised Predictive Analytics
▸ On structured data and text,
▸ Outcome as function of variables,
ground truth known on subset
WHAT IT DOES NOT
▸ Unsupervised learning
▸ Reinforcement learning
▸ Deep learning
WORK FLOW - AWS ML PROJECT
Create a datasource from S3, RDS or Redshift
Transform the data with recipes (opt)
Train a Model
Evaluate
Create endpoints
3
1
2
4
5
DATASOURCE
AWS extracts the schema
AWS analyses the data
Provides simple visualization
Offers default transformation
S3
Redshift
RDS (CLI - SDK only)
3
1
2
4
TITANIC DATASET - DEFAULT SCHEMA
TITANIC DATASET - SIMPLE ANALYSIS
SCHEMA, RECIPES AND FEATURES
▸ From the data, AWS suggests the optimal transformations - recipe
▸ 7 transformations are available
▸ Text: N-gram, Orthogonal Sparse Bigram, Lowercase, Punctuation
▸ TF-IDF by default, no stop words, no POS, Lemma, …
▸ Categorical: Cartesian product
▸ Numeric: Normalization, Quantile Binning
▸ QB: non linearities in continuous, numeric to categorical
▸ Recipe is downloadable
TITANIC DATASET - DATA TRANSFORMATION
TITANIC DATASET - DATA TRANSFORMATION
TRAIN YOUR MODEL
STOCHASTIC GRADIENT DESCENT
One model to rule them all
▸ Simple
▸ Regularization
▸ Epochs
▸ Shuffling
▸ That’s it!
STOCHASTIC GRADIENT DESCENT - TUNING
Epochs
Shuffling
Regularization
Stochastic Gradient in scikit-learn
CONVERGENCE - POWERFUL QUANTILE BINNING
Accuracy Accuracy
EVALUATION
EVALUATION
TEXT
END POINT - STREAMING
AWS INTEGRATION
STRONG POINTS
▸ Powerful modeling: SGD + quantile binning
▸ AWS ecosystem
▸ Multiple sources (S3, RDS, Redshift)
▸ Simple to setup and use
▸ Great for benchmarking
▸ No need for production code!
▸ CLI - SDKs (python, …)
ROOM FOR IMPROVEMENTS
‣ No cross validation!
‣ Can’t export your trained models*
‣ No scripting
(*) Stealing Machine Learning Models via prediction APIs

http://www.cs.unc.edu/~reiter/papers/2016/USENIX.pdf
▸ Limited data visualization
▸ Limited feature engineering
▸ SGD model only: no forests, SVMs, Bayes, …
▸ No deep learning (EC2)
STILL A NEED FOR DOMAIN EXPERTISE
AND FEATURE ENGINEERING
GREAT TIME SAVER BUT
AUTO-ML ON AWS
Modeling is simplified
But feature engineering still needs attention
3
1
2 Feature engineering
Include all available datasets
Feature importance
‣ Feature importance
‣ Composite features surfacing
‣ Complementary dataset integration
AUTOMATIC FEATURE ENGINEERING
Smart (Not naive)

Bayesian Optimization
ON AWS MARKETPLACE
EC2 INSTANCE
DATA EXPLORATION - FEATURE SELECTION
FEATURE SURFACING
AND ITERATE
+
‣ Reduced TTM + TCO
‣ Less resources
‣ Powerful benchmarking
‣ Fast iterations
‣ Holistic data integration
Jerry Hargrove @awsgeek https://www.awsgeek.com/posts/amazon-machine-learning-summary
LET’S CONNECT!
@alexip
linkedin.com/in/alexisperrier
THANK YOU

AWS Machine Learning Big Data NYC