AWS Machine Learning Big Data NYC

ALEXIS PERRIER
@alexip
Linkedin.com/in/alexisperrier
Data Scientist
Slides:

▸ AWS Machine Learning : predictive analytics
▸ Simple, efﬁcient but somewhat limited
▸ Auto-ML on AWS marketplace
AWS MACHINE LEARNING
PLAN

DATA SCIENCE TAKE TIME AND RESOURCES

10 FALLACIES OF DATA SCIENCE
Shane Brennan https://medium.com/towards-data-science/the-ten-fallacies-of-data-science-9b2af78a1862
1. Exists
2. Accessible
3. Consistent
4. Relevant
5. Understandable
6. Processable
7. Reproducibility
8. Compliance and security
9. Results are understood
10.Expected outcomes
Data scientist ROI
Access Massage Train 
Models
ProductionPresentation
The plan

“AWS WANTS TO PUT MACHINE
LEARNING IN REACH OF ANY
DEVELOPER“
April 2015 - Techcrunch
MACHINE LEARNING AS A SERVICE

AWS MACHINE LEARNING
WHAT IT DOES
▸ Supervised Predictive Analytics
▸ On structured data and text,
▸ Outcome as function of variables,
ground truth known on subset
WHAT IT DOES NOT
▸ Unsupervised learning
▸ Reinforcement learning
▸ Deep learning

WORK FLOW - AWS ML PROJECT
Create a datasource from S3, RDS or Redshift
Transform the data with recipes (opt)
Train a Model
Evaluate
Create endpoints
3
1
2
4
5

DATASOURCE
AWS extracts the schema
AWS analyses the data
Provides simple visualization
Offers default transformation
S3
Redshift
RDS (CLI - SDK only)
3
1
2
4

TITANIC DATASET - DEFAULT SCHEMA

TITANIC DATASET - SIMPLE ANALYSIS

SCHEMA, RECIPES AND FEATURES
▸ From the data, AWS suggests the optimal transformations - recipe
▸ 7 transformations are available
▸ Text: N-gram, Orthogonal Sparse Bigram, Lowercase, Punctuation
▸ TF-IDF by default, no stop words, no POS, Lemma, …
▸ Categorical: Cartesian product
▸ Numeric: Normalization, Quantile Binning
▸ QB: non linearities in continuous, numeric to categorical
▸ Recipe is downloadable

TITANIC DATASET - DATA TRANSFORMATION

TRAIN YOUR MODEL
STOCHASTIC GRADIENT DESCENT
One model to rule them all
▸ Simple
▸ Regularization
▸ Epochs
▸ Shufﬂing
▸ That’s it!

STOCHASTIC GRADIENT DESCENT - TUNING
Epochs
Shufﬂing
Regularization

Stochastic Gradient in scikit-learn

CONVERGENCE - POWERFUL QUANTILE BINNING
Accuracy Accuracy

STRONG POINTS
▸ Powerful modeling: SGD + quantile binning
▸ AWS ecosystem
▸ Multiple sources (S3, RDS, Redshift)
▸ Simple to setup and use
▸ Great for benchmarking
▸ No need for production code!
▸ CLI - SDKs (python, …)

ROOM FOR IMPROVEMENTS
‣ No cross validation!
‣ Can’t export your trained models*
‣ No scripting
(*) Stealing Machine Learning Models via prediction APIs 
http://www.cs.unc.edu/~reiter/papers/2016/USENIX.pdf
▸ Limited data visualization
▸ Limited feature engineering
▸ SGD model only: no forests, SVMs, Bayes, …
▸ No deep learning (EC2)

STILL A NEED FOR DOMAIN EXPERTISE
AND FEATURE ENGINEERING
GREAT TIME SAVER BUT

Modeling is simpliﬁed
But feature engineering still needs attention
3
1
2 Feature engineering
Include all available datasets
Feature importance

‣ Feature importance
‣ Composite features surfacing
‣ Complementary dataset integration
AUTOMATIC FEATURE ENGINEERING
Smart (Not naive) 
Bayesian Optimization

ON AWS MARKETPLACE
EC2 INSTANCE

DATA EXPLORATION - FEATURE SELECTION

+
‣ Reduced TTM + TCO
‣ Less resources
‣ Powerful benchmarking
‣ Fast iterations
‣ Holistic data integration

Jerry Hargrove @awsgeek https://www.awsgeek.com/posts/amazon-machine-learning-summary

LET’S CONNECT!
@alexip
linkedin.com/in/alexisperrier
THANK YOU

AWS Machine Learning Big Data NYC

More Related Content

Similar to AWS Machine Learning Big Data NYC

Recently uploaded

AWS Machine Learning Big Data NYC