Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS Machine Learning Big Data NYC

273 views

Published on

Presentation of the AWS Machine Learning platform at the Global Big Data conference - NYC 2017 Oct 24 - Alexis Perrier

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

AWS Machine Learning Big Data NYC

  1. 1. ALEXIS PERRIER @alexip Linkedin.com/in/alexisperrier Data Scientist Slides:
  2. 2. ▸ AWS Machine Learning : predictive analytics ▸ Simple, efficient but somewhat limited ▸ Auto-ML on AWS marketplace AWS MACHINE LEARNING PLAN
  3. 3. DATA SCIENCE TAKE TIME AND RESOURCES
  4. 4. 10 FALLACIES OF DATA SCIENCE Shane Brennan https://medium.com/towards-data-science/the-ten-fallacies-of-data-science-9b2af78a1862 1. Exists 2. Accessible 3. Consistent 4. Relevant 5. Understandable 6. Processable 7. Reproducibility 8. Compliance and security 9. Results are understood 10.Expected outcomes Data scientist ROI Access Massage Train
 Models ProductionPresentation The plan
  5. 5. “AWS WANTS TO PUT MACHINE LEARNING IN REACH OF ANY DEVELOPER“ April 2015 - Techcrunch MACHINE LEARNING AS A SERVICE
  6. 6. AWS DATA ECOSYSTEM
  7. 7. AWS MACHINE LEARNING WHAT IT DOES ▸ Supervised Predictive Analytics ▸ On structured data and text, ▸ Outcome as function of variables, ground truth known on subset WHAT IT DOES NOT ▸ Unsupervised learning ▸ Reinforcement learning ▸ Deep learning
  8. 8. WORK FLOW - AWS ML PROJECT Create a datasource from S3, RDS or Redshift Transform the data with recipes (opt) Train a Model Evaluate Create endpoints 3 1 2 4 5
  9. 9. DATASOURCE AWS extracts the schema AWS analyses the data Provides simple visualization Offers default transformation S3 Redshift RDS (CLI - SDK only) 3 1 2 4
  10. 10. TITANIC DATASET - DEFAULT SCHEMA
  11. 11. TITANIC DATASET - SIMPLE ANALYSIS
  12. 12. SCHEMA, RECIPES AND FEATURES ▸ From the data, AWS suggests the optimal transformations - recipe ▸ 7 transformations are available ▸ Text: N-gram, Orthogonal Sparse Bigram, Lowercase, Punctuation ▸ TF-IDF by default, no stop words, no POS, Lemma, … ▸ Categorical: Cartesian product ▸ Numeric: Normalization, Quantile Binning ▸ QB: non linearities in continuous, numeric to categorical ▸ Recipe is downloadable
  13. 13. TITANIC DATASET - DATA TRANSFORMATION
  14. 14. TITANIC DATASET - DATA TRANSFORMATION
  15. 15. TRAIN YOUR MODEL STOCHASTIC GRADIENT DESCENT One model to rule them all ▸ Simple ▸ Regularization ▸ Epochs ▸ Shuffling ▸ That’s it!
  16. 16. STOCHASTIC GRADIENT DESCENT - TUNING Epochs Shuffling Regularization
  17. 17. Stochastic Gradient in scikit-learn
  18. 18. CONVERGENCE - POWERFUL QUANTILE BINNING Accuracy Accuracy
  19. 19. EVALUATION
  20. 20. EVALUATION
  21. 21. TEXT END POINT - STREAMING
  22. 22. AWS INTEGRATION
  23. 23. STRONG POINTS ▸ Powerful modeling: SGD + quantile binning ▸ AWS ecosystem ▸ Multiple sources (S3, RDS, Redshift) ▸ Simple to setup and use ▸ Great for benchmarking ▸ No need for production code! ▸ CLI - SDKs (python, …)
  24. 24. ROOM FOR IMPROVEMENTS ‣ No cross validation! ‣ Can’t export your trained models* ‣ No scripting (*) Stealing Machine Learning Models via prediction APIs
 http://www.cs.unc.edu/~reiter/papers/2016/USENIX.pdf ▸ Limited data visualization ▸ Limited feature engineering ▸ SGD model only: no forests, SVMs, Bayes, … ▸ No deep learning (EC2)
  25. 25. STILL A NEED FOR DOMAIN EXPERTISE AND FEATURE ENGINEERING GREAT TIME SAVER BUT
  26. 26. AUTO-ML ON AWS
  27. 27. Modeling is simplified But feature engineering still needs attention 3 1 2 Feature engineering Include all available datasets Feature importance
  28. 28. ‣ Feature importance ‣ Composite features surfacing ‣ Complementary dataset integration AUTOMATIC FEATURE ENGINEERING Smart (Not naive)
 Bayesian Optimization
  29. 29. ON AWS MARKETPLACE EC2 INSTANCE
  30. 30. DATA EXPLORATION - FEATURE SELECTION
  31. 31. FEATURE SURFACING
  32. 32. AND ITERATE
  33. 33. + ‣ Reduced TTM + TCO ‣ Less resources ‣ Powerful benchmarking ‣ Fast iterations ‣ Holistic data integration
  34. 34. Jerry Hargrove @awsgeek https://www.awsgeek.com/posts/amazon-machine-learning-summary
  35. 35. LET’S CONNECT! @alexip linkedin.com/in/alexisperrier THANK YOU

×