Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lessons learned from building practical deep learning systems

Guest lecture at the Fullstack Deep Learning Bootcamp

  • Login to see the comments

Lessons learned from building practical deep learning systems

  1. 1. LESSONS LEARNED from building practical Deep Learning systems Xavier Amatriain (@xamat)
  2. 2. A bit about myself...
  3. 3. A bit about myself... ● PhD on Audio and Music Signal Processing and Modeling ● Researcher in Recommender Systems for several years ● Led ML Research/Engineering at Netflix ● VP of Engineering at Quora ● Currently co-founder/CTO at Curai (Providing the world’s best healthcare to everyone)
  4. 4. A bit about Curai...
  5. 5. What are we doing? ● Mission: Provide the world's best healthcare for everyone ● Product: User-facing mobile primary care app ● Team: Building an awesome and diverse team ● Approach: State-of-the-art AI/ML + product/UX/clinical AI-based interaction AI + Health coaches AI + Doctors
  6. 6. Peer-reviewed research at Curai
  7. 7. Lessons learned...
  8. 8. More Data or Better Data? Lesson 1
  9. 9. More data or better models? Really?
  10. 10. More data or better models? Sometimes, it’s not about more data
  11. 11. More data or better models? Norvig: “Google does not have better Algorithms only more Data” Many features/ low-bias models
  12. 12. More data or better models? Sometimes you might not need all your “Big Data” 0 2 4 6 8 10 12 14 16 18 20 Number of Training Examples (in Millions) TestingAccuracy
  13. 13. What about Deep Learning? Year Breakthrough in AI Datasets (First Available) Algorithms (First Proposal) 1994 Human-level spontaneous speech recognition Spoken Wall Street Journal articles and other texts (1991) Hidden Markov Model (1984) 1997 IBM Deep Blue defeated Garry Kasparov 700,000 Grandmaster chess games, aka “The Extended Book” (1991) Negascout planning algorithm (1983) 2005 Google’s Arabic- and Chinese-to-English translation 1,8 trillion tokens from Google Web and News pages (collected in 2005) Statistical machine translation algorithm (1988) 2011 IBM watson become the world Jeopardy! Champion 8,6 million documents from Wikipedia, Wiktionary, Wikiquote, and Project Gutenberg (updated in 2005) Mixture-of-Experts algorithm (1991) 2014 Google’s GoogLeNet object classification at near-human performance ImageNet corpus of 1,5 million labeled images and 1,000 object catagories (2010) Convolution neural network algorithm (1989) 2015 Google’s Deepmind achieved human parity in playing 29 Atari games by learning general control from video Arcade Learning Environment dataset of over 50 Atari games (2013) Q-learning algorithm (1992) Average No. Of Years to Breakthrough 3 years 18 years The average elapsed time between key algorithm proposals and corresponding advances was about 18 years, whereas the average elapsed time between key dataset availabilities and corresponding advances was less than 3 years, or about 6 times faster.
  14. 14. What about Deep Learning? Models and Recipes Pretrained Available models trained using OpenNMT → English → German → German → English → English Summarization → Multi-way – FR,ES,PT,IT,RO < > FR,ES,PT,IT,RO More models coming soon: → Ubuntu Dialog Dataset → Syntactic Parsing → Image-to-Text
  15. 15. More data and better data
  16. 16. Simple Models >> Complex Models Lesson 2
  17. 17. Football or Futbol?
  18. 18. Occam’s razor Given two models that perform more or less equally, you should always prefer the less complex Deep Learning might not be preferred, even if it squeezes a +1% in accuracy
  19. 19. Reasons to prefer a simpler model
  20. 20. Reasons to prefer a simpler model …. There are many others System complexity Maintenance Explainability …. Figure 3: GoogLeNet network with all the bells and whistles
  21. 21. A real-life example Goal: Supervised Classification → 40 features → 10k examples What did the ML Engineer choose? → Multi-layer ANN trained with Tensor Flow What was his proposed next step? → Try ConvNets Where is the problem? → Hours to train, already looking into distributing → There are much simpler approaches
  22. 22. .... But, sometimes you do need a Complex Model Lesson 3
  23. 23. Better models and features that “don’t work” E.g. You have a linear model and have been selecting and optimizing features for that model → More complex model with the same features -> improvement not likely → More expressive features -> improvement not likely More complex features may require a more complex model A more complex model may not show improvements with a feature set that is too simple
  24. 24. Yes, you should care about Feature Engineering Lesson 4
  25. 25. Feature Engineering Example - Answer Ranking How are those dimensions translated into features? Features that relate to the answer Quality itself Interaction features (upvotes/downvotes, clicks, comments…) User features (e.g. expertise in topic) What is a good Quora answer? Truthful Reusable Provides explanation Well formatted ...
  26. 26. Feature Engineering Properties of a well- behaved ML feature Output Mapping from features OutputOutput Most complex features Mapping from features Mapping from features Output Simplest features Features Hand – designed features Hand – designed program InputInputInputInput Rule - based systems Classic machine learning Representation learning Deep learning Fig; I. Goodfellow Deep Learning: Automating Feature Discovery Interpretable Reliable Reusable Transformable
  27. 27. Deep Learning & Feature Engineering
  28. 28. Deep Learning & Feature Architecture Engineering
  29. 29. Supervised vs. Unsupervised Learning Lesson 5
  30. 30. Supervised/Unsupervised Learning Unsupervised learning as dimensionality reduction E.g.1 Clustering + knn E.g.2 Matrix Factorization MF can be interpreted as Unsupervised: • Dimensionality Reduction a la PCA • Clustering (e.g. NMF) Supervised: • Labeled targets ~ regression Unsupervised learning as feature engineering The “magic” behind combining unsupervised/supervised learning
  31. 31. Supervised/Unsupervised Learning One of the “tricks” in Deep Learning is how it combines unsupervised /supervised learning → E.g. Stacked Autoencoders → E.g. training of convolutional nets X1 X2 X3 X4 X5 X6 +1 +1 +1 ... ... ... Input Features I Features II Softmax classifier P(y=0 | x) P(y=1 | x) P(y=2 | x) Stacked Autoencoders Input 83x83 Layer 1 64x75x75 Layer 2 64@14x14 Layer 3 256@6x6 Layer 4 256@1x1 Output4 101 9x9 Convolution (64 kernels) 10x10 pooling 5x5 subsampling 9x9 Convolution (4096 kernels) 6x6 pooling 4x4 subsamp → Non-Linearity: half-wave rectification, shrinkage function, sigmoid → Pooling: average, L1, L2, max → Training: Supervised (1988-2006), Unsupervised+supervised (2006-now) Convolutional Network (CovNet) Neural Networks Supervised Unsupervised Superviseed Boost ing SVM Decis ion Tree Perc eptro n AE D-AE Neur al Net RNN Conv . Net RBM Spar se Codi ng DBN DBM GMM Baye s NP ΣΠ
  32. 32. Supervised/Unsupervised Self-supervised Learning Self-supervision → E.g. BERT and other LM
  33. 33. Everything is an ensembleLesson 6
  34. 34. Ensembles Netflix Prize was won by an ensemble Most practical applications of ML run an ensemble → Initially Bellkor was using GDBTs → BigChaos introduced ANN-based ensemble → Why wouldn’t you? → At least as good as the best of your methods → Can add completely different approaches (e.g. CF and content-based) → You can use many different models at the ensemble layer: LR, GDBTs, RFs, ANNs...
  35. 35. Ensembles & Feature Engineering Ensembles are the way to turn any model into a feature! E.g. Don’t know if the way to go is to use Factorization Machines, Tensor Factorization, or RNNs? → Treat each model as a “feature” → Feed them into an ensemble Sigmoid Rectified Linear Units Output Units Hidden Layers Dense Embeddings Sparse Features Wide Models Deep Models Wide & Deep Models
  36. 36. There are biases in your data Lesson 7
  37. 37. Defining training/testing data Training a simple binary classifier for good/bad answer → Defining positive and negative labels -> Non-trivial task → Is this a positive or a negative? → funny uninformative answer with many upvotes → short uninformative answer by a well-known expert in the field → very long informative answer that nobody reads/upvotes → informative answer with grammar/spelling mistakes → ...
  38. 38. The curse of presentation bias Better options → Correcting for the probability a user will click on a position -> Attention models → Explore/exploit approaches such as MAB Simply treating things you show as negatives is not likely to work User can only click on what you decide to show → But, what you decide to show is the result of what your model predicted is good More likely to see Less likely
  39. 39. Bias & Fairness
  40. 40. Think about your models “in the wild” Lesson 8
  41. 41. AI in the wild: Desired properties ● Easily extensible ○ Incrementally/iteratively learn from “human-in-the-loop” or from additional data ● Knows what it does not know ○ Models uncertainty in prediction ○ Enables fall-back to manual
  42. 42. Assisted diagnosis in the wild 1. Extensibility a. Diagnosis as a ML task i. Expert systems as a prior b. Modeling less prevalent diseases i. Low-shot learning 2. Knowing what you don’t know b. Measures of uncertainty in prediction c. Allows fall-back to “physician-in-the-loop”
  43. 43. Data and Models are great. You know what’s even better? The right evaluation approach! Lesson 9
  44. 44. Offline/Online testing process Offline Experimentation Online Experimentation Initial Hypothesis Design AB Test Choose Control Deploy Prototype Observe Behavior Analyze Results Significant Improvements? Choose Model Train Model Test Offline Hypothesis Validated? Try different Model? Reformulated Hypothesis Deploy Feature NO YES NO YES NO YES
  45. 45. Executing A/B tests Overall Evaluation Criteria (OEC) = e.g. member retention at Netflix → Use long-term metrics whenever possible → Short-term metrics can be informative and allow faster decisions ⁻ But, not always aligned with OEC Measure differences in metrics across statistically identical populations that each experience a different algorithm. Decisions on the product always data-driven
  46. 46. Offline testing Measure model performance, using (IR) metrics Offline performance = indication to make decisions on follow-up A/B tests A critical (and mostly unsolved) issue is how offline metrics correlate with A/B test results.
  47. 47. Do not underestimate the value of systems and frameworks Lesson 10
  48. 48. ML vs Software Can you treat your ML infrastructure as you would your software one? → Yes and No You should apply best Software Engineering practices (e.g. encapsulation, abstraction, cohesion, low coupling…) However, Design Patterns for Machine Learning software are not well known/documented
  49. 49. Software: the new frontier of ML?
  50. 50. Your AI infrastructure will have two masters Lesson 11
  51. 51. Machine Learning Infrastructure → Whenever you develop any ML infrastructure, you need to target two different modes: Mode 1: ML experimentation − Flexibility − Easy-to-use − Reusability Mode 2: ML production − All of the above + performance & scalability → Ideally you want the two modes to be as similar as possible → How to combine them?
  52. 52. Machine Learning Infrastructure → Favor experimentation and only invest in productionizing once something shows results → E.g. Have ML researchers use R and then ask Engineers to implement things in production when they work Option 1 → Favor production and have “researchers” struggle to figure out how to run experiments → E.g. Implement highly optimized C++ code and have ML researchers experiment only through data available in logs/DB Option 2
  53. 53. Machine Learning Infrastructure → Favor experimentation and only invest in productionizing once something shows results → E.g. Have ML researchers use R and then ask Engineers to implement things in production when they work Option 1 → Favor production and have “researchers” struggle to figure out how to run experiments → E.g. Implement highly optimized C++ code and have ML researchers experiment only through data available in logs/DB Option 2
  54. 54. Machine Learning Infrastructure Good intermediate options → Have ML “researchers” experiment on Jupyter Notebooks using Python tools (scikit-learn, Pytorch, TF…). Use same tools in production whenever possible, implement optimized versions only when needed. → Implement abstraction layers on top of optimized implementations so they can be accessed from regular/friendly experimentation tools
  55. 55. There is ML beyond Deep Learning Lesson 12
  56. 56. Other ML Advances ● Factorization Machines ● Tensor Methods ● Non-parametric Bayesian models ● XGBoost ● Online Learning ● Reinforcement Learning ● Learning to rank ● ...
  57. 57. Other very successful approaches
  58. 58. Sometimes DL does not win
  59. 59. Conclusions
  60. 60. 01. 02. 03. 04. 05. Choose the right metric Be thoughtful about your data Understand dependencies between data, models & systems Optimize only what matters, beware of biases Be thoughtful about : Your ML infrastructure/tools, About organizing your teams
  61. 61. LESSONS LEARNED from building practical Deep Learning systems Xavier Amatriain (@xamat)