Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fast Distributed Online Classification

856 views

Published on

Fast Distributed Online Classification

Published in: Technology
  • How can I get a flat stomach without exercise? ▲▲▲ http://scamcb.com/bkfitness3/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • How can I get a flat tummy naturally? ♣♣♣ https://tinyurl.com/y6qaaou7
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Fast Distributed Online Classification

  1. 1. Fast Distributed Online Classification Ram Sriharsha (Product Manager, Apache Spark, Databricks) Prasad Chalasani (SVP Data Science, MediaMath) 13 April, 2016
  2. 2. Summary We leveraged recent machine-learning research to develop a I fast, practical, I scalable (up to 100s of Millions of sparse features) I online, I distributed (built on Apache Spark), I single-pass, ML classifier that has significant advantages over most similar ML packages.
  3. 3. Key Conceptual Take-aways I Supervised Machine Learning I Online vs Batch Learning, and importance of Online I Challenges in online-learning I Distributed implementation in Spark.
  4. 4. Supervised Machine Learning: Overview Given: I labeled training data, Goal: I fit a model to predict labels on (unseen) test data.
  5. 5. Supervised Machine Learning: Overview Given: I training data D: n labeled examples {(x1, y1), (x2, y2), . . . , (xn, yn)} where I xi is a k-dimensional feature-vector I yi is the label (0 or 1) that we want to predict. I an error (or loss) metric L(p, y) from predicting p when true label is y.
  6. 6. Supervised Machine Learning: Overview Given: I training data D: n labeled examples {(x1, y1), (x2, y2), . . . , (xn, yn)} where I xi is a k-dimensional feature-vector I yi is the label (0 or 1) that we want to predict. I an error (or loss) metric L(p, y) from predicting p when true label is y. Fix a family of functions fw (x) œ F that are parametrised by a weight-vector w.
  7. 7. Supervised Machine Learning: Overview Given: I training data D: n labeled examples {(x1, y1), (x2, y2), . . . , (xn, yn)} where I xi is a k-dimensional feature-vector I yi is the label (0 or 1) that we want to predict. I an error (or loss) metric L(p, y) from predicting p when true label is y. Fix a family of functions fw (x) œ F that are parametrised by a weight-vector w. Goal: find w that minimizes average loss over D: L(w) = 1 n nÿ i=1 Li (w) = 1 n nÿ i=1 L(fw (xi ), yi ).
  8. 8. Logistic Regression Logistic model fw (x) = 1 1+e≠w·x Probability interpretation
  9. 9. Logistic Regression Logistic model fw (x) = 1 1+e≠w·x Probability interpretation Loss Function Li (w) = ≠yi ln(fw (xi )) ≠ (1 ≠ yi ) ln(1 ≠ fw (xi ))
  10. 10. Logistic Regression Logistic model fw (x) = 1 1+e≠w·x Probability interpretation Loss Function Li (w) = ≠yi ln(fw (xi )) ≠ (1 ≠ yi ) ln(1 ≠ fw (xi )) Overall Loss L(w) = qn i=1 Li (w)
  11. 11. Logistic Regression Logistic model fw (x) = 1 1+e≠w·x Probability interpretation Loss Function Li (w) = ≠yi ln(fw (xi )) ≠ (1 ≠ yi ) ln(1 ≠ fw (xi )) Overall Loss L(w) = qn i=1 Li (w) L(w) is convex: I no local minima I di erentiate and follow gradients
  12. 12. Logistic Regression: gradient descent
  13. 13. Gradient Descent Basic idea: I start with an initial guess of weight-vector w I at iteration t, update w to a new weight-vector wÕ: wÕ = w ≠ ⁄gt where I gt is the (vector) gradient of L(w) w.r.t. w at time t, I ⁄ is the learning rate.
  14. 14. Gradient Descent Gradient gt =∆ step direction Learning rate ⁄ =∆ step size
  15. 15. Gradient Descent gt = ˆL(w) ˆw = ÿ i ˆLi (w) ˆw = ÿ i gti
  16. 16. Gradient Descent gt = ˆL(w) ˆw = ÿ i ˆLi (w) ˆw = ÿ i gti This is Batch Gradient Descent (BGD) : I To make one weight-update, need to compute gradient over entire training data-set. I repeat this until convergence.
  17. 17. Gradient Descent gt = ˆL(w) ˆw = ÿ i ˆLi (w) ˆw = ÿ i gti This is Batch Gradient Descent (BGD) : I To make one weight-update, need to compute gradient over entire training data-set. I repeat this until convergence. BGD is not scalable to large data-sets.
  18. 18. Online (Stochastic) Gradient Descent (SGD) A drastic simplification: Instead of computing gradient based on entire training data-set gt = ÿ i ˆLi (w) ˆw , and doing an update wÕ = w ≠ ⁄gt.
  19. 19. Online (Stochastic) Gradient Descent (SGD) A drastic simplification: Shu e data-set (if not naturally shu ed), Compute gradient based on a one example gti = ˆLi (w) ˆw , and do an update wÕ = w ≠ ⁄gti .
  20. 20. Batch vs Online Gradient Descent Batch: I to make one step, compute gradient w.r.t. entire data-set I extremely slow updates I correct gradient
  21. 21. Batch vs Online Gradient Descent Batch: I to make one step, compute gradient w.r.t. entire data-set I extremely slow updates I correct gradient Online: I to make one step, compute gradient w.r.t. one example: I extremely fast updates I not necessarily correct gradient
  22. 22. Visualize Batch vs Stochastic Gradient Descent
  23. 23. Batch vs Online Learning Batch Learning: I process a large training data-set, generate a model I use model to predict labels of test data-set
  24. 24. Batch vs Online Learning Batch Learning: I process a large training data-set, generate a model I use model to predict labels of test data-set Drawbacks: I infeasible/impractically slow for large data-sets. I need to repeat batch process to update model with new data
  25. 25. Batch vs Online Learning Online Learning: I for each “training” example:
  26. 26. Batch vs Online Learning Online Learning: I for each “training” example: I generate prediction (score),
  27. 27. Batch vs Online Learning Online Learning: I for each “training” example: I generate prediction (score), I compare with true label,
  28. 28. Batch vs Online Learning Online Learning: I for each “training” example: I generate prediction (score), I compare with true label, I update model (weights w)
  29. 29. Batch vs Online Learning Online Learning: I for each “training” example: I generate prediction (score), I compare with true label, I update model (weights w) I for each “test” example:
  30. 30. Batch vs Online Learning Online Learning: I for each “training” example: I generate prediction (score), I compare with true label, I update model (weights w) I for each “test” example: I predict with latest learned model (weights w)
  31. 31. Batch vs Online Learning Online Learning benefits: I does not pre-process enire training data-set
  32. 32. Batch vs Online Learning Online Learning benefits: I does not pre-process enire training data-set I does not explicitly retain previously-seen examples
  33. 33. Batch vs Online Learning Online Learning benefits: I does not pre-process enire training data-set I does not explicitly retain previously-seen examples I extremely light-weight: space and time-e cient
  34. 34. Batch vs Online Learning Online Learning benefits: I does not pre-process enire training data-set I does not explicitly retain previously-seen examples I extremely light-weight: space and time-e cient I no distinct “training” and “testing” phases:
  35. 35. Batch vs Online Learning Online Learning benefits: I does not pre-process enire training data-set I does not explicitly retain previously-seen examples I extremely light-weight: space and time-e cient I no distinct “training” and “testing” phases: I incremental, continual learning
  36. 36. Batch vs Online Learning Online Learning benefits: I does not pre-process enire training data-set I does not explicitly retain previously-seen examples I extremely light-weight: space and time-e cient I no distinct “training” and “testing” phases: I incremental, continual learning I adapts to changing patterns
  37. 37. Batch vs Online Learning Online Learning benefits: I does not pre-process enire training data-set I does not explicitly retain previously-seen examples I extremely light-weight: space and time-e cient I no distinct “training” and “testing” phases: I incremental, continual learning I adapts to changing patterns I easily update existing model with new data
  38. 38. Batch vs Online Learning Online Learning benefits: I does not pre-process enire training data-set I does not explicitly retain previously-seen examples I extremely light-weight: space and time-e cient I no distinct “training” and “testing” phases: I incremental, continual learning I adapts to changing patterns I easily update existing model with new data I better generalization to unseen observations.
  39. 39. The Online Learning Paradigm As each labeled example (xi , yi ) is seen, I make prediction given only current weight-vector w I update weight-vector w
  40. 40. Online Learning: Use Scenarios I extremely large data-sets where
  41. 41. Online Learning: Use Scenarios I extremely large data-sets where I batch learning is computationally infeasible/impractical, and
  42. 42. Online Learning: Use Scenarios I extremely large data-sets where I batch learning is computationally infeasible/impractical, and I it’s possible to only do a single pass over the data.
  43. 43. Online Learning: Use Scenarios I extremely large data-sets where I batch learning is computationally infeasible/impractical, and I it’s possible to only do a single pass over the data. I data arrives in real-time, and
  44. 44. Online Learning: Use Scenarios I extremely large data-sets where I batch learning is computationally infeasible/impractical, and I it’s possible to only do a single pass over the data. I data arrives in real-time, and I decisions/predictions must be made quickly
  45. 45. Online Learning: Use Scenarios I extremely large data-sets where I batch learning is computationally infeasible/impractical, and I it’s possible to only do a single pass over the data. I data arrives in real-time, and I decisions/predictions must be made quickly I learned model needs to adapt quickly to recent observations.
  46. 46. Online Learning Examples
  47. 47. Online Learning Example: Advertising (MediaMath) Listen to 100 billion ad-opportunities daily from Ad Exchanges.
  48. 48. Online Learning Example: Advertising (MediaMath) Listen to 100 billion ad-opportunities daily from Ad Exchanges. For each opportunity, need to predict whether exposed user will buy, as function of several features: I hour_of_day, browser_type, geo_region, age, . . .
  49. 49. Online Learning Example: Advertising (MediaMath) Listen to 100 billion ad-opportunities daily from Ad Exchanges. For each opportunity, need to predict whether exposed user will buy, as function of several features: I hour_of_day, browser_type, geo_region, age, . . . Online learning benefits: I fast update of learned model to reflect latest observations
  50. 50. Online Learning Example: Advertising (MediaMath) Listen to 100 billion ad-opportunities daily from Ad Exchanges. For each opportunity, need to predict whether exposed user will buy, as function of several features: I hour_of_day, browser_type, geo_region, age, . . . Online learning benefits: I fast update of learned model to reflect latest observations I light-weight models extremely quick to compute
  51. 51. Online Learning: IOT Vast amounts of data; need to adapt, respond quickly.
  52. 52. Online Learning: IOT Vast amounts of data; need to adapt, respond quickly. Nest Thermostats: behavior data =∆ predict preferred room temp.
  53. 53. Online Learning: IOT Vast amounts of data; need to adapt, respond quickly. Nest Thermostats: behavior data =∆ predict preferred room temp. Self-driving Cars: (sensor data, other cars) =∆ predict collision
  54. 54. Online Learning: IOT Vast amounts of data; need to adapt, respond quickly. Nest Thermostats: behavior data =∆ predict preferred room temp. Self-driving Cars: (sensor data, other cars) =∆ predict collision Clinical: Sensors (activity, vitals, . . . ) =∆ predict cardiac event
  55. 55. Online Learning: IOT Vast amounts of data; need to adapt, respond quickly. Nest Thermostats: behavior data =∆ predict preferred room temp. Self-driving Cars: (sensor data, other cars) =∆ predict collision Clinical: Sensors (activity, vitals, . . . ) =∆ predict cardiac event Smart cities: tra c sensors =∆ predict congestion
  56. 56. Online Learning: Challenge #1 Feature scale di erences
  57. 57. Online Learning: Feature Scaling Example from wearable devices domain: I feature 1 = heart-rate, range 40 to 200 I feature 2 = step-count, range 0 to 500,000
  58. 58. Online Learning: Feature Scaling Example from wearable devices domain: I feature 1 = heart-rate, range 40 to 200 I feature 2 = step-count, range 0 to 500,000 Extreme scale di erences =∆ convergence problems. Convergence much faster when features are of same scale: I normalize each feature by dividing by its max possible value.
  59. 59. Online Learning: Feature Scaling But often: I range of features not known in advance, and I we cannot make a separate pass over the data to find ranges.
  60. 60. Online Learning: Feature Scaling But often: I range of features not known in advance, and I we cannot make a separate pass over the data to find ranges. =∆ Need single-pass algorithms that adaptively normalize features with each new observation. [Ross,Mineiro,Langford 2013] proposed such an algorithm, which we implemented in our online ML system.
  61. 61. Online Learning Challenge #2: (Sparse) Feature frequency di erences
  62. 62. Online Learning: Feature Frequency Di erences Some sparse features occur much more frequently than others, e.g.: I categorical feature country with 200 values,
  63. 63. Online Learning: Feature Frequency Di erences Some sparse features occur much more frequently than others, e.g.: I categorical feature country with 200 values, I encoded as a vector of length 200 with exactly one entry = 1, and the rest 0
  64. 64. Online Learning: Feature Frequency Di erences Some sparse features occur much more frequently than others, e.g.: I categorical feature country with 200 values, I encoded as a vector of length 200 with exactly one entry = 1, and the rest 0 I country=USA may occur much more often than country=Belgium
  65. 65. Online Learning: Feature Frequency Di erences Some sparse features occur much more frequently than others, e.g.: I categorical feature country with 200 values, I encoded as a vector of length 200 with exactly one entry = 1, and the rest 0 I country=USA may occur much more often than country=Belgium I indicator feature visited_site = 1 much more often than purchased=1.
  66. 66. Online Learning: Feature Frequency Di erences Often, rare features much more predictive than frequent features. Same learning rate for all features =∆ slow convergence.
  67. 67. Online Learning: Feature Frequency Di erences Often, rare features much more predictive than frequent features. Same learning rate for all features =∆ slow convergence. =∆ rare features should have larger learning rates: I bigger steps whenever a rare feature is seen I much faster convergence
  68. 68. Online Learning: Feature Frequency Di erences Often, rare features much more predictive than frequent features. Same learning rate for all features =∆ slow convergence. =∆ rare features should have larger learning rates: I bigger steps whenever a rare feature is seen I much faster convergence E ectively, the algo pays more attention to rare features Enables finding rare but predictive features.
  69. 69. Online Learning: Feature Frequency Di erences Often, rare features much more predictive than frequent features. Same learning rate for all features =∆ slow convergence. =∆ rare features should have larger learning rates: I bigger steps whenever a rare feature is seen I much faster convergence E ectively, the algo pays more attention to rare features Enables finding rare but predictive features. ADAGRAD is an algorithm for this [Duchi,Hazan,Singer 2010], and we implemented this in our learning system.
  70. 70. Online Learning Challenge #3: Encoding sparse features
  71. 71. Online Learning: Sparse Features E.g. site_domain has large (unknown) set of possible values I google.com, yahoo.com, cnn.com, . . .
  72. 72. Online Learning: Sparse Features E.g. site_domain has large (unknown) set of possible values I google.com, yahoo.com, cnn.com, . . . I need to encode (conceptually) as 1-hot vectors, e.g.
  73. 73. Online Learning: Sparse Features E.g. site_domain has large (unknown) set of possible values I google.com, yahoo.com, cnn.com, . . . I need to encode (conceptually) as 1-hot vectors, e.g. I google.com = (1, 0, 0, 0, ... )
  74. 74. Online Learning: Sparse Features E.g. site_domain has large (unknown) set of possible values I google.com, yahoo.com, cnn.com, . . . I need to encode (conceptually) as 1-hot vectors, e.g. I google.com = (1, 0, 0, 0, ... ) I yahoo.com = (0, 1, 0, 0, ... )
  75. 75. Online Learning: Sparse Features E.g. site_domain has large (unknown) set of possible values I google.com, yahoo.com, cnn.com, . . . I need to encode (conceptually) as 1-hot vectors, e.g. I google.com = (1, 0, 0, 0, ... ) I yahoo.com = (0, 1, 0, 0, ... ) I cnn.com = (0, 0, 1, 0, ... )
  76. 76. Online Learning: Sparse Features E.g. site_domain has large (unknown) set of possible values I google.com, yahoo.com, cnn.com, . . . I need to encode (conceptually) as 1-hot vectors, e.g. I google.com = (1, 0, 0, 0, ... ) I yahoo.com = (0, 1, 0, 0, ... ) I cnn.com = (0, 0, 1, 0, ... ) I . . .
  77. 77. Online Learning: Sparse Features E.g. site_domain has large (unknown) set of possible values I google.com, yahoo.com, cnn.com, . . . I need to encode (conceptually) as 1-hot vectors, e.g. I google.com = (1, 0, 0, 0, ... ) I yahoo.com = (0, 1, 0, 0, ... ) I cnn.com = (0, 0, 1, 0, ... ) I . . . I all possible values not known in advance
  78. 78. Online Learning: Sparse Features E.g. site_domain has large (unknown) set of possible values I google.com, yahoo.com, cnn.com, . . . I need to encode (conceptually) as 1-hot vectors, e.g. I google.com = (1, 0, 0, 0, ... ) I yahoo.com = (0, 1, 0, 0, ... ) I cnn.com = (0, 0, 1, 0, ... ) I . . . I all possible values not known in advance I cannot pre-process data to find all possible values
  79. 79. Online Learning: Sparse Features E.g. site_domain has large (unknown) set of possible values I google.com, yahoo.com, cnn.com, . . . I need to encode (conceptually) as 1-hot vectors, e.g. I google.com = (1, 0, 0, 0, ... ) I yahoo.com = (0, 1, 0, 0, ... ) I cnn.com = (0, 0, 1, 0, ... ) I . . . I all possible values not known in advance I cannot pre-process data to find all possible values I don’t want to encode explicit (long) vectors
  80. 80. Online Learning: Sparse Features, Hashing Trick e.g. observation: I country = "china" (categorical) I age=32 (numerical) I domain="google.com" (categorical)
  81. 81. Online Learning: Sparse Features, Hashing Trick Hash the feature-names: I hash("country_china") = 24378 I hash("age") = 32905 I hash("domain_google.com") = 84395
  82. 82. Online Learning: Sparse Features, Hashing Trick Represent observation as a (special) Map: {24378 æ 1.0, 32905 æ 32.0, 84395 æ 1.0}
  83. 83. Online Learning: Sparse Features, Hashing Trick Represent observation as a (special) Map: {24378 æ 1.0, 32905 æ 32.0, 84395 æ 1.0} Sparse Representation (no explicit vectors)
  84. 84. Online Learning: Sparse Features, Hashing Trick Represent observation as a (special) Map: {24378 æ 1.0, 32905 æ 32.0, 84395 æ 1.0} Sparse Representation (no explicit vectors) No need for separate pass on data (unlike Spark MLLib)
  85. 85. Online Learning Challenge #4: Distributed Implementation of Online Learning
  86. 86. Distributed Online Logistic Regression Stochastic Gradient Descent (SGD) is inherently sequential: I how to parallelize?
  87. 87. Distributed Online Logistic Regression Stochastic Gradient Descent (SGD) is inherently sequential: I how to parallelize? Our (Scala) implementation in Apache Spark: I Randomly re-partition training data into shards I Use SGD to learn a model for each shard I average models using TreeReduce (~ “AllReduce”) I leverages Spark/Hadoop fault-tolerance.
  88. 88. Slider: Fast Distributed Online Learning System
  89. 89. Slider Fast, distributed, online, single-pass learning system. I Written in Scala on top of Spark I Works directly with Spark Data Frames I Usable as a library within other JVM systems I Leverages Spark/Hadoop fault-tolerance I Stochastic Gradient Descent I Online feature-scaling/normalization I Adaptive (per-feature) learning-rates I Single-pass I Hashing-trick to encode sparse features
  90. 90. Slider, Vowpal-Wabbit (VW), Spark-ML (SML) Fast, distributed, online, single-pass learning system. I Written in Scala on top of Spark (SML) I Works directly with Spark Data Frames (SML) I Usable as a library within other JVM systems (SML) I Leverages Spark/Hadoop fault-tolerance (SML) I Stochastic Gradient Descent (SGD) (VW, SML) I Online feature-scaling/normalization (VW) I Adaptive (per-feature) learning-rates (VW) I Single-pass (VW, SML) I Hashing-trick to encode sparse features (VW)
  91. 91. Slider example
  92. 92. Slider example
  93. 93. Slider example
  94. 94. Slider example
  95. 95. Slider vs Spark ML Task: Predict conversion probability from ad Impression features I 14M impressions from 1 ad campaign I 17 Categorical features, 2 numerical features I Train on first 80%, test on remaining 20%
  96. 96. Slider vs Spark ML Task: Predict conversion probability from ad Impression features I 14M impressions from 1 ad campaign I 17 Categorical features, 2 numerical features I Train on first 80%, test on remaining 20% Spark ML (using Pipelines) I makes 17 passes over data: one for each categorical feature I trains and scores in 40 minutes I need to specify iterations, etc. I AUC = 0.52 on test data
  97. 97. Slider vs Spark ML Task: Predict conversion probability from ad Impression features I 14M impressions from 1 ad campaign I 17 Categorical features, 2 numerical features I Train on first 80%, test on remaining 20% Slider I makes just one pass over data. I trains and scores in 5 minutes. I no tuning I AUC = 0.68 on test data
  98. 98. Other Work I Online version of k-means clustering I FTRL algorithm (regularized alternative to SGD) Ongoing/Future: I Online learning with Spark Streaming I Benchmarking vs other ML systems
  99. 99. Thank you

×