2013 - Andrei Zmievski: Machine learning para datos

326 views
247 views

Published on

PHP Conference Argentina - 2013

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
326
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

2013 - Andrei Zmievski: Machine learning para datos

  1. 1. Small Data Machine Learning Andrei Zmievski The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic. Questions - now and later
  2. 2. WORK We are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.
  3. 3. WORK We are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.
  4. 4. TRAVEL
  5. 5. TAKE PHOTOS
  6. 6. DRINK BEER
  7. 7. MAKE BEER
  8. 8. MATH
  9. 9. SOME MATH
  10. 10. AWESOME MATH
  11. 11. @a For those of you who don’t know me.. Acquired in October 2008 Had a different account earlier, but then @k asked if I wanted it.. Know many other single-letter Twitterers.
  12. 12. FAME Advantages
  13. 13. FAME FORTUNE
  14. 14. FAME FORTUNE Wall Street Journal?!
  15. 15. FAME FORTUNE FOLLOWERS
  16. 16. FAME FORTUNE FOLLOWERS lol, what?!
  17. 17. MAXIMUM REPLY SPACE! 140-length(“@a “)=137
  18. 18. CONS Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically
  19. 19. CONS Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically
  20. 20. CONS I hate humanity Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically
  21. 21. A D D
  22. 22. Annoyance Driven Development Best way to learn something is to be annoyed enough to create a solution based on the tech.
  23. 23. Machine Learning to the Rescue!
  24. 24. REPLYCLEANER Even with false negatives, reduces garbage to where visual filtering is possible - uses trained model to classify tweets into good/bad - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
  25. 25. REPLYCLEANER Even with false negatives, reduces garbage to where visual filtering is possible - uses trained model to classify tweets into good/bad - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
  26. 26. REPLYCLEANER
  27. 27. REPLYCLEANER
  28. 28. REPLYCLEANER
  29. 29. REPLYCLEANER
  30. 30. I still hate humanity
  31. 31. I still hate humanity I still hate humanity
  32. 32. Machine Learning A branch of Artificial Intelligence No widely accepted definition
  33. 33. “Field of study that gives computers the ability to learn without being explicitly programmed.” — Arthur Samuel (1959) concerns the construction and study of systems that can learn from data
  34. 34. SPAM FILTERING
  35. 35. RECOMMENDATIONS
  36. 36. TRANSLATION
  37. 37. CLUSTERING And many more: medical diagnoses, detecting credit card fraud, etc.
  38. 38. supervised unsupervised Labeled dataset, training maps input to desired outputs Example: regression - predicting house prices, classification - spam filtering
  39. 39. supervised unsupervised no labels in the dataset, algorithm needs to find structure Example: clustering We will be talking about classification, a supervised learning process.
  40. 40. Feature individual measurable property of the phenomenon under observation usually numeric
  41. 41. Feature Vector a set of features for an observation Think of it as an array
  42. 42. features # of rooms 2 sq. m house age yard? feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  43. 43. features parameters # of rooms 2 sq. m house age yard? 102.3 0.94 -10.1 83.0 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  44. 44. features parameters 1 # of rooms 2 sq. m house age yard? 45.7 102.3 0.94 -10.1 83.0 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  45. 45. features parameters 1 # of rooms 2 sq. m house age yard? 45.7 102.3 0.94 -10.1 83.0 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  46. 46. features parameters = prediction 1 # of rooms 2 sq. m house age yard? 45.7 102.3 0.94 -10.1 83.0 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  47. 47. features parameters = prediction 1 # of rooms 2 sq. m house age yard? 45.7 102.3 0.94 -10.1 83.0 758,013 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  48. 48. dot product ⇥ ⇤ X = 1 x1 x2 . . . ⇥ ⇤ ✓ = ✓0 ✓1 ✓2 . . . X - input feature vector theta - weights
  49. 49. dot product ⇥ ⇤ X = 1 x1 x2 . . . ⇥ ⇤ ✓ = ✓0 ✓1 ✓2 . . . ✓·X = ✓0 + ✓1 x1 + ✓2 x2 + . . . X - input feature vector theta - weights
  50. 50. training data learning algorithm hypothesis Hypothesis (decision function): what the system has learned so far Hypothesis is applied to new data
  51. 51. hθ(X) The task of our algorithm is to determine the parameters of the hypothesis.
  52. 52. input data hθ(X) The task of our algorithm is to determine the parameters of the hypothesis.
  53. 53. input data hθ(X) parameters The task of our algorithm is to determine the parameters of the hypothesis.
  54. 54. input data hθ(X) prediction y parameters The task of our algorithm is to determine the parameters of the hypothesis.
  55. 55. whisky price $ 200 160 120 80 40 5 10 15 20 25 30 35 whisky age LINEAR REGRESSION Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
  56. 56. whisky price $ 200 160 120 80 40 5 10 15 20 25 30 35 whisky age LINEAR REGRESSION Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
  57. 57. whisky price $ 200 160 120 80 40 5 10 15 20 25 30 35 whisky age LINEAR REGRESSION Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
  58. 58. 1 0.5 z 0 1 g(z) = 1+e z LOGISTIC REGRESSION Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.
  59. 59. 1 0.5 z 0 1 g(z) = 1+e z =✓·X z LOGISTIC REGRESSION Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.
  60. 60. 1 h✓ (X) = 1+e ✓·X Probability that y=1 for input X LOGISTIC REGRESSION If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70% chance it’s spam. Thresholding on that is up to you.
  61. 61. Building the Tool
  62. 62. Corpus collection of source data used for training and testing the model
  63. 63. Twitter MongoDB phirehose hooks into streaming API
  64. 64. Twitter MongoDB phirehose 8500 tweets hooks into streaming API
  65. 65. Feature Identification
  66. 66. independent & discriminant Independent: feature A should not co-occur (correlate) with feature B highly. Discriminant: a feature should provide uniquely classifiable data (what letter a tweet starts with is not a good feature).
  67. 67. possible features @a at the end of the tweet ‣ @a... ‣ length < N chars ‣ # of user mentions in the tweet ‣ # of hashtags ‣ language! ‣ @a followed by punctuation and a word character (except for apostrophe) ‣ …and more ‣
  68. 68. feature = extractor(tweet) For each feature, write a small function that takes a tweet and returns a numeric value (floating-point).
  69. 69. corpus extractors feature vectors Run the set of these functions over the corpus and build up feature vectors Array of arrays Save to DB
  70. 70. Language Matters high correlation between the language of the tweet and its category (good/bad)
  71. 71. Indonesian or Tagalog? Garbage.
  72. 72. Top 12 Languages id en tl es so ja pt ar nl it sw fr Indonesian English Tagalog Spanish Somalian Japanese Portuguese Arabic Dutch Italian Swahili French I guarantee you people aren’t tweeting at me in Swahili. 3548 1804 733 329 305 300 262 256 150 137 118 92
  73. 73. Language Detection Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.
  74. 74. Language Detection pear / Text_LanguageDetect pecl / textcat Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.
  75. 75. EnglishNotEnglish ✓ ✓ ✓ ✓ Clean-up text (remove mentions, links, etc) Run language detection If unknown/low weight, pretend it’s English, else: If not a character set-determined language, try harder: ✓ Tokenize into words ✓ Difference with English vocabulary ✓ If words remain, run parts-of-speech tagger on each ✓ For NNS, VBZ, and VBD run stemming algorithm ✓ If result is in English vocabulary, remove from remaining ✓ If remaining list is not empty, calculate: unusual_word_ratio = size(remaining)/size(words) ✓ If ratio < 20%, pretend it’s English A lot of this is heuristic-based, after some trial-and-error. Seems to help with my corpus.
  76. 76. BINARY CLASSIFICATION Grunt work Built a web-based tool to display tweets a page at a time and select good ones
  77. 77. INPUT feature vectors OUTPUT labels (good/bad) Had my input and output
  78. 78. BIAS CORRECTION One more thing to address
  79. 79. BIAS CORRECTION BAD 99% = bad (less < 100 tweets were good) Training a model as-is would not produce good results Need to adjust the bias GOOD
  80. 80. BIAS CORRECTION BAD GOOD
  81. 81. OVER SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
  82. 82. OVER SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
  83. 83. OVER SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training
  84. 84. SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training
  85. 85. Synthetic OVERSAMPLING Synthesize feature vectors by determining what constitutes a good tweet and do weighted random selection of feature values.
  86. 86. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  87. 87. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 1 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  88. 88. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 1 2 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  89. 89. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 1 2 0 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  90. 90. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 1 2 0 77 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  91. 91. Model Training We have the hypothesis (decision function) and the training set, How do we actually determine the weights/parameters?
  92. 92. COST FUNCTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
  93. 93. REALITY COST FUNCTION PREDICTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
  94. 94. COST FUNCTION m X 1 J(✓) = Cost(h✓ (x), y) m i=1 Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
  95. 95. LOGISTIC COST Cost(h✓ (x), y) = ( log (h✓ (x)) log (1 h✓ (x)) if y = 1 if y = 0
  96. 96. LOGISTIC COST y=1 0 y=0 1 Correct guess Incorrect guess 0 1 Cost = 0 Cost = huge When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.
  97. 97. minimize cost OVER θ Finding the best values of Theta that minimize the cost
  98. 98. GRADIENT DESCENT Random starting point. Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step. Repeat. Imagine a ball rolling down from a hill.
  99. 99. ✓i = ✓i @J(✓) ↵ @✓i GRADIENT DESCENT Each step adjusts the parameters according to the slope
  100. 100. ✓i = ✓i @J(✓) ↵ @✓i each parameter Have to update them simultaneously (the whole vector at a time).
  101. 101. learning rate ✓i = ✓i @J(✓) ↵ @✓i Controls how big a step you take If α is big have an aggressive gradient descent If α is small take tiny steps If too small, tiny steps, takes too long If too big, can overshoot the minimum and fail to converge
  102. 102. ✓i = ✓i @J(✓) ↵ @✓i derivative aka “the slope” The slope indicates the steepness of the descent step for each weight, i.e. direction. Keep going for a number of iterations or until cost is below a threshold (convergence). Graph the cost function versus # of iterations and see where it starts to approach 0, past that are diminishing returns.
  103. 103. ✓i = ✓i ↵ m X j (h✓ (x ) y j j=1 THE UPDATE ALGORITHM Derivative for logistic regression simplifies to this term. Have to update the weights simultaneously! j )xi
  104. 104. X1 = [1 12.0] X2 = [1 -3.5] Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  105. 105. X1 = [1 12.0] X2 = [1 -3.5] y1 = 1 y2 = 0 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  106. 106. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  107. 107. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  108. 108. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  109. 109. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  110. 110. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T0 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  111. 111. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  112. 112. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  113. 113. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) = 0.088 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  114. 114. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T1 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5) = 0.305 Note that the hypotheses don’t change within the iteration.
  115. 115. X1 = [1 12.0] X2 = [1 -3.5] y1 = 1 y2 = 0 θ = [T0 T1] Replace parameter (weights) vector with the temporaries. ↵ = 0.05
  116. 116. X1 = [1 12.0] X2 = [1 -3.5] y1 = 1 y2 = 0 ↵ = 0.05 θ = [0.088 0.305] Do next iteration
  117. 117. CROSS Trai ning Used to assess the results of the training.
  118. 118. DATA
  119. 119. TRAINING DATA
  120. 120. TEST TRAINING DATA Train model on training set, then test results on test set. Rinse, lather, repeat feature selection/synthesis/training until results are "good enough". Pick the best parameters and save them (DB, other).
  121. 121. Putting It All Together Let’s put our model to use, finally. The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain error handling, etc. Once we get the actual tweet though..
  122. 122. Load the model The weights we have calculated via training Easiest is to load them from DB (can be used to test different models).
  123. 123. HARD CODED RULES We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  124. 124. SKIP truncated retweets: "RT @A ..." HARD CODED RULES We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  125. 125. SKIP HARD CODED RULES truncated retweets: "RT @A ..." @ mentions of friends We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  126. 126. SKIP HARD CODED RULES truncated retweets: "RT @A ..." @ mentions of friends tweets from friends We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  127. 127. Classifying Tweets This is the moment we’ve been waiting for.
  128. 128. Classifying Tweets GOOD This is the moment we’ve been waiting for.
  129. 129. Classifying Tweets GOOD This is the moment we’ve been waiting for. BAD
  130. 130. Remember this? 1 h✓ (X) = 1+e First is our hypothesis. ✓·X
  131. 131. Remember this? 1 h✓ (X) = 1+e ✓·X ✓·X = ✓0 + ✓1 X1 + ✓2 X2 + . . . First is our hypothesis.
  132. 132. Finally h✓ (X) = 1 1+e (✓0 +✓1 X1 +✓2 X2 +... ) If h > threshold , tweet is bad, otherwise good Remember that the output of h() is 0..1 (probability). Threshold is [0, 1], adjust it for your degree of tolerance. I used 0.9 to reduce false positives.
  133. 133. extract features 3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
  134. 134. extract features run the model 3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
  135. 135. extract features run the model act on the result 3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
  136. 136. BAD? Also save the tweet to DB for future analysis. block user!
  137. 137. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective
  138. 138. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective -Connection handling, backoff in case of problems, undocumented API errors, etc.
  139. 139. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective -No way for blocked person to get ahold of you via Twitter anymore, so when training the model, err on the side of caution.
  140. 140. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective -Some tweets are shown on the website, but never seen through the API.
  141. 141. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective -Lots of room for improvement.
  142. 142. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective PHP sucks at math-y stuff -Lots of room for improvement.
  143. 143. Realtime feedback ★ More features ★ Grammar analysis ★ Support Vector Machines or decision trees ★ Clockwork Raven for manual classification ★ Other minimization algos: BFGS, conjugate gradient ★ Wish pecl/scikit-learn existed ★ NEXT STEPS Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classification to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.
  144. 144. MongoDB ★ pear/Text_LanguageDetect ★ English vocabulary corpus ★ Parts-Of-Speech tagging ★ SplFixedArray ★ phirehose ★ Python’s scikit-learn (for validation) ★ Code sample ★ TOOLS MongoDB (great fit for JSON data) English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/ SplFixedArray in PHP (memory savings and slightly faster)
  145. 145. LEARN Coursera.org ML course ★ Ian Barber’s blog ★ FastML.com ★ Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classification to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.
  146. 146. Questions?

×