Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Modeling Social Data, Lecture 8: Classification

260 views

Published on

http://modelingsocialdata.org

Published in: Education
  • Be the first to comment

  • Be the first to like this

Modeling Social Data, Lecture 8: Classification

  1. 1. classification chris.wiggins@columbia.edu 2017-03-10
  2. 2. wat?
  3. 3. example: spam/ham (cf. jake’s great deck on this)
  4. 4. Learning by example Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
  5. 5. Learning by example • How did you solve this problem? • Can you make this process explicit (e.g. write code to do so)? Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
  6. 6. classification? build a theory of 3’s?
  7. 7. 1-slide summary of classification • banana or orange? what would Gauss do?
  8. 8. 1-slide summary of classification what would Gauss do? length height • banana or orange?
  9. 9. 1-slide summary of classification length height pricesmell time of purchase • banana or orange?
  10. 10. 1-slide summary of classification length height game theory: “assume the worst” • banana or orange?
  11. 11. 1-slide summary of classification length height large deviation theory: “maximum margin” • banana or orange?
  12. 12. 1-slide summary of classification length height large deviation theory: “maximum margin” • banana or orange?
  13. 13. 1-slide summary of classification length height pricesmell time of purchase boosting (1997) SVMs (1990s) • banana or orange?
  14. 14. 1-slide summary of classification “acgt” & gene 45 down? “cat” & gene 11 up? “tag” & gene 34 up? “gataca” & gene 37 down? learn predictive features from data “gaga” & gene 1066 up? “gataca” & gene 37 down? • up- or down- regulated? “gaga” & gene 137 up?
  15. 15. example: bad bananas
  16. 16. example@NYT in CAR (computer assisted reporting) Figure 1: Tabuchi article
  17. 17. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”1
  18. 18. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”1 ◮ Takata airbag fatalities
  19. 19. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”1 ◮ Takata airbag fatalities ◮ 2219 labeled2 examples from 33,204 comments
  20. 20. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”1 ◮ Takata airbag fatalities ◮ 2219 labeled2 examples from 33,204 comments ◮ cf. Box’s “Science and Statistics”3
  21. 21. computer assisted reporting ◮ Impact Figure 3: impact
  22. 22. conjecture: cost function?
  23. 23. fallback: probability
  24. 24. review: regression as probability
  25. 25. classification as probability binary/dichotomous/boolean features + NB
  26. 26. digression: bayes rule generalize, maintain linerarity
  27. 27. Learning by example • How did you solve this problem? • Can you make this process explicit (e.g. write code to do so)? Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
  28. 28. Diagnoses a la Bayes1 • You’re testing for a rare disease: • 1% of the population is infected • You have a highly sensitive and specific test: • 99% of sick patients test positive • 99% of healthy patients test negative • Given that a patient tests positive, what is probability the patient is sick? 1 Wiggins, SciAm 2006 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 3 / 16
  29. 29. Diagnoses a la Bayes Population 10,000 ppl 1% Sick 100 ppl 99% Test + 99 ppl 1% Test - 1 per 99% Healthy 9900 ppl 1% Test + 99 ppl 99% Test - 9801 ppl Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
  30. 30. Diagnoses a la Bayes Population 10,000 ppl 1% Sick 100 ppl 99% Test + 99 ppl 1% Test - 1 per 99% Healthy 9900 ppl 1% Test + 99 ppl 99% Test - 9801 ppl So given that a patient tests positive (198 ppl), there is a 50% chance the patient is sick (99 ppl)! Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
  31. 31. Diagnoses a la Bayes Population 10,000 ppl 1% Sick 100 ppl 99% Test + 99 ppl 1% Test - 1 per 99% Healthy 9900 ppl 1% Test + 99 ppl 99% Test - 9801 ppl The small error rate on the large healthy population produces many false positives. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
  32. 32. Natural frequencies a la Gigerenzer2 2 http://bit.ly/ggbbc Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 5 / 16
  33. 33. Inverting conditional probabilities Bayes’ Theorem Equate the far right- and left-hand sides of product rule p (y|x) p (x) = p (x, y) = p (x|y) p (y) and divide to get the probability of y given x from the probability of x given y: p (y|x) = p (x|y) p (y) p (x) where p (x) = P y∈ΩY p (x|y) p (y) is the normalization constant. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 6 / 16
  34. 34. Diagnoses a la Bayes Given that a patient tests positive, what is probability the patient is sick? p (sick|+) = 99/100 z }| { p (+|sick) 1/100 z }| { p (sick) p (+) | {z } 99/1002+99/1002=198/1002 = 99 198 = 1 2 where p (+) = p (+|sick) p (sick) + p (+|healthy) p (healthy). Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 7 / 16
  35. 35. (Super) Naive Bayes We can use Bayes’ rule to build a one-word spam classifier: p (spam|word) = p (word|spam) p (spam) p (word) where we estimate these probabilities with ratios of counts: ˆp(word|spam) = # spam docs containing word # spam docs ˆp(word|ham) = # ham docs containing word # ham docs ˆp(spam) = # spam docs # docs ˆp(ham) = # ham docs # docs Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 8 / 16
  36. 36. (Super) Naive Bayes $ ./enron_naive_bayes.sh meeting 1500 spam examples 3672 ham examples 16 spam examples containing meeting 153 ham examples containing meeting estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(meeting|spam) = .0106 estimated P(meeting|ham) = .0416 P(spam|meeting) = .0923 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 9 / 16
  37. 37. (Super) Naive Bayes $ ./enron_naive_bayes.sh money 1500 spam examples 3672 ham examples 194 spam examples containing money 50 ham examples containing money estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(money|spam) = .1293 estimated P(money|ham) = .0136 P(spam|money) = .7957 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 10 / 16
  38. 38. (Super) Naive Bayes $ ./enron_naive_bayes.sh enron 1500 spam examples 3672 ham examples 0 spam examples containing enron 1478 ham examples containing enron estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(enron|spam) = 0 estimated P(enron|ham) = .4025 P(spam|enron) = 0 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 11 / 16
  39. 39. Naive Bayes Represent each document by a binary vector ~x where xj = 1 if the j-th word appears in the document (xj = 0 otherwise). Modeling each word as an independent Bernoulli random variable, the probability of observing a document ~x of class c is: p (~x|c) = Y j ✓ xj jc (1 − ✓jc)1−xj where ✓jc denotes the probability that the j-th word occurs in a document of class c. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 12 / 16
  40. 40. Naive Bayes Using this likelihood in Bayes’ rule and taking a logarithm, we have: log p (c|~x) = log p (~x|c) p (c) p (~x) = X j xj log ✓jc 1 − ✓jc + X j log(1 − ✓jc) + log ✓c p (~x) where ✓c is the probability of observing a document of class c. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 13 / 16
  41. 41. (a) big picture: surrogate convex loss functions
  42. 42. general Figure 4: Reminder: Surrogate Loss Functions
  43. 43. boosting Figure 5: ‘Cited by 12599’
  44. 44. tangent: logistic function as surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
  45. 45. tangent: logistic function as surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
  46. 46. tangent: logistic function as surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
  47. 47. tangent: logistic function as surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
  48. 48. tangent: logistic function as surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R. ◮ ∴ maximizing log-likelihood is minimizing a surrogate convex loss function for classification
  49. 49. tangent: logistic function as surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R. ◮ ∴ maximizing log-likelihood is minimizing a surrogate convex loss function for classification ◮ but i log2 1 + e−yi wT h(xi ) not as easy as i e−yi wT h(xi )
  50. 50. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi ))
  51. 51. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt)
  52. 52. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt) ◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1}
  53. 53. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt) ◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1} ◮ label y ∈ {−1, +1}
  54. 54. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 4, . . .
  55. 55. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 4, . . .
  56. 56. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 4, . . .
  57. 57. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− ◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where 0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1 Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 4, . . .
  58. 58. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− ◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where 0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1 ◮ update example weights dt+1 i = dt i e∓w Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 4, . . . 4 Duchi + Singer “Boosting with structural sparsity” ICML ’09
  59. 59. svm

×