Successfully reported this slideshow.
Upcoming SlideShare
×

# Modeling Social Data, Lecture 8: Classification

330 views

Published on

http://modelingsocialdata.org

Published in: Education
• Full Name
Comment goes here.

Are you sure you want to Yes No

Are you sure you want to  Yes  No
• I went from getting \$3 surveys to \$500 surveys every day!! learn more... ●●● http://ishbv.com/surveys6/pdf

Are you sure you want to  Yes  No
• Be the first to like this

### Modeling Social Data, Lecture 8: Classification

1. 1. classiﬁcation chris.wiggins@columbia.edu 2017-03-10
2. 2. wat?
3. 3. example: spam/ham (cf. jake’s great deck on this)
4. 4. Learning by example Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 2 / 16
5. 5. Learning by example • How did you solve this problem? • Can you make this process explicit (e.g. write code to do so)? Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 2 / 16
6. 6. classification? build a theory of 3’s?
7. 7. 1-slide summary of classification • banana or orange? what would Gauss do?
8. 8. 1-slide summary of classification what would Gauss do? length height • banana or orange?
9. 9. 1-slide summary of classification length height pricesmell time of purchase • banana or orange?
10. 10. 1-slide summary of classification length height game theory: “assume the worst” • banana or orange?
11. 11. 1-slide summary of classification length height large deviation theory: “maximum margin” • banana or orange?
12. 12. 1-slide summary of classification length height large deviation theory: “maximum margin” • banana or orange?
13. 13. 1-slide summary of classification length height pricesmell time of purchase boosting (1997) SVMs (1990s) • banana or orange?
14. 14. 1-slide summary of classification “acgt” & gene 45 down? “cat” & gene 11 up? “tag” & gene 34 up? “gataca” & gene 37 down? learn predictive features from data “gaga” & gene 1066 up? “gataca” & gene 37 down? • up- or down- regulated? “gaga” & gene 137 up?
16. 16. example@NYT in CAR (computer assisted reporting) Figure 1: Tabuchi article
17. 17. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”1
18. 18. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”1 ◮ Takata airbag fatalities
19. 19. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”1 ◮ Takata airbag fatalities ◮ 2219 labeled2 examples from 33,204 comments
20. 20. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”1 ◮ Takata airbag fatalities ◮ 2219 labeled2 examples from 33,204 comments ◮ cf. Box’s “Science and Statistics”3
21. 21. computer assisted reporting ◮ Impact Figure 3: impact
22. 22. conjecture: cost function?
23. 23. fallback: probability
24. 24. review: regression as probability
25. 25. classiﬁcation as probability binary/dichotomous/boolean features + NB
26. 26. digression: bayes rule generalize, maintain linerarity
27. 27. Learning by example • How did you solve this problem? • Can you make this process explicit (e.g. write code to do so)? Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 2 / 16
28. 28. Diagnoses a la Bayes1 • You’re testing for a rare disease: • 1% of the population is infected • You have a highly sensitive and speciﬁc test: • 99% of sick patients test positive • 99% of healthy patients test negative • Given that a patient tests positive, what is probability the patient is sick? 1 Wiggins, SciAm 2006 Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 3 / 16
29. 29. Diagnoses a la Bayes Population 10,000 ppl 1% Sick 100 ppl 99% Test + 99 ppl 1% Test - 1 per 99% Healthy 9900 ppl 1% Test + 99 ppl 99% Test - 9801 ppl Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 4 / 16
30. 30. Diagnoses a la Bayes Population 10,000 ppl 1% Sick 100 ppl 99% Test + 99 ppl 1% Test - 1 per 99% Healthy 9900 ppl 1% Test + 99 ppl 99% Test - 9801 ppl So given that a patient tests positive (198 ppl), there is a 50% chance the patient is sick (99 ppl)! Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 4 / 16
31. 31. Diagnoses a la Bayes Population 10,000 ppl 1% Sick 100 ppl 99% Test + 99 ppl 1% Test - 1 per 99% Healthy 9900 ppl 1% Test + 99 ppl 99% Test - 9801 ppl The small error rate on the large healthy population produces many false positives. Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 4 / 16
32. 32. Natural frequencies a la Gigerenzer2 2 http://bit.ly/ggbbc Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 5 / 16
33. 33. Inverting conditional probabilities Bayes’ Theorem Equate the far right- and left-hand sides of product rule p (y|x) p (x) = p (x, y) = p (x|y) p (y) and divide to get the probability of y given x from the probability of x given y: p (y|x) = p (x|y) p (y) p (x) where p (x) = P y∈ΩY p (x|y) p (y) is the normalization constant. Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 6 / 16
34. 34. Diagnoses a la Bayes Given that a patient tests positive, what is probability the patient is sick? p (sick|+) = 99/100 z }| { p (+|sick) 1/100 z }| { p (sick) p (+) | {z } 99/1002+99/1002=198/1002 = 99 198 = 1 2 where p (+) = p (+|sick) p (sick) + p (+|healthy) p (healthy). Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 7 / 16
35. 35. (Super) Naive Bayes We can use Bayes’ rule to build a one-word spam classiﬁer: p (spam|word) = p (word|spam) p (spam) p (word) where we estimate these probabilities with ratios of counts: ˆp(word|spam) = # spam docs containing word # spam docs ˆp(word|ham) = # ham docs containing word # ham docs ˆp(spam) = # spam docs # docs ˆp(ham) = # ham docs # docs Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 8 / 16
36. 36. (Super) Naive Bayes \$ ./enron_naive_bayes.sh meeting 1500 spam examples 3672 ham examples 16 spam examples containing meeting 153 ham examples containing meeting estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(meeting|spam) = .0106 estimated P(meeting|ham) = .0416 P(spam|meeting) = .0923 Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 9 / 16
37. 37. (Super) Naive Bayes \$ ./enron_naive_bayes.sh money 1500 spam examples 3672 ham examples 194 spam examples containing money 50 ham examples containing money estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(money|spam) = .1293 estimated P(money|ham) = .0136 P(spam|money) = .7957 Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 10 / 16
38. 38. (Super) Naive Bayes \$ ./enron_naive_bayes.sh enron 1500 spam examples 3672 ham examples 0 spam examples containing enron 1478 ham examples containing enron estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(enron|spam) = 0 estimated P(enron|ham) = .4025 P(spam|enron) = 0 Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 11 / 16
39. 39. Naive Bayes Represent each document by a binary vector ~x where xj = 1 if the j-th word appears in the document (xj = 0 otherwise). Modeling each word as an independent Bernoulli random variable, the probability of observing a document ~x of class c is: p (~x|c) = Y j ✓ xj jc (1 − ✓jc)1−xj where ✓jc denotes the probability that the j-th word occurs in a document of class c. Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 12 / 16
40. 40. Naive Bayes Using this likelihood in Bayes’ rule and taking a logarithm, we have: log p (c|~x) = log p (~x|c) p (c) p (~x) = X j xj log ✓jc 1 − ✓jc + X j log(1 − ✓jc) + log ✓c p (~x) where ✓c is the probability of observing a document of class c. Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 13 / 16
41. 41. (a) big picture: surrogate convex loss functions
42. 42. general Figure 4: Reminder: Surrogate Loss Functions
43. 43. boosting Figure 5: ‘Cited by 12599’
44. 44. tangent: logistic function as surrogate loss function ◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
45. 45. tangent: logistic function as surrogate loss function ◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
46. 46. tangent: logistic function as surrogate loss function ◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
47. 47. tangent: logistic function as surrogate loss function ◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
48. 48. tangent: logistic function as surrogate loss function ◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R. ◮ ∴ maximizing log-likelihood is minimizing a surrogate convex loss function for classiﬁcation
49. 49. tangent: logistic function as surrogate loss function ◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R. ◮ ∴ maximizing log-likelihood is minimizing a surrogate convex loss function for classiﬁcation ◮ but i log2 1 + e−yi wT h(xi ) not as easy as i e−yi wT h(xi )
50. 50. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi ))
51. 51. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt)
52. 52. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt) ◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1}
53. 53. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt) ◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1} ◮ label y ∈ {−1, +1}
54. 54. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, ﬂexible hypotheses spaces, L1, L∞ 4, . . .
55. 55. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, ﬂexible hypotheses spaces, L1, L∞ 4, . . .
56. 56. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, ﬂexible hypotheses spaces, L1, L∞ 4, . . .
57. 57. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− ◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where 0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1 Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, ﬂexible hypotheses spaces, L1, L∞ 4, . . .
58. 58. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− ◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where 0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1 ◮ update example weights dt+1 i = dt i e∓w Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, ﬂexible hypotheses spaces, L1, L∞ 4, . . . 4 Duchi + Singer “Boosting with structural sparsity” ICML ’09
59. 59. svm