Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Active Learning          COMS 6998-4:   Learning and Empirical Inference                Irina Rish                IBM T.J....
Outline   Motivation   Active learning approaches     Membership   queries     Uncertainty Sampling     Information-b...
Standard supervised learning modelGiven m labeled points, want to learn a classifier withmisclassification rate <ε, chosen...
Active learningIn many situations – like speech recognition and documentretrieval – unlabeled data is easy to come by, but...
5
What is Active Learning?   Unlabeled data are readily available; labels are    expensive   Want to use adaptive decision...
Active learning warning   Choice of data is only as good as the model itself   Assume a linear model, then two data poin...
Active Learning Flavors             Selective Sampling       Membership Queries         Pool            Sequential    Myop...
Active Learning Approaches Membership queries Uncertainty Sampling Information-based loss functions Uncertainty-Region...
10
11
ProblemMany results in this framework, even for complicatedhypothesis classes.[Baum and Lang, 1991] tried fitting a neural...
13
14
15
Uncertainty Sampling[Lewis & Gale, 1994]   Query the event that the current classifier is most    uncertain about    If u...
Information-based Loss Function[MacKay, 1992]   Maximize KL-divergence between posterior and prior   Maximize reduction ...
Query by Committee[Seung et al. 1992, Freund et al. 1997]   Prior distribution over hypotheses   Samples a set of classi...
Infogain-based ActiveLearning
NotationWe Have:• Dataset, D• Model parameter space, W• Query algorithm, q                             20
Dataset (D) Example    t   Sex    Age    Test A   Test B   Test C   Disease0       M     40-50     0        1        1    ...
NotationWe Have:• Dataset, D• Model parameter space, W• Query algorithm, q                             22
Model Example                           St           Ot                       Probabilistic ClassifierNotationT : Number o...
Model Example Patient state (St) St : DiseaseState Patient Observations (Ot) Ot1 : Gender Ot2 : Age Ot3 : TestA Ot4 : Test...
Possible Model Structures    Gender                            TestB    Age               GenderS   TestA               S ...
Model Space            Model:             St              OtModel Parameters:             P(St)          P(Ot|St)Generativ...
Model Parameter Space (W)• W = space of possible  parameter values• Prior on parameters:      P(W )• Posterior over models...
NotationWe Have:• Dataset, D• Model parameter space, W• Query algorithm, q                 q(W,D) returns t*, the next sam...
Gamewhile NotDone  •   Learn P(W | D)  •   q chooses next example to label  •   Expert adds label to D                    ...
SimulationO1    O2        O3      O4        O5        O6    O7            e      S2 fals                                  ...
Active Learning Flavors• Pool  (“random access” to patients)• Sequential  (must decide as patients walk in the door)      ...
q?• Recall: q(W,D) returns the “most  interesting” unlabelled example.• Well, what makes a doctor curious about a  patient...
1994       33
Score Functionscoreuncert ( S t ) = uncertainty(P ( St | Ot ))                = H ( St )                = ∑ P ( St = i ) l...
Uncertainty Sampling Example    t   Sex Age   Test   Test   Test    St     P(St)    H(St)                   A      B      ...
Uncertainty Sampling Example    t   Sex Age   Test   Test   Test    St     P(St)    H(St)                   A      B      ...
Uncertainty SamplingGOOD: couldn’t be easierGOOD: often performs pretty wellBAD: H(St) measures information gain about the...
Can we do better thanuncertainty sampling?                        38
1992       Informative with       respect to what?                          39
Model EntropyP(W|D)                  P(W|D)              P(W|D)           W                      W                       W...
Information-Gain• Choose the example that is expected to  most reduce H(W)• I.e., Maximize H(W) – H(W | St)            Cur...
Score Functionscore IG ( St ) = MI ( St ;W )              = H (W ) − H (W | St )                                       42
We usually can’t just sum over all models to get H(St|W)                   H (W ) = − ∫ P ( w) log P ( w) dw              ...
Conditional Model Entropy       H (W ) = ∫ P ( w) log P ( w) dw                     wH (W | St = i ) = ∫ P ( w | St = i ) ...
Score Functionscore IG ( St ) = H (C ) − H (C | St )                                         45
t   Sex   Age Test Test   Test   St   P(St)     Score =               A    B      C                  H(C) - H(C|St)1   M  ...
Score Functionscore IG ( St ) = H (C ) − H (C | St )              = H ( St ) − H ( St | C )                       Familiar...
Uncertainty Sampling &   Information GainscoreUncertain ( St ) = H ( St )score InfoGain ( St ) = H ( St ) − H ( St | C )  ...
But there is a problem…                          49
If our objective is to reducethe prediction error, then“the expected information gainof an unlabeled sample is NOTa suffic...
Strategy #2:Query by CommitteeTemporary Assumptions:Pool           SequentialP(W | D)       Version SpaceProbabilistic ...
O1        O2         O3       O4   O5   O6   O7S1        S2         S3       S4   S5   S6   S7                FALSE!      ...
O1        O2        O3       O4   O5   O6   O7S1           S2     S3       S4   S5   S6   S7     TRUE!        TRUE!     Mo...
O1        O2      O3       O4     O5           O6           O7S1        S2      S3       S4     S5           S6           ...
The Original QBCAlgorithmAs each example arrives…•   Choose a committee, C, (usually of size 2)    randomly from Version S...
1992       56
Infogain vs Query by Committee[Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997]First idea: Try to rapi...
Query-by-committeeFirst idea: Try to rapidly reduce volume of version space?Problem: doesn’t take data distribution into a...
Query-by-committeeElegant scheme which decreases volume in a manner which issensitive to the data distribution.Bayesian se...
Query-by-committeeFor t = 1, 2, …         receive an unlabeled point xt drawn from P         choose two hypotheses h,h’ ra...
61
Query-by-committeeLabel bound:For H = {linear separators in Rd}, P = uniform distribution,just d log 1/ε labels to reach a...
Sampling from convex bodies By random walk! 2. Ball walk 3. Hit-and-run[Gilad-Bachrach, Navot, Tishby 2005] Studies random...
64
Some challenges[1] For linear separators, analyze the label complexity forsome distribution other than uniform![2] How to ...
Active Collaborative Prediction
Approach: Collaborative Prediction (CP)    QoS measure (e.g. bandwidth)                       Movie Ratings            Ser...
Collaborative Prediction = Matrix Approximation                      100 servers                                          ...
2     4 5   1 4 2                                           User’s ‘weights’         Factors                              ...
2   4 5   1 4 2                  70
2   4 5   1 4 2                  71
rank k   2   4 5   1 4 2                                        7   2   5   4   5   3   1   4   2 3 1     2 2   5         ...
Matrix Factorization Approaches   Singular value decomposition (SVD) – low-rank approximation        Assumes fully obser...
Key Idea of MMMFRows – feature vectors, Columns – linear classifiers                    Linear classifiersweight vectors ...
MMMF: Simultaneous Search for Low-normFeature Vectors and Max-margin Classifiers                                          ...
Active Learning with MMMF- We extend MMMF to Active-MMMF using margin-based active sampling- We investigate exploitation v...
Active Max-Margin Matrix Factorization       A-MMMF(M,s)          1. Given s sparse matrix Y, learn approximation X = MMMF...
Empirical Results   Network latency prediction   Bandwidth prediction (peer-to-peer)   Movie Ranking Prediction   Sens...
Empirical Results: Latency Prediction             P2Psim data                       NLANR-AMP dataActive sampling with mos...
Movie Rating Prediction (MovieLens)                                      80
Sensor Network Connectivity                              81
Introducing Cost: Exploration vs Exploitation      DownloadGrid:                                      PlanetLab:   bandwid...
Conclusions   Common challenge in many applications:    need for cost-efficient sampling   This talk: linear hidden fact...
ReferencesSome of the most influential papers•   Simon Tong, Daphne Koller.    Support Vector Machine Active Learning with...
NIPS papers•   Francis Bach. Active learning for misspecified generalized linear models. NIPS-06•   Ran Gilad-Bachrach, Am...
ICML papers•   Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning. ICML-06•   Steven C. H. H...
Journal Papers•   Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou.    Fast Kernel Classifiers with Online and Act...
Workshops• http://domino.research.ibm.com/comm/researc                                        88
Appendix           89
Active Learning of Bayesian         Networks
Entropy Function• A measure of information in  random event X with possible  outcomes {x1,…,xn}   H(x) = - Σi p(xi) log2 p...
Kullback-Leibler divergence• P is the true distribution; Q distribution is used to encode  data instead of P• KL divergenc...
Learning Bayesian Networks                                                 E       B      Data                            ...
Active Learning• Selective Active Learning• Interventional Active Learning• Obtain measure of quality of current model• Ch...
Active Learning: Parameter                      Estimation•         [Tong & Koller, NIPS-2000]      Given a BN structure G...
Updating parameter density•Do not update A since we are fixing it                                              B•If we sel...
Bayesian point estimation• Goal: a single estimate θ  – instead of a distribution p over θ                p(θ)            ...
Bayesian point estimation• We do not know the true θ’• Density p represents optimal beliefs over θ’• Choose θ that minimiz...
The Querying component• Set the controllable variables so as to  minimize the expected posterior risk:    ExPRisk(p | Q=q)...
Algorithm Summary• For each potential query q• Compute ∆Risk(X|q)• Choose q for which ∆Risk(X|q) is greatest  – Cost of co...
Uncertainty samplingMaintain a single hypothesis, based on labels seen so far.Query the point about which this hypothesis ...
102
Region of uncertaintyCurrent version space: portion of H consistent with labels so far.“Region of uncertainty” = part of d...
Region of uncertaintyAlgorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty,pick one at random t...
Region of uncertaintyNumber of labels needed depends on H and also on P.Special case: H = {linear separators in Rd}, P = u...
Region of uncertaintyAlgorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty,pick one at random t...
ExpectedInfogainof sample   Uncertainty sampling!                                    107
108
Upcoming SlideShare
Loading in …5
×

Active learning lecture

1,187 views

Published on

Active learning lecture

Published in: Technology, Education
  • Be the first to comment

Active learning lecture

  1. 1. Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center
  2. 2. Outline Motivation Active learning approaches  Membership queries  Uncertainty Sampling  Information-based loss functions  Uncertainty-Region Sampling  Query by committee Applications  Active Collaborative Prediction  Active Bayes net learning 2
  3. 3. Standard supervised learning modelGiven m labeled points, want to learn a classifier withmisclassification rate <ε, chosen from a hypothesis class H withVC dimension d < 1.VC theory: need m to be roughly d/ε, in the realizable case. 3
  4. 4. Active learningIn many situations – like speech recognition and documentretrieval – unlabeled data is easy to come by, but there is acharge for each label.What is the minimum number of labels needed to achieve thetarget error rate? 4
  5. 5. 5
  6. 6. What is Active Learning? Unlabeled data are readily available; labels are expensive Want to use adaptive decisions to choose which labels to acquire for a given dataset Goal is accurate classifier with minimal cost 6
  7. 7. Active learning warning Choice of data is only as good as the model itself Assume a linear model, then two data points are sufficient What happens when data are not linear? 7
  8. 8. Active Learning Flavors Selective Sampling Membership Queries Pool Sequential Myopic Batch 8
  9. 9. Active Learning Approaches Membership queries Uncertainty Sampling Information-based loss functions Uncertainty-Region Sampling Query by committee 9
  10. 10. 10
  11. 11. 11
  12. 12. ProblemMany results in this framework, even for complicatedhypothesis classes.[Baum and Lang, 1991] tried fitting a neural net to handwrittencharacters.Synthetic instances created were incomprehensible to humans![Lewis and Gale, 1992] tried training text classifiers.“an artificial text created by a learning algorithm is unlikely tobe a legitimate natural language expression, and probably wouldbe uninterpretable by a human teacher.” 12
  13. 13. 13
  14. 14. 14
  15. 15. 15
  16. 16. Uncertainty Sampling[Lewis & Gale, 1994] Query the event that the current classifier is most uncertain about If uncertainty is measured in Euclidean distance: x x x x x x x x x x Used trivially in SVMs, graphical models, etc. 16
  17. 17. Information-based Loss Function[MacKay, 1992] Maximize KL-divergence between posterior and prior Maximize reduction in model entropy between posterior and prior Minimize cross-entropy between posterior and prior All of these are notions of information gain 17
  18. 18. Query by Committee[Seung et al. 1992, Freund et al. 1997] Prior distribution over hypotheses Samples a set of classifiers from distribution Queries an example based on the degree of disagreement between committee of classifiers x x x x x x x x x x A B C 18
  19. 19. Infogain-based ActiveLearning
  20. 20. NotationWe Have:• Dataset, D• Model parameter space, W• Query algorithm, q 20
  21. 21. Dataset (D) Example t Sex Age Test A Test B Test C Disease0 M 40-50 0 1 1 ?1 F 50-60 0 1 0 ?2 F 30-40 0 0 0 ?3 F 60+ 1 1 1 ?4 M 10-20 0 1 0 ?5 M 40-50 0 0 1 ?6 F 0-10 0 0 0 ?7 M 30-40 1 1 0 ?8 M 20-30 0 0 1 ? 21
  22. 22. NotationWe Have:• Dataset, D• Model parameter space, W• Query algorithm, q 22
  23. 23. Model Example St Ot Probabilistic ClassifierNotationT : Number of examplesOt : Vector of features of example tSt : Class of example t 23
  24. 24. Model Example Patient state (St) St : DiseaseState Patient Observations (Ot) Ot1 : Gender Ot2 : Age Ot3 : TestA Ot4 : TestB Ot5 : TestC 24
  25. 25. Possible Model Structures Gender TestB Age GenderS TestA S TestA Age TestB TestC TestC 25
  26. 26. Model Space Model: St OtModel Parameters: P(St) P(Ot|St)Generative Model:Must be able to compute P(St=i, Ot=ot | w) 26
  27. 27. Model Parameter Space (W)• W = space of possible parameter values• Prior on parameters: P(W )• Posterior over models: P (W | D ) ∝ P ( D | W ) P (W ) T ∝ ∏ P ( St , Ot | W ) P(W ) t 27
  28. 28. NotationWe Have:• Dataset, D• Model parameter space, W• Query algorithm, q q(W,D) returns t*, the next sample to label 28
  29. 29. Gamewhile NotDone • Learn P(W | D) • q chooses next example to label • Expert adds label to D 29
  30. 30. SimulationO1 O2 O3 O4 O5 O6 O7 e S2 fals eS1 = S3 S4 S5 true = S6 S7 fals = S2 S5 S7 ? ? hmm… ? q 30
  31. 31. Active Learning Flavors• Pool (“random access” to patients)• Sequential (must decide as patients walk in the door) 31
  32. 32. q?• Recall: q(W,D) returns the “most interesting” unlabelled example.• Well, what makes a doctor curious about a patient? 32
  33. 33. 1994 33
  34. 34. Score Functionscoreuncert ( S t ) = uncertainty(P ( St | Ot )) = H ( St ) = ∑ P ( St = i ) log P( St = i ) i 34
  35. 35. Uncertainty Sampling Example t Sex Age Test Test Test St P(St) H(St) A B C1 M 20- 30 0 1 1 ? 0.02 0.0432 F 20- 30 0 1 0 ? 0.01 0.0243 F 30- 40 1 0 0 ? 0.05 0.0864 F 60+ 1 1 0 ? FALSE 0.12 0.1595 M 10- 20 0 1 0 ? 0.01 0.0246 M 20- 30 1 1 1 ? 0.96 0.073 35
  36. 36. Uncertainty Sampling Example t Sex Age Test Test Test St P(St) H(St) A B C1 M 20- 30 0 1 1 ? 0.01 0.0242 F 20- 30 0 1 0 ? 0.02 0.0433 F 30- 40 1 0 0 ? 0.04 0.0734 F 60+ 1 1 0 ? FALSE 0.00 0.005 M 10- 20 0 1 0 ? TRUE 0.06 0.1126 M 20- 30 1 1 1 ? 0.97 0.059 36
  37. 37. Uncertainty SamplingGOOD: couldn’t be easierGOOD: often performs pretty wellBAD: H(St) measures information gain about the samples, not the model Sensitive to noisy samples 37
  38. 38. Can we do better thanuncertainty sampling? 38
  39. 39. 1992 Informative with respect to what? 39
  40. 40. Model EntropyP(W|D) P(W|D) P(W|D) W W W H(W) = high …better… H(W) = 0 40
  41. 41. Information-Gain• Choose the example that is expected to most reduce H(W)• I.e., Maximize H(W) – H(W | St) Current model Expected model space space entropy entropy if we learn St 41
  42. 42. Score Functionscore IG ( St ) = MI ( St ;W ) = H (W ) − H (W | St ) 42
  43. 43. We usually can’t just sum over all models to get H(St|W) H (W ) = − ∫ P ( w) log P ( w) dw w…but we can sample from P(W | D) H (W ) ∝ H (C ) = −∑ P (c) log P (c) c∈C 43
  44. 44. Conditional Model Entropy H (W ) = ∫ P ( w) log P ( w) dw wH (W | St = i ) = ∫ P ( w | St = i ) log P ( w | St = i ) dw w H (W | St ) = ∑ P ( St = i ) ∫ P( w | S t = i ) log P ( w | St = i ) dw w i 44
  45. 45. Score Functionscore IG ( St ) = H (C ) − H (C | St ) 45
  46. 46. t Sex Age Test Test Test St P(St) Score = A B C H(C) - H(C|St)1 M 20-3 0 1 1 0 ? 0.02 0.532 F 20-3 0 1 0 0 ? 0.01 0.583 F 30-4 1 0 0 0 ? 0.05 0.404 F 60+ 1 1 1 ? 0.12 0.495 M 10-2 0 1 0 0 ? 0.01 0.57 20-36 M 0 0 0 1 ? 0.02 0.52 46
  47. 47. Score Functionscore IG ( St ) = H (C ) − H (C | St ) = H ( St ) − H ( St | C ) Familiar? 47
  48. 48. Uncertainty Sampling & Information GainscoreUncertain ( St ) = H ( St )score InfoGain ( St ) = H ( St ) − H ( St | C ) 48
  49. 49. But there is a problem… 49
  50. 50. If our objective is to reducethe prediction error, then“the expected information gainof an unlabeled sample is NOTa sufficient criterion forconstructing good queries” 50
  51. 51. Strategy #2:Query by CommitteeTemporary Assumptions:Pool  SequentialP(W | D)  Version SpaceProbabilistic  Noiseless QBC attacks the size of the “Version space” 51
  52. 52. O1 O2 O3 O4 O5 O6 O7S1 S2 S3 S4 S5 S6 S7 FALSE! FALSE! Model #1 Model #2 52
  53. 53. O1 O2 O3 O4 O5 O6 O7S1 S2 S3 S4 S5 S6 S7 TRUE! TRUE! Model #1 Model #2 53
  54. 54. O1 O2 O3 O4 O5 O6 O7S1 S2 S3 S4 S5 S6 S7 FALSE! TRUE! Ooh, now we’re going to learn something for sure! One of them is definitely wrong. Model #1 Model #2 54
  55. 55. The Original QBCAlgorithmAs each example arrives…• Choose a committee, C, (usually of size 2) randomly from Version Space• Have each member of C classify it• If the committee disagrees, select it. 55
  56. 56. 1992 56
  57. 57. Infogain vs Query by Committee[Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997]First idea: Try to rapidly reduce volume of version space?Problem: doesn’t take data distribution into account. H:Which pair of hypotheses is closest? Depends on data distribution P.Distance measure on H: d(h,h’) = P(h(x) ≠ h’(x)) 57
  58. 58. Query-by-committeeFirst idea: Try to rapidly reduce volume of version space?Problem: doesn’t take data distribution into account.To keep things simple, say d(h,h’) = Euclidean distance H: Error is likely to remain large! 58
  59. 59. Query-by-committeeElegant scheme which decreases volume in a manner which issensitive to the data distribution.Bayesian setting: given a prior π on HH1 = HFor t = 1, 2, … receive an unlabeled point xt drawn from P [informally: is there a lot of disagreement about xt in Ht?] choose two hypotheses h,h’ randomly from (π, Ht) if h(xt) ≠ h’(xt): ask for xt’s label set Ht+1 59Problem: how to implement it efficiently?
  60. 60. Query-by-committeeFor t = 1, 2, … receive an unlabeled point xt drawn from P choose two hypotheses h,h’ randomly from (π, Ht) if h(xt) ≠ h’(xt): ask for xt’s label set Ht+1Observation: the probability of getting pair (h,h’) in the innerloop (when a query is made) is proportional to π(h) π(h’) d(h,h’). vs. 60 Ht
  61. 61. 61
  62. 62. Query-by-committeeLabel bound:For H = {linear separators in Rd}, P = uniform distribution,just d log 1/ε labels to reach a hypothesis with error < ε.Implementation: need to randomly pick h according to (π, Ht).e.g. H = {linear separators in Rd}, π = uniform distribution: How do you pick a Ht random point from a convex body? 62
  63. 63. Sampling from convex bodies By random walk! 2. Ball walk 3. Hit-and-run[Gilad-Bachrach, Navot, Tishby 2005] Studies random walks andalso ways to kernelize QBC. 63
  64. 64. 64
  65. 65. Some challenges[1] For linear separators, analyze the label complexity forsome distribution other than uniform![2] How to handle nonseparable data?Need a robust base learner + true boundary - 65
  66. 66. Active Collaborative Prediction
  67. 67. Approach: Collaborative Prediction (CP) QoS measure (e.g. bandwidth) Movie Ratings Server1 Server2 Server3 Matrix Geisha Shrek Client1 3674 18 Alina ? 1 3 Client2 187 567 Gerry 4 2 Client3 1688 Irina 9 10 Client4 3009 703 ? Raja 4 Given previously observed ratings R(x,y), where X is a “user” and Y is a “product”, predict unobserved ratings - will Alina like “The Matrix”? (unlikely ) - will Client 86 have fast download from Server 39? - will member X of funding committee approve our project Y?  67
  68. 68. Collaborative Prediction = Matrix Approximation 100 servers • Important assumption:100 clients matrix entries are NOT independent, e.g. similar users have similar tastes • Approaches: mainly factorized models assuming hidden ‘factors’ that affect ratings (pLSA, MCVQ, SVD, NMF, MMMF, …) 68
  69. 69. 2 4 5 1 4 2 User’s ‘weights’ Factors associated with ‘factors’’Assumptions: - there is a number of (hidden) factors behind the user preferences that relate to (hidden) movie properties - movies have intrinsic values associated with such factors - users have intrinsic weights with such factors; user ratings a weighted (linear) combinations of movie’s values 69
  70. 70. 2 4 5 1 4 2 70
  71. 71. 2 4 5 1 4 2 71
  72. 72. rank k 2 4 5 1 4 2 7 2 5 4 5 3 1 4 2 3 1 2 2 5 3 1 2 4 2 2 7 5 6 4 2 4 1 3 1 4 3 2 2 4 1 4 3 1 = 3 3 4 2 3 1 2 3 4 3 2 4 5 2 3 1 4 3 2 2 3 2 1 3 4 3 5 2 2 2 1 4 8 2 2 9 1 8 3 4 5 1 3 1 1 4 1 2 3 5 1 1 5 6 4 Y XObjective: find a factorizable X=UV’ that approximates Y X = arg min Loss ( X , Y ) Xand satisfies some “regularization” constraints (e.g. rank(X) < k)Loss functions: depends on the nature of your problem 72
  73. 73. Matrix Factorization Approaches Singular value decomposition (SVD) – low-rank approximation  Assumes fully observed Y and sum-squared loss In collaborative prediction, Y is only partially observed Low-rank approximation becomes non-convex problem w/ many local minima Furthermore, we may not want sum-squared loss, but instead  accurate predictions (0/1 loss, approximated by hinge loss)  cost-sensitive predictions (missing a good server vs suggesting a bad one)  ranking cost (e.g., suggest k ‘best’ movies for a user) NON-CONVEX PROBLEMS! Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05]  replaces bounded rank constraint by bounded norm of U, V’ vectors  convex optimization problem! – can be solved exactly by semi-definite programming  strongly relates to learning max-margin classifiers (SVMs) Exploit MMMF’s properties to augment it with active sampling! 73
  74. 74. Key Idea of MMMFRows – feature vectors, Columns – linear classifiers Linear classifiersweight vectors “margin” here = Dist(sample, line) v2 f1 -1 “ma rgin Feature vectors ” Xij = signij x marginij Predictorij = signij If signij > 0, classify as +1, Otherwise classify as -1 74
  75. 75. MMMF: Simultaneous Search for Low-normFeature Vectors and Max-margin Classifiers 75
  76. 76. Active Learning with MMMF- We extend MMMF to Active-MMMF using margin-based active sampling- We investigate exploitation vs exploration trade-offs imposed by different heuristics -0.3 -0.5 0.3 0.4 0.6 0.1 -0.9 -0.6 -0.5 0.8 0.2 -0.9 0.3 -0.1 0.6 -0.7 -0.5 0.7 -0.9 -0.8 0.9 0.1 0.5 0.2 0.3 -0.5 0.6 0.6 0.5 0.2 -0.9 0.7 -0.8 -0.5 0.9 -0.6 -0.1 0.9 0.7 0.8 -0.4 0.3 Margin-based heuristics: -0.5 -0.2 0.6 -0.5 -0.5 -0.5 -0.4 0.6 -0.5 0.4 0.5 min-margin (most-uncertain) -0.2 -0.5 -0.5 0.1 0.9 0.3 min-margin positive (“good” uncertain) 0.8 -0.5 0.6 0.2 max-margin (‘safe choice’ but no76 info)
  77. 77. Active Max-Margin Matrix Factorization A-MMMF(M,s) 1. Given s sparse matrix Y, learn approximation X = MMMF(Y) 2. Using current predictions, actively select “best s” samples and request their labels (e.g., test client/server pair via ‘enforced’ download) 3. Add new samples to Y 4. Repeat 1-3Issues:  Beyond simple greedy margin-based heuristics?  Theoretical guarantees? not so easy with non-trivial learning methods and non-trivial data distributions (any suggestions???  ) 77
  78. 78. Empirical Results Network latency prediction Bandwidth prediction (peer-to-peer) Movie Ranking Prediction Sensor net connectivity prediction 78
  79. 79. Empirical Results: Latency Prediction P2Psim data NLANR-AMP dataActive sampling with most-uncertain (and most-uncertain positive)heuristics provide consistent improvement over random and least-uncertain-next sampling 79
  80. 80. Movie Rating Prediction (MovieLens) 80
  81. 81. Sensor Network Connectivity 81
  82. 82. Introducing Cost: Exploration vs Exploitation DownloadGrid: PlanetLab: bandwidth prediction latency prediction Active sampling  lower prediction errors at lower costs (saves 100s of samples) (better prediction  better server assignment decisions  faster downloadsActive sampling achieves a good exploration vs exploitation trade-off:reduced decision cost AND information gain 82
  83. 83. Conclusions Common challenge in many applications: need for cost-efficient sampling This talk: linear hidden factor models with active sampling Active sampling improves predictive accuracy while keeping sampling complexity low in a wide variety of applications Future work:  Better active sampling heuristics?  Theoretical analysis of active sampling performance?  Dynamic Matrix Factorizations: tracking time-varying matrices  Incremental MMMF? (solving from scratch every time is too costly) 83
  84. 84. ReferencesSome of the most influential papers• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification . Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133—168• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996.• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994.• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992. 84
  85. 85. NIPS papers• Francis Bach. Active learning for misspecified generalized linear models. NIPS-06• Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real. NIPS-05• Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman. Active Learning For Identifying Function Threshold Boundaries . NIPS-05• Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning. NIPS-05• Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. NIPS-05• Masashi Sugiyama. Active Learning for Misspecified Models. NIPS-05• Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models. NIPS-05• Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection. NIPS-04• Sanjoy Dasgupta. Analysis of a greedy active learning strategy. NIPS-04• T. Jaakkola and H. Siegelmann. Active Information Retrieval. NIPS-01• M. K. Warmuth et al. Active Learning in the Drug Discovery Process. NIPS-01• Jonathan D. Nelson, Javier R. Movellan. Active Inference in Concept Learning. NIPS-00• Simon Tong, Daphne Koller. Active Learning for Parameter Estimation in Bayesian Networks. NIPS-00• Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering. NIPS-97• K. Fukumizu. Active Learning in Multilayer Perceptrons. NIPS-95• Anders Krogh, Jesper Vedelsby. NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE LEARNING. NIPS-94• Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION. NIPS-94• David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS. NIPS-94• Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments. NIPS-91 85
  86. 86. ICML papers• Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning. ICML-06• Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu. Batch Mode Active Learning and Its Application to Medical Image Classification. ICML-06• Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani. Active Sampling for Detecting Irrelevant Features. ICML-06• Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design. ICML-06• Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05• Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning. ICML-04• Klaus Brinker. Active Learning of Label Ranking Functions. ICML-04• Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04• Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines, ICML-00• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. ICML-00.• COLT papers• S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. COLT-05.• H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee. COLT-92, pages 287--294. 86
  87. 87. Journal Papers• Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou. Fast Kernel Classifiers with Online and Active Learning. Journal of Machine Learning Research (JMLR), vol. 6, pp. 1579-1619, 2005.• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133--168• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996.• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994.• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992.• Haussler, D., Kearns, M., and Schapire, R. E. (1994). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension . Machine Learning, 14, 83--113• Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press.• Saar-Tsechansky, M. and F. Provost. Active Sampling for Class Probability Estimation and Ranking. Machine Learning 54:2 2004, 153-178. 87
  88. 88. Workshops• http://domino.research.ibm.com/comm/researc 88
  89. 89. Appendix 89
  90. 90. Active Learning of Bayesian Networks
  91. 91. Entropy Function• A measure of information in random event X with possible outcomes {x1,…,xn} H(x) = - Σi p(xi) log2 p(xi)• Comments on entropy function: – Entropy of an event is zero when the outcome is known – Entropy is maximal when all outcomes are equally likely• The average minimum yes/no questions to answer some question (connection to binary 91 search) [Shannon, 1948]
  92. 92. Kullback-Leibler divergence• P is the true distribution; Q distribution is used to encode data instead of P• KL divergence is the expected extra message length per datum that must be transmitted using Q DKL(P || Q) = Σi P(xi) log (P(xi)/Q(xi)) = Σi P(xi) log Q(xi) – Σi P(xi) log P(xi) = -H(P,Q) + H(P) = -Cross-entropy + entropy• Measure of how “wrong” Q is with respect to true distribution P 92
  93. 93. Learning Bayesian Networks E B Data R A + LearnerPrior Knowledge C E B P(A | E,B) e b .9 .1 • Model Building e b .7 .3 • Parameter estimation • Causal structure discovery e b .8 .2 • Passive Learning vs Active Learning e b .99 .01 93
  94. 94. Active Learning• Selective Active Learning• Interventional Active Learning• Obtain measure of quality of current model• Choose query that most improves quality• Update model 94
  95. 95. Active Learning: Parameter Estimation• [Tong & Koller, NIPS-2000] Given a BN structure G• A prior distribution p(θ)• Learner request a particular instantiation q (Query) UpdatedInitial Network distribution G, p(θ) p´(θ) E B Query (q) E B Active Learner A A + Response (x) Training data How to update parameter density How to select next query based on p 95
  96. 96. Updating parameter density•Do not update A since we are fixing it B•If we select A then do not update B •Sampling from P(B|A=a) ≠ P(B)•If we force A then we can update B A •Sampling from P(B|A:=a) = P(B)*•Update all other nodes as usual J M•Obtain new density p(θ | A = a, X = x ) 96 *Pearl 2000
  97. 97. Bayesian point estimation• Goal: a single estimate θ – instead of a distribution p over θ p(θ) ~• If we choose θ and the true model is θ’ θ θ’ then we incur some loss, L(θ’ || θ) 97
  98. 98. Bayesian point estimation• We do not know the true θ’• Density p represents optimal beliefs over θ’• Choose θ that minimizes the expected loss ~ θ = argminθ ∫p(θ’) L(θ’ || θ) dθ’ ~• Call θ the Bayesian point estimate• Use the expected loss of the Bayesian point estimate as a measure of quality of p(θ): – Risk(p) = ∫p(θ’) L(θ’ || θ) dθ’ 98
  99. 99. The Querying component• Set the controllable variables so as to minimize the expected posterior risk: ExPRisk(p | Q=q) ~ = ∑ P( X = x | Q = q) ∫ p(θ | x ) KL(θ || θ ) dθ x• KL divergence will be used for loss Conditional KL-divergence – KL(θ || θ’)=∑KL(Pθ(Xi|Ui)|| Pθ’(Xi|Ui)) 99
  100. 100. Algorithm Summary• For each potential query q• Compute ∆Risk(X|q)• Choose q for which ∆Risk(X|q) is greatest – Cost of computing ∆Risk(X|q):• Cost of Bayesian network inference• Complexity: O (|Q|. Cost of inference) 100
  101. 101. Uncertainty samplingMaintain a single hypothesis, based on labels seen so far.Query the point about which this hypothesis is most “uncertain”.Problem: confidence of a single hypothesis may not accuratelyrepresent the true diversity of opinion in the hypothesis class. X - - - - + + - - + + + - - - 101
  102. 102. 102
  103. 103. Region of uncertaintyCurrent version space: portion of H consistent with labels so far.“Region of uncertainty” = part of data space about which there isstill some uncertainty (i.e. disagreement within version space) current version spaceSuppose data lieson circle in R2; +hypotheses are +linear separators.(spaces X, Hsuperimposed) region of uncertainty in data space 103
  104. 104. Region of uncertaintyAlgorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty,pick one at random to query.Data and current version spacehypothesis spaces,superimposed:(both are thesurface of the unitsphere in Rd) region of uncertainty in data space 104
  105. 105. Region of uncertaintyNumber of labels needed depends on H and also on P.Special case: H = {linear separators in Rd}, P = uniformdistribution over unit sphere.Then: just d log 1/ε labels are needed to reach a hypothesiswith error rate < ε.[1] Supervised learning: d/ε labels.[2] Best we can hope for. 105
  106. 106. Region of uncertaintyAlgorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty,pick one at random to query.For more general distributions: suboptimal… Need to measure quality of a query – or alternatively, size of version space. 106
  107. 107. ExpectedInfogainof sample Uncertainty sampling! 107
  108. 108. 108

×