Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

1,187 views

Published on

Active learning lecture

No Downloads

Total views

1,187

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

39

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center
- 2. Outline Motivation Active learning approaches Membership queries Uncertainty Sampling Information-based loss functions Uncertainty-Region Sampling Query by committee Applications Active Collaborative Prediction Active Bayes net learning 2
- 3. Standard supervised learning modelGiven m labeled points, want to learn a classifier withmisclassification rate <ε, chosen from a hypothesis class H withVC dimension d < 1.VC theory: need m to be roughly d/ε, in the realizable case. 3
- 4. Active learningIn many situations – like speech recognition and documentretrieval – unlabeled data is easy to come by, but there is acharge for each label.What is the minimum number of labels needed to achieve thetarget error rate? 4
- 5. 5
- 6. What is Active Learning? Unlabeled data are readily available; labels are expensive Want to use adaptive decisions to choose which labels to acquire for a given dataset Goal is accurate classifier with minimal cost 6
- 7. Active learning warning Choice of data is only as good as the model itself Assume a linear model, then two data points are sufficient What happens when data are not linear? 7
- 8. Active Learning Flavors Selective Sampling Membership Queries Pool Sequential Myopic Batch 8
- 9. Active Learning Approaches Membership queries Uncertainty Sampling Information-based loss functions Uncertainty-Region Sampling Query by committee 9
- 10. 10
- 11. 11
- 12. ProblemMany results in this framework, even for complicatedhypothesis classes.[Baum and Lang, 1991] tried fitting a neural net to handwrittencharacters.Synthetic instances created were incomprehensible to humans![Lewis and Gale, 1992] tried training text classifiers.“an artificial text created by a learning algorithm is unlikely tobe a legitimate natural language expression, and probably wouldbe uninterpretable by a human teacher.” 12
- 13. 13
- 14. 14
- 15. 15
- 16. Uncertainty Sampling[Lewis & Gale, 1994] Query the event that the current classifier is most uncertain about If uncertainty is measured in Euclidean distance: x x x x x x x x x x Used trivially in SVMs, graphical models, etc. 16
- 17. Information-based Loss Function[MacKay, 1992] Maximize KL-divergence between posterior and prior Maximize reduction in model entropy between posterior and prior Minimize cross-entropy between posterior and prior All of these are notions of information gain 17
- 18. Query by Committee[Seung et al. 1992, Freund et al. 1997] Prior distribution over hypotheses Samples a set of classifiers from distribution Queries an example based on the degree of disagreement between committee of classifiers x x x x x x x x x x A B C 18
- 19. Infogain-based ActiveLearning
- 20. NotationWe Have:• Dataset, D• Model parameter space, W• Query algorithm, q 20
- 21. Dataset (D) Example t Sex Age Test A Test B Test C Disease0 M 40-50 0 1 1 ?1 F 50-60 0 1 0 ?2 F 30-40 0 0 0 ?3 F 60+ 1 1 1 ?4 M 10-20 0 1 0 ?5 M 40-50 0 0 1 ?6 F 0-10 0 0 0 ?7 M 30-40 1 1 0 ?8 M 20-30 0 0 1 ? 21
- 22. NotationWe Have:• Dataset, D• Model parameter space, W• Query algorithm, q 22
- 23. Model Example St Ot Probabilistic ClassifierNotationT : Number of examplesOt : Vector of features of example tSt : Class of example t 23
- 24. Model Example Patient state (St) St : DiseaseState Patient Observations (Ot) Ot1 : Gender Ot2 : Age Ot3 : TestA Ot4 : TestB Ot5 : TestC 24
- 25. Possible Model Structures Gender TestB Age GenderS TestA S TestA Age TestB TestC TestC 25
- 26. Model Space Model: St OtModel Parameters: P(St) P(Ot|St)Generative Model:Must be able to compute P(St=i, Ot=ot | w) 26
- 27. Model Parameter Space (W)• W = space of possible parameter values• Prior on parameters: P(W )• Posterior over models: P (W | D ) ∝ P ( D | W ) P (W ) T ∝ ∏ P ( St , Ot | W ) P(W ) t 27
- 28. NotationWe Have:• Dataset, D• Model parameter space, W• Query algorithm, q q(W,D) returns t*, the next sample to label 28
- 29. Gamewhile NotDone • Learn P(W | D) • q chooses next example to label • Expert adds label to D 29
- 30. SimulationO1 O2 O3 O4 O5 O6 O7 e S2 fals eS1 = S3 S4 S5 true = S6 S7 fals = S2 S5 S7 ? ? hmm… ? q 30
- 31. Active Learning Flavors• Pool (“random access” to patients)• Sequential (must decide as patients walk in the door) 31
- 32. q?• Recall: q(W,D) returns the “most interesting” unlabelled example.• Well, what makes a doctor curious about a patient? 32
- 33. 1994 33
- 34. Score Functionscoreuncert ( S t ) = uncertainty(P ( St | Ot )) = H ( St ) = ∑ P ( St = i ) log P( St = i ) i 34
- 35. Uncertainty Sampling Example t Sex Age Test Test Test St P(St) H(St) A B C1 M 20- 30 0 1 1 ? 0.02 0.0432 F 20- 30 0 1 0 ? 0.01 0.0243 F 30- 40 1 0 0 ? 0.05 0.0864 F 60+ 1 1 0 ? FALSE 0.12 0.1595 M 10- 20 0 1 0 ? 0.01 0.0246 M 20- 30 1 1 1 ? 0.96 0.073 35
- 36. Uncertainty Sampling Example t Sex Age Test Test Test St P(St) H(St) A B C1 M 20- 30 0 1 1 ? 0.01 0.0242 F 20- 30 0 1 0 ? 0.02 0.0433 F 30- 40 1 0 0 ? 0.04 0.0734 F 60+ 1 1 0 ? FALSE 0.00 0.005 M 10- 20 0 1 0 ? TRUE 0.06 0.1126 M 20- 30 1 1 1 ? 0.97 0.059 36
- 37. Uncertainty SamplingGOOD: couldn’t be easierGOOD: often performs pretty wellBAD: H(St) measures information gain about the samples, not the model Sensitive to noisy samples 37
- 38. Can we do better thanuncertainty sampling? 38
- 39. 1992 Informative with respect to what? 39
- 40. Model EntropyP(W|D) P(W|D) P(W|D) W W W H(W) = high …better… H(W) = 0 40
- 41. Information-Gain• Choose the example that is expected to most reduce H(W)• I.e., Maximize H(W) – H(W | St) Current model Expected model space space entropy entropy if we learn St 41
- 42. Score Functionscore IG ( St ) = MI ( St ;W ) = H (W ) − H (W | St ) 42
- 43. We usually can’t just sum over all models to get H(St|W) H (W ) = − ∫ P ( w) log P ( w) dw w…but we can sample from P(W | D) H (W ) ∝ H (C ) = −∑ P (c) log P (c) c∈C 43
- 44. Conditional Model Entropy H (W ) = ∫ P ( w) log P ( w) dw wH (W | St = i ) = ∫ P ( w | St = i ) log P ( w | St = i ) dw w H (W | St ) = ∑ P ( St = i ) ∫ P( w | S t = i ) log P ( w | St = i ) dw w i 44
- 45. Score Functionscore IG ( St ) = H (C ) − H (C | St ) 45
- 46. t Sex Age Test Test Test St P(St) Score = A B C H(C) - H(C|St)1 M 20-3 0 1 1 0 ? 0.02 0.532 F 20-3 0 1 0 0 ? 0.01 0.583 F 30-4 1 0 0 0 ? 0.05 0.404 F 60+ 1 1 1 ? 0.12 0.495 M 10-2 0 1 0 0 ? 0.01 0.57 20-36 M 0 0 0 1 ? 0.02 0.52 46
- 47. Score Functionscore IG ( St ) = H (C ) − H (C | St ) = H ( St ) − H ( St | C ) Familiar? 47
- 48. Uncertainty Sampling & Information GainscoreUncertain ( St ) = H ( St )score InfoGain ( St ) = H ( St ) − H ( St | C ) 48
- 49. But there is a problem… 49
- 50. If our objective is to reducethe prediction error, then“the expected information gainof an unlabeled sample is NOTa sufficient criterion forconstructing good queries” 50
- 51. Strategy #2:Query by CommitteeTemporary Assumptions:Pool SequentialP(W | D) Version SpaceProbabilistic Noiseless QBC attacks the size of the “Version space” 51
- 52. O1 O2 O3 O4 O5 O6 O7S1 S2 S3 S4 S5 S6 S7 FALSE! FALSE! Model #1 Model #2 52
- 53. O1 O2 O3 O4 O5 O6 O7S1 S2 S3 S4 S5 S6 S7 TRUE! TRUE! Model #1 Model #2 53
- 54. O1 O2 O3 O4 O5 O6 O7S1 S2 S3 S4 S5 S6 S7 FALSE! TRUE! Ooh, now we’re going to learn something for sure! One of them is definitely wrong. Model #1 Model #2 54
- 55. The Original QBCAlgorithmAs each example arrives…• Choose a committee, C, (usually of size 2) randomly from Version Space• Have each member of C classify it• If the committee disagrees, select it. 55
- 56. 1992 56
- 57. Infogain vs Query by Committee[Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997]First idea: Try to rapidly reduce volume of version space?Problem: doesn’t take data distribution into account. H:Which pair of hypotheses is closest? Depends on data distribution P.Distance measure on H: d(h,h’) = P(h(x) ≠ h’(x)) 57
- 58. Query-by-committeeFirst idea: Try to rapidly reduce volume of version space?Problem: doesn’t take data distribution into account.To keep things simple, say d(h,h’) = Euclidean distance H: Error is likely to remain large! 58
- 59. Query-by-committeeElegant scheme which decreases volume in a manner which issensitive to the data distribution.Bayesian setting: given a prior π on HH1 = HFor t = 1, 2, … receive an unlabeled point xt drawn from P [informally: is there a lot of disagreement about xt in Ht?] choose two hypotheses h,h’ randomly from (π, Ht) if h(xt) ≠ h’(xt): ask for xt’s label set Ht+1 59Problem: how to implement it efficiently?
- 60. Query-by-committeeFor t = 1, 2, … receive an unlabeled point xt drawn from P choose two hypotheses h,h’ randomly from (π, Ht) if h(xt) ≠ h’(xt): ask for xt’s label set Ht+1Observation: the probability of getting pair (h,h’) in the innerloop (when a query is made) is proportional to π(h) π(h’) d(h,h’). vs. 60 Ht
- 61. 61
- 62. Query-by-committeeLabel bound:For H = {linear separators in Rd}, P = uniform distribution,just d log 1/ε labels to reach a hypothesis with error < ε.Implementation: need to randomly pick h according to (π, Ht).e.g. H = {linear separators in Rd}, π = uniform distribution: How do you pick a Ht random point from a convex body? 62
- 63. Sampling from convex bodies By random walk! 2. Ball walk 3. Hit-and-run[Gilad-Bachrach, Navot, Tishby 2005] Studies random walks andalso ways to kernelize QBC. 63
- 64. 64
- 65. Some challenges[1] For linear separators, analyze the label complexity forsome distribution other than uniform![2] How to handle nonseparable data?Need a robust base learner + true boundary - 65
- 66. Active Collaborative Prediction
- 67. Approach: Collaborative Prediction (CP) QoS measure (e.g. bandwidth) Movie Ratings Server1 Server2 Server3 Matrix Geisha Shrek Client1 3674 18 Alina ? 1 3 Client2 187 567 Gerry 4 2 Client3 1688 Irina 9 10 Client4 3009 703 ? Raja 4 Given previously observed ratings R(x,y), where X is a “user” and Y is a “product”, predict unobserved ratings - will Alina like “The Matrix”? (unlikely ) - will Client 86 have fast download from Server 39? - will member X of funding committee approve our project Y? 67
- 68. Collaborative Prediction = Matrix Approximation 100 servers • Important assumption:100 clients matrix entries are NOT independent, e.g. similar users have similar tastes • Approaches: mainly factorized models assuming hidden ‘factors’ that affect ratings (pLSA, MCVQ, SVD, NMF, MMMF, …) 68
- 69. 2 4 5 1 4 2 User’s ‘weights’ Factors associated with ‘factors’’Assumptions: - there is a number of (hidden) factors behind the user preferences that relate to (hidden) movie properties - movies have intrinsic values associated with such factors - users have intrinsic weights with such factors; user ratings a weighted (linear) combinations of movie’s values 69
- 70. 2 4 5 1 4 2 70
- 71. 2 4 5 1 4 2 71
- 72. rank k 2 4 5 1 4 2 7 2 5 4 5 3 1 4 2 3 1 2 2 5 3 1 2 4 2 2 7 5 6 4 2 4 1 3 1 4 3 2 2 4 1 4 3 1 = 3 3 4 2 3 1 2 3 4 3 2 4 5 2 3 1 4 3 2 2 3 2 1 3 4 3 5 2 2 2 1 4 8 2 2 9 1 8 3 4 5 1 3 1 1 4 1 2 3 5 1 1 5 6 4 Y XObjective: find a factorizable X=UV’ that approximates Y X = arg min Loss ( X , Y ) Xand satisfies some “regularization” constraints (e.g. rank(X) < k)Loss functions: depends on the nature of your problem 72
- 73. Matrix Factorization Approaches Singular value decomposition (SVD) – low-rank approximation Assumes fully observed Y and sum-squared loss In collaborative prediction, Y is only partially observed Low-rank approximation becomes non-convex problem w/ many local minima Furthermore, we may not want sum-squared loss, but instead accurate predictions (0/1 loss, approximated by hinge loss) cost-sensitive predictions (missing a good server vs suggesting a bad one) ranking cost (e.g., suggest k ‘best’ movies for a user) NON-CONVEX PROBLEMS! Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05] replaces bounded rank constraint by bounded norm of U, V’ vectors convex optimization problem! – can be solved exactly by semi-definite programming strongly relates to learning max-margin classifiers (SVMs) Exploit MMMF’s properties to augment it with active sampling! 73
- 74. Key Idea of MMMFRows – feature vectors, Columns – linear classifiers Linear classifiersweight vectors “margin” here = Dist(sample, line) v2 f1 -1 “ma rgin Feature vectors ” Xij = signij x marginij Predictorij = signij If signij > 0, classify as +1, Otherwise classify as -1 74
- 75. MMMF: Simultaneous Search for Low-normFeature Vectors and Max-margin Classifiers 75
- 76. Active Learning with MMMF- We extend MMMF to Active-MMMF using margin-based active sampling- We investigate exploitation vs exploration trade-offs imposed by different heuristics -0.3 -0.5 0.3 0.4 0.6 0.1 -0.9 -0.6 -0.5 0.8 0.2 -0.9 0.3 -0.1 0.6 -0.7 -0.5 0.7 -0.9 -0.8 0.9 0.1 0.5 0.2 0.3 -0.5 0.6 0.6 0.5 0.2 -0.9 0.7 -0.8 -0.5 0.9 -0.6 -0.1 0.9 0.7 0.8 -0.4 0.3 Margin-based heuristics: -0.5 -0.2 0.6 -0.5 -0.5 -0.5 -0.4 0.6 -0.5 0.4 0.5 min-margin (most-uncertain) -0.2 -0.5 -0.5 0.1 0.9 0.3 min-margin positive (“good” uncertain) 0.8 -0.5 0.6 0.2 max-margin (‘safe choice’ but no76 info)
- 77. Active Max-Margin Matrix Factorization A-MMMF(M,s) 1. Given s sparse matrix Y, learn approximation X = MMMF(Y) 2. Using current predictions, actively select “best s” samples and request their labels (e.g., test client/server pair via ‘enforced’ download) 3. Add new samples to Y 4. Repeat 1-3Issues: Beyond simple greedy margin-based heuristics? Theoretical guarantees? not so easy with non-trivial learning methods and non-trivial data distributions (any suggestions??? ) 77
- 78. Empirical Results Network latency prediction Bandwidth prediction (peer-to-peer) Movie Ranking Prediction Sensor net connectivity prediction 78
- 79. Empirical Results: Latency Prediction P2Psim data NLANR-AMP dataActive sampling with most-uncertain (and most-uncertain positive)heuristics provide consistent improvement over random and least-uncertain-next sampling 79
- 80. Movie Rating Prediction (MovieLens) 80
- 81. Sensor Network Connectivity 81
- 82. Introducing Cost: Exploration vs Exploitation DownloadGrid: PlanetLab: bandwidth prediction latency prediction Active sampling lower prediction errors at lower costs (saves 100s of samples) (better prediction better server assignment decisions faster downloadsActive sampling achieves a good exploration vs exploitation trade-off:reduced decision cost AND information gain 82
- 83. Conclusions Common challenge in many applications: need for cost-efficient sampling This talk: linear hidden factor models with active sampling Active sampling improves predictive accuracy while keeping sampling complexity low in a wide variety of applications Future work: Better active sampling heuristics? Theoretical analysis of active sampling performance? Dynamic Matrix Factorizations: tracking time-varying matrices Incremental MMMF? (solving from scratch every time is too costly) 83
- 84. ReferencesSome of the most influential papers• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification . Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133—168• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996.• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994.• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992. 84
- 85. NIPS papers• Francis Bach. Active learning for misspecified generalized linear models. NIPS-06• Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real. NIPS-05• Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman. Active Learning For Identifying Function Threshold Boundaries . NIPS-05• Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning. NIPS-05• Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. NIPS-05• Masashi Sugiyama. Active Learning for Misspecified Models. NIPS-05• Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models. NIPS-05• Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection. NIPS-04• Sanjoy Dasgupta. Analysis of a greedy active learning strategy. NIPS-04• T. Jaakkola and H. Siegelmann. Active Information Retrieval. NIPS-01• M. K. Warmuth et al. Active Learning in the Drug Discovery Process. NIPS-01• Jonathan D. Nelson, Javier R. Movellan. Active Inference in Concept Learning. NIPS-00• Simon Tong, Daphne Koller. Active Learning for Parameter Estimation in Bayesian Networks. NIPS-00• Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering. NIPS-97• K. Fukumizu. Active Learning in Multilayer Perceptrons. NIPS-95• Anders Krogh, Jesper Vedelsby. NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE LEARNING. NIPS-94• Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION. NIPS-94• David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS. NIPS-94• Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments. NIPS-91 85
- 86. ICML papers• Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning. ICML-06• Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu. Batch Mode Active Learning and Its Application to Medical Image Classification. ICML-06• Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani. Active Sampling for Detecting Irrelevant Features. ICML-06• Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design. ICML-06• Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05• Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning. ICML-04• Klaus Brinker. Active Learning of Label Ranking Functions. ICML-04• Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04• Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines, ICML-00• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. ICML-00.• COLT papers• S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. COLT-05.• H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee. COLT-92, pages 287--294. 86
- 87. Journal Papers• Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou. Fast Kernel Classifiers with Online and Active Learning. Journal of Machine Learning Research (JMLR), vol. 6, pp. 1579-1619, 2005.• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133--168• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996.• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994.• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992.• Haussler, D., Kearns, M., and Schapire, R. E. (1994). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension . Machine Learning, 14, 83--113• Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press.• Saar-Tsechansky, M. and F. Provost. Active Sampling for Class Probability Estimation and Ranking. Machine Learning 54:2 2004, 153-178. 87
- 88. Workshops• http://domino.research.ibm.com/comm/researc 88
- 89. Appendix 89
- 90. Active Learning of Bayesian Networks
- 91. Entropy Function• A measure of information in random event X with possible outcomes {x1,…,xn} H(x) = - Σi p(xi) log2 p(xi)• Comments on entropy function: – Entropy of an event is zero when the outcome is known – Entropy is maximal when all outcomes are equally likely• The average minimum yes/no questions to answer some question (connection to binary 91 search) [Shannon, 1948]
- 92. Kullback-Leibler divergence• P is the true distribution; Q distribution is used to encode data instead of P• KL divergence is the expected extra message length per datum that must be transmitted using Q DKL(P || Q) = Σi P(xi) log (P(xi)/Q(xi)) = Σi P(xi) log Q(xi) – Σi P(xi) log P(xi) = -H(P,Q) + H(P) = -Cross-entropy + entropy• Measure of how “wrong” Q is with respect to true distribution P 92
- 93. Learning Bayesian Networks E B Data R A + LearnerPrior Knowledge C E B P(A | E,B) e b .9 .1 • Model Building e b .7 .3 • Parameter estimation • Causal structure discovery e b .8 .2 • Passive Learning vs Active Learning e b .99 .01 93
- 94. Active Learning• Selective Active Learning• Interventional Active Learning• Obtain measure of quality of current model• Choose query that most improves quality• Update model 94
- 95. Active Learning: Parameter Estimation• [Tong & Koller, NIPS-2000] Given a BN structure G• A prior distribution p(θ)• Learner request a particular instantiation q (Query) UpdatedInitial Network distribution G, p(θ) p´(θ) E B Query (q) E B Active Learner A A + Response (x) Training data How to update parameter density How to select next query based on p 95
- 96. Updating parameter density•Do not update A since we are fixing it B•If we select A then do not update B •Sampling from P(B|A=a) ≠ P(B)•If we force A then we can update B A •Sampling from P(B|A:=a) = P(B)*•Update all other nodes as usual J M•Obtain new density p(θ | A = a, X = x ) 96 *Pearl 2000
- 97. Bayesian point estimation• Goal: a single estimate θ – instead of a distribution p over θ p(θ) ~• If we choose θ and the true model is θ’ θ θ’ then we incur some loss, L(θ’ || θ) 97
- 98. Bayesian point estimation• We do not know the true θ’• Density p represents optimal beliefs over θ’• Choose θ that minimizes the expected loss ~ θ = argminθ ∫p(θ’) L(θ’ || θ) dθ’ ~• Call θ the Bayesian point estimate• Use the expected loss of the Bayesian point estimate as a measure of quality of p(θ): – Risk(p) = ∫p(θ’) L(θ’ || θ) dθ’ 98
- 99. The Querying component• Set the controllable variables so as to minimize the expected posterior risk: ExPRisk(p | Q=q) ~ = ∑ P( X = x | Q = q) ∫ p(θ | x ) KL(θ || θ ) dθ x• KL divergence will be used for loss Conditional KL-divergence – KL(θ || θ’)=∑KL(Pθ(Xi|Ui)|| Pθ’(Xi|Ui)) 99
- 100. Algorithm Summary• For each potential query q• Compute ∆Risk(X|q)• Choose q for which ∆Risk(X|q) is greatest – Cost of computing ∆Risk(X|q):• Cost of Bayesian network inference• Complexity: O (|Q|. Cost of inference) 100
- 101. Uncertainty samplingMaintain a single hypothesis, based on labels seen so far.Query the point about which this hypothesis is most “uncertain”.Problem: confidence of a single hypothesis may not accuratelyrepresent the true diversity of opinion in the hypothesis class. X - - - - + + - - + + + - - - 101
- 102. 102
- 103. Region of uncertaintyCurrent version space: portion of H consistent with labels so far.“Region of uncertainty” = part of data space about which there isstill some uncertainty (i.e. disagreement within version space) current version spaceSuppose data lieson circle in R2; +hypotheses are +linear separators.(spaces X, Hsuperimposed) region of uncertainty in data space 103
- 104. Region of uncertaintyAlgorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty,pick one at random to query.Data and current version spacehypothesis spaces,superimposed:(both are thesurface of the unitsphere in Rd) region of uncertainty in data space 104
- 105. Region of uncertaintyNumber of labels needed depends on H and also on P.Special case: H = {linear separators in Rd}, P = uniformdistribution over unit sphere.Then: just d log 1/ε labels are needed to reach a hypothesiswith error rate < ε.[1] Supervised learning: d/ε labels.[2] Best we can hope for. 105
- 106. Region of uncertaintyAlgorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty,pick one at random to query.For more general distributions: suboptimal… Need to measure quality of a query – or alternatively, size of version space. 106
- 107. ExpectedInfogainof sample Uncertainty sampling! 107
- 108. 108

No public clipboards found for this slide

Be the first to comment