Active learning lecture
Upcoming SlideShare
Loading in...5
×
 

Active learning lecture

on

  • 525 views

Active learning lecture

Active learning lecture

Statistics

Views

Total Views
525
Views on SlideShare
525
Embed Views
0

Actions

Likes
0
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Explain why this isn’t pathological: issue of margins
  • Explain why this isn’t pathological: issue of margins
  • Explain why this isn’t pathological: issue of margins
  • Explain why this isn’t pathological: issue of margins
  • Explain why this isn’t pathological: issue of margins
  • Explain why this isn’t pathological: issue of margins
  • New properties of our application vs CF: Dynamically changing ratings Possibilitues for active sampling Constraint on resources vs CF Absense of rating may mean that server was NOT selected selectd before on purpose (?) Future improvements: Use client and server ‘features’ (IP etc)
  • Red box on left top : which data where actually used in experiments? Color version Histograms
  • Examples: movies and dGrid – Motivate active learning in CF scenarios Minimize # of questions /downloads but maximuze their ‘informativeness’ Incremental algorithm – future work
  • Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
  • Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
  • Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
  • Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
  • When you don’t have enough data to even fit a hypothesis, should you trust its confidence judgements?
  • Explain why this isn’t pathological: issue of margins

Active learning lecture Active learning lecture Presentation Transcript

  • Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center
  • Outline Motivation Active learning approaches  Membership queries  Uncertainty Sampling  Information-based loss functions  Uncertainty-Region Sampling  Query by committee Applications  Active Collaborative Prediction  Active Bayes net learning 2
  • Standard supervised learning modelGiven m labeled points, want to learn a classifier withmisclassification rate <ε, chosen from a hypothesis class H withVC dimension d < 1.VC theory: need m to be roughly d/ε, in the realizable case. 3
  • Active learningIn many situations – like speech recognition and documentretrieval – unlabeled data is easy to come by, but there is acharge for each label.What is the minimum number of labels needed to achieve thetarget error rate? 4
  • 5
  • What is Active Learning? Unlabeled data are readily available; labels are expensive Want to use adaptive decisions to choose which labels to acquire for a given dataset Goal is accurate classifier with minimal cost 6
  • Active learning warning Choice of data is only as good as the model itself Assume a linear model, then two data points are sufficient What happens when data are not linear? 7
  • Active Learning Flavors Selective Sampling Membership Queries Pool Sequential Myopic Batch 8
  • Active Learning Approaches Membership queries Uncertainty Sampling Information-based loss functions Uncertainty-Region Sampling Query by committee 9
  • 10
  • 11
  • ProblemMany results in this framework, even for complicatedhypothesis classes.[Baum and Lang, 1991] tried fitting a neural net to handwrittencharacters.Synthetic instances created were incomprehensible to humans![Lewis and Gale, 1992] tried training text classifiers.“an artificial text created by a learning algorithm is unlikely tobe a legitimate natural language expression, and probably wouldbe uninterpretable by a human teacher.” 12
  • 13
  • 14
  • 15
  • Uncertainty Sampling[Lewis & Gale, 1994] Query the event that the current classifier is most uncertain about If uncertainty is measured in Euclidean distance: x x x x x x x x x x Used trivially in SVMs, graphical models, etc. 16
  • Information-based Loss Function[MacKay, 1992] Maximize KL-divergence between posterior and prior Maximize reduction in model entropy between posterior and prior Minimize cross-entropy between posterior and prior All of these are notions of information gain 17
  • Query by Committee[Seung et al. 1992, Freund et al. 1997] Prior distribution over hypotheses Samples a set of classifiers from distribution Queries an example based on the degree of disagreement between committee of classifiers x x x x x x x x x x A B C 18
  • Infogain-based ActiveLearning
  • NotationWe Have:• Dataset, D• Model parameter space, W• Query algorithm, q 20
  • Dataset (D) Example t Sex Age Test A Test B Test C Disease0 M 40-50 0 1 1 ?1 F 50-60 0 1 0 ?2 F 30-40 0 0 0 ?3 F 60+ 1 1 1 ?4 M 10-20 0 1 0 ?5 M 40-50 0 0 1 ?6 F 0-10 0 0 0 ?7 M 30-40 1 1 0 ?8 M 20-30 0 0 1 ? 21
  • NotationWe Have:• Dataset, D• Model parameter space, W• Query algorithm, q 22
  • Model Example St Ot Probabilistic ClassifierNotationT : Number of examplesOt : Vector of features of example tSt : Class of example t 23
  • Model Example Patient state (St) St : DiseaseState Patient Observations (Ot) Ot1 : Gender Ot2 : Age Ot3 : TestA Ot4 : TestB Ot5 : TestC 24
  • Possible Model Structures Gender TestB Age GenderS TestA S TestA Age TestB TestC TestC 25
  • Model Space Model: St OtModel Parameters: P(St) P(Ot|St)Generative Model:Must be able to compute P(St=i, Ot=ot | w) 26
  • Model Parameter Space (W)• W = space of possible parameter values• Prior on parameters: P(W )• Posterior over models: P (W | D ) ∝ P ( D | W ) P (W ) T ∝ ∏ P ( St , Ot | W ) P(W ) t 27
  • NotationWe Have:• Dataset, D• Model parameter space, W• Query algorithm, q q(W,D) returns t*, the next sample to label 28
  • Gamewhile NotDone • Learn P(W | D) • q chooses next example to label • Expert adds label to D 29
  • SimulationO1 O2 O3 O4 O5 O6 O7 e S2 fals eS1 = S3 S4 S5 true = S6 S7 fals = S2 S5 S7 ? ? hmm… ? q 30
  • Active Learning Flavors• Pool (“random access” to patients)• Sequential (must decide as patients walk in the door) 31
  • q?• Recall: q(W,D) returns the “most interesting” unlabelled example.• Well, what makes a doctor curious about a patient? 32
  • 1994 33
  • Score Functionscoreuncert ( S t ) = uncertainty(P ( St | Ot )) = H ( St ) = ∑ P ( St = i ) log P( St = i ) i 34
  • Uncertainty Sampling Example t Sex Age Test Test Test St P(St) H(St) A B C1 M 20- 30 0 1 1 ? 0.02 0.0432 F 20- 30 0 1 0 ? 0.01 0.0243 F 30- 40 1 0 0 ? 0.05 0.0864 F 60+ 1 1 0 ? FALSE 0.12 0.1595 M 10- 20 0 1 0 ? 0.01 0.0246 M 20- 30 1 1 1 ? 0.96 0.073 35
  • Uncertainty Sampling Example t Sex Age Test Test Test St P(St) H(St) A B C1 M 20- 30 0 1 1 ? 0.01 0.0242 F 20- 30 0 1 0 ? 0.02 0.0433 F 30- 40 1 0 0 ? 0.04 0.0734 F 60+ 1 1 0 ? FALSE 0.00 0.005 M 10- 20 0 1 0 ? TRUE 0.06 0.1126 M 20- 30 1 1 1 ? 0.97 0.059 36
  • Uncertainty SamplingGOOD: couldn’t be easierGOOD: often performs pretty wellBAD: H(St) measures information gain about the samples, not the model Sensitive to noisy samples 37
  • Can we do better thanuncertainty sampling? 38
  • 1992 Informative with respect to what? 39
  • Model EntropyP(W|D) P(W|D) P(W|D) W W W H(W) = high …better… H(W) = 0 40
  • Information-Gain• Choose the example that is expected to most reduce H(W)• I.e., Maximize H(W) – H(W | St) Current model Expected model space space entropy entropy if we learn St 41
  • Score Functionscore IG ( St ) = MI ( St ;W ) = H (W ) − H (W | St ) 42
  • We usually can’t just sum over all models to get H(St|W) H (W ) = − ∫ P ( w) log P ( w) dw w…but we can sample from P(W | D) H (W ) ∝ H (C ) = −∑ P (c) log P (c) c∈C 43
  • Conditional Model Entropy H (W ) = ∫ P ( w) log P ( w) dw wH (W | St = i ) = ∫ P ( w | St = i ) log P ( w | St = i ) dw w H (W | St ) = ∑ P ( St = i ) ∫ P( w | S t = i ) log P ( w | St = i ) dw w i 44
  • Score Functionscore IG ( St ) = H (C ) − H (C | St ) 45
  • t Sex Age Test Test Test St P(St) Score = A B C H(C) - H(C|St)1 M 20-3 0 1 1 0 ? 0.02 0.532 F 20-3 0 1 0 0 ? 0.01 0.583 F 30-4 1 0 0 0 ? 0.05 0.404 F 60+ 1 1 1 ? 0.12 0.495 M 10-2 0 1 0 0 ? 0.01 0.57 20-36 M 0 0 0 1 ? 0.02 0.52 46
  • Score Functionscore IG ( St ) = H (C ) − H (C | St ) = H ( St ) − H ( St | C ) Familiar? 47
  • Uncertainty Sampling & Information GainscoreUncertain ( St ) = H ( St )score InfoGain ( St ) = H ( St ) − H ( St | C ) 48
  • But there is a problem… 49
  • If our objective is to reducethe prediction error, then“the expected information gainof an unlabeled sample is NOTa sufficient criterion forconstructing good queries” 50
  • Strategy #2:Query by CommitteeTemporary Assumptions:Pool  SequentialP(W | D)  Version SpaceProbabilistic  Noiseless QBC attacks the size of the “Version space” 51
  • O1 O2 O3 O4 O5 O6 O7S1 S2 S3 S4 S5 S6 S7 FALSE! FALSE! Model #1 Model #2 52
  • O1 O2 O3 O4 O5 O6 O7S1 S2 S3 S4 S5 S6 S7 TRUE! TRUE! Model #1 Model #2 53
  • O1 O2 O3 O4 O5 O6 O7S1 S2 S3 S4 S5 S6 S7 FALSE! TRUE! Ooh, now we’re going to learn something for sure! One of them is definitely wrong. Model #1 Model #2 54
  • The Original QBCAlgorithmAs each example arrives…• Choose a committee, C, (usually of size 2) randomly from Version Space• Have each member of C classify it• If the committee disagrees, select it. 55
  • 1992 56
  • Infogain vs Query by Committee[Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997]First idea: Try to rapidly reduce volume of version space?Problem: doesn’t take data distribution into account. H:Which pair of hypotheses is closest? Depends on data distribution P.Distance measure on H: d(h,h’) = P(h(x) ≠ h’(x)) 57
  • Query-by-committeeFirst idea: Try to rapidly reduce volume of version space?Problem: doesn’t take data distribution into account.To keep things simple, say d(h,h’) = Euclidean distance H: Error is likely to remain large! 58
  • Query-by-committeeElegant scheme which decreases volume in a manner which issensitive to the data distribution.Bayesian setting: given a prior π on HH1 = HFor t = 1, 2, … receive an unlabeled point xt drawn from P [informally: is there a lot of disagreement about xt in Ht?] choose two hypotheses h,h’ randomly from (π, Ht) if h(xt) ≠ h’(xt): ask for xt’s label set Ht+1 59Problem: how to implement it efficiently?
  • Query-by-committeeFor t = 1, 2, … receive an unlabeled point xt drawn from P choose two hypotheses h,h’ randomly from (π, Ht) if h(xt) ≠ h’(xt): ask for xt’s label set Ht+1Observation: the probability of getting pair (h,h’) in the innerloop (when a query is made) is proportional to π(h) π(h’) d(h,h’). vs. 60 Ht
  • 61
  • Query-by-committeeLabel bound:For H = {linear separators in Rd}, P = uniform distribution,just d log 1/ε labels to reach a hypothesis with error < ε.Implementation: need to randomly pick h according to (π, Ht).e.g. H = {linear separators in Rd}, π = uniform distribution: How do you pick a Ht random point from a convex body? 62
  • Sampling from convex bodies By random walk! 2. Ball walk 3. Hit-and-run[Gilad-Bachrach, Navot, Tishby 2005] Studies random walks andalso ways to kernelize QBC. 63
  • 64
  • Some challenges[1] For linear separators, analyze the label complexity forsome distribution other than uniform![2] How to handle nonseparable data?Need a robust base learner + true boundary - 65
  • Active Collaborative Prediction
  • Approach: Collaborative Prediction (CP) QoS measure (e.g. bandwidth) Movie Ratings Server1 Server2 Server3 Matrix Geisha Shrek Client1 3674 18 Alina ? 1 3 Client2 187 567 Gerry 4 2 Client3 1688 Irina 9 10 Client4 3009 703 ? Raja 4 Given previously observed ratings R(x,y), where X is a “user” and Y is a “product”, predict unobserved ratings - will Alina like “The Matrix”? (unlikely ) - will Client 86 have fast download from Server 39? - will member X of funding committee approve our project Y?  67
  • Collaborative Prediction = Matrix Approximation 100 servers • Important assumption:100 clients matrix entries are NOT independent, e.g. similar users have similar tastes • Approaches: mainly factorized models assuming hidden ‘factors’ that affect ratings (pLSA, MCVQ, SVD, NMF, MMMF, …) 68
  • 2 4 5 1 4 2 User’s ‘weights’ Factors associated with ‘factors’’Assumptions: - there is a number of (hidden) factors behind the user preferences that relate to (hidden) movie properties - movies have intrinsic values associated with such factors - users have intrinsic weights with such factors; user ratings a weighted (linear) combinations of movie’s values 69
  • 2 4 5 1 4 2 70
  • 2 4 5 1 4 2 71
  • rank k 2 4 5 1 4 2 7 2 5 4 5 3 1 4 2 3 1 2 2 5 3 1 2 4 2 2 7 5 6 4 2 4 1 3 1 4 3 2 2 4 1 4 3 1 = 3 3 4 2 3 1 2 3 4 3 2 4 5 2 3 1 4 3 2 2 3 2 1 3 4 3 5 2 2 2 1 4 8 2 2 9 1 8 3 4 5 1 3 1 1 4 1 2 3 5 1 1 5 6 4 Y XObjective: find a factorizable X=UV’ that approximates Y X = arg min Loss ( X , Y ) Xand satisfies some “regularization” constraints (e.g. rank(X) < k)Loss functions: depends on the nature of your problem 72
  • Matrix Factorization Approaches Singular value decomposition (SVD) – low-rank approximation  Assumes fully observed Y and sum-squared loss In collaborative prediction, Y is only partially observed Low-rank approximation becomes non-convex problem w/ many local minima Furthermore, we may not want sum-squared loss, but instead  accurate predictions (0/1 loss, approximated by hinge loss)  cost-sensitive predictions (missing a good server vs suggesting a bad one)  ranking cost (e.g., suggest k ‘best’ movies for a user) NON-CONVEX PROBLEMS! Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05]  replaces bounded rank constraint by bounded norm of U, V’ vectors  convex optimization problem! – can be solved exactly by semi-definite programming  strongly relates to learning max-margin classifiers (SVMs) Exploit MMMF’s properties to augment it with active sampling! 73
  • Key Idea of MMMFRows – feature vectors, Columns – linear classifiers Linear classifiersweight vectors “margin” here = Dist(sample, line) v2 f1 -1 “ma rgin Feature vectors ” Xij = signij x marginij Predictorij = signij If signij > 0, classify as +1, Otherwise classify as -1 74
  • MMMF: Simultaneous Search for Low-normFeature Vectors and Max-margin Classifiers 75
  • Active Learning with MMMF- We extend MMMF to Active-MMMF using margin-based active sampling- We investigate exploitation vs exploration trade-offs imposed by different heuristics -0.3 -0.5 0.3 0.4 0.6 0.1 -0.9 -0.6 -0.5 0.8 0.2 -0.9 0.3 -0.1 0.6 -0.7 -0.5 0.7 -0.9 -0.8 0.9 0.1 0.5 0.2 0.3 -0.5 0.6 0.6 0.5 0.2 -0.9 0.7 -0.8 -0.5 0.9 -0.6 -0.1 0.9 0.7 0.8 -0.4 0.3 Margin-based heuristics: -0.5 -0.2 0.6 -0.5 -0.5 -0.5 -0.4 0.6 -0.5 0.4 0.5 min-margin (most-uncertain) -0.2 -0.5 -0.5 0.1 0.9 0.3 min-margin positive (“good” uncertain) 0.8 -0.5 0.6 0.2 max-margin (‘safe choice’ but no76 info)
  • Active Max-Margin Matrix Factorization A-MMMF(M,s) 1. Given s sparse matrix Y, learn approximation X = MMMF(Y) 2. Using current predictions, actively select “best s” samples and request their labels (e.g., test client/server pair via ‘enforced’ download) 3. Add new samples to Y 4. Repeat 1-3Issues:  Beyond simple greedy margin-based heuristics?  Theoretical guarantees? not so easy with non-trivial learning methods and non-trivial data distributions (any suggestions???  ) 77
  • Empirical Results Network latency prediction Bandwidth prediction (peer-to-peer) Movie Ranking Prediction Sensor net connectivity prediction 78
  • Empirical Results: Latency Prediction P2Psim data NLANR-AMP dataActive sampling with most-uncertain (and most-uncertain positive)heuristics provide consistent improvement over random and least-uncertain-next sampling 79
  • Movie Rating Prediction (MovieLens) 80
  • Sensor Network Connectivity 81
  • Introducing Cost: Exploration vs Exploitation DownloadGrid: PlanetLab: bandwidth prediction latency prediction Active sampling  lower prediction errors at lower costs (saves 100s of samples) (better prediction  better server assignment decisions  faster downloadsActive sampling achieves a good exploration vs exploitation trade-off:reduced decision cost AND information gain 82
  • Conclusions Common challenge in many applications: need for cost-efficient sampling This talk: linear hidden factor models with active sampling Active sampling improves predictive accuracy while keeping sampling complexity low in a wide variety of applications Future work:  Better active sampling heuristics?  Theoretical analysis of active sampling performance?  Dynamic Matrix Factorizations: tracking time-varying matrices  Incremental MMMF? (solving from scratch every time is too costly) 83
  • ReferencesSome of the most influential papers• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification . Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133—168• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996.• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994.• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992. 84
  • NIPS papers• Francis Bach. Active learning for misspecified generalized linear models. NIPS-06• Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real. NIPS-05• Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman. Active Learning For Identifying Function Threshold Boundaries . NIPS-05• Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning. NIPS-05• Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. NIPS-05• Masashi Sugiyama. Active Learning for Misspecified Models. NIPS-05• Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models. NIPS-05• Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection. NIPS-04• Sanjoy Dasgupta. Analysis of a greedy active learning strategy. NIPS-04• T. Jaakkola and H. Siegelmann. Active Information Retrieval. NIPS-01• M. K. Warmuth et al. Active Learning in the Drug Discovery Process. NIPS-01• Jonathan D. Nelson, Javier R. Movellan. Active Inference in Concept Learning. NIPS-00• Simon Tong, Daphne Koller. Active Learning for Parameter Estimation in Bayesian Networks. NIPS-00• Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering. NIPS-97• K. Fukumizu. Active Learning in Multilayer Perceptrons. NIPS-95• Anders Krogh, Jesper Vedelsby. NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE LEARNING. NIPS-94• Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION. NIPS-94• David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS. NIPS-94• Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments. NIPS-91 85
  • ICML papers• Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning. ICML-06• Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu. Batch Mode Active Learning and Its Application to Medical Image Classification. ICML-06• Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani. Active Sampling for Detecting Irrelevant Features. ICML-06• Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design. ICML-06• Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05• Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning. ICML-04• Klaus Brinker. Active Learning of Label Ranking Functions. ICML-04• Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04• Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines, ICML-00• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. ICML-00.• COLT papers• S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. COLT-05.• H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee. COLT-92, pages 287--294. 86
  • Journal Papers• Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou. Fast Kernel Classifiers with Online and Active Learning. Journal of Machine Learning Research (JMLR), vol. 6, pp. 1579-1619, 2005.• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133--168• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996.• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994.• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992.• Haussler, D., Kearns, M., and Schapire, R. E. (1994). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension . Machine Learning, 14, 83--113• Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press.• Saar-Tsechansky, M. and F. Provost. Active Sampling for Class Probability Estimation and Ranking. Machine Learning 54:2 2004, 153-178. 87
  • Workshops• http://domino.research.ibm.com/comm/researc 88
  • Appendix 89
  • Active Learning of Bayesian Networks
  • Entropy Function• A measure of information in random event X with possible outcomes {x1,…,xn} H(x) = - Σi p(xi) log2 p(xi)• Comments on entropy function: – Entropy of an event is zero when the outcome is known – Entropy is maximal when all outcomes are equally likely• The average minimum yes/no questions to answer some question (connection to binary 91 search) [Shannon, 1948]
  • Kullback-Leibler divergence• P is the true distribution; Q distribution is used to encode data instead of P• KL divergence is the expected extra message length per datum that must be transmitted using Q DKL(P || Q) = Σi P(xi) log (P(xi)/Q(xi)) = Σi P(xi) log Q(xi) – Σi P(xi) log P(xi) = -H(P,Q) + H(P) = -Cross-entropy + entropy• Measure of how “wrong” Q is with respect to true distribution P 92
  • Learning Bayesian Networks E B Data R A + LearnerPrior Knowledge C E B P(A | E,B) e b .9 .1 • Model Building e b .7 .3 • Parameter estimation • Causal structure discovery e b .8 .2 • Passive Learning vs Active Learning e b .99 .01 93
  • Active Learning• Selective Active Learning• Interventional Active Learning• Obtain measure of quality of current model• Choose query that most improves quality• Update model 94
  • Active Learning: Parameter Estimation• [Tong & Koller, NIPS-2000] Given a BN structure G• A prior distribution p(θ)• Learner request a particular instantiation q (Query) UpdatedInitial Network distribution G, p(θ) p´(θ) E B Query (q) E B Active Learner A A + Response (x) Training data How to update parameter density How to select next query based on p 95
  • Updating parameter density•Do not update A since we are fixing it B•If we select A then do not update B •Sampling from P(B|A=a) ≠ P(B)•If we force A then we can update B A •Sampling from P(B|A:=a) = P(B)*•Update all other nodes as usual J M•Obtain new density p(θ | A = a, X = x ) 96 *Pearl 2000
  • Bayesian point estimation• Goal: a single estimate θ – instead of a distribution p over θ p(θ) ~• If we choose θ and the true model is θ’ θ θ’ then we incur some loss, L(θ’ || θ) 97
  • Bayesian point estimation• We do not know the true θ’• Density p represents optimal beliefs over θ’• Choose θ that minimizes the expected loss ~ θ = argminθ ∫p(θ’) L(θ’ || θ) dθ’ ~• Call θ the Bayesian point estimate• Use the expected loss of the Bayesian point estimate as a measure of quality of p(θ): – Risk(p) = ∫p(θ’) L(θ’ || θ) dθ’ 98
  • The Querying component• Set the controllable variables so as to minimize the expected posterior risk: ExPRisk(p | Q=q) ~ = ∑ P( X = x | Q = q) ∫ p(θ | x ) KL(θ || θ ) dθ x• KL divergence will be used for loss Conditional KL-divergence – KL(θ || θ’)=∑KL(Pθ(Xi|Ui)|| Pθ’(Xi|Ui)) 99
  • Algorithm Summary• For each potential query q• Compute ∆Risk(X|q)• Choose q for which ∆Risk(X|q) is greatest – Cost of computing ∆Risk(X|q):• Cost of Bayesian network inference• Complexity: O (|Q|. Cost of inference) 100
  • Uncertainty samplingMaintain a single hypothesis, based on labels seen so far.Query the point about which this hypothesis is most “uncertain”.Problem: confidence of a single hypothesis may not accuratelyrepresent the true diversity of opinion in the hypothesis class. X - - - - + + - - + + + - - - 101
  • 102
  • Region of uncertaintyCurrent version space: portion of H consistent with labels so far.“Region of uncertainty” = part of data space about which there isstill some uncertainty (i.e. disagreement within version space) current version spaceSuppose data lieson circle in R2; +hypotheses are +linear separators.(spaces X, Hsuperimposed) region of uncertainty in data space 103
  • Region of uncertaintyAlgorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty,pick one at random to query.Data and current version spacehypothesis spaces,superimposed:(both are thesurface of the unitsphere in Rd) region of uncertainty in data space 104
  • Region of uncertaintyNumber of labels needed depends on H and also on P.Special case: H = {linear separators in Rd}, P = uniformdistribution over unit sphere.Then: just d log 1/ε labels are needed to reach a hypothesiswith error rate < ε.[1] Supervised learning: d/ε labels.[2] Best we can hope for. 105
  • Region of uncertaintyAlgorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty,pick one at random to query.For more general distributions: suboptimal… Need to measure quality of a query – or alternatively, size of version space. 106
  • ExpectedInfogainof sample Uncertainty sampling! 107
  • 108