SlideShare a Scribd company logo
1 of 108
Active Learning
          COMS 6998-4:
   Learning and Empirical Inference




                Irina Rish
                IBM T.J. Watson Research Center
Outline
   Motivation
   Active learning approaches
     Membership   queries
     Uncertainty Sampling
     Information-based loss functions
     Uncertainty-Region Sampling
     Query by committee

   Applications
     Active Collaborative Prediction
     Active Bayes net learning


                                         2
Standard supervised learning model
Given m labeled points, want to learn a classifier with
misclassification rate <ε, chosen from a hypothesis class H with
VC dimension d < 1.




VC theory: need m to be roughly d/ε, in the realizable case.
                                                               3
Active learning
In many situations – like speech recognition and document
retrieval – unlabeled data is easy to come by, but there is a
charge for each label.




What is the minimum number of labels needed to achieve the
target error rate?                                         4
5
What is Active Learning?


   Unlabeled data are readily available; labels are
    expensive

   Want to use adaptive decisions to choose which
    labels to acquire for a given dataset

   Goal is accurate classifier with minimal cost


                                                       6
Active learning warning

   Choice of data is only as good as the model itself
   Assume a linear model, then two data points are sufficient
   What happens when data are not linear?




                                                            7
Active Learning Flavors




             Selective Sampling       Membership Queries




         Pool            Sequential




    Myopic       Batch



                                                           8
Active Learning Approaches


 Membership queries
 Uncertainty Sampling
 Information-based loss functions
 Uncertainty-Region Sampling
 Query by committee




                                     9
10
11
Problem

Many results in this framework, even for complicated
hypothesis classes.

[Baum and Lang, 1991] tried fitting a neural net to handwritten
characters.
Synthetic instances created were incomprehensible to humans!

[Lewis and Gale, 1992] tried training text classifiers.
“an artificial text created by a learning algorithm is unlikely to
be a legitimate natural language expression, and probably would
be uninterpretable by a human teacher.”


                                                               12
13
14
15
Uncertainty Sampling
[Lewis & Gale, 1994]
   Query the event that the current classifier is most
    uncertain about

    If uncertainty is measured in Euclidean distance:


              x        x   x x   x x   x   x   x   x




   Used trivially in SVMs, graphical models, etc.

                                                          16
Information-based Loss Function
[MacKay, 1992]


   Maximize KL-divergence between posterior and prior


   Maximize reduction in model entropy between posterior
    and prior


   Minimize cross-entropy between posterior and prior


   All of these are notions of information gain

                                                         17
Query by Committee
[Seung et al. 1992, Freund et al. 1997]



   Prior distribution over hypotheses
   Samples a set of classifiers from distribution
   Queries an example based on the degree of
    disagreement between committee of classifiers

               x      x       x x   x x       x   x   x   x


                          A               B       C



                                                              18
Infogain-based Active
Learning
Notation

We Have:

• Dataset, D

• Model parameter space, W

• Query algorithm, q




                             20
Dataset (D) Example
    t   Sex    Age    Test A   Test B   Test C   Disease
0       M     40-50     0        1        1        ?
1       F     50-60     0        1        0        ?
2       F     30-40     0        0        0        ?
3       F     60+       1        1        1        ?
4       M     10-20     0        1        0        ?
5       M     40-50     0        0        1        ?
6       F     0-10      0        0        0        ?
7       M     30-40     1        1        0        ?
8       M     20-30     0        0        1        ?       21
Notation

We Have:

• Dataset, D

• Model parameter space, W

• Query algorithm, q




                             22
Model Example


                           St           Ot


                       Probabilistic Classifier




Notation

T : Number of examples

Ot : Vector of features of example t

St : Class of example t
                                                  23
Model Example

 Patient state (St)
 St : DiseaseState




 Patient Observations (Ot)
 Ot1 : Gender
 Ot2 : Age
 Ot3 : TestA
 Ot4 : TestB
 Ot5 : TestC


                             24
Possible Model Structures


    Gender
                            TestB
    Age
               Gender

S   TestA               S   TestA

                Age
    TestB
                            TestC

    TestC



                                    25
Model Space


            Model:             St              Ot




Model Parameters:             P(St)          P(Ot|St)




Generative Model:
Must be able to compute P(St=i, Ot=ot | w)




                                                        26
Model Parameter Space (W)
• W = space of possible
  parameter values

• Prior on parameters:      P(W )

• Posterior over models: P (W | D ) ∝ P ( D | W ) P (W )
                                         T
                                      ∝ ∏ P ( St , Ot | W ) P(W )
                                          t




                                                               27
Notation

We Have:

• Dataset, D

• Model parameter space, W

• Query algorithm, q



                 q(W,D) returns t*, the next sample to label

                                                               28
Game

while NotDone
  •   Learn P(W | D)
  •   q chooses next example to label
  •   Expert adds label to D




                                        29
Simulation

O1    O2        O3      O4        O5        O6    O7


            e
      S2 fals                                               e
S1     =        S3      S4        S5 true
                                   =        S6    S7 fals
                                                   =
     S2                          S5              S7


                             ?
            ?
                      hmm…
                                       ?
                        q

                                                       30
Active Learning Flavors
• Pool
  (“random access” to patients)

• Sequential
  (must decide as patients walk in the door)




                                               31
q?
• Recall: q(W,D) returns the “most
  interesting” unlabelled example.

• Well, what makes a doctor curious about a
  patient?




                                          32
1994




       33
Score Function


scoreuncert ( S t ) = uncertainty(P ( St | Ot ))


                = H ( St )
                = ∑ P ( St = i ) log P( St = i )
                    i


                                                   34
Uncertainty Sampling Example
    t   Sex Age   Test   Test   Test    St     P(St)    H(St)
                   A      B      C



1       M 20-
            30
                   0      1      1      ?      0.02    0.043
2       F   20-
            30
                   0      1      0      ?      0.01    0.024
3       F   30-
            40
                   1      0      0      ?      0.05    0.086
4       F   60+    1      1      0       ?
                                       FALSE   0.12    0.159
5       M 10-
            20
                   0      1      0      ?      0.01    0.024
6       M 20-
            30
                   1      1      1      ?      0.96    0.073
                                                                35
Uncertainty Sampling Example
    t   Sex Age   Test   Test   Test    St     P(St)    H(St)
                   A      B      C



1       M 20-
            30
                   0      1      1      ?      0.01    0.024
2       F   20-
            30
                   0      1      0      ?      0.02    0.043
3       F   30-
            40
                   1      0      0      ?      0.04    0.073
4       F   60+    1      1      0       ?
                                       FALSE   0.00    0.00
5       M 10-
            20
                   0      1      0       ?
                                       TRUE    0.06    0.112
6       M 20-
            30
                   1      1      1      ?      0.97    0.059
                                                                36
Uncertainty Sampling

GOOD: couldn’t be easier
GOOD: often performs pretty well


BAD: H(St) measures information gain about the
 samples, not the model


       Sensitive to noisy samples
                                                 37
Can we do better than
uncertainty sampling?




                        38
1992




       Informative with
       respect to what?




                          39
Model Entropy
P(W|D)                  P(W|D)              P(W|D)




           W                      W                       W




         H(W) = high             …better…            H(W) = 0



                                                                40
Information-Gain
• Choose the example that is expected to
  most reduce H(W)

• I.e., Maximize H(W) – H(W | St)

            Current model   Expected model space
            space entropy   entropy if we learn St




                                                     41
Score Function


score IG ( St ) = MI ( St ;W )


              = H (W ) − H (W | St )


                                       42
We usually can’t just sum over all models to
 get H(St|W)
                   H (W ) = − ∫ P ( w) log P ( w) dw
                               w




…but we can sample from P(W | D)
                   H (W ) ∝ H (C )
                          = −∑ P (c) log P (c)
                              c∈C


                                                       43
Conditional Model Entropy


       H (W ) = ∫ P ( w) log P ( w) dw
                     w



H (W | St = i ) = ∫ P ( w | St = i ) log P ( w | St = i ) dw
                     w




     H (W | St ) = ∑ P ( St = i ) ∫ P( w | S t = i ) log P ( w | St = i ) dw
                                   w
                     i



                                                                       44
Score Function



score IG ( St ) = H (C ) − H (C | St )




                                         45
t   Sex   Age Test Test   Test   St   P(St)     Score =
               A    B      C                  H(C) - H(C|St)

1   M     20-3   0   1     1
          0
                                 ?    0.02       0.53
2   F     20-3   0   1     0
          0
                                 ?    0.01       0.58
3   F     30-4   1   0     0
          0
                                 ?    0.05       0.40
4   F     60+    1   1     1     ?    0.12       0.49
5   M     10-2   0   1     0
          0
                                 ?    0.01       0.57
          20-3
6   M
          0
                 0   0     1     ?    0.02       0.52
                                                               46
Score Function


score IG ( St ) = H (C ) − H (C | St )
              = H ( St ) − H ( St | C )

                       Familiar?



                                          47
Uncertainty Sampling &
   Information Gain


scoreUncertain ( St ) = H ( St )


score InfoGain ( St ) = H ( St ) − H ( St | C )


                                                  48
But there is a problem…




                          49
If our objective is to reduce
the prediction error, then


“the expected information gain
of an unlabeled sample is NOT
a sufficient criterion for
constructing good queries”




                           50
Strategy #2:
Query by Committee



Temporary Assumptions:
Pool           Sequential
P(W | D)       Version Space
Probabilistic  Noiseless



   QBC attacks the size of the “Version space”



                                                  51
O1        O2         O3       O4   O5   O6   O7



S1        S2         S3       S4   S5   S6   S7




                FALSE!


        FALSE!



     Model #1      Model #2

                                                  52
O1        O2        O3       O4   O5   O6   O7



S1           S2     S3       S4   S5   S6   S7




     TRUE!        TRUE!




     Model #1     Model #2

                                                 53
O1        O2      O3       O4     O5           O6           O7



S1        S2      S3       S4     S5           S6           S7




     FALSE!
                TRUE!           Ooh, now we’re going to learn
                                something for sure!

                                One of them is definitely wrong.
     Model #1   Model #2

                                                                 54
The Original QBC
Algorithm


As each example arrives…

•   Choose a committee, C, (usually of size 2)
    randomly from Version Space

•   Have each member of C classify it

•   If the committee disagrees, select it.
                                                 55
1992




       56
Infogain vs Query by Committee
[Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997]

First idea: Try to rapidly reduce volume of version space?
Problem: doesn’t take data distribution into account.

       H:




Which pair of hypotheses is closest? Depends on data distribution P.
Distance measure on H: d(h,h’) = P(h(x) ≠ h’(x))               57
Query-by-committee
First idea: Try to rapidly reduce volume of version space?
Problem: doesn’t take data distribution into account.
To keep things simple, say d(h,h’) = Euclidean distance



  H:
                                           Error is likely to
                                           remain large!



                                                                58
Query-by-committee
Elegant scheme which decreases volume in a manner which is
sensitive to the data distribution.

Bayesian setting: given a prior π on H

H1 = H
For t = 1, 2, …
        receive an unlabeled point xt drawn from P
        [informally: is there a lot of disagreement about xt in Ht?]
        choose two hypotheses h,h’ randomly from (π, Ht)
        if h(xt) ≠ h’(xt): ask for xt’s label
        set Ht+1
                                                                   59
Problem: how to implement it efficiently?
Query-by-committee
For t = 1, 2, …
         receive an unlabeled point xt drawn from P
         choose two hypotheses h,h’ randomly from (π, Ht)
         if h(xt) ≠ h’(xt): ask for xt’s label
         set Ht+1

Observation: the probability of getting pair (h,h’) in the inner
loop (when a query is made) is proportional to π(h) π(h’) d(h,h’).



                                       vs.


                                                                60
      Ht
61
Query-by-committee
Label bound:
For H = {linear separators in Rd}, P = uniform distribution,
just d log 1/ε labels to reach a hypothesis with error < ε.

Implementation: need to randomly pick h according to (π, Ht).

e.g. H = {linear separators in Rd}, π = uniform distribution:

                                             How do you pick a
           Ht                                random point from a
                                             convex body?


                                                                62
Sampling from convex bodies


 By random walk!
 2. Ball walk
 3. Hit-and-run




[Gilad-Bachrach, Navot, Tishby 2005] Studies random walks and
also ways to kernelize QBC.
                                                           63
64
Some challenges

[1] For linear separators, analyze the label complexity for
some distribution other than uniform!

[2] How to handle nonseparable data?
Need a robust base learner



                +                    true boundary
                     -
                                                              65
Active Collaborative Prediction
Approach: Collaborative Prediction (CP)
    QoS measure (e.g. bandwidth)                       Movie Ratings
            Server1   Server2   Server3                Matrix   Geisha   Shrek
  Client1   3674                  18           Alina        ?     1        3
  Client2    187       567                     Gerry        4     2
  Client3                       1688           Irina        9              10
  Client4   3009       703        ?            Raja               4



  Given previously observed ratings R(x,y), where X is a
  “user” and Y is a “product”, predict unobserved ratings


             - will Alina like “The Matrix”? (unlikely )

             - will Client 86 have fast download from Server 39?

             - will member X of funding committee approve our project Y? 
                                                                                 67
Collaborative Prediction = Matrix Approximation


                      100 servers
                                           • Important assumption:
100 clients




                                             matrix entries are NOT independent,
                                             e.g. similar users have similar tastes



                                           • Approaches:
                                             mainly factorized models assuming
                                             hidden ‘factors’ that affect ratings
                                             (pLSA, MCVQ, SVD, NMF, MMMF, …)



                                                                                 68
2     4 5   1 4 2




                                           User’s ‘weights’         Factors
                                       associated with ‘factors’’
Assumptions:
     - there is a number of (hidden) factors behind the user
      preferences that relate to (hidden) movie properties
         - movies have intrinsic values associated with such factors
         - users have intrinsic weights with such factors;
           user ratings a weighted (linear) combinations of movie’s values


                                                                              69
2   4 5   1 4 2




                  70
2   4 5   1 4 2




                  71
rank k

   2   4 5   1 4 2                                        7   2   5   4   5   3   1   4   2
 3 1     2 2   5                                          3   1   2   4   2   2   7   5   6
 4   2   4 1   3 1                                        4   3   2   2   4   1   4   3   1


                                                     =
 3     3 4   2                                            3   1   2   3   4   3   2   4   5
 2 3   1   4 3   2                                        2   3   2   1   3   4   3   5   2
   2 2   1     4                                          8   2   2   9   1   8   3   4   5
 1   3   1 1     4                                        1   2   3   5   1   1   5   6   4



       Y                                                                  X
Objective: find a factorizable X=UV’ that approximates Y

                     X = arg min Loss ( X ' , Y )
                               X'

and satisfies some “regularization” constraints (e.g. rank(X) < k)

Loss functions: depends on the nature of your problem
                                                                                  72
Matrix Factorization Approaches

   Singular value decomposition (SVD) – low-rank approximation
        Assumes fully observed Y and sum-squared loss

   In collaborative prediction, Y is only partially observed
    Low-rank approximation becomes non-convex problem w/ many local minima

   Furthermore, we may not want sum-squared loss, but instead
      accurate predictions (0/1 loss, approximated by hinge loss)
      cost-sensitive predictions (missing a good server vs suggesting a bad one)
      ranking cost (e.g., suggest k ‘best’ movies for a user)

     NON-CONVEX PROBLEMS!

   Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05]
      replaces bounded rank constraint by bounded norm of U, V’ vectors
      convex optimization problem! – can be solved exactly by semi-definite programming
      strongly relates to learning max-margin classifiers (SVMs)


   Exploit MMMF’s properties to augment it with active sampling!                    73
Key Idea of MMMF
Rows – feature vectors, Columns – linear classifiers


                    Linear classifiersweight vectors
                                                               “margin” here =
                                                               Dist(sample, line)
                            v2
                    f1      -1




                                                                        “ma
                                                                            rgin
  Feature vectors




                                                                                ”
                                                            Xij = signij x marginij
                                                            Predictorij = signij

                                                        If signij > 0, classify as +1,
                                                        Otherwise classify as -1 74
MMMF: Simultaneous Search for Low-norm
Feature Vectors and Max-margin Classifiers




                                             75
Active Learning with MMMF



- We extend MMMF to Active-MMMF using margin-based active sampling

- We investigate exploitation vs exploration trade-offs imposed by different heuristics




                                 -0.3          -0.5                0.3                         0.4
                                 0.6           0.1                                      -0.9         -0.6
                                        -0.5          0.8                 0.2
                          -0.9                                     0.3           -0.1
                                        0.6           -0.7 -0.5                         0.7
                                               -0.9                       -0.8                       0.9
                                 0.1                               0.5           0.2           0.3
                                        -0.5                0.6                          0.6
                                 0.5           0.2          -0.9          0.7           -0.8         -0.5
                                 0.9                        -0.6                 -0.1          0.9
                                 0.7                  0.8 -0.4                   0.3                        Margin-based heuristics:
                                        -0.5                       -0.2                        0.6
                          -0.5                 -0.5         -0.5 -0.4
                                        0.6                 -0.5                 0.4           0.5           min-margin (most-uncertain)
                                 -0.2          -0.5         -0.5 0.1             0.9           0.3
                                                                                                             min-margin positive (“good” uncertain)
                           0.8          -0.5                 0.6                        0.2
                                                                                                             max-margin (‘safe choice’ but no76
                                                                                                                                              info)
Active Max-Margin Matrix Factorization


       A-MMMF(M,s)
          1. Given s sparse matrix Y, learn approximation X = MMMF(Y)
          2. Using current predictions, actively select “best s” samples and
          request their labels (e.g., test client/server pair via ‘enforced’ download)
          3. Add new samples to Y
          4. Repeat 1-3



Issues:
      Beyond simple greedy margin-based heuristics?
      Theoretical guarantees? not so easy with non-trivial learning
       methods and non-trivial data distributions
       (any suggestions???  )
                                                                                     77
Empirical Results


   Network latency prediction
   Bandwidth prediction (peer-to-peer)
   Movie Ranking Prediction
   Sensor net connectivity prediction



                                          78
Empirical Results: Latency Prediction

             P2Psim data                       NLANR-AMP data




Active sampling with most-uncertain (and most-uncertain positive)
heuristics provide consistent improvement over random and least-
uncertain-next sampling                                             79
Movie Rating Prediction (MovieLens)




                                      80
Sensor Network Connectivity




                              81
Introducing Cost: Exploration vs Exploitation

      DownloadGrid:                                      PlanetLab:
   bandwidth prediction                              latency prediction




 Active sampling  lower prediction errors at lower costs (saves 100s of samples)
 (better prediction  better server assignment decisions  faster downloads

Active sampling achieves a good exploration vs exploitation trade-off:
reduced decision cost AND information gain
                                                                             82
Conclusions

   Common challenge in many applications:
    need for cost-efficient sampling
   This talk: linear hidden factor models with active sampling
   Active sampling improves predictive accuracy while keeping
    sampling complexity low in a wide variety of applications
   Future work:
       Better active sampling heuristics?
       Theoretical analysis of active sampling performance?
       Dynamic Matrix Factorizations: tracking time-varying matrices
       Incremental MMMF? (solving from scratch every time is too costly)



                                                                            83
References
Some of the most influential papers

•   Simon Tong, Daphne Koller.
    Support Vector Machine Active Learning with Applications to Text Classification
    . Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.

•   Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997.
    Selective sampling using the query by committee algorithm. Machine
    Learning, 28:133—168

•   David Cohn, Zoubin Ghahramani, and Michael Jordan.
    Active learning with statistical models, Journal of Artificial Intelligence
    Research, (4): 129-145, 1996.

•   David Cohn, Les Atlas and Richard Ladner.
    Improving generalization with active learning, Machine Learning
    15(2):201-221, 1994.

•   D. J. C. Mackay.
    Information-Based Objective Functions for Active Data Selection. Neural
    Comput., vol. 4, no. 4, pp. 590--604, 1992.                                   84
NIPS papers
•   Francis Bach. Active learning for misspecified generalized linear models. NIPS-06
•   Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real. NIPS-05
•   Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman.
    Active Learning For Identifying Function Threshold Boundaries . NIPS-05
•   Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning. NIPS-05
•   Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. NIPS-05
•   Masashi Sugiyama. Active Learning for Misspecified Models. NIPS-05
•   Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models. NIPS-05
•   Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection. NIPS-04
•   Sanjoy Dasgupta. Analysis of a greedy active learning strategy. NIPS-04
•   T. Jaakkola and H. Siegelmann. Active Information Retrieval. NIPS-01
•   M. K. Warmuth et al. Active Learning in the Drug Discovery Process. NIPS-01
•   Jonathan D. Nelson, Javier R. Movellan. Active Inference in Concept Learning. NIPS-00
•   Simon Tong, Daphne Koller. Active Learning for Parameter Estimation in Bayesian Networks. NIPS-00
•   Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering. NIPS-97
•   K. Fukumizu. Active Learning in Multilayer Perceptrons. NIPS-95
•   Anders Krogh, Jesper Vedelsby.
    NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE LEARNING. NIPS-94
•   Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION. NIPS-94
•   David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS.
    NIPS-94
•   Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments. NIPS-91                  85
ICML papers
•   Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning. ICML-06
•   Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu.
    Batch Mode Active Learning and Its Application to Medical Image Classification. ICML-06
•   Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani.
    Active Sampling for Detecting Irrelevant Features. ICML-06
•   Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design. ICML-06
•   Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for
    Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05
•   Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning. ICML-04
•   Klaus Brinker. Active Learning of Label Ranking Functions. ICML-04
•   Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04
•   Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines, ICML-00
•   Simon Tong, Daphne Koller.
    Support Vector Machine Active Learning with Applications to Text Classification. ICML-00.
•   COLT papers
•   S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. COLT-05.
•   H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee. COLT-92, pages 287--294.

                                                                                                    86
Journal Papers
•   Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou.
    Fast Kernel Classifiers with Online and Active Learning. Journal of Machine Learning Research
    (JMLR), vol. 6, pp. 1579-1619, 2005.
•   Simon Tong, Daphne Koller.
    Support Vector Machine Active Learning with Applications to Text Classification. Journal of
    Machine Learning Research. Volume 2, pages 45-66. 2001.
•   Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997.
    Selective sampling using the query by committee algorithm. Machine Learning, 28:133--168
•   David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models,
    Journal of Artificial Intelligence Research, (4): 129-145, 1996.
•   David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning,
    Machine Learning 15(2):201-221, 1994.
•   D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural
    Comput., vol. 4, no. 4, pp. 590--604, 1992.
•   Haussler, D., Kearns, M., and Schapire, R. E. (1994).
    Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension
    . Machine Learning, 14, 83--113
•   Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press.
•   Saar-Tsechansky, M. and F. Provost.
    Active Sampling for Class Probability Estimation and Ranking. Machine Learning 54:2 2004,
    153-178.
                                                                                                    87
Workshops
• http://domino.research.ibm.com/comm/researc




                                        88
Appendix




           89
Active Learning of Bayesian
         Networks
Entropy Function
• A measure of information in
  random event X with possible
  outcomes {x1,…,xn}
   H(x) = - Σi p(xi) log2 p(xi)

• Comments on entropy function:
   – Entropy of an event is zero when
     the outcome is known
   – Entropy is maximal when all
     outcomes are equally likely
• The average minimum yes/no
  questions to answer some
  question (connection to binary                          91
  search)                               [Shannon, 1948]
Kullback-Leibler divergence
• P is the true distribution; Q distribution is used to encode
  data instead of P
• KL divergence is the expected extra message length per
  datum that must be transmitted using Q

            DKL(P || Q) = Σi P(xi) log (P(xi)/Q(xi))
                        = Σi P(xi) log Q(xi) – Σi P(xi) log P(xi)
                        = -H(P,Q) + H(P)
                        = -Cross-entropy + entropy


• Measure of how “wrong” Q is with respect to true
  distribution P                                                92
Learning Bayesian Networks


                                                 E       B
      Data
                                             R       A
        +                 Learner
Prior Knowledge                                      C


                                                             E B P(A | E,B)
                                                             e b .9    .1
     • Model Building
                                                             e b .7    .3
         • Parameter estimation
         • Causal structure discovery                        e b .8    .2
     • Passive Learning vs Active Learning                   e b .99 .01
                                                                      93
Active Learning

• Selective Active Learning
• Interventional Active Learning


• Obtain measure of quality of current model
• Choose query that most improves quality
• Update model


                                               94
Active Learning: Parameter
                      Estimation
•
         [Tong & Koller, NIPS-2000]
      Given a BN structure G
• A prior distribution p(θ)
• Learner request a particular instantiation q (Query)
                                                            Updated
Initial Network
                                                           distribution
     G, p(θ)
                                                               p´(θ)
  E           B
                   Query (q)                                E         B
                                  Active Learner
          A
                                                                  A
      +
                  Response (x)
  Training
    data
                               How to update parameter density
                               How to select next query based on p
                                                                95
Updating parameter density

•Do not update A since we are fixing it
                                              B
•If we select A then do not update B
   •Sampling from P(B|A=a) ≠ P(B)
•If we force A then we can update B           A
   •Sampling from P(B|A:=a) = P(B)*
•Update all other nodes as usual
                                          J           M
•Obtain new density

  p(θ | A = a, X = x )
                                                           96
                                                  *Pearl 2000
Bayesian point estimation
• Goal: a single estimate θ
  – instead of a distribution p over θ


                p(θ)


                     ~
• If we choose θ and the true model is θ’
                     θ    θ’
  then we incur some loss, L(θ’ || θ)

                                            97
Bayesian point estimation
• We do not know the true θ’
• Density p represents optimal beliefs over θ’
• Choose θ that minimizes the expected loss
     ~
     θ = argminθ ∫p(θ’) L(θ’ || θ) dθ’

       ~
• Call θ the Bayesian point estimate
• Use the expected loss of the Bayesian point
  estimate as a measure of quality of p(θ):
  – Risk(p) = ∫p(θ’) L(θ’ || θ) dθ’
                                                 98
The Querying component
• Set the controllable variables so as to
  minimize the expected posterior risk:

    ExPRisk(p | Q=q)
                                              ~
    = ∑ P( X = x | Q = q) ∫ p(θ | x ) KL(θ || θ ) dθ
       x


• KL divergence will be used for loss
                        Conditional KL-divergence

  – KL(θ || θ’)=∑KL(Pθ(Xi|Ui)|| Pθ’(Xi|Ui))            99
Algorithm Summary
• For each potential query q
• Compute ∆Risk(X|q)
• Choose q for which ∆Risk(X|q) is greatest
  – Cost of computing ∆Risk(X|q):




• Cost of Bayesian network inference
• Complexity: O (|Q|. Cost of inference)

                                              100
Uncertainty sampling
Maintain a single hypothesis, based on labels seen so far.
Query the point about which this hypothesis is most “uncertain”.

Problem: confidence of a single hypothesis may not accurately
represent the true diversity of opinion in the hypothesis class.

           X
                                -
                   -                                    -
                                        -
                       +   +                        -
                                            -
                           +
                       +       +                -
               -                    -                              101
102
Region of uncertainty
Current version space: portion of H consistent with labels so far.
“Region of uncertainty” = part of data space about which there is
still some uncertainty (i.e. disagreement within version space)

                                            current version space
Suppose data lies
on circle in R2;                            +
hypotheses are       +
linear separators.
(spaces X, H
superimposed)
                                             region of uncertainty
                                             in data space     103
Region of uncertainty
Algorithm [CAL92]:
of the unlabeled points which lie in the region of uncertainty,
pick one at random to query.


Data and                                   current version space
hypothesis spaces,
superimposed:
(both are the
surface of the unit
sphere in Rd)
                                             region of uncertainty
                                             in data space     104
Region of uncertainty
Number of labels needed depends on H and also on P.

Special case: H = {linear separators in Rd}, P = uniform
distribution over unit sphere.

Then: just d log 1/ε labels are needed to reach a hypothesis
with error rate < ε.

[1] Supervised learning: d/ε labels.
[2] Best we can hope for.


                                                               105
Region of uncertainty
Algorithm [CAL92]:
of the unlabeled points which lie in the region of uncertainty,
pick one at random to query.

For more general distributions: suboptimal…



                                         Need to measure
                                         quality of a query – or
                                         alternatively, size of
                                         version space.

                                                              106
Expected
Infogain
of sample   Uncertainty sampling!




                                    107
108

More Related Content

What's hot

Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Krishnaram Kenthapadi
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine LearningAnkit Rai
 
Enabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationEnabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationQualcomm Research
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
META-LEARNING.pptx
META-LEARNING.pptxMETA-LEARNING.pptx
META-LEARNING.pptxAyanaRukasar
 
Active learning
Active learningActive learning
Active learningAli Abbasi
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative ModelsMLReview
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)SwatiTripathi44
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Sri Ambati
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning AlgorithmsHichem Felouat
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.ASHOK KUMAR
 

What's hot (20)

Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine Learning
 
Enabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationEnabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through Quantization
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
META-LEARNING.pptx
META-LEARNING.pptxMETA-LEARNING.pptx
META-LEARNING.pptx
 
Active learning
Active learningActive learning
Active learning
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Federated Learning
Federated LearningFederated Learning
Federated Learning
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 

Similar to Active learning lecture

Probing the Efficacy of the Algebra Project: A Summary of Findings
Probing the Efficacy of the Algebra Project: A Summary of FindingsProbing the Efficacy of the Algebra Project: A Summary of Findings
Probing the Efficacy of the Algebra Project: A Summary of FindingsEDD SFSU
 
Integrated modelling Cape Town
Integrated modelling Cape TownIntegrated modelling Cape Town
Integrated modelling Cape TownBob O'Hara
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systemsrecsysfr
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...Quinn Lathrop
 
Surface features with nonparametric machine learning
Surface features with nonparametric machine learningSurface features with nonparametric machine learning
Surface features with nonparametric machine learningSylvain Ferrandiz
 
Relational machine-learning
Relational machine-learningRelational machine-learning
Relational machine-learningBhushan Kotnis
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoringharmonylab
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 Albert Bifet
 
Admission in India
Admission in IndiaAdmission in India
Admission in IndiaEdhole.com
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...butest
 
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...NTNU
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Miningbutest
 
Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...
Session II - Estimation methods and accuracy   Li-Chun Zhang Discussion: Sess...Session II - Estimation methods and accuracy   Li-Chun Zhang Discussion: Sess...
Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...Istituto nazionale di statistica
 

Similar to Active learning lecture (20)

Probing the Efficacy of the Algebra Project: A Summary of Findings
Probing the Efficacy of the Algebra Project: A Summary of FindingsProbing the Efficacy of the Algebra Project: A Summary of Findings
Probing the Efficacy of the Algebra Project: A Summary of Findings
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
 
Integrated modelling Cape Town
Integrated modelling Cape TownIntegrated modelling Cape Town
Integrated modelling Cape Town
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
ictir2016
ictir2016ictir2016
ictir2016
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
 
Surface features with nonparametric machine learning
Surface features with nonparametric machine learningSurface features with nonparametric machine learning
Surface features with nonparametric machine learning
 
Relational machine-learning
Relational machine-learningRelational machine-learning
Relational machine-learning
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoring
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
Admission in India
Admission in IndiaAdmission in India
Admission in India
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
 
Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...
Session II - Estimation methods and accuracy   Li-Chun Zhang Discussion: Sess...Session II - Estimation methods and accuracy   Li-Chun Zhang Discussion: Sess...
Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...
 

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 

Active learning lecture

  • 1. Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center
  • 2. Outline  Motivation  Active learning approaches  Membership queries  Uncertainty Sampling  Information-based loss functions  Uncertainty-Region Sampling  Query by committee  Applications  Active Collaborative Prediction  Active Bayes net learning 2
  • 3. Standard supervised learning model Given m labeled points, want to learn a classifier with misclassification rate <ε, chosen from a hypothesis class H with VC dimension d < 1. VC theory: need m to be roughly d/ε, in the realizable case. 3
  • 4. Active learning In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label. What is the minimum number of labels needed to achieve the target error rate? 4
  • 5. 5
  • 6. What is Active Learning?  Unlabeled data are readily available; labels are expensive  Want to use adaptive decisions to choose which labels to acquire for a given dataset  Goal is accurate classifier with minimal cost 6
  • 7. Active learning warning  Choice of data is only as good as the model itself  Assume a linear model, then two data points are sufficient  What happens when data are not linear? 7
  • 8. Active Learning Flavors Selective Sampling Membership Queries Pool Sequential Myopic Batch 8
  • 9. Active Learning Approaches  Membership queries  Uncertainty Sampling  Information-based loss functions  Uncertainty-Region Sampling  Query by committee 9
  • 10. 10
  • 11. 11
  • 12. Problem Many results in this framework, even for complicated hypothesis classes. [Baum and Lang, 1991] tried fitting a neural net to handwritten characters. Synthetic instances created were incomprehensible to humans! [Lewis and Gale, 1992] tried training text classifiers. “an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.” 12
  • 13. 13
  • 14. 14
  • 15. 15
  • 16. Uncertainty Sampling [Lewis & Gale, 1994]  Query the event that the current classifier is most uncertain about If uncertainty is measured in Euclidean distance: x x x x x x x x x x  Used trivially in SVMs, graphical models, etc. 16
  • 17. Information-based Loss Function [MacKay, 1992]  Maximize KL-divergence between posterior and prior  Maximize reduction in model entropy between posterior and prior  Minimize cross-entropy between posterior and prior  All of these are notions of information gain 17
  • 18. Query by Committee [Seung et al. 1992, Freund et al. 1997]  Prior distribution over hypotheses  Samples a set of classifiers from distribution  Queries an example based on the degree of disagreement between committee of classifiers x x x x x x x x x x A B C 18
  • 20. Notation We Have: • Dataset, D • Model parameter space, W • Query algorithm, q 20
  • 21. Dataset (D) Example t Sex Age Test A Test B Test C Disease 0 M 40-50 0 1 1 ? 1 F 50-60 0 1 0 ? 2 F 30-40 0 0 0 ? 3 F 60+ 1 1 1 ? 4 M 10-20 0 1 0 ? 5 M 40-50 0 0 1 ? 6 F 0-10 0 0 0 ? 7 M 30-40 1 1 0 ? 8 M 20-30 0 0 1 ? 21
  • 22. Notation We Have: • Dataset, D • Model parameter space, W • Query algorithm, q 22
  • 23. Model Example St Ot Probabilistic Classifier Notation T : Number of examples Ot : Vector of features of example t St : Class of example t 23
  • 24. Model Example Patient state (St) St : DiseaseState Patient Observations (Ot) Ot1 : Gender Ot2 : Age Ot3 : TestA Ot4 : TestB Ot5 : TestC 24
  • 25. Possible Model Structures Gender TestB Age Gender S TestA S TestA Age TestB TestC TestC 25
  • 26. Model Space Model: St Ot Model Parameters: P(St) P(Ot|St) Generative Model: Must be able to compute P(St=i, Ot=ot | w) 26
  • 27. Model Parameter Space (W) • W = space of possible parameter values • Prior on parameters: P(W ) • Posterior over models: P (W | D ) ∝ P ( D | W ) P (W ) T ∝ ∏ P ( St , Ot | W ) P(W ) t 27
  • 28. Notation We Have: • Dataset, D • Model parameter space, W • Query algorithm, q q(W,D) returns t*, the next sample to label 28
  • 29. Game while NotDone • Learn P(W | D) • q chooses next example to label • Expert adds label to D 29
  • 30. Simulation O1 O2 O3 O4 O5 O6 O7 e S2 fals e S1 = S3 S4 S5 true = S6 S7 fals = S2 S5 S7 ? ? hmm… ? q 30
  • 31. Active Learning Flavors • Pool (“random access” to patients) • Sequential (must decide as patients walk in the door) 31
  • 32. q? • Recall: q(W,D) returns the “most interesting” unlabelled example. • Well, what makes a doctor curious about a patient? 32
  • 33. 1994 33
  • 34. Score Function scoreuncert ( S t ) = uncertainty(P ( St | Ot )) = H ( St ) = ∑ P ( St = i ) log P( St = i ) i 34
  • 35. Uncertainty Sampling Example t Sex Age Test Test Test St P(St) H(St) A B C 1 M 20- 30 0 1 1 ? 0.02 0.043 2 F 20- 30 0 1 0 ? 0.01 0.024 3 F 30- 40 1 0 0 ? 0.05 0.086 4 F 60+ 1 1 0 ? FALSE 0.12 0.159 5 M 10- 20 0 1 0 ? 0.01 0.024 6 M 20- 30 1 1 1 ? 0.96 0.073 35
  • 36. Uncertainty Sampling Example t Sex Age Test Test Test St P(St) H(St) A B C 1 M 20- 30 0 1 1 ? 0.01 0.024 2 F 20- 30 0 1 0 ? 0.02 0.043 3 F 30- 40 1 0 0 ? 0.04 0.073 4 F 60+ 1 1 0 ? FALSE 0.00 0.00 5 M 10- 20 0 1 0 ? TRUE 0.06 0.112 6 M 20- 30 1 1 1 ? 0.97 0.059 36
  • 37. Uncertainty Sampling GOOD: couldn’t be easier GOOD: often performs pretty well BAD: H(St) measures information gain about the samples, not the model Sensitive to noisy samples 37
  • 38. Can we do better than uncertainty sampling? 38
  • 39. 1992 Informative with respect to what? 39
  • 40. Model Entropy P(W|D) P(W|D) P(W|D) W W W H(W) = high …better… H(W) = 0 40
  • 41. Information-Gain • Choose the example that is expected to most reduce H(W) • I.e., Maximize H(W) – H(W | St) Current model Expected model space space entropy entropy if we learn St 41
  • 42. Score Function score IG ( St ) = MI ( St ;W ) = H (W ) − H (W | St ) 42
  • 43. We usually can’t just sum over all models to get H(St|W) H (W ) = − ∫ P ( w) log P ( w) dw w …but we can sample from P(W | D) H (W ) ∝ H (C ) = −∑ P (c) log P (c) c∈C 43
  • 44. Conditional Model Entropy H (W ) = ∫ P ( w) log P ( w) dw w H (W | St = i ) = ∫ P ( w | St = i ) log P ( w | St = i ) dw w H (W | St ) = ∑ P ( St = i ) ∫ P( w | S t = i ) log P ( w | St = i ) dw w i 44
  • 45. Score Function score IG ( St ) = H (C ) − H (C | St ) 45
  • 46. t Sex Age Test Test Test St P(St) Score = A B C H(C) - H(C|St) 1 M 20-3 0 1 1 0 ? 0.02 0.53 2 F 20-3 0 1 0 0 ? 0.01 0.58 3 F 30-4 1 0 0 0 ? 0.05 0.40 4 F 60+ 1 1 1 ? 0.12 0.49 5 M 10-2 0 1 0 0 ? 0.01 0.57 20-3 6 M 0 0 0 1 ? 0.02 0.52 46
  • 47. Score Function score IG ( St ) = H (C ) − H (C | St ) = H ( St ) − H ( St | C ) Familiar? 47
  • 48. Uncertainty Sampling & Information Gain scoreUncertain ( St ) = H ( St ) score InfoGain ( St ) = H ( St ) − H ( St | C ) 48
  • 49. But there is a problem… 49
  • 50. If our objective is to reduce the prediction error, then “the expected information gain of an unlabeled sample is NOT a sufficient criterion for constructing good queries” 50
  • 51. Strategy #2: Query by Committee Temporary Assumptions: Pool  Sequential P(W | D)  Version Space Probabilistic  Noiseless  QBC attacks the size of the “Version space” 51
  • 52. O1 O2 O3 O4 O5 O6 O7 S1 S2 S3 S4 S5 S6 S7 FALSE! FALSE! Model #1 Model #2 52
  • 53. O1 O2 O3 O4 O5 O6 O7 S1 S2 S3 S4 S5 S6 S7 TRUE! TRUE! Model #1 Model #2 53
  • 54. O1 O2 O3 O4 O5 O6 O7 S1 S2 S3 S4 S5 S6 S7 FALSE! TRUE! Ooh, now we’re going to learn something for sure! One of them is definitely wrong. Model #1 Model #2 54
  • 55. The Original QBC Algorithm As each example arrives… • Choose a committee, C, (usually of size 2) randomly from Version Space • Have each member of C classify it • If the committee disagrees, select it. 55
  • 56. 1992 56
  • 57. Infogain vs Query by Committee [Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997] First idea: Try to rapidly reduce volume of version space? Problem: doesn’t take data distribution into account. H: Which pair of hypotheses is closest? Depends on data distribution P. Distance measure on H: d(h,h’) = P(h(x) ≠ h’(x)) 57
  • 58. Query-by-committee First idea: Try to rapidly reduce volume of version space? Problem: doesn’t take data distribution into account. To keep things simple, say d(h,h’) = Euclidean distance H: Error is likely to remain large! 58
  • 59. Query-by-committee Elegant scheme which decreases volume in a manner which is sensitive to the data distribution. Bayesian setting: given a prior π on H H1 = H For t = 1, 2, … receive an unlabeled point xt drawn from P [informally: is there a lot of disagreement about xt in Ht?] choose two hypotheses h,h’ randomly from (π, Ht) if h(xt) ≠ h’(xt): ask for xt’s label set Ht+1 59 Problem: how to implement it efficiently?
  • 60. Query-by-committee For t = 1, 2, … receive an unlabeled point xt drawn from P choose two hypotheses h,h’ randomly from (π, Ht) if h(xt) ≠ h’(xt): ask for xt’s label set Ht+1 Observation: the probability of getting pair (h,h’) in the inner loop (when a query is made) is proportional to π(h) π(h’) d(h,h’). vs. 60 Ht
  • 61. 61
  • 62. Query-by-committee Label bound: For H = {linear separators in Rd}, P = uniform distribution, just d log 1/ε labels to reach a hypothesis with error < ε. Implementation: need to randomly pick h according to (π, Ht). e.g. H = {linear separators in Rd}, π = uniform distribution: How do you pick a Ht random point from a convex body? 62
  • 63. Sampling from convex bodies By random walk! 2. Ball walk 3. Hit-and-run [Gilad-Bachrach, Navot, Tishby 2005] Studies random walks and also ways to kernelize QBC. 63
  • 64. 64
  • 65. Some challenges [1] For linear separators, analyze the label complexity for some distribution other than uniform! [2] How to handle nonseparable data? Need a robust base learner + true boundary - 65
  • 67. Approach: Collaborative Prediction (CP) QoS measure (e.g. bandwidth) Movie Ratings Server1 Server2 Server3 Matrix Geisha Shrek Client1 3674 18 Alina ? 1 3 Client2 187 567 Gerry 4 2 Client3 1688 Irina 9 10 Client4 3009 703 ? Raja 4 Given previously observed ratings R(x,y), where X is a “user” and Y is a “product”, predict unobserved ratings - will Alina like “The Matrix”? (unlikely ) - will Client 86 have fast download from Server 39? - will member X of funding committee approve our project Y?  67
  • 68. Collaborative Prediction = Matrix Approximation 100 servers • Important assumption: 100 clients matrix entries are NOT independent, e.g. similar users have similar tastes • Approaches: mainly factorized models assuming hidden ‘factors’ that affect ratings (pLSA, MCVQ, SVD, NMF, MMMF, …) 68
  • 69. 2 4 5 1 4 2 User’s ‘weights’ Factors associated with ‘factors’’ Assumptions: - there is a number of (hidden) factors behind the user preferences that relate to (hidden) movie properties - movies have intrinsic values associated with such factors - users have intrinsic weights with such factors; user ratings a weighted (linear) combinations of movie’s values 69
  • 70. 2 4 5 1 4 2 70
  • 71. 2 4 5 1 4 2 71
  • 72. rank k 2 4 5 1 4 2 7 2 5 4 5 3 1 4 2 3 1 2 2 5 3 1 2 4 2 2 7 5 6 4 2 4 1 3 1 4 3 2 2 4 1 4 3 1 = 3 3 4 2 3 1 2 3 4 3 2 4 5 2 3 1 4 3 2 2 3 2 1 3 4 3 5 2 2 2 1 4 8 2 2 9 1 8 3 4 5 1 3 1 1 4 1 2 3 5 1 1 5 6 4 Y X Objective: find a factorizable X=UV’ that approximates Y X = arg min Loss ( X ' , Y ) X' and satisfies some “regularization” constraints (e.g. rank(X) < k) Loss functions: depends on the nature of your problem 72
  • 73. Matrix Factorization Approaches  Singular value decomposition (SVD) – low-rank approximation  Assumes fully observed Y and sum-squared loss  In collaborative prediction, Y is only partially observed Low-rank approximation becomes non-convex problem w/ many local minima  Furthermore, we may not want sum-squared loss, but instead  accurate predictions (0/1 loss, approximated by hinge loss)  cost-sensitive predictions (missing a good server vs suggesting a bad one)  ranking cost (e.g., suggest k ‘best’ movies for a user) NON-CONVEX PROBLEMS!  Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05]  replaces bounded rank constraint by bounded norm of U, V’ vectors  convex optimization problem! – can be solved exactly by semi-definite programming  strongly relates to learning max-margin classifiers (SVMs)  Exploit MMMF’s properties to augment it with active sampling! 73
  • 74. Key Idea of MMMF Rows – feature vectors, Columns – linear classifiers Linear classifiersweight vectors “margin” here = Dist(sample, line) v2 f1 -1 “ma rgin Feature vectors ” Xij = signij x marginij Predictorij = signij If signij > 0, classify as +1, Otherwise classify as -1 74
  • 75. MMMF: Simultaneous Search for Low-norm Feature Vectors and Max-margin Classifiers 75
  • 76. Active Learning with MMMF - We extend MMMF to Active-MMMF using margin-based active sampling - We investigate exploitation vs exploration trade-offs imposed by different heuristics -0.3 -0.5 0.3 0.4 0.6 0.1 -0.9 -0.6 -0.5 0.8 0.2 -0.9 0.3 -0.1 0.6 -0.7 -0.5 0.7 -0.9 -0.8 0.9 0.1 0.5 0.2 0.3 -0.5 0.6 0.6 0.5 0.2 -0.9 0.7 -0.8 -0.5 0.9 -0.6 -0.1 0.9 0.7 0.8 -0.4 0.3 Margin-based heuristics: -0.5 -0.2 0.6 -0.5 -0.5 -0.5 -0.4 0.6 -0.5 0.4 0.5 min-margin (most-uncertain) -0.2 -0.5 -0.5 0.1 0.9 0.3 min-margin positive (“good” uncertain) 0.8 -0.5 0.6 0.2 max-margin (‘safe choice’ but no76 info)
  • 77. Active Max-Margin Matrix Factorization A-MMMF(M,s) 1. Given s sparse matrix Y, learn approximation X = MMMF(Y) 2. Using current predictions, actively select “best s” samples and request their labels (e.g., test client/server pair via ‘enforced’ download) 3. Add new samples to Y 4. Repeat 1-3 Issues:  Beyond simple greedy margin-based heuristics?  Theoretical guarantees? not so easy with non-trivial learning methods and non-trivial data distributions (any suggestions???  ) 77
  • 78. Empirical Results  Network latency prediction  Bandwidth prediction (peer-to-peer)  Movie Ranking Prediction  Sensor net connectivity prediction 78
  • 79. Empirical Results: Latency Prediction P2Psim data NLANR-AMP data Active sampling with most-uncertain (and most-uncertain positive) heuristics provide consistent improvement over random and least- uncertain-next sampling 79
  • 80. Movie Rating Prediction (MovieLens) 80
  • 82. Introducing Cost: Exploration vs Exploitation DownloadGrid: PlanetLab: bandwidth prediction latency prediction Active sampling  lower prediction errors at lower costs (saves 100s of samples) (better prediction  better server assignment decisions  faster downloads Active sampling achieves a good exploration vs exploitation trade-off: reduced decision cost AND information gain 82
  • 83. Conclusions  Common challenge in many applications: need for cost-efficient sampling  This talk: linear hidden factor models with active sampling  Active sampling improves predictive accuracy while keeping sampling complexity low in a wide variety of applications  Future work:  Better active sampling heuristics?  Theoretical analysis of active sampling performance?  Dynamic Matrix Factorizations: tracking time-varying matrices  Incremental MMMF? (solving from scratch every time is too costly) 83
  • 84. References Some of the most influential papers • Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification . Journal of Machine Learning Research. Volume 2, pages 45-66. 2001. • Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133—168 • David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996. • David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994. • D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992. 84
  • 85. NIPS papers • Francis Bach. Active learning for misspecified generalized linear models. NIPS-06 • Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real. NIPS-05 • Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman. Active Learning For Identifying Function Threshold Boundaries . NIPS-05 • Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning. NIPS-05 • Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. NIPS-05 • Masashi Sugiyama. Active Learning for Misspecified Models. NIPS-05 • Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models. NIPS-05 • Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection. NIPS-04 • Sanjoy Dasgupta. Analysis of a greedy active learning strategy. NIPS-04 • T. Jaakkola and H. Siegelmann. Active Information Retrieval. NIPS-01 • M. K. Warmuth et al. Active Learning in the Drug Discovery Process. NIPS-01 • Jonathan D. Nelson, Javier R. Movellan. Active Inference in Concept Learning. NIPS-00 • Simon Tong, Daphne Koller. Active Learning for Parameter Estimation in Bayesian Networks. NIPS-00 • Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering. NIPS-97 • K. Fukumizu. Active Learning in Multilayer Perceptrons. NIPS-95 • Anders Krogh, Jesper Vedelsby. NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE LEARNING. NIPS-94 • Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION. NIPS-94 • David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS. NIPS-94 • Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments. NIPS-91 85
  • 86. ICML papers • Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning. ICML-06 • Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu. Batch Mode Active Learning and Its Application to Medical Image Classification. ICML-06 • Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani. Active Sampling for Detecting Irrelevant Features. ICML-06 • Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design. ICML-06 • Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05 • Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning. ICML-04 • Klaus Brinker. Active Learning of Label Ranking Functions. ICML-04 • Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04 • Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines, ICML-00 • Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. ICML-00. • COLT papers • S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. COLT-05. • H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee. COLT-92, pages 287--294. 86
  • 87. Journal Papers • Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou. Fast Kernel Classifiers with Online and Active Learning. Journal of Machine Learning Research (JMLR), vol. 6, pp. 1579-1619, 2005. • Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research. Volume 2, pages 45-66. 2001. • Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133--168 • David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996. • David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994. • D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992. • Haussler, D., Kearns, M., and Schapire, R. E. (1994). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension . Machine Learning, 14, 83--113 • Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press. • Saar-Tsechansky, M. and F. Provost. Active Sampling for Class Probability Estimation and Ranking. Machine Learning 54:2 2004, 153-178. 87
  • 89. Appendix 89
  • 90. Active Learning of Bayesian Networks
  • 91. Entropy Function • A measure of information in random event X with possible outcomes {x1,…,xn} H(x) = - Σi p(xi) log2 p(xi) • Comments on entropy function: – Entropy of an event is zero when the outcome is known – Entropy is maximal when all outcomes are equally likely • The average minimum yes/no questions to answer some question (connection to binary 91 search) [Shannon, 1948]
  • 92. Kullback-Leibler divergence • P is the true distribution; Q distribution is used to encode data instead of P • KL divergence is the expected extra message length per datum that must be transmitted using Q DKL(P || Q) = Σi P(xi) log (P(xi)/Q(xi)) = Σi P(xi) log Q(xi) – Σi P(xi) log P(xi) = -H(P,Q) + H(P) = -Cross-entropy + entropy • Measure of how “wrong” Q is with respect to true distribution P 92
  • 93. Learning Bayesian Networks E B Data R A + Learner Prior Knowledge C E B P(A | E,B) e b .9 .1 • Model Building e b .7 .3 • Parameter estimation • Causal structure discovery e b .8 .2 • Passive Learning vs Active Learning e b .99 .01 93
  • 94. Active Learning • Selective Active Learning • Interventional Active Learning • Obtain measure of quality of current model • Choose query that most improves quality • Update model 94
  • 95. Active Learning: Parameter Estimation • [Tong & Koller, NIPS-2000] Given a BN structure G • A prior distribution p(θ) • Learner request a particular instantiation q (Query) Updated Initial Network distribution G, p(θ) p´(θ) E B Query (q) E B Active Learner A A + Response (x) Training data How to update parameter density How to select next query based on p 95
  • 96. Updating parameter density •Do not update A since we are fixing it B •If we select A then do not update B •Sampling from P(B|A=a) ≠ P(B) •If we force A then we can update B A •Sampling from P(B|A:=a) = P(B)* •Update all other nodes as usual J M •Obtain new density p(θ | A = a, X = x ) 96 *Pearl 2000
  • 97. Bayesian point estimation • Goal: a single estimate θ – instead of a distribution p over θ p(θ) ~ • If we choose θ and the true model is θ’ θ θ’ then we incur some loss, L(θ’ || θ) 97
  • 98. Bayesian point estimation • We do not know the true θ’ • Density p represents optimal beliefs over θ’ • Choose θ that minimizes the expected loss ~ θ = argminθ ∫p(θ’) L(θ’ || θ) dθ’ ~ • Call θ the Bayesian point estimate • Use the expected loss of the Bayesian point estimate as a measure of quality of p(θ): – Risk(p) = ∫p(θ’) L(θ’ || θ) dθ’ 98
  • 99. The Querying component • Set the controllable variables so as to minimize the expected posterior risk: ExPRisk(p | Q=q) ~ = ∑ P( X = x | Q = q) ∫ p(θ | x ) KL(θ || θ ) dθ x • KL divergence will be used for loss Conditional KL-divergence – KL(θ || θ’)=∑KL(Pθ(Xi|Ui)|| Pθ’(Xi|Ui)) 99
  • 100. Algorithm Summary • For each potential query q • Compute ∆Risk(X|q) • Choose q for which ∆Risk(X|q) is greatest – Cost of computing ∆Risk(X|q): • Cost of Bayesian network inference • Complexity: O (|Q|. Cost of inference) 100
  • 101. Uncertainty sampling Maintain a single hypothesis, based on labels seen so far. Query the point about which this hypothesis is most “uncertain”. Problem: confidence of a single hypothesis may not accurately represent the true diversity of opinion in the hypothesis class. X - - - - + + - - + + + - - - 101
  • 102. 102
  • 103. Region of uncertainty Current version space: portion of H consistent with labels so far. “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) current version space Suppose data lies on circle in R2; + hypotheses are + linear separators. (spaces X, H superimposed) region of uncertainty in data space 103
  • 104. Region of uncertainty Algorithm [CAL92]: of the unlabeled points which lie in the region of uncertainty, pick one at random to query. Data and current version space hypothesis spaces, superimposed: (both are the surface of the unit sphere in Rd) region of uncertainty in data space 104
  • 105. Region of uncertainty Number of labels needed depends on H and also on P. Special case: H = {linear separators in Rd}, P = uniform distribution over unit sphere. Then: just d log 1/ε labels are needed to reach a hypothesis with error rate < ε. [1] Supervised learning: d/ε labels. [2] Best we can hope for. 105
  • 106. Region of uncertainty Algorithm [CAL92]: of the unlabeled points which lie in the region of uncertainty, pick one at random to query. For more general distributions: suboptimal… Need to measure quality of a query – or alternatively, size of version space. 106
  • 107. Expected Infogain of sample Uncertainty sampling! 107
  • 108. 108

Editor's Notes

  1. Explain why this isn’t pathological: issue of margins
  2. Explain why this isn’t pathological: issue of margins
  3. Explain why this isn’t pathological: issue of margins
  4. Explain why this isn’t pathological: issue of margins
  5. Explain why this isn’t pathological: issue of margins
  6. Explain why this isn’t pathological: issue of margins
  7. New properties of our application vs CF: Dynamically changing ratings Possibilitues for active sampling Constraint on resources vs CF Absense of rating may mean that server was NOT selected selectd before on purpose (?) Future improvements: Use client and server ‘features’ (IP etc)
  8. Red box on left top : which data where actually used in experiments? Color version Histograms
  9. Examples: movies and dGrid – Motivate active learning in CF scenarios Minimize # of questions /downloads but maximuze their ‘informativeness’ Incremental algorithm – future work
  10. Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
  11. Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
  12. Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
  13. Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
  14. When you don’t have enough data to even fit a hypothesis, should you trust its confidence judgements?
  15. Explain why this isn’t pathological: issue of margins