Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introduction toActive LearningAlexey Voropaev
Mail.Ru have own Search Engine    – It is not «patched Google»    – Completely out product    – 8.3% of market    - The sa...
W    We love Machine Learning :)    Simplified Search Engine Architecture:                              ML                ...
What is Supervised Machine Learning?      «Machine learning is the hot new thing»                                   John H...
Supervised Machine Learning     There is unknown function     Given:         Set of samples     Goal:          Build appro...
Take good features           A — impossible               B — possible                         Lenear model               ...
Proper model selection                                                                                                    ...
Importance of train set construction      Usually:        - random sampling is not the best strategy        - labeling is ...
Information about train set construction    Modern books about ML:     - Pattern Recognition and Machine Learning — 0     ...
Idea of Active Learning                              Image from: Burr Settles. Active Learning Literature Survey. Computer...
Active learning workflow Main assumption: obtaining an unlabeled instance is free                                 Image fr...
Sampling scenarios                      Image from: Burr Settles. Active Learning Literature Survey. Computer             ...
Uncertainty Sampling Take instances about which it is least certain how to label.     Least confident:     Margin sampling...
Query-By-Committee                      1413
Query-By-Committee                      1514
Query-By-Committee                      1615
Query-By-Committee     Measure of committee disagreement:      Vote entropy      Kullback-Leibler (KL)      divergence:   ...
Query-By-Committee (QBag)     Input: T – labeled train set            С – size of the committe            A – learning alg...
QBag: Quality                                                                               19                 K.Dwyer, R....
QBag: Stability                                                                                 20                   K.Dwy...
Density-Weighted Methods Idea: Inhabit dense/sparse regions of the input space                                            ...
Other methods     Expected Model Change     Expected Error Reduction     Variance Reduction     Many «model dependent» met...
Example 1: Recognition of sentence boundaries     «I recommend Mitchell (1997) or Duda et al. (2001). I have strived...»  ...
Example 1: Recognition of sentence boundaries             OpenNLP / «AOT»                        project                  ...
Example 2: Ranking formula     Idea:         1. Query-Document presented by x, where:             x1 – tf-idf rank        ...
Example 2: Self-organizing map Idea: mapping from N-dimentional to 2-dimentional                                          ...
Example 2: Map of train set for ranking                                2726
Example 2: Density of QDocuments on the map                                    High density                               ...
Example 2: Result of “SOM miss” selection                                  Old documents                                  ...
Example 2: Qbag + SOM heuristic     1. Transform regression to binary classification problem:         - We should order pa...
Example 2: Qbag + SOM heuristic: results                                 3130
Example 3,4: Antispam, Antiporn Antispam    – Neuron net    – SOM balancing Antiporn    – Decision trees    – Uncertanty s...
Reference                   Thank you!     Reference:       - http://active-learning.net/       - Burr Settles. Active Lea...
Upcoming SlideShare
Loading in …5
×

Introduction to active learning

962 views

Published on

  • Be the first to comment

Introduction to active learning

  1. 1. Introduction toActive LearningAlexey Voropaev
  2. 2. Mail.Ru have own Search Engine – It is not «patched Google» – Completely out product – 8.3% of market - The same engine for: - web - image - video - news - real time - e-mail search - etc. 21
  3. 3. W We love Machine Learning :) Simplified Search Engine Architecture: ML Crawler – StaticRank: crawling priority Web ML Analyzer – Antispam – Object extraction – Antiporn – User behaviour Indexer ML – Ranking Searchers – Lingustic ML – Query processing Frontend – Antirobot 3 – Decision when to show vertical search2
  4. 4. What is Supervised Machine Learning? «Machine learning is the hot new thing» John Hennessy, President, Stanford 43
  5. 5. Supervised Machine Learning There is unknown function Given: Set of samples Goal: Build approximation of using 54
  6. 6. Take good features A — impossible B — possible Lenear model 65
  7. 7. Proper model selection 7 Image from: Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer. 20066
  8. 8. Importance of train set construction Usually: - random sampling is not the best strategy - labeling is very expensive 87
  9. 9. Information about train set construction Modern books about ML: - Pattern Recognition and Machine Learning — 0 - Elements of Statistical Learning — 0 - An Introduction to Information Retrieval — 1 paragraph Book about TS construction: Burr Settles. Active learning. - Publication Date: 2 July 2012 98
  10. 10. Idea of Active Learning Image from: Burr Settles. Active Learning Literature Survey. Computer 10 Sciences Technical Report 1648, University of Wisconsin–Madison. 2009.9
  11. 11. Active learning workflow Main assumption: obtaining an unlabeled instance is free Image from: Burr Settles. Active Learning Literature Survey. Computer 11 Sciences Technical Report 1648, University of Wisconsin–Madison. 2009.10
  12. 12. Sampling scenarios Image from: Burr Settles. Active Learning Literature Survey. Computer 12 Sciences Technical Report 1648, University of Wisconsin–Madison. 2009.11
  13. 13. Uncertainty Sampling Take instances about which it is least certain how to label. Least confident: Margin sampling Entropy 1312
  14. 14. Query-By-Committee 1413
  15. 15. Query-By-Committee 1514
  16. 16. Query-By-Committee 1615
  17. 17. Query-By-Committee Measure of committee disagreement: Vote entropy Kullback-Leibler (KL) divergence: 1716
  18. 18. Query-By-Committee (QBag) Input: T – labeled train set С – size of the committe A – learning algorithm U – set of unlabeled objects 1. Uniformly resample T, obtain T1...TС, where |Ti| < |T| 2. For each Ti build model Hi using A 3. Select x* = min x∈U | |Hi(x)=1| - |Hi(x)=0| | 4. Pass x* to oracle and update T 5. Repeat from 1until convergence 1817
  19. 19. QBag: Quality 19 K.Dwyer, R.Holte, Decision Tree Instability and Active Learning, 200718
  20. 20. QBag: Stability 20 K.Dwyer, R.Holte, Decision Tree Instability and Active Learning, 200719
  21. 21. Density-Weighted Methods Idea: Inhabit dense/sparse regions of the input space 2120
  22. 22. Other methods Expected Model Change Expected Error Reduction Variance Reduction Many «model dependent» methods 2221
  23. 23. Example 1: Recognition of sentence boundaries «I recommend Mitchell (1997) or Duda et al. (2001). I have strived...» Idea: 1. Make set of simple regexp-based rules like «big letter at the right» or «digits at the rigth and left» (40 rules) 2. Build classifier: Gradient Boosted Decision Trees A.Kudinov, A.Voropaev, A.Kalinin. A HIGHLY PRECISIONAL METHOD FOR THE RECOGNITION OF SENTENCE BOUNDARIES. Computational Linguistics and Intellectual Technologies. 2011 2322
  24. 24. Example 1: Recognition of sentence boundaries OpenNLP / «AOT» project Mail.Ru detector / Mail.Ru Sentence Random detector / Boundary sampling least Detector / confident Random sampling Error 41,2 % 30,4 % 8.2 % 0,8 % Rate Train set: 9820 examples Validation: 500 examples 2423
  25. 25. Example 2: Ranking formula Idea: 1. Query-Document presented by x, where: x1 – tf-idf rank x2 – query geographical x3 – geo-region of user is equal to documents region (600 other factors) 2. Build train set: 5 – vital 4 – exact answer 3 – usefull 2 – slightly usefull 1 – out of topic 0 – cant be labeled (unknown language, etc.) 3. Train ranking formula using modified LambdaRank 2524
  26. 26. Example 2: Self-organizing map Idea: mapping from N-dimentional to 2-dimentional 2625
  27. 27. Example 2: Map of train set for ranking 2726
  28. 28. Example 2: Density of QDocuments on the map High density Low density 2827
  29. 29. Example 2: Result of “SOM miss” selection Old documents New documents 2928
  30. 30. Example 2: Qbag + SOM heuristic 1. Transform regression to binary classification problem: - We should order pair of QDocuments - Apply QBag and select pairs of QDocuments 2. Apply SOM heuristic - Construct initial trainset T - Filter selected pairs A) Dont take pair B) May be C) Take pair 3029
  31. 31. Example 2: Qbag + SOM heuristic: results 3130
  32. 32. Example 3,4: Antispam, Antiporn Antispam – Neuron net – SOM balancing Antiporn – Decision trees – Uncertanty sampling 3231
  33. 33. Reference Thank you! Reference: - http://active-learning.net/ - Burr Settles. Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison. 2009. 3332

×