Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Vowpal Wabbit A Machine Learning System

1,112 views

Published on

VW

Published in: Technology

Vowpal Wabbit A Machine Learning System

  1. 1. Vowpal Wabbit A Machine Learning System John Langford Microsoft Research http://hunch.net/~vw/ git clone git://github.com/JohnLangford/vowpal_wabbit.git
  2. 2. Why does Vowpal Wabbit exist?
  3. 3. Why does Vowpal Wabbit exist? 1. Prove research.
  4. 4. Why does Vowpal Wabbit exist? 1. Prove research. 2. Curiosity. 3. Perfectionist. 4. Solve problem better.
  5. 5. A user base becomes addictive 1. Mailing list of >400
  6. 6. A user base becomes addictive 1. Mailing list of >400 2. The ocial strawman for large scale logistic regression @ NIPS :-)
  7. 7. A user base becomes addictive 1. Mailing list of 400 2. The ocial strawman for large scale logistic regression @ NIPS :-) 3.
  8. 8. An example wget http://hunch.net/~jl/VW_raw.tar.gz vw -c rcv1.train.raw.txt -b 22 --ngram 2 --skips 4 -l 0.25 --binary provides stellar performance in 12 seconds.
  9. 9. Surface details 1. BSD license, automated test suite, github repository.
  10. 10. Surface details 1. BSD license, automated test suite, github repository. 2. VW supports all I/O modes: executable, library, port, daemon, service (see next).
  11. 11. Surface details 1. BSD license, automated test suite, github repository. 2. VW supports all I/O modes: executable, library, port, daemon, service (see next). 3. VW has a reasonable++ input format: sparse, dense, namespaces, etc...
  12. 12. Surface details 1. BSD license, automated test suite, github repository. 2. VW supports all I/O modes: executable, library, port, daemon, service (see next). 3. VW has a reasonable++ input format: sparse, dense, namespaces, etc... 4. Mostly C++, but bindings in other languages of varying maturity (python, C#, Java good).
  13. 13. Surface details 1. BSD license, automated test suite, github repository. 2. VW supports all I/O modes: executable, library, port, daemon, service (see next). 3. VW has a reasonable++ input format: sparse, dense, namespaces, etc... 4. Mostly C++, but bindings in other languages of varying maturity (python, C#, Java good). 5. A substantial user base + developer base. Thanks to many who have helped.
  14. 14. VW service http://tinyurl.com/vw-azureml Problem: How to deploy model for large scale use?
  15. 15. VW service http://tinyurl.com/vw-azureml Problem: How to deploy model for large scale use? Solution:
  16. 16. This Tutorial in 4 parts How do you: 1. use all the data?
  17. 17. This Tutorial in 4 parts How do you: 1. use all the data? 2. solve the right problem?
  18. 18. This Tutorial in 4 parts How do you: 1. use all the data? 2. solve the right problem? 3. solve complex joint problems?
  19. 19. This Tutorial in 4 parts How do you: 1. use all the data? 2. solve the right problem? 3. solve complex joint problems? 4. solve interactive problems?
  20. 20. Using all the data: Step 1
  21. 21. Using all the data: Step 1 Small RAM + large data ⇒ Online Learning Active research area, 4-5 papers related to online learning algorithms in VW.
  22. 22. Using all the data: Step 2 1. 3.2 ∗ 10 6 labeled emails. 2. 433167 users. 3. ∼ 40 ∗ 10 6 unique tokens. How do we construct a spam lter which is personalized, yet uses global information?
  23. 23. Using all the data: Step 2 1. 3.2 ∗ 10 6 labeled emails. 2. 433167 users. 3. ∼ 40 ∗ 10 6 unique tokens. How do we construct a spam lter which is personalized, yet uses global information? Bad answer: Construct a global lter + 433167 personalized lters using a conventional hashmap to specify features. This might require 433167 ∗ 40 ∗ 10 6 ∗ 4 ∼ 70Terabytes of RAM.
  24. 24. Using Hashing Use hashing to predict according to: w, φ(x) + w, φu(x) NEU Votre Apotheke en ligne Euro ... USER123_NEU USER123_Votre USER123_Apotheke USER123_en USER123_ligne USER123_Euro ... + 323 0 5235 0 0 123 0 626 232 ... text document (email) tokenized, duplicated bag of words hashed, sparse vector h2x classification w xh x xl xh (in VW: specify the userid as a feature and use -q)
  25. 25. Results !#$% !#% !##% !##% !% !!'% #$'% #(#% #)$% #)(% ###% #'#% #*#% #)#% #$#% !##% !'#% !$% '#% ''% '*% ')% !#$%$!!'(#)*%+(*,#-.*%)/%0#!*,1*2% 0%0)!%1%3#!3')#0,*% +,-./,01/2134% 5362-7/,8934% ./23,873% 2 26 parameters = 64M parameters = 256MB of RAM. An x270K savings in RAM requirements.
  26. 26. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines!
  27. 27. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! I: You fool! The only thing parallel machines are good for is computational windtunnels!
  28. 28. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! I: You fool! The only thing parallel machines are good for is computational windtunnels! The worst part: he had a point.
  29. 29. Using all the data: Step 3 Given 2.1 Terafeatures of data, how can you learn a good linear predictor fw(x) = i wixi?
  30. 30. Using all the data: Step 3 Given 2.1 Terafeatures of data, how can you learn a good linear predictor fw(x) = i wixi? 17B Examples 16M parameters 1K nodes How long does it take?
  31. 31. Using all the data: Step 3 Given 2.1 Terafeatures of data, how can you learn a good linear predictor fw(x) = i wixi? 17B Examples 16M parameters 1K nodes How long does it take? 70 minutes = 500M features/second: faster than the IO bandwidth of a single machine⇒ faster than all possible single machine linear learning algorithms.
  32. 32. MPI-style AllReduce 7 2 3 4 6 Allreduce initial state 5 1
  33. 33. MPI-style AllReduce 2828 28 Allreduce final state 28 28 28 28 Properties: 1. How long does it take? 2. How much bandwidth? 3. How hard to program?
  34. 34. MPI-style AllReduce 2 3 4 6 7 5 1 Create Binary Tree
  35. 35. MPI-style AllReduce 2 3 4 7 8 1 13 Reducing, step 1
  36. 36. MPI-style AllReduce 2 3 4 8 1 13 Reducing, step 2 28
  37. 37. MPI-style AllReduce 2 3 41 28 Broadcast, step 1 28 28
  38. 38. MPI-style AllReduce 28 28 28 Allreduce final state 28 28 28 28 AllReduce = Reduce+Broadcast
  39. 39. MPI-style AllReduce 28 28 28 Allreduce final state 28 28 28 28 AllReduce = Reduce+Broadcast Properties: 1. How long does it take? 2. How much bandwidth? 3. How hard to program?
  40. 40. MPI-style AllReduce 28 28 28 Allreduce final state 28 28 28 28 AllReduce = Reduce+Broadcast Properties: 1. How long does it take? O(1) time(*) 2. How much bandwidth? O(1) bits(*) 3. How hard to program? Very easy (*) When done right.
  41. 41. An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number max) 1. While (examples left) 1.1 Do online update. 2. AllReduce(weights) 3. For each weight w ← w/n
  42. 42. An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number max) 1. While (examples left) 1.1 Do online update. 2. AllReduce(weights) 3. For each weight w ← w/n Code tour
  43. 43. What is Hadoop AllReduce? 1. Data Program Map job moves program to data.
  44. 44. What is Hadoop AllReduce? 1. Data Program Map job moves program to data. 2. Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce.
  45. 45. What is Hadoop AllReduce? 1. Data Program Map job moves program to data. 2. Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. 3. Speculative execution: In a busy cluster, one node is often slow. Use speculative execution to start additional mappers.
  46. 46. Robustness Speedup 0 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 90 100 Speedup Speed per method Average_10 Min_10 Max_10 linear
  47. 47. Splice Site Recognition 0 10 20 30 40 50 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 Iteration auPRC Online L−BFGS w/ 5 online passes L−BFGS w/ 1 online pass L−BFGS
  48. 48. Splice Site Recognition 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 Effective number of passes over data auPRC L−BFGS w/ one online pass Zinkevich et al. Dekel et al.
  49. 49. This Tutorial in 4 parts How do you: 1. use all the data? 2. solve the right problem? 3. solve complex joint problems easily? 4. solve interactive problems?
  50. 50. Applying Machine Learning in Practice
  51. 51. Applying Machine Learning in Practice 1. Ignore mismatch. Often faster. 2. Understand problem and nd more suitable tool. Often better.
  52. 52. Importance Weighted Classication Importance-Weighted Classication Given training data {(x1, y1, c1), . . . , (xn, yn, cn)}, produce a classier h : X → {0, 1}. Unknown underlying distribution D over X × {0, 1}×[0, ∞). Find h with small expected cost: (h, D) = E(x,y,c)∼D[c · 1(h(x) = y)]
  53. 53. Where does this come up? 1. Spam Prediction (Ham predicted as Spam much worse than Spam predicted as Ham.) 2. Distribution Shifts (Optimize search engine results for monetizing queries.) 3. Boosting (Reweight problem examples for residual learning.) 4. Large Scale Learning (Downsample common class and importance weight to compensate.)
  54. 54. Multiclass Classication Distribution D over X × Y , where Y = {1, . . . , k}. Find a classier h : X → Y minimizing the multi-class loss on D k(h, D) = Pr(x,y)∼D[h(x) = y]
  55. 55. Multiclass Classication Distribution D over X × Y , where Y = {1, . . . , k}. Find a classier h : X → Y minimizing the multi-class loss on D k(h, D) = Pr(x,y)∼D[h(x) = y] 1. Categorization: Which of k things is it? 2. Actions: Which of k choices should be made?
  56. 56. Use in VW Multiclass label format: Label [Importance] ['Tag] Methods: oaa k one-against-all prediction. O(k) time. The baseline ect k error correcting tournament. O(log(k)) time. log_multi n Adaptive log time. O(log(n)) time.
  57. 57. One-Against-All (OAA) Create k binary problems, one per class. For class i predict Is the label i or not? (x, y) −→    (x, 1(y = 1)) (x, 1(y = 2)) . . . (x, 1(y = k))
  58. 58. The inconsistency problem Given an optimal binary classier, one-against-all doesn't produce an optimal multiclass classier. Prob(label|features) 1 2 − δ 1 4 + δ 2 1 4 + δ 2 Prediction 1v23 1 0 0 0 2v13 0 1 0 0 3v12 0 0 1 0
  59. 59. The inconsistency problem Given an optimal binary classier, one-against-all doesn't produce an optimal multiclass classier. Prob(label|features) 1 2 − δ 1 4 + δ 2 1 4 + δ 2 Prediction 1v23 1 0 0 0 2v13 0 1 0 0 3v12 0 0 1 0 Solution: always use one-against-all regression.
  60. 60. Cost-sensitive Multiclass Cost-sensitive multiclass classication Distribution D over X × [0, 1]k, where a vector in [0, 1]k species the cost of each of the k choices. Find a classier h : X → {1, . . . , k} minimizing the expected cost cost(c, D) = E(x,c)∼D[ch(x)].
  61. 61. Cost-sensitive Multiclass Cost-sensitive multiclass classication Distribution D over X × [0, 1]k, where a vector in [0, 1]k species the cost of each of the k choices. Find a classier h : X → {1, . . . , k} minimizing the expected cost cost(c, D) = E(x,c)∼D[ch(x)]. 1. Is this packet {normal,error,attack}? 2. A subroutine used later...
  62. 62. Use in VW Label information via sparse vector. A test example: |Namespace Feature A test example with only classes 1,2,4 valid: 1: 2: 4: |Namespace Feature A training example with only classes 1,2,4 valid: 1:0.4 2:3.1 4:2.2 |Namespace Feature
  63. 63. Use in VW Label information via sparse vector. A test example: |Namespace Feature A test example with only classes 1,2,4 valid: 1: 2: 4: |Namespace Feature A training example with only classes 1,2,4 valid: 1:0.4 2:3.1 4:2.2 |Namespace Feature Methods: csoaa k cost-sensitive OAA prediction. O(k) time. csoaa_ldf Label-dependent features OAA. wap_ldf LDF Weighted-all-pairs.
  64. 64. Code Tour
  65. 65. This Tutorial in 4 parts How do you: 1. use all the data? 2. solve the right problem? 3. solve complex joint problems easily? 4. How do you solve interactive problems?
  66. 66. The Problem: Joint Prediction How?
  67. 67. The Problem: Joint Prediction How? 1. Each prediction is independent.
  68. 68. The Problem: Joint Prediction How? 1. Each prediction is independent. 2. Multitask learning.
  69. 69. The Problem: Joint Prediction How? 1. Each prediction is independent. 2. Multitask learning. 3. Assume tractable graphical model, optimize.
  70. 70. The Problem: Joint Prediction How? 1. Each prediction is independent. 2. Multitask learning. 3. Assume tractable graphical model, optimize. 4. Hand-crafted approaches.
  71. 71. What makes a good solution? 1. Programming complexity.
  72. 72. What makes a good solution? 1. Programming complexity. Most complex problems addressed independentlytoo complex to do otherwise.
  73. 73. What makes a good solution? 1. Programming complexity. Most complex problems addressed independentlytoo complex to do otherwise. 2. Prediction accuracy. It had better work well.
  74. 74. What makes a good solution? 1. Programming complexity. Most complex problems addressed independentlytoo complex to do otherwise. 2. Prediction accuracy. It had better work well. 3. Train speed. Debug/development productivity + maximum data input.
  75. 75. What makes a good solution? 1. Programming complexity. Most complex problems addressed independentlytoo complex to do otherwise. 2. Prediction accuracy. It had better work well. 3. Train speed. Debug/development productivity + maximum data input. 4. Test speed. Application eciency
  76. 76. A program complexity comparison 1 10 100 1000 CRFSGD CRF++ S-SVM Search linesofcodeforPOS
  77. 77. 10-3 10-2 10-1 100 101 102 103 Training time (minutes) 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 Accuracy(perword) 94.9 95.7 96.6 95.9 95.5 96.1 90.7 96.1 1s 10s 1m 10m 30m1h POS Tagging (tuned hps) OAA L2S L2S (ft) CRFsgd CRF++ StrPerc StrSVM StrSVM2
  78. 78. NER POS 0 100 200 300 400 500 600 563 365 520 404 24 5.7 98 13 5.6 14 5.3 Prediction (test-time) Speed L2S L2S (ft) CRFsgd CRF++ StrPerc StrSVM StrSVM2 Thousands of Tokens per Second
  79. 79. How do you program? Sequential_RUN(examples) 1: for i = 1 to len(examples) do 2: prediction ← predict(examples[i], examples[i].label) 3: loss(prediction = examples[i].label) 4: end for
  80. 80. How do you program? Sequential_RUN(examples) 1: for i = 1 to len(examples) do 2: prediction ← predict(examples[i], examples[i].label) 3: loss(prediction = examples[i].label) 4: end for In essence, write the decoder, providing a little bit of side information for training.
  81. 81. RunParser(sentence) 1: stack S ← {Root} 2: buer B ← [words in sentence] 3: arcs A ← ∅ 4: while B = ∅ or |S| 1 do 5: ValidActs ← GetValidActions(S, B) 6: features ← GetFeat(S, B, A) 7: ref ← GetGoldAction(S, B) 8: action ← predict(features, ref, ValidActs) 9: S, B, A ← Transition(S, B, A, action) 10: end while 11: loss(A[w] = A∗ [w], ∀w ∈ sentence) 12: return output
  82. 82. How does it work? An Application of Learning to Search algorithms (e.g. Searn,DAgger,LOLS [ICML2015]). Decoder run many times at train time to optimize predict(...) for loss(...). See Tutorial with Hal Daume @ ICML2015 + LOLS paper @ICML2015.
  83. 83. Named Entity Recogntion Is this word part of an organization, person, or not? 10-1 100 101 Training Time (minutes) 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 F-score(perentity) 80.079.2 73.3 76.5 75.976.5 74.6 78.3 10s 1m 10m Named entity recognition (tuned hps)
  84. 84. Entity Relation Goal: nd the Entities and then nd their Relations Method Entity F1 Relation F1 Train Time Structured SVM 88.00 50.04 300 seconds L2S 92.51 52.03 13 seconds Requires about 100 LOC.
  85. 85. Dependency Parsing Goal: Find dependency structure of sentence. 70 75 80 85 90 95 Ar* Bu Ch Da Du En Ja Po* Sl* Sw Avg UAS(higher=better) language L2S Dyna SNN Requires about 300 LOC.
  86. 86. A demonstration wget http://bilbo.cs.uiuc.edu/~kchang10/ tmp/wsj.vw.zip vw -b 24 -d wsj.train.vw -c search_task sequence search 45 search_alpha 1e-8 search_neighbor_features -1:w,1:w ax -1w,+1w -f foo.reg; vw -t -i foo.reg wsj.test.vw
  87. 87. This Tutorial in 4 parts How do you: 1. use all the data? 2. solve the right problem? 3. solve complex joint problems easily? 4. solve interactive problems?
  88. 88. Examples of Interactive Learning Repeatedly: 1. A user comes to Microsoft (with history of previous visits, IP address, data related to an account) 2. Microsoft chooses information to present (urls, ads, news stories) 3. The user reacts to the presented information (clicks on something, clicks, comes back and clicks again,...) Microsoft wants to interactively choose content and use the
  89. 89. Another Example: Clinical Decision Making Repeatedly: 1. A patient comes to a doctor with symptoms, medical history, test results 2. The doctor chooses a treatment 3. The patient responds to it The doctor wants a policy for choosing targeted treatments for individual patients.
  90. 90. The Contextual Bandit Setting For t = 1, . . . , T: 1. The world produces some context x ∈ X 2. The learner chooses an action a ∈ A 3. The world reacts with reward ra ∈ [0, 1] Goal: Learn a good policy for choosing actions given context.
  91. 91. The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg maxa ˆr(x, a).
  92. 92. The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg maxa ˆr(x, a). Example: Deployed policy always takes a1 on x1 and a2 on x2. a1 a2 x1 x2
  93. 93. The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg maxa ˆr(x, a). Example: Deployed policy always takes a1 on x1 and a2 on x2. Observed a1 a2 x1 .8 ? x2 ? .2
  94. 94. The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg maxa ˆr(x, a). Example: Deployed policy always takes a1 on x1 and a2 on x2. Observed/Estimated a1 a2 x1 .8/.8 ?/.5 x2 ?/.5 .2 /.2
  95. 95. The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg maxa ˆr(x, a). Example: Deployed policy always takes a1 on x1 and a2 on x2. Observed/Estimated a1 a2 x1 .8/.8 ?/.5 x2 .3/.5 .2 /.2
  96. 96. The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg maxa ˆr(x, a). Example: Deployed policy always takes a1 on x1 and a2 on x2. Observed/Estimated a1 a2 x1 .8/.8 ?/.514 x2 .3/.3 .2 /.014
  97. 97. The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg maxa ˆr(x, a). Example: Deployed policy always takes a1 on x1 and a2 on x2. Observed/Estimated/True a1 a2 x1 .8/.8/.8 ?/.514/1 x2 .3/.3/.3 .2 /.014 /.2
  98. 98. The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg maxa ˆr(x, a). Example: Deployed policy always takes a1 on x1 and a2 on x2. Observed/Estimated/True a1 a2 x1 .8/.8/.8 ?/.514/1 x2 .3/.3/.3 .2 /.014 /.2 Basic observation 1: Generalization insucient.
  99. 99. The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg maxa ˆr(x, a). Example: Deployed policy always takes a1 on x1 and a2 on x2. Observed/Estimated/True a1 a2 x1 .8/.8/.8 ?/.514/1 x2 .3/.3/.3 .2 /.014 /.2 Basic observation 2: Exploration required.
  100. 100. The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg maxa ˆr(x, a). Example: Deployed policy always takes a1 on x1 and a2 on x2. Observed/Estimated/True a1 a2 x1 .8/.8/.8 ?/.514/1 x2 .3/.3/.3 .2 /.014 /.2 Basic observation 3: Errors = exploration.
  101. 101. The Evaluation Problem Let π : X → A be a policy mapping features to actions. How do we evaluate it?
  102. 102. The Evaluation Problem Let π : X → A be a policy mapping features to actions. How do we evaluate it? Method 1: Deploy algorithm in the world. Very Expensive!
  103. 103. The Importance Weighting Trick Let π : X → A be a policy mapping features to actions. How do we evaluate it?
  104. 104. The Importance Weighting Trick Let π : X → A be a policy mapping features to actions. How do we evaluate it? One answer: Collect T exploration samples (x, a, ra, pa), where x = context a = action ra = reward for action pa = probability of action a then evaluate: Value(π) = Average ra 1(π(x) = a) pa
  105. 105. The Importance Weighting Trick Theorem For all policies π, for all IID data distributions D, Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] Proof: Ea∼p ra1(π(x)=a) pa = a pa ra1(π(x)=a) pa = rπ(x) Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate
  106. 106. The Importance Weighting Trick Theorem For all policies π, for all IID data distributions D, Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] Proof: Ea∼p ra1(π(x)=a) pa = a pa ra1(π(x)=a) pa = rπ(x) Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate 2 0
  107. 107. The Importance Weighting Trick Theorem For all policies π, for all IID data distributions D, Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] Proof: Ea∼p ra1(π(x)=a) pa = a pa ra1(π(x)=a) pa = rπ(x) Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate 2 | 0 0 | 4 3
  108. 108. How do you test things? Use format: action:cost:probability | features Example: 1:1:0.5 | tuesday year million short compan vehicl line stat nanc commit exchang plan corp subsid credit issu debt pay gold bureau prelimin ren billion telephon time draw basic relat le spokesm reut secur acquir form prospect period interview regist toront resourc barrick ontario qualif bln prospectus convertibl vinc borg arequip ...
  109. 109. How do you train?
  110. 110. How do you train? Reduce to cost-sensitive classication.
  111. 111. How do you train? Reduce to cost-sensitive classication. vw cb 2 cb_type dr rcv1.train.txt.gz -c ngram 2 skips 4 -b 24 -l 0.25 Progressive 0/1 loss: 0.04582 vw cb 2 cb_type ips rcv1.train.txt.gz -c ngram 2 skips 4 -b 24 -l 0.125 Progressive 0/1 loss: 0.05065 vw cb 2 cb_type dm rcv1.train.txt.gz -c ngram 2 skips 4 -b 24 -l 0.125 Progressive 0/1 loss: 0.04679
  112. 112. Reminder: Contextual Bandit Setting For t = 1, . . . , T: 1. The world produces some context x ∈ X 2. The learner chooses an action a ∈ A 3. The world reacts with reward ra ∈ [0, 1] Goal: Learn a good policy for choosing actions given context. What does learning mean? Eciently competing with some large reference class of policies Π = {π : X → A}: Regret = max π∈Π averaget(rπ(x) − ra)
  113. 113. Building an Algorithm For t = 1, . . . , T: 1. The world produces some context x ∈ X 3. The learner chooses an action a ∈ A 4. The world reacts with reward ra ∈ [0, 1]
  114. 114. Building an Algorithm Let Q1 = uniform distribution For t = 1, . . . , T: 1. The world produces some context x ∈ X 2. Draw π ∼ Qt 3. The learner chooses an action a ∈ A using π(x). 4. The world reacts with reward ra ∈ [0, 1] 5. Update Qt+1
  115. 115. What is good Qt? Exploration: Qt allows discovery of good policies Exploitation: Qt small on bad policies
  116. 116. How do you nd Qt?
  117. 117. How do you nd Qt? by Reduction ... [details complex, but coded]
  118. 118. How well does this work? 0 0.02 0.04 0.06 0.08 0.1 0.12 eps-greedy tau-first LinU C B* C over loss losses on CCAT RCV1 problem
  119. 119. How long does this take? 1 10 100 1000 10000 100000 1e+06 eps-greedy tau-first LinU C B* C over seconds running times on CCAT RCV1 problem
  120. 120. Next for Vowpal Wabbit Version 8 series just started. Primary goal: New research (as always) + tackle deployability: 1. Backwards compatibility of model to VW. 2. More serious testing. 3. More serious documentation. 4. More and better library interfaces. 5. Dynamic module loading.
  121. 121. Further reading VW wiki: https://github.com/JohnLangford/ vowpal_wabbit/wiki Search: NYU large scale learning class NIPS tutorial on Exploration: http://hunch.net/~jl/interact.pdf ICML Tutorial on Learning to Search: ... coming soon.
  122. 122. Bibliography Release Vowpal Wabbit, 2007, http://github.com/ JohnLangford/vowpal_wabbit/wiki. Terascale A. Agarwal et al, A Reliable Eective Terascale Linear Learning System, http://arxiv.org/abs/1110.4198 Reduce A. Beygelzimer et al, Learning Reductions that Really Work, http://arxiv.org/abs/1502.02704 LOLS K. Chang et al, Learning to Search Better Than Your Teacher, ICML 2014, http://arxiv.org/abs/1502.02206 Explore A. Agarwal et al, Taming the Monster..., http://arxiv.org/abs/1402.0555

×