  Bandit Basics – A Differenttake on Online Optimization
  Who is this guy?Matt GershoffCEO: ConductricsMany Years in Database Marketing (New Yorkand Paris)and a bit of Web
  What Are We Going to Hear?• Optimization Basics• Multi-Armed Bandit • Its a Problem, Not a Method • Some Methods • AB Testing • Epsilon Greedy • Upper Confidence Interval (UCB)• Some Results
  Choices TargetingLearning Optimization
  OPTIMIZATIONIf THIS Then THATITTT brings together:1.Decision Rules2.Predictive Analytics3.Choice Optimization
  OPTIMIZATIONFind and Apply the Rule with the most Value If THIS Then THAT If THIS Then THAT If THIS Then THAT If THIS Then THAT If THIS Then THAT If THIS Then THAT If THIS Then THAT If THIS Then THAT If THIS Then THAT If THIS Then THAT If THIS Then THAT If THIS Then THAT
  OPTIMIZATIONVariables whose Values Variables whose Values Are Given to You You Control THIS THATIf Facebook High Spend Urban GEO . Then Offer A Offer B Offer C . . Predictive Model . . . . F1 . . F2 S Valuei . Home Page Fm Offer Y App Use Offer Z Inputs Outputs
  But …1. We Don't Have Data on 'THAT'2. Need to Collect – Sample THAT3. How to Sample Efficiently? Offer A ? Offer B ? Offer C ? . . . . . Offer Y ? Offer Z ?
  WhereMarketing Applications:• Websites• Mobile• Social Media Campaigns• Banner AdsPharma: Clinical Trials
  What is a Multi Armed BanditOne Armed Bandit –>Slot MachineThe problem:How to pick between Slot Machines so thatyou walk out with most $$$ from Casino at theend of the Night? OR A B
  ObjectivePick so as to get the mostreturn/profit as you canover timeTechnical term: Minimize Regret
  Sequential Selection… but how to Pick? OR A BNeed to Sample, but do it efficiently
  Explore – Collect Data OR A B• Data Collection is costly – an Investment• Be Efficient – Balance the potential value of collecting new data with exploiting what you currently know.
  Multi-Armed Bandits"Bandit problems embody in essentialform a conflict evident in all humanaction: choosing actions which yieldimmediate reward vs. choosing actions… whose benefit will come only later."*- Peter Whittle *Source: Qing Zhao, UC Davis. Plenary talk at SPAWC, June, 2010.
  Exploration Exploitation1) Explore/Learn – Try out different actionsto learn how they perform over time – This isa data collection task.2) Exploit/Earn – Take advantage of whatyou have learned to get highest payoff –Your current best guess
  Not A New Problem1933 – first work on competing options1940 – WWII Problem Allies attempt to tackle1953 – Bellman formulates as a Dynamic Programing problemSource:
  Testing• Explore First – All actions have an equal chance of selection (uniform random). – Use hypothesis testing to select a 'Winner'.• Then Exploit - Keep only 'Winner' for selection
  Learn FirstData Collection/Sample Apply Leaning Explore/ Exploit/ Learn Earn Time
  P-Values: A DigressionP-Values:• NOT the probability that the Null is True. P( Null=True| DATA)• P(DATA (or more extreme)| Null=True)• Not a great tool for deciding when to stop samplingSee:
  A Couple Other Methods1. Epsilon Greedy Nice and Simple2. Upper Confidence Bounds(UCB) Adapts to Uncertainty
  1) Epsilon-Greedy
  23. 23. GreedyWhat do you mean by ‘Greedy’?Make whatever choice seemsbest at the moment.
  24. 24. Epsilon GreedyWhat do you mean by ‘EpsilonGreedy’?• Explore – randomly select action  percent of the time (say 20%)• Exploit – Play greedy (pick the current best) 1-  (say 80%)
  Epsilon Greedy UserExplore/Learn Exploit/Earn(20%) (80%) Select Select Current Randomly Best Like AB Testing (Be Greedy)
  Epsilon Greedy20% Random 80% Select Best Action Value A $5.00 B $4.00 C $3.00 D $2.00 E $1.00
  Continuous Sampling Explore/Learn Exploit/Earn Time
  Epsilon Greedy– Super Simple/low cost to implement– Tends to be surprisingly effective– Less affected by 'Seasonality'– Not optimal (hard to pick best )– Doesn't use measure of variance– Should/How to decrease Exploration over time?
  Upper Confidence BoundBasic Idea:1) Calculate both mean and a measure of uncertainty (variance) for each action.2) Make Greedy selections based on mean + uncertainty bonus
  Confidence Interval ReviewConfidence Interval = mean +/- z*Std - 2*Std Mean +2*Std
  Upper ConfidenceScore each option using the upperportion of the interval as a Bonus Mean +Bonus
  Upper Confidence Bound1) Use upper portion of CI as 'Bonus'2) Make Greedy Selections A Select A B C $0 $5 $10 Estimated Reward
  Upper Confidence Bound1) Selecting Action 'A' reduces uncertaintybonus (because more data)2) Action 'C' now has highest score A B C Select C $0 $5 $10 Estimated Reward
  Upper Confidence Bound• Like A/B Test – uses variance measure• Unlike A/B Test – no hypothesis test• Automatically Balances Exploration with Exploitation
  Case Study: ConversionTreatment Rate ServedV2V3 9.9% 14,893V2V2 9.7% 9,720V2V1 8.0% 2,441V1V3 3.3% 2,090V2V3 2.6% 1,849V2V2 2.0% 1,817V1V1 1.8% 1,926V3V1 1.8% 1,821V1V2 1.5% 1,873
  Case StudyTest Method Conversion RateAdaptive 7%Non Adaptive 4.5%
  AB Testing V BanditOption A ->Option B ->Option C ->
  Why Should I Care?• More Efficient Learning• Automation• Changing World
  Questions?
  Thank You!Matt Gershoff p) 646-384-5151 e) t) @mgershoff