TokyoR #35
Multi-Armed-Bandit Problem
Masayuki Isobe
Adfive, Inc.
Who am I ?
• Masayuki Isobe
– MS in computer science.
• Graduation research is modeling statistical model for
game player in logic programing with EM-learning.

– Now I’m a head of own company - Adfive Inc.
• dedicated in ad-tech consulting, related system
development.

– Interested in “scientific and business”.

• Twitter: @chiral, Facebook: masayuki.isobe.14
– Anyone in this room is welcome to touch me in
these SNS.
What is multi-armed-bundit?
• A problem in search-theory.
– Finite targets of investment (like arms of slot
machine in casino) which have own reward
probabilities respectively, finite resources to be
bet, and limited number of trials are given.
– Problem is: “what is the best bets in each trials?”

• Typical dilemma:
– explore vs harvest: difficulty in evaluation for
partial knowledge possibly converged to rewards
future.
- Detailed in wikipedia:en “Multi-armed_bandit” entry
formulation
• M arms, N units of res in each trial, T trials.
– M, N, T are natural number.

• P(Reward): Each arm’s reward probability
– Binomial, poisson, etc.
• In usual formulation, arms don’t have mutual relations.

– Parameter is allowed to vary in trials for a complex
model.

• Goal: maximize the sum of rewards for all
trials.
Betting strategy
• Strategies can be put into four typical categories.
(by wikipedia)
– Semi-uniform
• epsilon-greedy and its sophisticated versions.
• sample source in an O’reilly book.
– https://github.com/johnmyleswhite/BanditsBook

– Probability matching
• Suppose some distributions and bet infered parameters
• Packages ‘bandit’ in CRAN

– Pricing
• evaluate knowleges for future rewards.

– Like ethical constraints
• avoid arms found out to be inferior.
Simulation in R
• Code is in chiral’s Gists.
– https://gist.github.com/chiral/7340722
– Including simulation code, and simple two
strategy: “random”, “epsilon-greedy”

• Results:
Simple random strategy
epsilon-greedy,
with epsilon parametes
from 0.1 to 0.5

e-greedy seems to be worthless to apply..
conclusion & future works
• Compared to simple random betting, simple egreedy strategy seems to have little difference.
– More sophisticated strategy is needed.
– Some useful algorithms are implemented in
Google Analytics.

• High performing algorithm for bandit problem
could be applied to programmatic marketing
such as RTB.
– So I’ll continue exploring in R.

Tokyo r35 bandit

  • 1.
  • 2.
    Who am I? • Masayuki Isobe – MS in computer science. • Graduation research is modeling statistical model for game player in logic programing with EM-learning. – Now I’m a head of own company - Adfive Inc. • dedicated in ad-tech consulting, related system development. – Interested in “scientific and business”. • Twitter: @chiral, Facebook: masayuki.isobe.14 – Anyone in this room is welcome to touch me in these SNS.
  • 3.
    What is multi-armed-bundit? •A problem in search-theory. – Finite targets of investment (like arms of slot machine in casino) which have own reward probabilities respectively, finite resources to be bet, and limited number of trials are given. – Problem is: “what is the best bets in each trials?” • Typical dilemma: – explore vs harvest: difficulty in evaluation for partial knowledge possibly converged to rewards future. - Detailed in wikipedia:en “Multi-armed_bandit” entry
  • 4.
    formulation • M arms,N units of res in each trial, T trials. – M, N, T are natural number. • P(Reward): Each arm’s reward probability – Binomial, poisson, etc. • In usual formulation, arms don’t have mutual relations. – Parameter is allowed to vary in trials for a complex model. • Goal: maximize the sum of rewards for all trials.
  • 5.
    Betting strategy • Strategiescan be put into four typical categories. (by wikipedia) – Semi-uniform • epsilon-greedy and its sophisticated versions. • sample source in an O’reilly book. – https://github.com/johnmyleswhite/BanditsBook – Probability matching • Suppose some distributions and bet infered parameters • Packages ‘bandit’ in CRAN – Pricing • evaluate knowleges for future rewards. – Like ethical constraints • avoid arms found out to be inferior.
  • 6.
    Simulation in R •Code is in chiral’s Gists. – https://gist.github.com/chiral/7340722 – Including simulation code, and simple two strategy: “random”, “epsilon-greedy” • Results: Simple random strategy epsilon-greedy, with epsilon parametes from 0.1 to 0.5 e-greedy seems to be worthless to apply..
  • 7.
    conclusion & futureworks • Compared to simple random betting, simple egreedy strategy seems to have little difference. – More sophisticated strategy is needed. – Some useful algorithms are implemented in Google Analytics. • High performing algorithm for bandit problem could be applied to programmatic marketing such as RTB. – So I’ll continue exploring in R.