2. Basics of Probability and Linear Algebra:
Probability:
Experiment: An action or procedure that produces outcomes.
Sample Space (S): The set of all possible outcomes.
Event: A subset of the sample space.
Probability: A measure of the likelihood of an event, represented as a number between 0 and 1, inclusive.
Basic Concepts:
Complementary Events: If A is an event, then its complement A′ is the event not occurring.
Conditional Probability: P(A∣B) is the probability of event A given that event B has occurred.
Independence: Two events A and B are independent if P(A∩B)=P(A)P(B).
3. Basics of Probability and Linear Algebra:
Linear Algebra:
Vector: An ordered list of numbers.
Matrix: A rectangular array of numbers.
Dot Product: Given two vectors u and v, their dot product is given by u⋅v=u1v1+u2v2+...+unvn.
Matrix Multiplication: The product of two matrices A and B is another matrix whose elements are formed by
taking the dot product of the rows of A with the columns of B.
4. Definition of a Stochastic Multi-Armed Bandit
• A multi-armed bandit problem is a classical optimization problem where an agent interacts with multiple options
(the "arms") and at each interaction, the agent must choose which arm to "pull" to receive a reward. The reward
is drawn from a probability distribution associated with the chosen arm, but the distributions are initially
unknown to the agent.
• Stochastic: The term "stochastic" implies that the rewards are random variables with some underlying (but
unknown) probability distribution.
5. Definition of Regret
Sample Footer Text
• Regret is a measure of the difference between the reward an agent receives by following its policy and the reward
it would have received by always selecting the best arm. Mathematically:
9/12/2023 5
6. Definition of Regret
In our slot machine scenario, regret is the difference between:
• The total reward you'd get if you always played the best machine (knowing future outcomes, which you can't)
and the actual reward you get.
• It's a measure of how well you're making decisions compared to the "best possible" decisions
7. Achieving Sublinear Regret:
• Sublinear regret implies that the average regret per round goes to zero as the number of rounds T increases. For a
bandit algorithm, achieving sublinear regret means that, in the long run, the performance of the algorithm
approaches the performance of the best arm.
• As you play more, your average regret per play should decrease if you're making good decisions. If your regret
grows slower than the number of times you play (sublinearly), then, over time, you're getting close to the best
possible outcome
8. UCB Algorithm (Upper Confidence Bound):
Sample Footer Text
• UCB (Upper Confidence Bound) algorithm balances exploration (trying out all arms) with exploitation (playing the
best-known arms). At each step, it chooses the arm with the highest upper confidence bound:
9/12/2023 8
9. UCB Algorithm (Upper Confidence Bound):
• This is a strategy to deal with the slot machines. Rather than just considering the average reward of each
machine, it also considers how uncertain we are about each machine's reward. It then plays the machine that has
the highest potential to be the best.
• The formula gives an "optimistic" estimate of each machine's potential.
10. KL-UCB:
Sample Footer Text
• KL-UCB is an extension of UCB, incorporating the Kullback-Leibler (KL) divergence to determine the
confidence intervals. Instead of the confidence interval from UCB, KL-UCB uses the equation:
• It is a refined version of UCB. Instead of a simple measure of uncertainty, it uses the Kullback-Leibler
divergence, a concept from information theory. This can sometimes give better estimates of each
machine's potential
9/12/2023 10
11. Thompson Sampling:
• Thompson Sampling is a Bayesian approach to the bandit problem. For each arm:
• Maintain a posterior distribution over the expected reward of that arm.
• At each step, sample from the posterior distribution of each arm.
• Choose the arm with the highest sample.
• The key to Thompson Sampling is the updating of the posterior distribution based on observed rewards. It
naturally balances exploration and exploitation through the stochastic nature of the sampling process.