Multiarm Bandit Problems

Multiarm Bandit Problems#

The multi-armed bandit problem is a fundamental problem in machine learning and decision-making. It refers to a scenario where an agent must repeatedly choose between multiple actions, each with an unknown reward distribution. Each action can be considered a slot machine, or a one-armed bandit, with potentially different payout distributions.

The goal of the agent is to maximize its total reward over time while learning the reward distributions of the different actions. This presents a tradeoff between exploitation, where the agent chooses the action with the highest estimated reward so far, and exploration, where it selects an action with an uncertain reward to learn more about its distribution. The multi-armed bandit problem has practical applications in various fields, such as clinical trials, online advertising, and recommender systems.

Binary Bernoulli bandit problems#

When an agent needs to choose between different actions, each with an unknown reward distribution, they face a crucial trade-off between exploration and exploitation. Exploitation involves selecting the action with the highest estimated reward based on experience, while exploration involves choosing an action with an uncertain reward to learn more about its distribution. To maximize the total reward over time, it’s essential to balance these two objectives. Various algorithms and strategies have been developed to address this trade-off, and studying their performance is crucial to understanding and solving multi-armed bandit problems.

A general bandit problem is a sequential game played between an agent and the environment. The game is played over \(n\) rounds called the horizon. In each round the agent chooses an action \(a\in\mathcal{A}\) and the environment then reveals a reward \(r\). Let’s look at a particular sub-type of multi-arm bandit problem, namely, the Bernoulli bandit problem and Thompson sampling, a particular solution approach to this class of problem [Russo et al., 2017]:

Definition 16 (Binary Bernoulli bandit problem)

A Binary Bernoulli bandit problem is a type of multi-armed bandit problem where each action has a binary reward distribution with an unknown probability of success. In this problem:

An agent has a choice between \(K\) possible actions \(\mathcal{A}=\left\{a_{1},a_{2},\dots,a_{K}\right\}\) where the success or failure of any action \(a_{i}\in\mathcal{A}\) is governed by a Bernoulli random variable (see Definition 8).
The success probabilities of any action \(p_{a_{i}}\), while static over the horizon of the game, are unknown to the agent, but they can be learned by experimentation.
The agent receives a reward of \(R_{a_{i}} = 1\) (success) with probability \(p_{a_{i}}\), and \(R_{a_{i}} = 0\) (failure) otherwise.

The agent’s objective is to maximize the cumulative number of successes over the game horizon, e.g., T-periods, where \(T\gg{K}\).

There are multiple heuristic strategies to solve the Binary Bernoulli bandit problem described in Definition 16:

Thompson sampling which is shown in Algorithm 1 consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief, where the belief distribution follows a Beta distribution.

Algorithm 1 (Beta-Thompson-sampling)

Inputs number of time periods in the horizon \(T\), the dimension of the action space \(K = |\mathcal{A}|\), and \(\alpha_{1},\dots,\alpha_{K}\) and \(\beta_{1},\dots,\beta_{K}\) the initial values (prior) for the parameters that appear in the Beta-distribution for each action

Outputs posterior values for the parameters \(\alpha_{1},\dots,\alpha_{K}\) and \(\beta_{1},\dots,\beta_{K}\) that appear in the Beta-distributions for each action

Initialize a Beta-distribution for each action using initial values \((\alpha,\beta)\).

Main

for t \(\in\) 1 to T
1. for k \(\in\) 1 to K
  1. sample \(\hat{\theta}_{k} \sim \beta(\alpha_{k}, \beta_{k})\)
2. Select action: \(a_{t} = \text{arg}\max_{k}\hat{\theta}_{k}\)
3. Apply \(a_{t}\) and observe \(r_{t}\).
4. Update distribution parameters:
  1. \((\alpha_{t},\beta_{t})\leftarrow\left(\alpha_{t}+r_{t},\beta_{t}+(1-r_{t})\right)\)

Return \((\alpha_{1},\dots,\alpha_{K})\) and \((\beta_{1},\dots,\beta_{K})\)

An obvious (but naive) alternative approach is to allocate a fixed fraction of the time steps in the time horizon, say \(\epsilon\), to explore purely random actions. In contrast, in the remaining periods, only successful actions are chosen via Thompson sampling or another exploitation approach. This strategy, called \(\epsilon\)-greedy exploration, is shown in Algorithm 2 with a Thompson sampling exploitation step:

Algorithm 2 (\(\epsilon\)-greedy exploration)

Inputs the pure exploration parameter \(\epsilon\), the number of time periods in the horizon \(T\), the dimension of the action space \(K = |\mathcal{A}|\), and \(\alpha_{1},\dots,\alpha_{K}\) and \(\beta_{1},\dots,\beta_{K}\) the initial values (prior) for the parameters that appear in the Beta-distribution for each action

Outputs posterior values for the parameters \(\alpha_{1},\dots,\alpha_{K}\) and \(\beta_{1},\dots,\beta_{K}\) that appear in the Beta-distributions for each action

Initialize A Beta-distribution for each action using initial values \((\alpha,\beta)\).

Main

for t \(\in\) 1 to T
1. if rand() \(<\epsilon\)
  1. Select random action: \(a_{t}\sim\text{Uniform}(\mathcal{A})\)
2. else
  1. for k \(\in\) 1 to K
    1. sample \(\hat{\theta}_{k} \sim \beta(\alpha_{k}, \beta_{k})\)
  2. Select action: \(a_{t} = \text{arg}\max_{k}\hat{\theta}_{k}\)
3. Apply \(a_{t}\) and observe \(r_{t}\).
4. Update distribution parameters:
  1. \((\alpha_{t},\beta_{t})\leftarrow\left(\alpha_{t}+r_{t},\beta_{t}+(1-r_{t})\right)\)

Return \((\alpha_{1},\dots,\alpha_{K})\) and \((\beta_{1},\dots,\beta_{K})\)

Applications#

Portfolio allocation#

The central theme of modern portfolio theory is balancing risk and reward. Reward is measured as the expected return of a portfolio of assets, while, the risk is calculated as the variance of the logarithmic or fraction return of the portfolio. Thus, the objective of modern portfolio theory is to balance the composition of the portfolio such that an investor obtains a desired expected return while minimizing the variance (risk).

Summary#

In this lecture series, we explored bandit problems, and covered different algorithms and strategies for balancing exploration and exploitation:

Binary Bernoulli bandit problems are games random actions vereus exploiting knowledge already obtained through interacting with the world. Pure exploration allows agents to construct a comprehensive model of their environment, but likely at the expense of positive reward. On the other hand, pure exploitation has the agent continually choosing the action it thinks best to accumulate reward, but different, better actions could be taken.