Decision-Making - Beating the Bandits

decision-making
reinforcement-learning
Author

Sean Daly

Published

August 3, 2023

Decisions are hard. We can’t see how things will turn out, and we never have all the information we need to make the best decision right now. However, there are systematic ways to work through the choices we make to give us a better chance of coming out on top.

One way to view things is through the lens of reinforcement learning. This is a flexible framework for decision making that assumes there is some environment out there to act on, and whatever we decide to do the environment gives us some feedback in the form of a reward or a punishment. Through these interactions, we can learn a better decision making rule.

There is a little more to the framework, such as the idea of a “state” that can impact the results of our actions, and there are diferent ways of setting up a model for analysis. But, we will ignore that for now for the classic model I will talk about below: the multi-armed bandit. The way I set this up allows for the following types of decisions:

You could apply the above to decisions such as:

In all the above, we have some number of options, and we keep gong back to make these decisions repeatedly. Each option can generally get better or worse over time as staff get more experienced or leave, but we generally don’t know what impact that will have until we get to see how it plays out, and each one of them can have days where they uncharacteristically knock it out of the park or just do a terrible job.

Besides being actually useful, this type of decision is also nice because it’s relatively simple and it demonstrates some really nice elements of a decision-making strategy, which are the main takeaways from this post:

  • Exploration v. Exploitation: a really key concept to grasp. Do you explore new restaurants every weekend like an epicurean nomad, or do you decide you’ve found one that’s good enough (exploiting what you already know)? Spoiler: you try to balance both.
  • Bias Towards Recency: Things change all the time - you should change with them.
  • Optimistic Initial Expectations: Believe it or not, until you get some experience you’re better off assuming the best.

The Multi-Armed Bandit Model

The actual model we will look at to demonstrate the idea is called the “multi-armed bandit”1, a classic model in reinforcement learning.

Using a gambling machine allows us to bring this model squarely into the realm of probabilities, and it adds a natural system of rewards, i.e. prizes. So say, for example, you walk into a casino and start playing 3 machines. You go from one to the other over the course of 6 turns and get the following prizes.

What do you do next?

Code
df = pd.DataFrame(
    {
        "turn": [1, 2, 3, 4, 5, 6],
        "machine_id": [1, 2, 3, 1, 2, 3],
        "prize": [0.10, 1.32, 0.29, 1.18, 1.10, 0.17]
    }
)

# Format table
d = dict(selector="th",
    props=[('text-align', 'center')])

(df.style.hide()
    .format({'prize': '${:,.2f}'})
    .set_properties(**{'width':'10em', 'text-align':'center'})
    .set_table_styles([d]))
Table 1: Results from 6 Turns
turn machine_id prize
1 1 $0.10
2 2 $1.32
3 3 $0.29
4 1 $1.18
5 2 $1.10
6 3 $0.17

My set up for the model is as follows:

  • We are in a casino with 3 slot machines (one-armed bandits).
  • We take 500 turns, deciding which machine to play on each turn.
  • It costs $1.00 to play each turn.
  • Expected reward can change over time, but in the long run is $0.99.
  • The reward on a given play is exponentially distributed around the expected value.

Modelling the prizes won by a player as an exponential distribution seems appropriate because it will give us many small wins and a few big wins. I thought it would also be good to use a challenging distribution. Models like these can be set up with normal distributions that have small standard deviations, but these are pretty easy to learn as a small standard deviation means you can more accurately guess the actual expected value for a machine. For the exponential distribution, the standard deviation is equal to the mean. Additionally, the skew of the distribution means there will be more curveball high valued prizes that have to be accounted for.

Below I have an example plot of the exponential distribution with an expected prize of $0.99, with some sample prizes scattered around that. As you can see, most of the turns will give you a prize less than $0.99, but some of the turns will give you much more.

Code
# Parameters
rng = np.random.default_rng(7)
mean = 0.99
lambda_param = 1.0 / mean
x = np.linspace(0, 6, 400)  # Generate x values
y = lambda_param * np.exp(-lambda_param * x)  # Exponential distribution function

# Sample 50 points and jitter for dodging
sample_points_x = rng.exponential(mean, 50)
jitter = 0.05  # Adjust this value for more/less jitter
sample_points_y = [rng.uniform(-jitter, jitter) for _ in range(50)]

palette = sns.color_palette()

sns.lineplot(x=x, y=y)
plt.plot([mean, mean], [lambda_param * np.exp(-lambda_param * mean), 0], color=palette[0])
plt.annotate("expected prize", (mean, 0.2), (1.5, 0.5),
    arrowprops=dict(facecolor=palette[0], edgecolor=palette[0],
        arrowstyle="simple,tail_width=0.07,head_width=0.7,head_length=1"))
plt.annotate("random samples", (2.5, 0), (4, 0.3),
    arrowprops=dict(facecolor=palette[0], edgecolor=palette[0],
        arrowstyle="simple,tail_width=0.07,head_width=0.7,head_length=1"))
sns.scatterplot(x=sample_points_x, y=sample_points_y)
# plt.grid(axis='y', color='grey', linestyle='--', linewidth=0.5)
sns.despine(left=True, bottom=True, top=True, right=True)
plt.tick_params(axis='x', which='both', bottom=True, left=True)

formatter = ticker.FuncFormatter(lambda x, pos: f'${x:.0f}')
plt.gca().xaxis.set_major_formatter(formatter)
plt.xlabel('prize')
plt.ylabel('probability density function')
plt.show()

A plot of the exponential distribution probability density overlain by some sample points.

Figure 1: Example of an Exponential Distribution

Machines with an identical distribution of random prizes are not interesting to model though, because there would be no “good” machine to pick. Whether you pick one machine or hop around, you would have the same expected prize. The actual prize you win will vary through random chance which is not predictable.

It’s also not that interesting if the expected value never changes. A static expected value would also limit the applicability of a model like this. In real life, a restaurant (say) will get better or worse over time. If we have a decision making model that doesn’t recognise that, then it’s not that useful in reality.

Instead, we are interested in the cases where different machines have different expected outcomes, and they will each follow a random walk (with mean recursion). That random walk will look like this, noting that these are expected prizes, not actual prizes:

Code
n_iterations = 5000
n_machines = 3
n_turns = 500

sigma = 0.2
cost_per_game = 1
expected_prize = 0.99
starting_value = 100

rng = np.random.default_rng(5)

mean_lin = np.ones(n_machines) * expected_prize
var_lin = np.ones(n_machines) * (sigma**2)

mean_log = np.log(mean_lin**2 / (np.sqrt(mean_lin**2 + var_lin)))
var_log = np.log(1 + var_lin / mean_lin**2) 
cov_log = np.diag(var_log)

# Mean-reverting process
noise_log = rng.multivariate_normal(np.zeros(n_machines), cov_log, size=(n_iterations, n_turns + 200))

e_rewards_log = np.ones_like(noise_log) * mean_log
theta = 0.01
for i in range(1, n_turns + 200):
    e_rewards_log[:, i, :] = e_rewards_log[:, i - 1, :] + (
        0.01 * (mean_log - e_rewards_log[:, i - 1, :]) + 0.15 * noise_log[:, i, :]
        )
e_rewards = np.exp(e_rewards_log[:, 200:])

# Generate rewards from expected rewards
rewards = rng.exponential(e_rewards)

machine_labels = [str(i + 1) for i in range(rewards.shape[2])]

rewards_df = pd.DataFrame(rewards[0, :, :])
rewards_df.columns = machine_labels
rewards_df["turn"] = list(range(1, rewards.shape[1] + 1))
rewards_df = rewards_df.melt(
    id_vars="turn",
    value_vars=machine_labels,
    value_name="prize",
    var_name="machine",)

e_rewards_df = pd.DataFrame(e_rewards[0, ::])
e_rewards_df.columns = machine_labels
e_rewards_df["turn"] = list(range(1, e_rewards.shape[1] + 1))
e_rewards_df = e_rewards_df.melt(
    id_vars="turn",
    value_vars=machine_labels,
    value_name="expected_prize",
    var_name="machine",)

rewards_df = rewards_df.merge(e_rewards_df, on=("turn", "machine"), how="left")

# Plot
ax = sns.lineplot(data=rewards_df, x="turn", y="expected_prize", hue="machine")
plt.grid(axis='y', color='#E5E5E5')
sns.despine(left=True, bottom=True, top=True, right=True)
plt.tick_params(axis='x', which='both', bottom=True, left=True)
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), frameon=False)
plt.ylim(0.6, 1.8)

formatter = ticker.FuncFormatter(lambda x, pos: f'${x:.2f}')
plt.gca().yaxis.set_major_formatter(formatter)
plt.show()

A line-plot of the expected prizes to each of 3 machines.

Figure 2: Expected Prizes for each of 3 Machines

The actual prizes you would win if you played all the machines at the same time would look as follows:

Code
ax = sns.scatterplot(data=rewards_df, x="turn", y="prize", hue="machine", s=7)
plt.grid(axis='y', color='#E5E5E5')
sns.despine(left=True, bottom=True, top=True, right=True)
plt.tick_params(axis='x', which='both', bottom=True, left=True)
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), frameon=False)
plt.ylim(0.0, 11)

formatter = ticker.FuncFormatter(lambda x, pos: f'${x:.2f}')
plt.gca().yaxis.set_major_formatter(formatter)
plt.show()

A scatter plot of the rewards to each of 3 machines.

Figure 3: Actual Rewards for each of 3 Machines

Default Strategies

In order to know whether a strategy is good or bad, we need to create some benchmarks. Here we will use the Oracle strategy, which assumes we have perfect knowledge of the environment, and the Random strategy, which assumes we don’t know anything at all, and won’t try to learn anything either.

The Oracle Strategy

Under this strategy, we assume that we have some oracle that tells us what to expect from each machine, i.e. it knows which machine is the best to play on each turn. However, remember that the expectation is just a parameter of the distribution used to generate prize amounts. The actual prize is generated randomly, so we can’t know it in advance. So, even the oracle will not get the maximum prize on each turn.

The expected return to the oracle on each looks like this, i.e. it always plays the machine with the highest expected prize:

Code
oracle_e_rewards = np.max(e_rewards[0], axis=1)
turns = list(range(1, len(oracle_e_rewards)+1))

ax = sns.lineplot(data=rewards_df, x="turn", y="expected_prize", hue="machine", alpha=0.3)
sns.lineplot(x=turns, y=oracle_e_rewards, color=palette[4], label="oracle")

plt.grid(axis='y', color='#E5E5E5')
sns.despine(left=True, bottom=True, top=True, right=True)
plt.tick_params(axis='x', which='both', bottom=True, left=True)
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), frameon=False)
plt.tight_layout(rect=[0, 0, 1, 1])
plt.ylim(0.6, 1.8)

formatter = ticker.FuncFormatter(lambda x, pos: f'${x:.2f}')
plt.gca().yaxis.set_major_formatter(formatter)
plt.show()

A scatter plot of the prizes won by the Oracle Strategy.

Figure 4: Expected Prizes under the Oracle Strategy

The Random Strategy

Another strategy we could use is to pick a machine to play at random in each round. This might seem like an ok way to play the game, but remember, each game has a cost associated with it of $1.00, and an expected reward of $0.99. We should expect that a player following this strategy will not quite get back the money they spend on the machine.

Code
random_choice = rng.integers(0, n_machines, (n_turns))
random_choice = np.expand_dims(random_choice, 1)
random_e_rewards = np.take_along_axis(e_rewards[0], random_choice, axis=1)
random_e_rewards = np.squeeze(random_e_rewards)

# random_e_rewards = rng.choice(e_rewards[0], axis=1)
turns = list(range(1, len(oracle_e_rewards)+1))

ax = sns.lineplot(data=rewards_df, x="turn", y="expected_prize", hue="machine", alpha=0.3)
sns.lineplot(x=turns, y=random_e_rewards, color=palette[5], label="random")

plt.grid(axis='y', color='#E5E5E5')
sns.despine(left=True, bottom=True, top=True, right=True)
plt.tick_params(axis='x', which='both', bottom=True, left=True)
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), frameon=False)
plt.tight_layout(rect=[0, 0, 1, 1])
plt.ylim(0.6, 1.8)

formatter = ticker.FuncFormatter(lambda x, pos: f'${x:.2f}')
plt.gca().yaxis.set_major_formatter(formatter)
plt.show()

A scatter plot of the prizes won by the Random Strategy.

Figure 5: Expected Prizes under the Random Strategy

The plots above all track a single game, where a player sits down and plays 500 games. To see how effective our strategies are, we will instead look at the average prize per turn in a Monte Carlo/Bootstrap simulation with 5000 iterations:

Code
# Oracle Strategy
oracle_choice = np.argmax(e_rewards, axis=2, keepdims=True)
oracle_rewards = np.take_along_axis(rewards, oracle_choice, axis=2)
oracle_rewards = np.squeeze(oracle_rewards)

turns = np.array(list(range(1, n_turns+1)))
average_oracle_rewards = np.cumsum(oracle_rewards, axis=1) / turns
average_oracle_rewards = np.mean(average_oracle_rewards, axis=0)

# Random Strategy
random_choice = rng.integers(0, n_machines, (n_iterations, n_turns))
random_choice = np.expand_dims(random_choice, 2)
random_rewards = np.take_along_axis(rewards, random_choice, axis=2)
random_rewards = np.squeeze(random_rewards)

turns = np.array(list(range(1, random_rewards.shape[1]+1)))
average_random_rewards = np.cumsum(random_rewards, axis=1) / turns
average_random_rewards = np.mean(average_random_rewards, axis=0)

# Plotting
sns.lineplot(x=turns, y=average_oracle_rewards, color=palette[4], label="oracle")
sns.lineplot(x=turns, y=average_random_rewards, color=palette[5], label="random")

plt.grid(axis='y', color='#E5E5E5')
sns.despine(left=True, bottom=True, top=True, right=True)
plt.tick_params(axis='x', which='both', bottom=True, left=True)
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), frameon=False)
plt.tight_layout(rect=[0, 0, 0.98, 1])
plt.ylim(0.6, 1.8)
plt.xlabel("turn")
plt.ylabel("average_prize")

formatter = ticker.FuncFormatter(lambda x, pos: f'${x:.2f}')
plt.gca().yaxis.set_major_formatter(formatter)
plt.show()

Line plot of the running Average of prizes won by Oracle and Random Strategies over 5000 iterations.

Figure 6: Running Average of Prizes won by Oracle and Random Strategies over 5000 iterations

As expected, the random strategy just about fails to justify the $1.00 price to play the game. On average, it hands out a prize of just less than that.

Some Simple Strategies

Now, we will look at some simple rules that can be used to play the game.

Exploitation

Here, “exploitation” refers to exploiting the knowledge that we have. The way this one will work is that we will play each machine once. After that, we will assume that the best machine we saw in the initial round will be the best for the rest of the game. Let’s see what happens when we stick to our guns.

Code
def strategy(epsilon: float = 0.0, alpha: float=None, initial_value: float = None):
    """Runs a Strategy over a Bootstrap/Monte Carlo of rewards.

    Inputs:
    -------
    epsilon: float
        The probability that the user will try a different machine at random in a given game.
    alpha: float
        The step size parameter used for exponential weighted averaging. A value of None
        means that the mean is used.
    initial_value: float
        The prize expected by the player for each machine at the beginning of the game.
    """
    # Initialize vectors
    running_value_estimate = np.zeros((n_iterations, n_turns, n_machines))
    running_count = np.zeros((n_iterations, n_turns, n_machines), dtype=int)
    strategy_rewards = np.zeros((n_iterations, n_turns))
    running_selection = np.zeros((n_iterations, n_turns), dtype=int)
    selection = np.zeros(n_iterations, dtype=int)[:, None]

    # Instantiate all random variables up front
    random_selection = rng.integers(low=0, high=n_machines, size=(n_iterations, n_turns))
    random_explore = rng.uniform(0, 1, size=(n_iterations, n_turns))
    for i in range(n_turns):
        if i < n_machines:
            # Try all machines once.
            selection = np.array([i]*n_iterations)[:, None]
        else:
            # Explore with some probability epsilon
            explore = random_explore[:, i] < epsilon
            selection[explore] = random_selection[explore, i][:, None]
            # Otherwise, use greedy selection (select machine thought most valuable)
            selection[~explore] = np.argmax(running_value_estimate[~explore, i-1, :], axis=1)[:, None]

        running_selection[:, i] = selection[:, 0]

        strategy_rewards[:, i] = np.take_along_axis(rewards[:, i, :], selection, axis=1)[:, 0]

        if i > 0:
            running_count[:, i, :] = running_count[:, i - 1, :]
        update_count = np.zeros((n_iterations, n_machines))
        np.put_along_axis(update_count, selection, 1, axis = 1)
        running_count[:, i, :] = running_count[:, i, :] + update_count

        if i < n_machines and initial_value is None:
            # If initial_value is None, start with initial value observed in machines.
            # NOTE: initial iterations could be randomized, but iterating along machines
            # 1, 2, 3, ... is random enough for this exercise.
            np.put_along_axis(running_value_estimate[:, i, :], selection, strategy_rewards[:, i][:, None], axis=1)
        else:
            if i == 0 and initial_value is not None:
                # If there is an initial_value, start with that.
                running_value_estimate[:, i, :] = initial_value
            else:
                running_value_estimate[:, i, :] = running_value_estimate[:, i - 1, :]

            if alpha is not None:
                # Exponential Weight Decay
                step_size = alpha
            else:
                # Incremental Mean Update
                step_size = 1/np.take_along_axis(running_count[:, i, :], selection, axis=1) 
            
            update_flat = (
                step_size
                * (
                    strategy_rewards[:, i][:, None] 
                    - np.take_along_axis(running_value_estimate[:, i, :], selection, axis=1))
            )
            update = np.zeros((n_iterations, n_machines))
            np.put_along_axis(update, selection, update_flat, axis=1)
            running_value_estimate[:, i, :] = running_value_estimate[:, i, :] + update

    return running_value_estimate, running_count, strategy_rewards, running_selection

exploitation_results = strategy(epsilon=0.0)
average_exploitation_rewards = np.mean(exploitation_results[2], axis=0)
# Plotting
sns.lineplot(x=turns, y=average_exploitation_rewards, color=palette[6], label="exploitation")
sns.lineplot(x=turns, y=average_oracle_rewards, color=palette[4], label="oracle")
sns.lineplot(x=turns, y=average_random_rewards, color=palette[5], label="random")

plt.grid(axis='y', color='#E5E5E5')
sns.despine(left=True, bottom=True, top=True, right=True)
plt.tick_params(axis='x', which='both', bottom=True, left=True)
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), frameon=False)
plt.tight_layout(rect=[0, 0, 1, 1])
plt.ylim(0.9, 1.2)
plt.xlabel("turn")
plt.ylabel("average_prize")

formatter = ticker.FuncFormatter(lambda x, pos: f'${x:.2f}')
plt.gca().yaxis.set_major_formatter(formatter)
plt.show()

Line plot of the running Average of prizes won by the Exploitation Strategy over 5000 iterations.

Figure 7: Running Average of Prizes won by the Exploitation Strategy over 5000 iterations

This turns out to be a bad strategy, no better than random chance (in expectation) in our example. Remember, the actual prizes in this game are randomly generated from a distribution around the expected value. The player might be more likely to see a better prize from the better machine, but because they did not play the other machines again, they never found out if one of the others might have been better after all.

Exploration

In general, we will still want to exploit the information that we have found out about the machines in the game. We know that picking random machines on every turn is a strategy that will lose in the long term. But, we still need to check the other machines from time to see if we’ve maybe underestimated them. This leads us to the idea of exploration. With some probability, instead of picking the machine we think is best, we will instead pick one completely at random.

Code
exploratation_results = strategy(epsilon=0.2)
average_exploratation_rewards = np.mean(exploratation_results[2], axis=0)
# Plotting
sns.lineplot(x=turns, y=average_exploitation_rewards, color=palette[6], label="exploitation", alpha = 0.3)
sns.lineplot(x=turns, y=average_exploratation_rewards, color=palette[6], label="exploratation")
sns.lineplot(x=turns, y=average_oracle_rewards, color=palette[4], label="oracle")
sns.lineplot(x=turns, y=average_random_rewards, color=palette[5], label="random")

plt.grid(axis='y', color='#E5E5E5')
sns.despine(left=True, bottom=True, top=True, right=True)
plt.tick_params(axis='x', which='both', bottom=True, left=True)
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), frameon=False)
plt.tight_layout(rect=[0, 0, 1, 1])
plt.ylim(0.9, 1.2)
plt.xlabel("turn")
plt.ylabel("average_prize")

formatter = ticker.FuncFormatter(lambda x, pos: f'${x:.2f}')
plt.gca().yaxis.set_major_formatter(formatter)
plt.show()

Line plot of the running Average of prizes won by the Exploration Strategy over 5000 iterations.

Figure 8: Running Average of Prizes won by the Exploration Strategy over 5000 iterations

Okay! Now we’re starting to see some progress! We’ve hit on a way to finally start making money on these prizes (in expectation at least). But, there’s something weird going on, our success gets worse as the game goes on.

What’s happening here is a result of the fact that the expected prize given out by the machine is not constant over time. In fact, over a long enough time frame, I’ve set the machines to vary around our expected prize value of $0.99. Because our rule takes the average over all turns of the machine, when we are on turn 1000, we are still weighing the prize we got on turn 1 the same as the prize we got on turn 999.

Recency Bias

In order to combat this, what we want to do (in this particular situation) is to place more weight on recent observations. This will allow us to ride the wave when one machine is outperforming the others. We will do this using exponentially weighted averaging, with a parameter that weights recent observations just a little bit more highly than the average of all the observations we’ve seen so far. In this way, the ability of the early wins to influence our later decision making is reduced further and further with every turn.

Code
exp_weighting_results = strategy(epsilon=0.25, alpha=0.15)
average_exp_weighting_rewards = np.mean(exp_weighting_results[2], axis=0)
# Plotting
sns.lineplot(x=turns, y=average_exploitation_rewards, color=palette[6], label="exploitation", alpha=0.3)
sns.lineplot(x=turns, y=average_exploratation_rewards, color=palette[6], label="exploratation", alpha=0.3)
sns.lineplot(x=turns, y=average_exp_weighting_rewards, color=palette[6], label="exp_weighting")
sns.lineplot(x=turns, y=average_oracle_rewards, color=palette[4], label="oracle")
sns.lineplot(x=turns, y=average_random_rewards, color=palette[5], label="random")

plt.grid(axis='y', color='#E5E5E5')
sns.despine(left=True, bottom=True, top=True, right=True)
plt.tick_params(axis='x', which='both', bottom=True, left=True)
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), frameon=False)
plt.tight_layout(rect=[0, 0, 1, 1])
plt.ylim(0.9, 1.2)
plt.xlabel("turn")
plt.ylabel("average_prize")

formatter = ticker.FuncFormatter(lambda x, pos: f'${x:.2f}')
plt.gca().yaxis.set_major_formatter(formatter)
plt.show()

Line plot of the running Average of prizes won by the Exploration Strategy with exponentially weighted averaging over 5000 iterations.

Figure 9: Running Average of Prizes won by the Exploration Strategy with exponentially weighted averaging over 5000 iterations

Better again! But, we have a period at the start of the game where it takes us about 50 turns to “get up to speed”. Let’s see if we can do something about that.

Optimistic Starting Conditions

To try and improve performance in the early turns, we will start be slightly optimistically assuming thae each machine will hand out an expected $1.10 on each round. this seems to work, but admittedly, it’s not as impressive as the other strategies.

Code
optimistic_results = strategy(epsilon=0.25, alpha=0.15, initial_value=1.1)
average_optimistic_rewards = np.mean(optimistic_results[2], axis=0)
# Plotting
sns.lineplot(x=turns, y=average_exploitation_rewards, color=palette[6], label="exploitation", alpha=0.3)
sns.lineplot(x=turns, y=average_exploratation_rewards, color=palette[6], label="exploratation", alpha=0.3)
sns.lineplot(x=turns, y=average_exp_weighting_rewards, color=palette[6], label="exp_weighting", alpha=0.3)
sns.lineplot(x=turns, y=average_optimistic_rewards, color=palette[6], label="optimistic")
sns.lineplot(x=turns, y=average_oracle_rewards, color=palette[4], label="oracle")
sns.lineplot(x=turns, y=average_random_rewards, color=palette[5], label="random")

plt.grid(axis='y', color='#E5E5E5')
sns.despine(left=True, bottom=True, top=True, right=True)
plt.tick_params(axis='x', which='both', bottom=True, left=True)
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), frameon=False)
plt.tight_layout(rect=[0, 0, 1, 1])
plt.ylim(0.9, 1.2)
plt.xlabel("turn")
plt.ylabel("average_prize")

formatter = ticker.FuncFormatter(lambda x, pos: f'${x:.2f}')
plt.gca().yaxis.set_major_formatter(formatter)
plt.show()

Line plot of the running Average of prizes won by the Exploration Strategy with exponentially weighted averaging and optimistic initial expectations over 5000 iterations.

Figure 10: Running Average of Prizes won by the Exploration Strategy with exponentially weighted averaging and optimistic initial expectations over 5000 iterations

A Note on Risk

So, above we look at the average return to each strategy. But the average is only one part of the story. To see how we might end up, we will model the paths of 5000 games, each through 500 turns. We can then easily create a plot to see how players are likely to finish up if they start with $100 and follow each of these strategies. The charts below are path dependent, so if the player runs out of money, they’re out of the game.

Code
def final_stats(rewards: np.array, initial_holding: float=100, cost_per_play: float=1.0):
    holdings = np.cumsum(rewards - np.ones(rewards.shape[1])[None], axis=1) + initial_holding 
    final_holdings = holdings[:, -1]
    final_holdings[np.any(holdings<=0, axis=1)] = 0

    sorted_final_holdings = np.sort(final_holdings)
    cdf_prob = np.arange(1, len(sorted_final_holdings) + 1) / len(sorted_final_holdings)
    return sorted_final_holdings, cdf_prob

stats_oracle = final_stats(oracle_rewards)
stats_random = final_stats(random_rewards)
stats_optimistic = final_stats(optimistic_results[2])

plt.axvline(x=100, color=palette[0])
sns.lineplot(x=stats_oracle[0], y=stats_oracle[1], color=palette[4], label="oracle")
sns.lineplot(x=stats_random[0], y=stats_random[1], color=palette[5], label="random")
sns.lineplot(x=stats_optimistic[0], y=stats_optimistic[1], color=palette[6], label="optimistic")

plt.grid(axis='y', color='#E5E5E5')
sns.despine(left=True, bottom=True, top=True, right=True)
plt.tick_params(axis='x', which='both', bottom=True, left=True)
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), frameon=False)
plt.tight_layout(rect=[0, 0, 1, 1])
plt.ylim(0, 1)
plt.xlabel("final_prize_money")
plt.ylabel("continuous_density_function")

plt.annotate("breakeven", (100, 0.1), (300, 0.18),
    arrowprops=dict(facecolor=palette[0], edgecolor=palette[0],
        arrowstyle="simple,tail_width=0.07,head_width=0.7,head_length=1"))
formatter_perc = ticker.FuncFormatter(lambda x, pos: f'{x:.0%}')
formatter_dollar = ticker.FuncFormatter(lambda x, pos: f'${x:.0f}')
plt.gca().yaxis.set_major_formatter(formatter_perc)
plt.gca().xaxis.set_major_formatter(formatter_dollar)
plt.show()

Cumulative Density Function for 5000 iterations of Exploration Strategy with exponentially weighted averaging and optimistic initial expectations.

Figure 11: Cumulative Density Function for 5000 iterations of Exploration Strategy with exponentially weighted averaging and optimistic initial expectations

The plots above tell us that we would likely have lost money over 55% of the time if we randomly pick a machine to play each time. Following the strategy we came up with with exploitation and exploration, using exponential weighting, and optimistic initial expectations we would have lost money less than 30% of the time. Due to randomness, even the oracle strategy would have resulted in losses about 5% of the time.

Summary

Hopefully this was at least a little interesting and provided some useful insights into decision making. To recap, we looked a specific type of decision: a decision we make repeatedly, where we only see the reward we get for the choices we make. Under these conditions, we can use the concepts of exploration, a bias towards more recent observations, and an initially optimistic view of expectations. Operating under uncertainty, where you never get to know what the oracle strategy might be, concepts like these help us to maximize our return.

Further Reading

There is a lot more to explore in reinforcement learning. First of all, there is the concept of a “state”. This allows us to incorporate more information into a decision, e.g. “I’d like Mexican food tonight” would make a difference in selecting a restaurant. There are also methods in machine learning to allow for planning, so you can make more complicated decisions, e.g. deciding which move to make next in a game of chess, and methods that allow us to learn by example, e.g. by watching someone else play chess.

For a more detailed exploration of the topic, I highly recommend the foundational Reinforcement Learning: An Introduction by Sutton and Barto.

Footnotes

  1. The name comes from the fact that older mechanical slot machines had an arm on the side to make the machine work, and over the long run they will take all your money.↩︎