Multi-armed Bandits
Preface
The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. Evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken.
Problem Statement
Consider the following learning problem. You are faced repeatedly with a choice among n different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.
According to Wikipedia:
The multi-armed bandit problem models an agent that simultaneously attempts to acquire new knowledge (called "exploration") and optimize his or her decisions based on existing knowledge (called "exploitation"). The agent attempts to balance these competing tasks in order to maximize his total value over the period of time considered. There are many practical applications of the bandit model, for example:
clinical trials investigating the effects of different experimental treatments while minimizing patient losses
adaptive routing efforts for minimizing delays in a network,
financial portfolio design
Balancing Methods
In any specific case, whether it is better to explore or exploit depends in a complex way on the precise values of the estimates, uncertainties, and the number of remaining steps. That means both exploration and exploitation are important in our learning and we need to balance them well. There are many sophisticated methods for balancing exploration and exploitation but most of them make strong assumptions about stationarity and prior knowledge that are either violated or impossible to verify in applications.
Initial Value Estimates
Action-Value Methods
We denote the true (actual) value of action as , and the estimated value on the tth time step as . Recall that the true value of an action is the mean reward received when that action is selected. One natural way to estimate this is by averaging the rewards actually received when the action was selected. In other words, if by the th time step action a has been chosen times prior to , yielding rewards , then its value is estimated to be
If , then we define instead as some default value, such as . As , by the law of large numbers, converges to .
The simplest action selection rule is to select the action (or one of the actions) with highest estimated action value. The greedy action selection method can be written as
A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability , instead to select randomly from amongst all the actions with equal probability independently of the actionvalue estimates. We call methods using this near-greedy action selection rule -greedy methods. An advantage of these methods is that, in the limit as the number of plays increases, every action will be sampled an infinite number of times, guaranteeing that for all , and thus ensuring that all the converge to .
Despite the theoretically precise estimate, the cost of exploration is obviously too vast to cover especially taking the nonstationary practical environment, that is, that the true values of the actions changed over time into account. Even if the underlying task is stationary and deterministic, the learner faces a set of banditlike decision tasks each of which changes over time due to the learning process itself. Therefore, a more practical method is in need.
Incremental Implementation
However if we take the straightforward implementation mentioned above, the memory and computational requirements will grow over time without bound. Luckily, this is not really necessary, we can devise incremental update formulas for computing averages with small, constant computation required to process each new reward. For some action, let denote the estimate for its th reward, that is, the average of its first rewards. Given this average and a th reward for the action, , then the average of all rewards can be computed by
which holds even for , obtaining for arbitrary . This implementation requires memory only for and , and only the small computation above for each new reward.
The update rule is of a form that occurs frequently throughout this book. The general form is
Tracking a Nonstationary Problem
The averaging methods discussed so far are appropriate in a stationary environment, but not if the bandit is changing over time. One of the most popular ways of doing nonstationary scenes is to use a constant step-szie parameter. For example, the incremental update rule (2.3) for updating an average Qk of the k−1 past rewards is modified to be
where the step-size parameter is constant. This results in being a weighted average of past rewards and the initial estimate :
Let denote the step-size parameter used to process the reward received after the th selection of action . To assure convergence with probability 1, we must the limit the conditions to:
The first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations. The second condition guarantees that eventually the steps become small enough to assure convergence.
Optimistic Initial Values
This method applies a wildly optimistic initial estimate to encourage exploration. Suppose that instead of setting the initial action values to zero, as we did in the 10-armed testbed, we set them all to . Recall that the in this problem are selected from a normal distribution with mean 0 and variance 1. An initial estimate of +5 is thus wildly optimistic. But this optimism encourages action-value methods to explore. Whichever actions are initially selected, the reward is less than the starting estimates; the learner switches to other actions, being “disappointed” with the rewards it is receiving. The result is that all actions are tried several times before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time.
However, it is not well suited to nonstationary problems because its drive for exploration is inherently temporary. If the task changes, creating a renewed need for exploration, this method cannot help. Indeed, any method that focuses on the initial state in any special way is unlikely to help with the general nonstationary case. The beginning of time occurs only once, and thus we should not focus on it too much. This criticism applies as well to the sample-average methods, which also treat the beginning of time as a special event, averaging all subsequent rewards with equal weights. Nevertheless, all of these methods are very simple, and one of them or some simple combination of them is often adequate in practice.
Upper-Confidence-Bound Action Selection
Now let's come back to the action selection strategy. It has been witnessed that -greedy action selection is more effective than the greedy actions in practice, but this strategy show no preference for those that are nearly greedy or paticularly uncertain. It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates. One effective way of doing this is to select actions as
where denotes the natural logarithm of (the number that would have to be raised to in order to equal ), and the number controls the degree of exploration. If , then is considered to be a maximizing action.
Gradient Bandits
The following prove process is so beautiful that I made a nearly complete copy from the book except that I made some adjustment to make the document fit the markdown grammar and gitbook typesetting better.
In this section we consider learning a numerical preference for each action a. The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward. Only the relative preference of one action over another is important; if we add 1000 to all the preferences there is no affect on the action probabilities, which are determined according to a soft-max distribution (i.e., Gibbs or Boltzmann distribution) as follows:
where here we have also introduced a useful new notation for the probability of taking action at time . Initially all preferences are the same (e.g., ) so that all actions have an equal probability of being selected.
On each step, after selecting the action and receiving the reward , the preferences are updated by:
where is a step-size parameter, and is the average of all the rewards up through and including time .
To give a deeper insight into this algorithm, we can understand it as a stochastic approximation to gradient ascent.
where the measure of performance here is the expected reward:
Of course, it is not possible to implement gradient ascent exactly in our case because by assumption we do not know the , but in fact these two algorithms are equal in the meaning of expected value, making the algorithm an instance of stochastic gradient ascent.
The calculations showing this require only beginning calculus, but take several steps. If you are mathematically inclined, then you will enjoy the rest of this section in which we go through these steps. First we take a closer look at the exact performance gradient:
where can be any scalar that does not depend on . We can include it here because the gradient sums to zero over the all the actions,. As is changed, some actions’ probabilities go up and some down, but the sum of the changes must be zero because the sum of the probabilities must remain one.
The equation is now in the form of an expectation, summing over all possible values b of the random variable At, then multiplying by the probability of taking those values. Thus:
where here we have chosen and substituted for , which is permitted because and because all the other factors are nonrandom. Shortly we will establish that , where is defined to be 1 if , else 0. Assuming that for now we have
Recall that our plan has been to write the performance gradient as an expectation of something that we can sample on each step, as we have just done, and then update on each step proportional to the sample. Substituting a sample of the expectation above for the performance gradient yields:
which you will recognize as being equivalent to our original algorithm.
Thus it remains only to show that , as we assumed earlier. Recall the standard quotient rule for derivatives:
Using this, we can write
We have just shown that the expected update of the gradient-bandit algorithm is equal to the gradient of expected reward, and thus that the algorithm is an instance of stochastic gradient ascent. This assures us that the algorithm has robust convergence properties.
Note that we did not require anything of the reward baseline other than that it not depend on the selected action. For example, we could have set is to zero, or to 1000, and the algorithm would still have been an instance of stochastic gradient ascent. The choice of the baseline does not affect the expected update of the algorithm, but it does affect the variance of the update and thus the rate of convergence. Choosing it as the average of the rewards may not be the very best, but it is simple and works well in practice.
Associative Search (Contextual Bandits)
The above problems we consider are only nonassociative tasks, where there is no need to associate different actions with different situations. However, in a general reinforcement learning task there is more than one situation, and the goal is to learn a policy: a mapping from situations to the actions that are best in those situations.
Last updated