Multi-armed Bandits
Preface
The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. Evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken.
Problem Statement
Consider the following learning problem. You are faced repeatedly with a choice among n different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.
According to Wikipedia:
The multi-armed bandit problem models an agent that simultaneously attempts to acquire new knowledge (called "exploration") and optimize his or her decisions based on existing knowledge (called "exploitation"). The agent attempts to balance these competing tasks in order to maximize his total value over the period of time considered. There are many practical applications of the bandit model, for example:
clinical trials investigating the effects of different experimental treatments while minimizing patient losses
adaptive routing efforts for minimizing delays in a network,
financial portfolio design
Balancing Methods
In any specific case, whether it is better to explore or exploit depends in a complex way on the precise values of the estimates, uncertainties, and the number of remaining steps. That means both exploration and exploitation are important in our learning and we need to balance them well. There are many sophisticated methods for balancing exploration and exploitation but most of them make strong assumptions about stationarity and prior knowledge that are either violated or impossible to verify in applications.
Initial Value Estimates
Action-Value Methods
The simplest action selection rule is to select the action (or one of the actions) with highest estimated action value. The greedy action selection method can be written as
Despite the theoretically precise estimate, the cost of exploration is obviously too vast to cover especially taking the nonstationary practical environment, that is, that the true values of the actions changed over time into account. Even if the underlying task is stationary and deterministic, the learner faces a set of banditlike decision tasks each of which changes over time due to the learning process itself. Therefore, a more practical method is in need.
Incremental Implementation
The update rule is of a form that occurs frequently throughout this book. The general form is
Tracking a Nonstationary Problem
The averaging methods discussed so far are appropriate in a stationary environment, but not if the bandit is changing over time. One of the most popular ways of doing nonstationary scenes is to use a constant step-szie parameter. For example, the incremental update rule (2.3) for updating an average Qk of the k−1 past rewards is modified to be
The first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations. The second condition guarantees that eventually the steps become small enough to assure convergence.
Optimistic Initial Values
However, it is not well suited to nonstationary problems because its drive for exploration is inherently temporary. If the task changes, creating a renewed need for exploration, this method cannot help. Indeed, any method that focuses on the initial state in any special way is unlikely to help with the general nonstationary case. The beginning of time occurs only once, and thus we should not focus on it too much. This criticism applies as well to the sample-average methods, which also treat the beginning of time as a special event, averaging all subsequent rewards with equal weights. Nevertheless, all of these methods are very simple, and one of them or some simple combination of them is often adequate in practice.
Upper-Confidence-Bound Action Selection
Gradient Bandits
The following prove process is so beautiful that I made a nearly complete copy from the book except that I made some adjustment to make the document fit the markdown grammar and gitbook typesetting better.
To give a deeper insight into this algorithm, we can understand it as a stochastic approximation to gradient ascent.
where the measure of performance here is the expected reward:
The calculations showing this require only beginning calculus, but take several steps. If you are mathematically inclined, then you will enjoy the rest of this section in which we go through these steps. First we take a closer look at the exact performance gradient:
The equation is now in the form of an expectation, summing over all possible values b of the random variable At, then multiplying by the probability of taking those values. Thus:
Recall that our plan has been to write the performance gradient as an expectation of something that we can sample on each step, as we have just done, and then update on each step proportional to the sample. Substituting a sample of the expectation above for the performance gradient yields:
which you will recognize as being equivalent to our original algorithm.
Using this, we can write
We have just shown that the expected update of the gradient-bandit algorithm is equal to the gradient of expected reward, and thus that the algorithm is an instance of stochastic gradient ascent. This assures us that the algorithm has robust convergence properties.
Note that we did not require anything of the reward baseline other than that it not depend on the selected action. For example, we could have set is to zero, or to 1000, and the algorithm would still have been an instance of stochastic gradient ascent. The choice of the baseline does not affect the expected update of the algorithm, but it does affect the variance of the update and thus the rate of convergence. Choosing it as the average of the rewards may not be the very best, but it is simple and works well in practice.
Associative Search (Contextual Bandits)
The above problems we consider are only nonassociative tasks, where there is no need to associate different actions with different situations. However, in a general reinforcement learning task there is more than one situation, and the goal is to learn a policy: a mapping from situations to the actions that are best in those situations.
Last updated