Dynamic Programming
Note Before All: We assume that the environment is a finite MDP in this chapter.
The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP).
The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize and structure the search for good policies.
DP algorithms are obtained by turning Bellman equations such as these into assignments, that is, into update rules for improving approximations of the desired value functions.
Policy
Evaluation
Let's use the Bellman equation for as an update rule:
for all . Clearly, is a fixed point for this update rule because the Bellman equation for vπ assures us of equality in this case. Indeed, the sequence can be shown in general to converge to as under the same conditions that guarantee the existence of . This algorithm is called iterative policy evaluation.
To produce each successive approximation, from , iterative policy evaluation applies the same operation to each state : it replaces the old value of with a new value obtained from the old values of the successor states of , and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated. We call this kind of operation a full backup. Each iteration of iterative policy evaluation backs up the value of every state once to produce the new approximate value function .
Improvement
The reason for computing the value function for a policy is to help find better policies. With the value function we know how good it is to follow the policy but we do not know whether it is better or worse to choose another action ( is the current state). Therefore, naturally we would like to consider the value of action in the state . That is,
The key criterion is whether this is greater than or less than . If it is greater—that is, if it is better to select a once in s and thereafter follow than it would be to follow all the time—then one would expect it to be better still to select every time is encountered, and that the new policy would in fact be a better one overall.
Well, now let's generalize this to a more universal situation. Let and be any pair of deterministic policies such that, for all ,
Then the policy must be as good as, or better than, . That is, it must obtain greater or equal expected return from all states :
The proof of the policy improvement theorem:
Then it is a natural extension to consider changes at all states and to all possible actions, selecting at each state the action that appears best according to . Consider the new greedy policy :
So far in this section we have considered the special case of deterministic policies, but it's natural to extend this to stochastic policies in the meaning of expectation:
Iteration
Once a policy, , has been improved using to yield a better policy, , we can then compute and improve it again to yield an even better . We can thus obtain a sequence of monotonically improving policies and value functions:
where denotes a policy evaluation and denotes a policy improvement.
Asynchronous Dynamic Programming
Asynchronous DP algorithms are in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms back up the values of states in any order whatsoever, using whatever values of other states happen to be available. The values of some states may be backed up several times before the values of others are backed up once. To converge correctly, however, an asynchronous algorithm must continue to backup the values of all the states: it can’t ignore any state after some point in the computation.
Note that we can not get away with less computation. It just means that an algorithm does not need to get locked into any hopelessly long sweep before it can make progress improving a policy.
Asynchronous algorithms are especially applicable when it comes to real-time interaction. To solve a given MDP, we can run an iterative DP algorithm at the same time that an agent is actually experiencing the MDP.
Generalized Policy Iteration
Policy iteration consists of two simultaneous, interacting processes, one making the value function consistent with the current policy (policy evaluation), and the other making the policy greedy with respect to the current value function (policy improvement). In policy iteration, these two processes alternate, each completing before the other begins, but this is not really necessary. In value iteration, for example, only a single iteration of policy evaluation is performed in between each policy improvement. In asynchronous DP methods, the evaluation and improvement processes are interleaved at an even finer grain.
We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.
The evaluation and improvement processes in GPI can be viewed as both competing and cooperating. They compete in the sense that they pull in opposing directions. Making the policy greedy with respect to the value function typically makes the value function incorrect for the changed policy, and making the value function consistent with the policy typically causes that policy no longer to be greedy.
Efficiency of Dynamic Programming
DP may not be practical for very large problems, but compared with other methods for solving MDPs, DP methods are actually quite efficient. If we ignore a few technical details, then the (worst case) time DP methods take to find an optimal policy is polynomial in the number of states and actions.
On problems with large state spaces, asynchronous DP methods are often preferred.
Summary
Finally, we note one last special property of DP methods. All of them update estimates of the values of states based on estimates of the values of successor states. That is, they update estimates on the basis of other estimates. We call this general idea bootstrapping. Many reinforcement learning methods perform bootstrapping, even those that do not require, as DP requires, a complete and accurate model of the environment.
Last updated