ml-N
  • Introduction
  • Theory
    • Reinforcement Learning
      • Preface
      • Basic Conceptions
      • Multi-armed Bandits
      • Finite Markov Decision Processes
      • Dynamic Programming
    • Conception
    • IO
    • Architecture
    • Awesome Ideas
  • Tools
    • Caffe
      • Summary
      • Tips
      • Issues
    • Tensorflow
      • Tips
  • Applications
    • Object Detection
Powered by GitBook
On this page
  • Policy
  • Evaluation
  • Improvement
  • Iteration
  • Asynchronous Dynamic Programming
  • Generalized Policy Iteration
  • Efficiency of Dynamic Programming
  • Summary
  1. Theory
  2. Reinforcement Learning

Dynamic Programming

PreviousFinite Markov Decision ProcessesNextConception

Last updated 6 years ago

Note Before All: We assume that the environment is a finite MDP in this chapter.

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP).

The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize and structure the search for good policies.

DP algorithms are obtained by turning Bellman equations such as these into assignments, that is, into update rules for improving approximations of the desired value functions.

Policy

Evaluation

Let's use the Bellman equation for vπv_{\pi}vπ​ as an update rule:

vk+1(s)=E[Rt+1+γvk(St+1)∣St=s]=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)[r+γvk(s′)]\begin{align*} v_{k+1}(s) &= \mathbb{E} \left[ R_{t+1} + \gamma v_k(S_{t+1}) | S_t = s \right] \\ &= \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) \left[ r + \gamma v_k(s') \right] \end{align*}vk+1​(s)​=E[Rt+1​+γvk​(St+1​)∣St​=s]=a∑​π(a∣s)s′,r∑​p(s′,r∣s,a)[r+γvk​(s′)]​

for all s∈Ss \in \mathcal{S}s∈S. Clearly, vk=vπv_k = v_{\pi}vk​=vπ​ is a fixed point for this update rule because the Bellman equation for vπ assures us of equality in this case. Indeed, the sequence {vk}\{v_k\}{vk​} can be shown in general to converge to vπv_{\pi}vπ​ as k→∞k \rightarrow \inftyk→∞ under the same conditions that guarantee the existence of vπv_{\pi}vπ​. This algorithm is called iterative policy evaluation.

To produce each successive approximation, vk+1v_{k+1}vk+1​ from vkv_kvk​, iterative policy evaluation applies the same operation to each state sss: it replaces the old value of sss with a new value obtained from the old values of the successor states of sss, and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated. We call this kind of operation a full backup. Each iteration of iterative policy evaluation backs up the value of every state once to produce the new approximate value function vk+1v_{k+1}vk+1​.

Improvement

The reason for computing the value function for a policy is to help find better policies. With the value function vπv_{\pi}vπ​ we know how good it is to follow the policy π\piπ but we do not know whether it is better or worse to choose another action a≠π(s)a \neq \pi(s)a=π(s) (sss is the current state). Therefore, naturally we would like to consider the value of action aaa in the state sss. That is,

qπ(s,a)=Eπ[Rt+1+γvπ(St+1)∣St=s,At=a]=∑s′,rp(s′,r∣s,a)[r+γvπ(s′)]\begin{align*} q_{\pi}(s,a) &= \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t=s, A_t=a \right] \\ &= \sum_{s',r} p(s',r| s,a) \left[ r + \gamma v_{\pi}(s') \right] \end{align*}qπ​(s,a)​=Eπ​[Rt+1​+γvπ​(St+1​)∣St​=s,At​=a]=s′,r∑​p(s′,r∣s,a)[r+γvπ​(s′)]​

The proof of the policy improvement theorem:

So far in this section we have considered the special case of deterministic policies, but it's natural to extend this to stochastic policies in the meaning of expectation:

Iteration

Asynchronous Dynamic Programming

Asynchronous DP algorithms are in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms back up the values of states in any order whatsoever, using whatever values of other states happen to be available. The values of some states may be backed up several times before the values of others are backed up once. To converge correctly, however, an asynchronous algorithm must continue to backup the values of all the states: it can’t ignore any state after some point in the computation.

Note that we can not get away with less computation. It just means that an algorithm does not need to get locked into any hopelessly long sweep before it can make progress improving a policy.

Asynchronous algorithms are especially applicable when it comes to real-time interaction. To solve a given MDP, we can run an iterative DP algorithm at the same time that an agent is actually experiencing the MDP.

Generalized Policy Iteration

Policy iteration consists of two simultaneous, interacting processes, one making the value function consistent with the current policy (policy evaluation), and the other making the policy greedy with respect to the current value function (policy improvement). In policy iteration, these two processes alternate, each completing before the other begins, but this is not really necessary. In value iteration, for example, only a single iteration of policy evaluation is performed in between each policy improvement. In asynchronous DP methods, the evaluation and improvement processes are interleaved at an even finer grain.

We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.

The evaluation and improvement processes in GPI can be viewed as both competing and cooperating. They compete in the sense that they pull in opposing directions. Making the policy greedy with respect to the value function typically makes the value function incorrect for the changed policy, and making the value function consistent with the policy typically causes that policy no longer to be greedy.

Efficiency of Dynamic Programming

DP may not be practical for very large problems, but compared with other methods for solving MDPs, DP methods are actually quite efficient. If we ignore a few technical details, then the (worst case) time DP methods take to find an optimal policy is polynomial in the number of states and actions.

On problems with large state spaces, asynchronous DP methods are often preferred.

Summary

Finally, we note one last special property of DP methods. All of them update estimates of the values of states based on estimates of the values of successor states. That is, they update estimates on the basis of other estimates. We call this general idea bootstrapping. Many reinforcement learning methods perform bootstrapping, even those that do not require, as DP requires, a complete and accurate model of the environment.

The key criterion is whether this is greater than or less than vπ(s)v_{\pi}(s)vπ​(s). If it is greater—that is, if it is better to select a once in s and thereafter follow π\piπ than it would be to follow π\piπ all the time—then one would expect it to be better still to select aaa every time sss is encountered, and that the new policy would in fact be a better one overall.

Well, now let's generalize this to a more universal situation. Let π\piπ and π′\pi'π′ be any pair of deterministic policies such that, for all s∈Ss \in \mathcal{S}s∈S,

qπ(s,π′(s))≥vπ(s).q_{\pi}(s, \pi'(s)) \ge v_{\pi}(s).qπ​(s,π′(s))≥vπ​(s).

Then the policy π′\pi'π′ must be as good as, or better than, π\piπ. That is, it must obtain greater or equal expected return from all states s∈Ss \in \mathcal{S}s∈S:

vπ′(s)≥vπ(s).v_{\pi'}(s) \ge v_{\pi}(s).vπ′​(s)≥vπ​(s).
vπ(s)≤qπ(s,π′(s))=Eπ′[Rt+1+γvπ(St+1)∣St=s]≤Eπ′[Rt+1+γqπ(St+1,π′(St+1))∣St=s]=Eπ′[Rt+1+γEπ′[Rt+2+γvπ(St+2)]∣St=s]=Eπ′[Rt+1+γRt+2+γ2vπ(St+2)∣St=s]≤Eπ′[Rt+1+γRt+2+γ2Rt+3+γ3vπ(St+3)∣St=s]⋯≤Eπ′[Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+⋯∣St=s]=vπ′(s).\begin{align*} v_{\pi}(s) &\le q_{\pi}(s, \pi'(s)) \\ &= \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t=s \right] \\ &\le \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma q_{\pi}(S_{t+1}, \pi'(S_{t+1})) | S_t=s \right] \\ &= \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma \mathbb{E}_{\pi'} [R_{t+2} + \gamma v_{\pi}(S_{t+2})] | S_t=s \right] \\ &= \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 v_{\pi}(S_{t+2}) | S_t=s \right] \\ &\le \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 v_{\pi}(S_{t+3}) | S_t=s \right] \\ &\cdots \\ &\le \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \cdots | S_t=s \right] \\ &= v_{\pi'}(s). \end{align*}vπ​(s)​≤qπ​(s,π′(s))=Eπ′​[Rt+1​+γvπ​(St+1​)∣St​=s]≤Eπ′​[Rt+1​+γqπ​(St+1​,π′(St+1​))∣St​=s]=Eπ′​[Rt+1​+γEπ′​[Rt+2​+γvπ​(St+2​)]∣St​=s]=Eπ′​[Rt+1​+γRt+2​+γ2vπ​(St+2​)∣St​=s]≤Eπ′​[Rt+1​+γRt+2​+γ2Rt+3​+γ3vπ​(St+3​)∣St​=s]⋯≤Eπ′​[Rt+1​+γRt+2​+γ2Rt+3​+γ3Rt+4​+⋯∣St​=s]=vπ′​(s).​

Then it is a natural extension to consider changes at all states and to all possible actions, selecting at each state the action that appears best according to qπ(s,a)q_{\pi}(s,a)qπ​(s,a). Consider the new greedy policy π′\pi'π′:

π′(s)=arg⁡max⁡aqπ(s,a)=arg⁡max⁡aE[Rt+1+γvπ(St+1)∣St=s,At=a]=arg⁡max⁡a∑s′,rp(s′,r∣s,a)[r+γvπ(s′)]\begin{align*} \pi'(s) &= \mathop{\arg\max}_a q_{\pi}(s,a) \\ &= \mathop{\arg\max}_a \mathbb{E}\left[ R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t=s, A_t=a \right] \\ &= \mathop{\arg\max}_a \sum_{s',r} p(s',r|s,a) \left[ r + \gamma v_{\pi}(s') \right] \end{align*}π′(s)​=argmaxa​qπ​(s,a)=argmaxa​E[Rt+1​+γvπ​(St+1​)∣St​=s,At​=a]=argmaxa​s′,r∑​p(s′,r∣s,a)[r+γvπ​(s′)]​
qπ(s,π′(s))=∑aπ′(a∣s)qπ(s,a)q_{\pi}(s,\pi'(s)) = \sum_a \pi'(a|s) q_{\pi}(s,a)qπ​(s,π′(s))=a∑​π′(a∣s)qπ​(s,a)

Once a policy, π\piπ, has been improved using vπv_{\pi}vπ​ to yield a better policy, π′\pi'π′, we can then compute vπ′v_{\pi}'vπ′​ and improve it again to yield an even better π′′π''π′′. We can thus obtain a sequence of monotonically improving policies and value functions:

π0→Evπ0→Iπ1→Evπ1→Iπ2→E⋯→Iπ∗→Ev∗,\pi_0 \xrightarrow{E} v_{\pi_0} \xrightarrow{I} \pi_1 \xrightarrow{E} v_{\pi_1} \xrightarrow{I} \pi_2 \xrightarrow{E} \cdots \xrightarrow{I} \pi_* \xrightarrow{E} v_*,π0​E​vπ0​​I​π1​E​vπ1​​I​π2​E​⋯I​π∗​E​v∗​,

where →E\xrightarrow{E}E​ denotes a policy evaluation and →I\xrightarrow{I}I​ denotes a policy improvement.