ml-N
  • Introduction
  • Theory
    • Reinforcement Learning
      • Preface
      • Basic Conceptions
      • Multi-armed Bandits
      • Finite Markov Decision Processes
      • Dynamic Programming
    • Conception
    • IO
    • Architecture
    • Awesome Ideas
  • Tools
    • Caffe
      • Summary
      • Tips
      • Issues
    • Tensorflow
      • Tips
  • Applications
    • Object Detection
Powered by GitBook
On this page
  • Preface
  • Returns
  • Discrete Situation
  • Continual Situation
  • Markov Decision Processes
  • Definition
  • Value Functions
  • Optimal Value Function
  • Optimality and Approximation
  1. Theory
  2. Reinforcement Learning

Finite Markov Decision Processes

Preface

A somewhat aggressive idea proposed by the book:

In this chapter we introduce the problem that we try to solve in the rest of the book. For us, this problem defines the field of reinforcement learning: any method that is suited to solving this problem we consider to be a reinforcement learning method.

The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment. ... In fact, in some cases the agent may know everything about how its environment works and still face a difficult reinforcement learning task, just as we may know exactly how a puzzle like Rubik’s cube works, but still be unable to solve it. The agent–environment boundary represents the limit of the agent’s absolute control, not of its knowledge.

Well, something beside the point. I noticed an interesting idea mentioned here that even if the agent knows everything about how its environment works, it may still face a difficult learning task. It's something philosophical, I think. You know all but you can't make it. It indicates the gap between methodology and epistimology to some degree.

A model of learning goal-directed behavior:

Three signals passing back and forth between an agent and its environment: one signal to represent the choices made by the agent (the actions), one signal to represent the basis on which the choices are made (the states), and one signal to define the agent’s goal (the rewards).

Some tips:

  • The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.

  • Rewards are computed in the environment rather than in the agent.

  • The agent's ultimate goal should be something over which it has imperfect control: it should not be able, for example, to simply decree that the reward has been received in the same way that it might arbitrarily change its actions.

Returns

The agents' goal is to maximize the cumulative reward it receives in the long run. Suppose the sequence of rewards received after time step ttt is denoted Rt+1,Rt+2,Rt+3,...R_{t+1}, R_{t+2}, R_{t+3},...Rt+1​,Rt+2​,Rt+3​,..., then the expected return, where the return GtG_tGt​ is defined as some specific function of the reward sequence.

Discrete Situation

In the simplest case the return is the sum of the rewards:

Gt=Rt+1+Rt+2+Rt+3+...+RTG_t = R_{t+1} + R_{t+2} + R_{t+3} + ... + R_TGt​=Rt+1​+Rt+2​+Rt+3​+...+RT​

where TTT is a final time step.

This approach makes sense in applications in which there is a natural notion of final time step, that is, when the agent–environment interaction breaks naturally into subsequences, which we call episodes, such as plays of a game, trips through a maze, or any sort of repeated interactions. Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Tasks with episodes of this kind are called episodic tasks.

Continual Situation

On the other hand, in many cases the agent–environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. For example, this would be the natural way to formulate a continual process-control task, or an application to a robot with a long life span. We call these continuing tasks.

The additional concept that we need is that of discounting. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses At to maximize the expected discounted return:

Gt=Rt+1+γRt+2+γ2Rt+3+...=∑k=0∞γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}Gt​=Rt+1​+γRt+2​+γ2Rt+3​+...=k=0∑∞​γkRt+k+1​

where \gamma is a parameter, 0≤γ≤10 \le \gamma \le 10≤γ≤1, called the discount rate.

The discount rate determines the present value of future rewards: a reward received k time steps in the future is worth only γk−1\gamma^{k−1}γk−1 times what it would be worth if it were received immediately. If γ<1\gamma < 1γ<1, the infinite sum has a finite value as long as the reward sequence {Rk}\{R_k\}{Rk​} is bounded. If γ=0\gamma = 0γ=0, the agent is “myopic” in being concerned only with maximizing immediate rewards: its objective in this case is to learn how to choose At so as to maximize only Rt+1R_{t+1}Rt+1​. If each of the agent’s actions happened to influence only the immediate reward, not future rewards as well, then a myopic agent could maximize the equation by separately maximizing each immediate reward. But in general, acting to maximize immediate reward can reduce access to future rewards so that the return may actually be reduced. As γ\gammaγ approaches 1, the objective takes future rewards into account more strongly: the agent becomes more farsighted.

Markov Decision Processes

Definition

A reinforcement learning task that satisfies the Markov property is called a Markov decision process, or MDP. If the state and action spaces are finite, then it is called a finite Markov decision process (finite MDP).

A particular finite MDP is defined by its state and action sets and by the one-step dynamics of the environment. Given any state and action sss and aaa, the probability of each possible pair of next state and reward, s′s's′,rrr, is denoted

p(s′,r∣s,a)=Pr{St+1=s′,Rt+1=r∣St=s,At=a}p(s',r|s,a) = Pr\{S_{t+1}=s', R_{t+1}=r | S_t=s, A_t=a\}p(s′,r∣s,a)=Pr{St+1​=s′,Rt+1​=r∣St​=s,At​=a}

These quantities completely specify the dynamics of a finite MDP.

With the dynamics, we can compute anything we want to know, such as the expected rewards for state-action pairs,

r(s,a)=E[Rt+1∣St=s,At=a]=∑r∈Rr∑s′∈Sp(s′,r∣s,a),r(s,a) = \mathbb{E} \left[ R_{t+1} | S_t=s, A_t=a \right] = \sum_{r\in \mathcal{R}}r \sum_{s' \in \mathcal{S}} p(s',r | s,a),r(s,a)=E[Rt+1​∣St​=s,At​=a]=r∈R∑​rs′∈S∑​p(s′,r∣s,a),

the state-transition probabilities,

p(s′∣s,a)=Pr{St+1=s′∣St=s,At=a}=∑r∈Rp(s′,r∣s,a),p(s'|s,a) = Pr\{ S_{t+1}=s' | S_t=s, A_t=a \} = \sum_{r\in R} p(s',r| s,a),p(s′∣s,a)=Pr{St+1​=s′∣St​=s,At​=a}=r∈R∑​p(s′,r∣s,a),

and the expected reweards for the state-action-next state triples,

r(s,a,s′)=Eπ[Rt+1∣St=s,At=s,St+1=s′]=∑r∈Rrp(s′,r∣s,a)p(s′∣s,a)r(s,a,s') = \mathbb{E}_{\pi} \left[ R_{t+1} | S_t=s, A_t=s, S_{t+1}=s' \right] = \frac{\sum_{r\in \mathcal{R}} rp(s',r | s,a)}{p(s'|s,a)}r(s,a,s′)=Eπ​[Rt+1​∣St​=s,At​=s,St+1​=s′]=p(s′∣s,a)∑r∈R​rp(s′,r∣s,a)​

Value Functions

Recall that a policy, π\piπ, is a mapping from each state, s∈Ss \in \mathcal{S}s∈S, and action, a∈A(s)a \in A(s)a∈A(s), to the probability π(a∣s)\pi(a|s)π(a∣s) of taking action a when in state sss. For MDPs, we can define vπ(s)v_{\pi}(s)vπ​(s) (the value of a state sss under a policy π\piπ) formally as

vπ(s)=E[Gt∣St=s]=Eπ[∑k=0∞γkRt+k+1∣St=s],v_{\pi}(s) = \mathbb{E}\left[ G_t | S_t=s \right] = \mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t=s \right],vπ​(s)=E[Gt​∣St​=s]=Eπ​[k=0∑∞​γkRt+k+1​∣St​=s],

where Eπ[⋅]\mathbb{E}_{\pi} [ \centerdot ]Eπ​[⋅] denotes the expected value of a random variable given that the agent follows policy π\piπ, and ttt is any time step. Note that the value of the terminal state, if any, is always zero. We call the function vπv_\pivπ​ the state-value function for policy π\piπ.

Similarly, we define the value of taking action aaa in state sss under a policy π\piπ, denoted qπ(s,a)q_{\pi}(s,a)qπ​(s,a), as the expected return starting from sss, taking the action aaa, and thereafter following policy π\piπ:

qπ(s,a)=Eπ[Gt∣St=s,At=a]=Eπ[∑k=0∞γkRt+k+1∣St=s,At=a]q_{\pi}(s,a) = \mathbb{E}_{\pi} \left[ G_t | S_t=s, A_t=a \right] = \mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t=s, A_t=a \right]qπ​(s,a)=Eπ​[Gt​∣St​=s,At​=a]=Eπ​[k=0∑∞​γkRt+k+1​∣St​=s,At​=a]

We call qπq_{\pi}qπ​ the action-value function for poliy π\piπ.

For any policy π\piπ and any state sss, the following consistency condition holds between the value of sss and the value of its possible successor states:

vπ(s)=Eπ[Gt∣St=s]=Eπ[∑k=0∞γkRt+k+1∣St=s]=Eπ[Rt+1+γ∑k=0∞γkRt+k+2∣St=s],\begin{align*} v_{\pi}(s) &= \mathbb{E}_{\pi} \left[ G_t | S_t=s \right] \\ &= \mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t=s \right] \\ &= \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma \sum_{k=0}^{\infty} \gamma^k R_{t+k+2} | S_t=s \right], \end{align*}vπ​(s)​=Eπ​[Gt​∣St​=s]=Eπ​[k=0∑∞​γkRt+k+1​∣St​=s]=Eπ​[Rt+1​+γk=0∑∞​γkRt+k+2​∣St​=s],​

here, because of the Markov property

Eπ[∑k=0∞γkRt+k+2∣St+1=s′,St=s]=Eπ[∑k=0∞γkRt+k+2∣St+1=s′]⟹Eπ[∑k=0∞γkRt+k+2∣St=s]=∑s′Eπ[∑k=0∞γkRt+k+2∣St+1=s′]p(s′∣s),\mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+2} | S_{t+1}=s', S_t=s \right] = \mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+2} | S_{t+1}=s' \right] \\ \Longrightarrow \mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+2} | S_t=s \right] = \sum_{s'} \mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+2} | S_{t+1}=s' \right] p(s'|s),Eπ​[k=0∑∞​γkRt+k+2​∣St+1​=s′,St​=s]=Eπ​[k=0∑∞​γkRt+k+2​∣St+1​=s′]⟹Eπ​[k=0∑∞​γkRt+k+2​∣St​=s]=s′∑​Eπ​[k=0∑∞​γkRt+k+2​∣St+1​=s′]p(s′∣s),

and we know

r(s,a)=E[Rt+1∣St=s,At=a]=∑r∈Rr∑s′∈Sp(s′,r∣s,a)p(s′∣s,a)=Pr{St+1=s′∣St=s,At=a}=∑r∈Rp(s′,r∣s,a)r(s,a) = \mathbb{E} \left[ R_{t+1} | S_t=s, A_t=a \right] = \sum_{r\in \mathcal{R}}r \sum_{s' \in \mathcal{S}} p(s',r | s,a) \\ p(s'|s,a) = Pr\{ S_{t+1}=s' | S_t=s, A_t=a \} = \sum_{r\in \mathcal{R}} p(s',r| s,a) \\r(s,a)=E[Rt+1​∣St​=s,At​=a]=r∈R∑​rs′∈S∑​p(s′,r∣s,a)p(s′∣s,a)=Pr{St+1​=s′∣St​=s,At​=a}=r∈R∑​p(s′,r∣s,a)

Therefore,

vπ(s)=∑aπ(a∣s)∑s′∑rp(s′,r∣s,a)[r+γEπ[∑k=0∞γkRt+k+2∣St+1=s′]]=∑aπ(a∣s)∑s′∑rp(s′,r∣s,a)[r+γvπ(s′)],\begin{align*} v_{\pi}(s) &= \sum_a \pi(a|s) \sum_{s'} \sum_r p(s',r|s,a) \left[ r + \gamma \mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+2} | S_{t+1} =s' \right] \right] \\ &= \sum_a \pi(a|s) \sum_{s'} \sum_r p(s',r|s,a) \left[ r + \gamma v_{\pi}(s') \right], \end{align*}vπ​(s)​=a∑​π(a∣s)s′∑​r∑​p(s′,r∣s,a)[r+γEπ​[k=0∑∞​γkRt+k+2​∣St+1​=s′]]=a∑​π(a∣s)s′∑​r∑​p(s′,r∣s,a)[r+γvπ​(s′)],​

This is the Bellman Equation for vπv_{\pi}vπ​. It expresses a relationship between the value of a state and the values of its successor states. It averages over all the possibilities, weighting each by its probability of occurring. It states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way.

Optimal Value Function

For finite MDPs, we can precisely define an optimal policy in the following way. Value functions define a partial ordering over policies. A policy π\piπ is defined to be better than or equal to a policy π′\pi'π′ if its expected return is greater than or equal to that of π0 for all states. In other words, π≥π′\pi \ge \pi'π≥π′ if and only if vπ(s)≥vπ′(s)v_{\pi}(s) \ge v_{\pi'}(s)vπ​(s)≥vπ′​(s) for all s∈Ss \in \mathcal{S}s∈S. There is always at least one policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by π∗\pi_*π∗​. They share the same state-value function, called the optimal state-value function, denoted v∗v_*v∗​, and defined as

v∗(s)=max⁡πvπ(s),v_*(s) = \max_{\pi} v_{\pi}(s),v∗​(s)=πmax​vπ​(s),

for all s∈Ss \in \mathcal{S}s∈S.

Optimal policies also share the same optimal action-value function, denoted q∗, and defined as

q∗(s,a)=max⁡πqπ(s,a),q_*(s,a) = \max_{\pi} q_{\pi}(s,a),q∗​(s,a)=πmax​qπ​(s,a),

Noted: Here v∗(s)v_*(s)v∗​(s) DOES NOT have the exactly same meaning as the vπ(s)v_{\pi}(s)vπ​(s) we mentioned above. Precisely speaking,

v∗(s)=max⁡a∈A(s)qπ∗(s,a)=max⁡aEπ∗[Gt∣St=s,At=a]=max⁡aEπ∗[∑k=0∞γkRt+k+1∣St=s,At=a]=max⁡aEπ∗[Rt+1+γ∑k=0∞γkRt+k+2∣St=s,At=a]=max⁡aE[Rt+1+γv∗(St+1)∣St=s,At=a]=max⁡a∈A(s)∑s′,rp(s′,r∣s,a)[r+γv∗(s′)]\begin{align*} v_*(s) &= \max_{a\in \mathcal{A}(s)} q_{\pi_*}(s,a) \\ &= \max_a \mathbb{E}_{\pi_*} \left[ G_t | S_t=s, A_t=a \right] \\ &= \max_a \mathbb{E}_{\pi_*} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t=s, A_t=a \right] \\ &= \max_a \mathbb{E}_{\pi_*} \left[ R_{t+1} + \gamma \sum_{k=0}^{\infty} \gamma^k R_{t+k+2} | S_t=s, A_t=a \right] \\ &= \max_a \mathbb{E} \left[ R_{t+1} + \gamma v_*(S_{t+1}) | S_t=s, A_t=a \right] \\ &= \max_{a\in \mathcal{A}(s)} \sum_{s',r} p(s',r|s,a) \left[ r + \gamma v_*(s') \right] \end{align*}v∗​(s)​=a∈A(s)max​qπ∗​​(s,a)=amax​Eπ∗​​[Gt​∣St​=s,At​=a]=amax​Eπ∗​​[k=0∑∞​γkRt+k+1​∣St​=s,At​=a]=amax​Eπ∗​​[Rt+1​+γk=0∑∞​γkRt+k+2​∣St​=s,At​=a]=amax​E[Rt+1​+γv∗​(St+1​)∣St​=s,At​=a]=a∈A(s)max​s′,r∑​p(s′,r∣s,a)[r+γv∗​(s′)]​

Similarly,

q∗(s,a)=E[Rt+1+γmax⁡a′q∗(St+1,a′)∣St=s,At=a]=∑s′,rp(s′,r∣s,a)[r+γmax⁡a′q∗(s′,a′)]\begin{align*} q_*(s,a) &= \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} q_*(S_{t+1},a') | S_t=s, A_t=a \right] \\ &= \sum_{s',r} p(s',r|s,a) \left[ r + \gamma \max_{a'} q_*(s',a') \right] \end{align*}q∗​(s,a)​=E[Rt+1​+γa′max​q∗​(St+1​,a′)∣St​=s,At​=a]=s′,r∑​p(s′,r∣s,a)[r+γa′max​q∗​(s′,a′)]​

Optimality and Approximation

Even if the agent has a complete and accurate environment model, the agent is typically unable to perform enough computation per time step to fully use it. The memory available is also an important constraint. Memory may be required to build up accurate approximations of value functions, policies, and models. In most cases of practical interest there are far more states than could possibly be entries in a table, and approximations must be made.

PreviousMulti-armed BanditsNextDynamic Programming

Last updated 6 years ago