Finite Markov Decision Processes
Preface
A somewhat aggressive idea proposed by the book:
In this chapter we introduce the problem that we try to solve in the rest of the book. For us, this problem defines the field of reinforcement learning: any method that is suited to solving this problem we consider to be a reinforcement learning method.
The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment. ... In fact, in some cases the agent may know everything about how its environment works and still face a difficult reinforcement learning task, just as we may know exactly how a puzzle like Rubik’s cube works, but still be unable to solve it. The agent–environment boundary represents the limit of the agent’s absolute control, not of its knowledge.
Well, something beside the point. I noticed an interesting idea mentioned here that even if the agent knows everything about how its environment works, it may still face a difficult learning task. It's something philosophical, I think. You know all but you can't make it. It indicates the gap between methodology and epistimology to some degree.
A model of learning goal-directed behavior:
Three signals passing back and forth between an agent and its environment: one signal to represent the choices made by the agent (the actions), one signal to represent the basis on which the choices are made (the states), and one signal to define the agent’s goal (the rewards).
Some tips:
The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.
Rewards are computed in the environment rather than in the agent.
The agent's ultimate goal should be something over which it has imperfect control: it should not be able, for example, to simply decree that the reward has been received in the same way that it might arbitrarily change its actions.
Returns
The agents' goal is to maximize the cumulative reward it receives in the long run. Suppose the sequence of rewards received after time step is denoted , then the expected return, where the return is defined as some specific function of the reward sequence.
Discrete Situation
In the simplest case the return is the sum of the rewards:
where is a final time step.
This approach makes sense in applications in which there is a natural notion of final time step, that is, when the agent–environment interaction breaks naturally into subsequences, which we call episodes, such as plays of a game, trips through a maze, or any sort of repeated interactions. Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Tasks with episodes of this kind are called episodic tasks.
Continual Situation
On the other hand, in many cases the agent–environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. For example, this would be the natural way to formulate a continual process-control task, or an application to a robot with a long life span. We call these continuing tasks.
The additional concept that we need is that of discounting. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses At to maximize the expected discounted return:
where \gamma is a parameter, , called the discount rate.
The discount rate determines the present value of future rewards: a reward received k time steps in the future is worth only times what it would be worth if it were received immediately. If , the infinite sum has a finite value as long as the reward sequence is bounded. If , the agent is “myopic” in being concerned only with maximizing immediate rewards: its objective in this case is to learn how to choose At so as to maximize only . If each of the agent’s actions happened to influence only the immediate reward, not future rewards as well, then a myopic agent could maximize the equation by separately maximizing each immediate reward. But in general, acting to maximize immediate reward can reduce access to future rewards so that the return may actually be reduced. As approaches 1, the objective takes future rewards into account more strongly: the agent becomes more farsighted.
Markov Decision Processes
Definition
A reinforcement learning task that satisfies the Markov property is called a Markov decision process, or MDP. If the state and action spaces are finite, then it is called a finite Markov decision process (finite MDP).
A particular finite MDP is defined by its state and action sets and by the one-step dynamics of the environment. Given any state and action and , the probability of each possible pair of next state and reward, ,, is denoted
These quantities completely specify the dynamics of a finite MDP.
With the dynamics, we can compute anything we want to know, such as the expected rewards for state-action pairs,
the state-transition probabilities,
and the expected reweards for the state-action-next state triples,
Value Functions
Recall that a policy, , is a mapping from each state, , and action, , to the probability of taking action a when in state . For MDPs, we can define (the value of a state under a policy ) formally as
where denotes the expected value of a random variable given that the agent follows policy , and is any time step. Note that the value of the terminal state, if any, is always zero. We call the function the state-value function for policy .
Similarly, we define the value of taking action in state under a policy , denoted , as the expected return starting from , taking the action , and thereafter following policy :
We call the action-value function for poliy .
For any policy and any state , the following consistency condition holds between the value of and the value of its possible successor states:
here, because of the Markov property
and we know
Therefore,
This is the Bellman Equation for . It expresses a relationship between the value of a state and the values of its successor states. It averages over all the possibilities, weighting each by its probability of occurring. It states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way.
Optimal Value Function
For finite MDPs, we can precisely define an optimal policy in the following way. Value functions define a partial ordering over policies. A policy is defined to be better than or equal to a policy if its expected return is greater than or equal to that of π0 for all states. In other words, if and only if for all . There is always at least one policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by . They share the same state-value function, called the optimal state-value function, denoted , and defined as
for all .
Optimal policies also share the same optimal action-value function, denoted q∗, and defined as
Noted: Here DOES NOT have the exactly same meaning as the we mentioned above. Precisely speaking,
Similarly,
Optimality and Approximation
Even if the agent has a complete and accurate environment model, the agent is typically unable to perform enough computation per time step to fully use it. The memory available is also an important constraint. Memory may be required to build up accurate approximations of value functions, policies, and models. In most cases of practical interest there are far more states than could possibly be entries in a table, and approximations must be made.
Last updated