Reinforcement learning

Markov decision process

The agent-environment interaction in a Markov Decision Process.

  • Agent: decision maker

  • Action: agent's decision which impacts the environment (AtA_t)

  • State: the observation of the environment (StS_t)

  • Reward: a special numerical values (RtR_t)

  • Environment: comprising everything outside the agent

At each time step t{0,1,2,...}t \in \{0,1,2,...\}, the agent receives some representation of the environment's state, StSS_t \in \boldsymbol{S}, and on that basis selects an action, AtA(s)A_t \in A(s), which depends on the states. The agent receives a numerical rewards, Rt+1RR_{t+1} \in R.

The history is the sequence of states, actions, rewards:

S0,A0,R1,S1,A1,R2,S2,A2,R2,...S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_2,...

The conditional probability of s,rs,r at time tt is like:

p(s,rs,a)=P(St=s,Rt=rSt1=s,At1=a)p(s',r|s,a) =P(S_{t}=s', R_{t}=r|S_{t-1}=s,A_{t-1}=a)

State-transition probability

p(ss,a)=P(St=sSt1=s,At1=a)=rRp(s,rs,a)p(s'|s,a) =P(S_{t}=s'|S_{t-1}=s,A_{t-1}=a) =\sum_{r \in R}p(s',r|s,a)

Expected rewards for state-action pairs

r(s,a)=E[RtSt1=s,At1=a]=rRrsSp(s,rs,a)r(s,a) =\mathbb{E}[R_t|S_{t-1}=s,A_{t-1}=a] =\sum_{r \in R}r\sum_{s' \in S}p(s',r|s,a)

Expected rewards for state-action-next-state triples

r(s,a,s)=E[RtSt1=s,At1=a,St=s]=rRrp(s,rs,a)p(ss,a)r(s,a,s') =\mathbb{E}[R_t|S_{t-1}=s,A_{t-1}=a,S_t=s'] =\sum_{r \in R}r \dfrac{p(s',r|s,a)}{p(s'|s,a)}
  • Return: the total reward an agent can get. (GtG_t)

  • Policy: a probability distribution of actions under some states. (π(as)\pi(a|s))

  • Value: the expected return of a state under a policy. (vπ(s)v_\pi(s))

Return or discounted return

Gt=k=0γkRt+k+1,0γ1G_t = \sum_{k=0}^{\infty}\gamma^k R_{t+k+1}, \qquad 0\le\gamma\le 1

The state-value function for policy π\pi

vπ(s)=Eπ[GtSt=s]v_\pi(s)=\mathbb{E}_\pi \left[ G_t | S_{t}=s \right]

The action-value function for policy π\pi

qπ(s,a)=Eπ[GtSt=s,At=a]q_\pi(s,a)=\mathbb{E}_\pi \left[ G_t | S_{t}=s , A_{t}=a \right]

The Bellman equation for vπ(s)v_\pi(s)

vπ(s)=aπ(as)r,sp(r,ss,a)[r+γvπ(s)]v_\pi(s)=\sum_a \pi(a|s) \sum_{r,s'}p(r,s'|s,a) \left[ r + \gamma v_\pi(s') \right]

The Bellman equation for qπ(s,a)q_\pi(s,a)

qπ(s,a)=r,sp(r,ss,a)[r+γa[π(as)qπ(s,a)]]q_\pi(s,a) = \sum_{r,s'} p(r,s'|s,a) \left[ r + \gamma \sum_{a'} \left[ \pi(a'|s')q_\pi(s',a') \right] \right]

Optimal state-value function

v(s)=maxπvπ(s)v_*(s) = \max_\pi v_\pi(s)

Prediction & Control

  • Prediction: estimating the value function for a given policy π\pi

  • Control: finding the optimal policy

Last updated