Deep Learning with Pytorch in a Nutshell
  • Content
  • Image recognition
  • Object detection
  • Semantic segmentation
  • GAN
  • Image style transfer
  • Face recognition
  • Interpretability
  • Word embedding
  • Pytorch
  • Optimization
  • Special layers
    • Transposed convolution
  • Neural architecture search
  • Reinforcement learning
    • Proof of Bellman equation
    • Tabular solution method
Powered by GitBook
On this page
  • Markov decision process
  • Prediction & Control

Reinforcement learning

PreviousNeural architecture searchNextProof of Bellman equation

Last updated 6 years ago

Markov decision process

The agent-environment interaction in a Markov Decision Process.

  • Agent: decision maker

  • Action: agent's decision which impacts the environment (AtA_tAt​)

  • State: the observation of the environment (StS_tSt​)

  • Reward: a special numerical values (RtR_tRt​)

  • Environment: comprising everything outside the agent

At each time step t∈{0,1,2,...}t \in \{0,1,2,...\}t∈{0,1,2,...}, the agent receives some representation of the environment's state, St∈SS_t \in \boldsymbol{S}St​∈S, and on that basis selects an action, At∈A(s)A_t \in A(s)At​∈A(s), which depends on the states. The agent receives a numerical rewards, Rt+1∈RR_{t+1} \in RRt+1​∈R.

The history is the sequence of states, actions, rewards:

S0,A0,R1,S1,A1,R2,S2,A2,R2,...S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_2,...S0​,A0​,R1​,S1​,A1​,R2​,S2​,A2​,R2​,...

The conditional probability of s,rs,rs,r at time ttt is like:

p(s′,r∣s,a)=P(St=s′,Rt=r∣St−1=s,At−1=a)p(s',r|s,a) =P(S_{t}=s', R_{t}=r|S_{t-1}=s,A_{t-1}=a)p(s′,r∣s,a)=P(St​=s′,Rt​=r∣St−1​=s,At−1​=a)

State-transition probability

p(s′∣s,a)=P(St=s′∣St−1=s,At−1=a)=∑r∈Rp(s′,r∣s,a)p(s'|s,a) =P(S_{t}=s'|S_{t-1}=s,A_{t-1}=a) =\sum_{r \in R}p(s',r|s,a)p(s′∣s,a)=P(St​=s′∣St−1​=s,At−1​=a)=r∈R∑​p(s′,r∣s,a)

Expected rewards for state-action pairs

r(s,a)=E[Rt∣St−1=s,At−1=a]=∑r∈Rr∑s′∈Sp(s′,r∣s,a)r(s,a) =\mathbb{E}[R_t|S_{t-1}=s,A_{t-1}=a] =\sum_{r \in R}r\sum_{s' \in S}p(s',r|s,a)r(s,a)=E[Rt​∣St−1​=s,At−1​=a]=r∈R∑​rs′∈S∑​p(s′,r∣s,a)

Expected rewards for state-action-next-state triples

r(s,a,s′)=E[Rt∣St−1=s,At−1=a,St=s′]=∑r∈Rrp(s′,r∣s,a)p(s′∣s,a)r(s,a,s') =\mathbb{E}[R_t|S_{t-1}=s,A_{t-1}=a,S_t=s'] =\sum_{r \in R}r \dfrac{p(s',r|s,a)}{p(s'|s,a)}r(s,a,s′)=E[Rt​∣St−1​=s,At−1​=a,St​=s′]=r∈R∑​rp(s′∣s,a)p(s′,r∣s,a)​
  • Return: the total reward an agent can get. (GtG_tGt​)

  • Policy: a probability distribution of actions under some states. (π(a∣s)\pi(a|s)π(a∣s))

  • Value: the expected return of a state under a policy. (vπ(s)v_\pi(s)vπ​(s))

Return or discounted return

Gt=∑k=0∞γkRt+k+1,0≤γ≤1G_t = \sum_{k=0}^{\infty}\gamma^k R_{t+k+1}, \qquad 0\le\gamma\le 1Gt​=k=0∑∞​γkRt+k+1​,0≤γ≤1

The state-value function for policy π\piπ

vπ(s)=Eπ[Gt∣St=s]v_\pi(s)=\mathbb{E}_\pi \left[ G_t | S_{t}=s \right]vπ​(s)=Eπ​[Gt​∣St​=s]

The action-value function for policy π\piπ

qπ(s,a)=Eπ[Gt∣St=s,At=a]q_\pi(s,a)=\mathbb{E}_\pi \left[ G_t | S_{t}=s , A_{t}=a \right]qπ​(s,a)=Eπ​[Gt​∣St​=s,At​=a]

The Bellman equation for vπ(s)v_\pi(s)vπ​(s)

vπ(s)=∑aπ(a∣s)∑r,s′p(r,s′∣s,a)[r+γvπ(s′)]v_\pi(s)=\sum_a \pi(a|s) \sum_{r,s'}p(r,s'|s,a) \left[ r + \gamma v_\pi(s') \right]vπ​(s)=a∑​π(a∣s)r,s′∑​p(r,s′∣s,a)[r+γvπ​(s′)]

The Bellman equation for qπ(s,a)q_\pi(s,a)qπ​(s,a)

qπ(s,a)=∑r,s′p(r,s′∣s,a)[r+γ∑a′[π(a′∣s′)qπ(s′,a′)]]q_\pi(s,a) = \sum_{r,s'} p(r,s'|s,a) \left[ r + \gamma \sum_{a'} \left[ \pi(a'|s')q_\pi(s',a') \right] \right]qπ​(s,a)=r,s′∑​p(r,s′∣s,a)[r+γa′∑​[π(a′∣s′)qπ​(s′,a′)]]

Optimal state-value function

v∗(s)=max⁡πvπ(s)v_*(s) = \max_\pi v_\pi(s)v∗​(s)=πmax​vπ​(s)

Prediction & Control

  • Prediction: estimating the value function for a given policy π\piπ

  • Control: finding the optimal policy