Markov decision process
The agent-environment interaction in a Markov Decision Process.
Action: agent's decision which impacts the environment (At)
State: the observation of the environment (St)
Reward: a special numerical values (Rt)
Environment: comprising everything outside the agent
At each time step t∈{0,1,2,...}, the agent receives some representation of the environment's state, St∈S, and on that basis selects an action, At∈A(s), which depends on the states. The agent receives a numerical rewards, Rt+1∈R.
The history is the sequence of states, actions, rewards:
S0,A0,R1,S1,A1,R2,S2,A2,R2,... The conditional probability of s,r at time t is like:
p(s′,r∣s,a)=P(St=s′,Rt=r∣St−1=s,At−1=a) State-transition probability
p(s′∣s,a)=P(St=s′∣St−1=s,At−1=a)=r∈R∑p(s′,r∣s,a) Expected rewards for state-action pairs
r(s,a)=E[Rt∣St−1=s,At−1=a]=r∈R∑rs′∈S∑p(s′,r∣s,a) Expected rewards for state-action-next-state triples
r(s,a,s′)=E[Rt∣St−1=s,At−1=a,St=s′]=r∈R∑rp(s′∣s,a)p(s′,r∣s,a) Return: the total reward an agent can get. (Gt)
Policy: a probability distribution of actions under some states. (π(a∣s))
Value: the expected return of a state under a policy. (vπ(s))
Return or discounted return
Gt=k=0∑∞γkRt+k+1,0≤γ≤1 The state-value function for policy π
vπ(s)=Eπ[Gt∣St=s] The action-value function for policy π
qπ(s,a)=Eπ[Gt∣St=s,At=a] The Bellman equation for vπ(s)
vπ(s)=a∑π(a∣s)r,s′∑p(r,s′∣s,a)[r+γvπ(s′)] The Bellman equation for qπ(s,a)
qπ(s,a)=r,s′∑p(r,s′∣s,a)[r+γa′∑[π(a′∣s′)qπ(s′,a′)]] Optimal state-value function
v∗(s)=πmaxvπ(s) Prediction & Control
Prediction: estimating the value function for a given policy π
Control: finding the optimal policy