Reinforcement learning
Last updated
Last updated
The agent-environment interaction in a Markov Decision Process.
Agent: decision maker
Action: agent's decision which impacts the environment ()
State: the observation of the environment ()
Reward: a special numerical values ()
Environment: comprising everything outside the agent
At each time step , the agent receives some representation of the environment's state, , and on that basis selects an action, , which depends on the states. The agent receives a numerical rewards, .
The history is the sequence of states, actions, rewards:
The conditional probability of at time is like:
State-transition probability
Expected rewards for state-action pairs
Expected rewards for state-action-next-state triples
Return or discounted return
Optimal state-value function
Control: finding the optimal policy
Return: the total reward an agent can get. ()
Policy: a probability distribution of actions under some states. ()
Value: the expected return of a state under a policy. ()
The state-value function for policy
The action-value function for policy
The Bellman equation for
The Bellman equation for
Prediction: estimating the value function for a given policy