An MDP models how an agent chooses actions to maximize reward. MDP 用来描述智能体如何选择动作以最大化回报。
In reinforcement learning, we often assume the environment can be approximated as a Markov decision process, even if the real world is noisy and partially observed. 在强化学习中,我们常常假设环境可以近似为马尔可夫决策过程,即使现实世界存在噪声并且只能部分观测。