当前位置：首页 > news >正文

强化学习贝尔曼方程推导

news 2025/7/2 8:33:04

引言

强化学习中贝尔曼方程的重要性就不说了，本文利用高中生都能看懂的数学知识推导贝尔曼方程。

回报

折扣回报 $G_t$ 的定义为：
$G_t = R_{t+1} +\gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^\infty \gamma^k R_{t+k+1} \tag 1$
注意 $G_t$ 描述的是一条轨迹上的结果，也就是从某个时刻 $t$ 开始，在这条轨迹上，未来累积获得的奖励。

价值函数

状态价值函数表示从某个状态 $s$ 开始，智能体能获得的期望回报。
$V^\pi(s) = \Bbb E[G_t|S_t=s] \tag 2$
意思是

从状态 $s$ 出发
遵循策略 $\pi$ 行动
未来累计获得的回报 $G_t$ 的期望值

描述的是很多条轨迹的回报的平均值。

状态价值函数是针对所有状态的，强化学习的环境可能有成千上万个状态，我们一开始并不知道每个状态的价值。

所以，需要一种方法来求出 $V^\pi(s)$ 。

这种方法就是贝尔曼方程。

贝尔曼方程

状态价值函数

价值函数通常通过贝尔曼方程来递归定义，它为价值函数提供了一种递推的方式。

我们来一步一步推导理解它。

回顾一下状态价值函数的定义：
$V^\pi(s) = \Bbb E[G_t|S_t=s]$
其中 $G_t$ 可以展开为：
$G_t =R_{t+1} +\gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = R_{t+1} + \gamma G_{t+1} \tag 3$
其中 $G_{t+1}$ 是依赖于 $t + 1$ 时刻的状态 $S_{t+1}$ 的未来回报。

我们得到：
$V^\pi(s) = \Bbb E[R_{t+1} + \gamma G_{t+1}|S_t=s] \tag 4$
利用期望的可加性我们继续展开：
$\Bbb E[R_{t+1} + \gamma G_{t+1}|S_t=s] = \Bbb E[R_{t+1} |S_t=s] + \Bbb E[ \gamma G_{t+1}|S_t=s] \tag 5$
上式得到了两项，我们分别来推导。先看第一项。

标准的奖励函数形式为：
$R(s,a,s^\prime) \tag 6$
即奖励跟当前状态、采取的动作和下一个状态有关。假设奖励的取值是有限的 $\in \mathcal R$ ，那么第一项可以写成：
$\Bbb E[R_{t+1} |S_t=s] = \sum_{r \in \mathcal R} r \cdot p(r|s) \tag 7$
我们可以根据条件边缘化(可以简单地把公共条件 $s$ 遮住来理解)把其中的 $p (r ∣ s)$ 写成标准形式：
$\sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} p(s^\prime, a, r|s)\tag 8$
这里解释一下条件边缘化，回顾条件概率定义：
$\frac{p(A,B)}{p(B)} \tag{S1.1}$
即：
$\tag {S1.2}$
把 $p(s^\prime, a, r|s)$ 展开，根据条件概率定义有：
$p(s^\prime, a, r|s) = \frac{p(s^\prime, a, r,s)}{p(s)} \tag{S1.3}$
边缘化的基本公式告诉我们：
$\sum_{s^\prime, a} p(s^\prime, a, r, s) \tag {S1.4}$
而 $p (r ∣ s)$ 可以写成：
$\frac{p(r,s)}{p(s)} \tag {S1.5}$
把(S1.4)带入上式：
$\begin{aligned} p(r|s) &= \frac{ \sum_{s^\prime, a} p(s^\prime, a, r, s)}{p(s)} \\ &= \frac{ \sum_{s^\prime, a} p(s^\prime, a, r|s)p(s)}{p(s)} \\ &= \sum_{s^\prime, a} p(s^\prime, a, r|s) \end{aligned} \tag {S1.6}$
针对(8)我们可以拆出策略 $p(a|s)=\pi(a|s)$ ：
$\sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} p(s^\prime, a, r|s)= \sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} p(a|s)p(s^\prime, r|s,a) = \sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} \pi(a|s)p(s^\prime, r|s,a) \tag 9$
合并回公式(7)：
$\Bbb E[R_{t+1} |S_t=s] = \sum_{r \in \mathcal R} r \cdot p(r|s) = \sum_{r \in \mathcal R}\sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} r \cdot \pi(a|s)p(s^\prime, r|s,a) \tag{10}$
对于第二项 $\Bbb E[ \gamma G_{t+1}|S_t=s] = \gamma \Bbb E[ G_{t+1}|S_t=s]$ ，其中
$\Bbb E[ G_{t+1}|S_t=s] = \sum_g g\cdot p(g|s) \tag{11}$
这里的 $g$ 代表的是 $G_{t+1}$ 可能取到的每一个数值。和第一项一样，我们分解 $p (g ∣ s)$ ：
$\sum_{r \in \mathcal R}\sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} p(s^\prime,r,a,g|s) \tag{12}$
将其分解为边缘部分：
$p(s^\prime,r,a,g|s) = p(s^\prime,r,a|s)p(g|s^\prime,r,a,s) \tag{13}$

这里分解的意思是在状态 $s$ ：

首先，做动作 $a$ ，跳到 $s^\prime$ ，得到奖励 $r$ ，这一整套事情的概率是 $p(s^\prime,r,a|s)$
然后，在这之后未来累积到的回报是 $g$ ，这个条件概率是 $p(g|s^\prime,r,a,s)$

两者乘起来就是整个一条链上

从 $s$
经过 $a,s^\prime,r$
最后回报是 $g$

这种分解是为了后面可以递归推导贝尔曼方程。

从公式(9)我们知道 $p(s^\prime,r,a|s)=\pi(a|s)p(s^\prime,r|s,a)$ ，带入公式(13)：
$\sum_{r \in \mathcal R}\sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} \pi(a|s)p(s^\prime,r|s,a)p(g|s^\prime,r,a,s) \tag{14}$
基于马尔可夫性质，未来的动作和它们带来的奖励仅依赖于当前的状态，因此：
$p(g|s^\prime,r,a,s) = p(g|s^\prime) \tag {15}$

代入(14)得：
$\sum_{r \in \mathcal R}\sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} \pi(a|s)p(s^\prime,r|s,a)p(g|s^\prime) \tag{16}$

把上式代入公式(11)：
$\begin{aligned} \Bbb E[ G_{t+1}|S_t=s] &= \sum_g g\cdot p(g|s) \\ &= \sum_{r \in \mathcal R}\sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} \pi(a|s)p(s^\prime,r|s,a)\sum_g g\cdot p(g|s^\prime) \\ &=\sum_{r \in \mathcal R}\sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} \pi(a|s)p(s^\prime,r|s,a)V^\pi(s^\prime) \end{aligned} \tag{17}$
这里利用了 $\sum_g g \cdot p(g|s^\prime) = \Bbb E[G_{t+1}|S_{t+1}=s^\prime] = V^\pi(s^\prime)$ ，为从 $s^\prime$ 开始的期望总回报。

又 $\Bbb E[ \gamma G_{t+1}|S_t=s] =\gamma\Bbb E[ G_{t+1}|S_t=s]$ 。把这两项的结果带入公式(4)中：
$\begin{aligned} V^\pi(s) &= \Bbb E[R_{t+1} + \gamma G_{t+1}|S_t=s] \\ &= \Bbb E[R_{t+1} |S_t=s] + \Bbb E[ \gamma G_{t+1}|S_t=s] \\ &= \sum_{r \in \mathcal R}\sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} r \cdot \pi(a|s)p(s^\prime, r|s,a) + \sum_{r \in \mathcal R}\sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} \gamma\pi(a|s)p(s^\prime,r|s,a)V^\pi(s^\prime) \\ &= \sum_{r \in \mathcal R}\sum_{s^\prime \in \mathcal S} \sum_{a \in \mathcal A} \pi(a|s)p(s^\prime,r|s,a)(r + \gamma V^\pi(s^\prime)) \\ &= \sum_{a \in \mathcal A} \pi(a|s) \sum_{r \in \mathcal R}\sum_{s^\prime \in \mathcal S}p(s^\prime,r|s,a)\left(r + \gamma V^\pi(s^\prime) \right) \end{aligned} \tag{18}$

需要在状态空间 $\mathcal S$ 中的所有 $s$ 上应用该公式。

动作价值函数

同理可以推导出动作价值函数贝尔曼方程的递推式子。

$q_\pi(s,a) = \Bbb E[G_t|S_t=s, A_t=a] \tag{19}$

$q_\pi(s,a) = \Bbb E[R_{t+1} + \gamma G_{t+1}|S_t=s, A_t=a] \tag{20}$

$\Bbb E[R_{t+1} |S_t=s,A_t=a] = \sum_{s^\prime \in \mathcal S}\sum_{r \in \mathcal R} r \cdot p(s^\prime, r|s,a) \tag{21}$

$\begin{aligned} \Bbb E[G_{t+1}|S_t=s, A_t=a] &= \sum_g g\cdot p(g|s,a) \\ &= \sum_{s^\prime \in \mathcal S}\sum_{r \in \mathcal R} \sum_g g\cdot p(s^\prime,r,g|s,a) \\ &= \sum_{s^\prime \in \mathcal S}\sum_{r \in \mathcal R} \sum_g g\cdot p(s^\prime,r|s,a)p(g|s^\prime,r,s,a) \\ &= \sum_{s^\prime \in \mathcal S}\sum_{r \in \mathcal R} \sum_g g\cdot p(s^\prime,r|s,a)p(g|s^\prime) \\ &= \sum_{s^\prime \in \mathcal S}\sum_{r \in \mathcal R} p(s^\prime,r|s,a) \sum_g g\cdot p(g|s^\prime) \\ &= \sum_{s^\prime \in \mathcal S}\sum_{r \in \mathcal R} p(s^\prime,r|s,a) V^\pi(s^\prime) \end{aligned} \tag{22}$

$\begin{aligned} q_\pi(s,a) &= \Bbb E[R_{t+1} + \gamma G_{t+1}|S_t=s, A_t=a] \\ &= \sum_{s^\prime \in \mathcal S}\sum_{r \in \mathcal R} r \cdot p(s^\prime, r|s,a) + \gamma \sum_{s^\prime \in \mathcal S}\sum_{r \in \mathcal R} p(s^\prime,r|s,a) V^\pi(s^\prime) \\ &= \sum_{s^\prime \in \mathcal S}\sum_{r \in \mathcal R} p(s^\prime,r|s,a) (r + \gamma V^\pi(s^\prime)) \end{aligned} \tag{23}$