snippetMinor
How to setup the Bellman Equation as a linear system of equation
Viewed 0 times
bellmanthelinearsystemsetuphowequation
Problem
I was watching a video on Reinforcement Learning by Andrew Ng, and at about minute 23 of the video he mentions that we can represent the Bellman equation as a linear system of equations. I am talking about the Hamilton-Jacobi-Bellman equation, used for discrete control problems or discrete reinforcement learning problems.
The equation as he posts it is:
$$
V^\pi(s) = R(s) + \gamma\sum_{s
$$
$V^\pi(s)$ represents the "value" at state $s$. So the idea is simple enough. But I was not clear on how to really represent this as a system of equation. I have a notion of this, but I was hoping that someone could help fill in the correct answer.
So in the linear system, the $V^\pi(s)$ is the unknown. A linear system looks like $Ku = f$. Now, I know that in some Markov Decision Process, I have a probability of the subsequent states $s'$ given the action $a$ that is chosen. So if the policy indicates go "North", then I have an 80% chance of going north, with a 10% chance of going East, and a 10% chance of going West. That is the reason for the probability.
Now for convenience I will write $v_1, v_2$, instead of $V^\pi(s_1), V^\pi(s_2),...$. The biggest point of confusion is how to handle the immediate reward $R(v_i)$.
I would basically have a system like the following, (I left out the $\gamma$ just to simplify the notation).
$$
-v_1 + 0.8v_2 + 0.1v_3 + 0.1v_4 + 0 + 0 + ... + R(v_1) = 0 \\
0.8v_1 - v_2 + 0.1v_3 + 0 + 0.1v_5 + ... + R(v_2) = 0 \\
...
$$
So I could express this as a system with a vector of $[v_1, v_2, ..., v_n]$, and a matrix of coefficients like :
$$
\begin{bmatrix}
-1 & 0.8 & 0.1 & 0.1 & 0 & 0 & 0 & ... \\
0.8 & -1 & 0.1 & 0 & 0.1 & 0 & 0 & ... \\
\vdots \\
\end{bmatrix} *
\begin{bmatrix}
v_1 \\
v_2 \\
\vdots
\end{bmatrix} =
\begin{bmatrix}
0 \\
0 \\
\vdots \\
\end{bmatrix}
$$
Or is the reward $R(v_1) = v_1$ in which case I would have 0's on the diagonal of the coefficient matrix.
Again, Ng did not write this out comple
The equation as he posts it is:
$$
V^\pi(s) = R(s) + \gamma\sum_{s
}P_{s,\pi(s)}V^\pi(s)$$
$V^\pi(s)$ represents the "value" at state $s$. So the idea is simple enough. But I was not clear on how to really represent this as a system of equation. I have a notion of this, but I was hoping that someone could help fill in the correct answer.
So in the linear system, the $V^\pi(s)$ is the unknown. A linear system looks like $Ku = f$. Now, I know that in some Markov Decision Process, I have a probability of the subsequent states $s'$ given the action $a$ that is chosen. So if the policy indicates go "North", then I have an 80% chance of going north, with a 10% chance of going East, and a 10% chance of going West. That is the reason for the probability.
Now for convenience I will write $v_1, v_2$, instead of $V^\pi(s_1), V^\pi(s_2),...$. The biggest point of confusion is how to handle the immediate reward $R(v_i)$.
I would basically have a system like the following, (I left out the $\gamma$ just to simplify the notation).
$$
-v_1 + 0.8v_2 + 0.1v_3 + 0.1v_4 + 0 + 0 + ... + R(v_1) = 0 \\
0.8v_1 - v_2 + 0.1v_3 + 0 + 0.1v_5 + ... + R(v_2) = 0 \\
...
$$
So I could express this as a system with a vector of $[v_1, v_2, ..., v_n]$, and a matrix of coefficients like :
$$
\begin{bmatrix}
-1 & 0.8 & 0.1 & 0.1 & 0 & 0 & 0 & ... \\
0.8 & -1 & 0.1 & 0 & 0.1 & 0 & 0 & ... \\
\vdots \\
\end{bmatrix} *
\begin{bmatrix}
v_1 \\
v_2 \\
\vdots
\end{bmatrix} =
\begin{bmatrix}
0 \\
0 \\
\vdots \\
\end{bmatrix}
$$
Or is the reward $R(v_1) = v_1$ in which case I would have 0's on the diagonal of the coefficient matrix.
Again, Ng did not write this out comple
Solution
I found the full answer in a video by David Silver. The idea is easy enough.
The underlying matrix equation is
$$
v = R^\pi + \gamma P^\pi_{s, s'} v
$$
Where $v$ is the value function, $R^\pi$ is the immediate reward under policy $\pi$, $\gamma$ is the discount factor, and $P^\pi_{s, s'}$ is the transition matrix.
I can write out the system of equations as below. For the sake of convenience, I omit the $\pi$ and the $\{s, s'\}$ subscripts.
$$
\begin{bmatrix}
v_1 \\
v_2 \\
\vdots \\
v_n
\end{bmatrix}
= \begin{bmatrix}
R_1 \\
R_2 \\
\vdots \\
R_n
\end{bmatrix} + \gamma
\begin{bmatrix}
p_{11} & p_{12} & \cdots & p_{1n} \\
p_{21} & p_{22} & \cdots & p_{2n} \\
\vdots \\
p_{n1} & \cdots & \cdots & p_{nn}
\end{bmatrix}
\begin{bmatrix}
v_1 \\
v_2 \\
\vdots \\
v_n
\end{bmatrix}
$$
I can solve the matrix equation with some simple linear algebra:
$$
\begin{align*}
v &= R + \gamma P v \\
v - \gamma P v &= R \\
(I - \gamma P)v &= R \\
v &= (I - \gamma P)^{-1}R
\end{align*}
$$
Now of course this solution involves inverting a matrix and that is not easy to do except in very small cases. So we can use iterative methods to solve for $v$. But iterative methods just alter the above equation such that.
$$
v^{k+1} = R^\pi + \gamma P^\pi_{s, s'} v^{k}
$$
The underlying matrix equation is
$$
v = R^\pi + \gamma P^\pi_{s, s'} v
$$
Where $v$ is the value function, $R^\pi$ is the immediate reward under policy $\pi$, $\gamma$ is the discount factor, and $P^\pi_{s, s'}$ is the transition matrix.
I can write out the system of equations as below. For the sake of convenience, I omit the $\pi$ and the $\{s, s'\}$ subscripts.
$$
\begin{bmatrix}
v_1 \\
v_2 \\
\vdots \\
v_n
\end{bmatrix}
= \begin{bmatrix}
R_1 \\
R_2 \\
\vdots \\
R_n
\end{bmatrix} + \gamma
\begin{bmatrix}
p_{11} & p_{12} & \cdots & p_{1n} \\
p_{21} & p_{22} & \cdots & p_{2n} \\
\vdots \\
p_{n1} & \cdots & \cdots & p_{nn}
\end{bmatrix}
\begin{bmatrix}
v_1 \\
v_2 \\
\vdots \\
v_n
\end{bmatrix}
$$
I can solve the matrix equation with some simple linear algebra:
$$
\begin{align*}
v &= R + \gamma P v \\
v - \gamma P v &= R \\
(I - \gamma P)v &= R \\
v &= (I - \gamma P)^{-1}R
\end{align*}
$$
Now of course this solution involves inverting a matrix and that is not easy to do except in very small cases. So we can use iterative methods to solve for $v$. But iterative methods just alter the above equation such that.
$$
v^{k+1} = R^\pi + \gamma P^\pi_{s, s'} v^{k}
$$
Context
StackExchange Computer Science Q#142128, answer score: 3
Revisions (0)
No revisions yet.