patternMinor
The meaning of discount factor on reinforcement learning
Viewed 0 times
meaningthelearningreinforcementfactordiscount
Problem
After reading of the google deepmind achievements on Atari's games, I am trying to understand the q-learning and q-networks, but I am little bit confused. The confusion arise in the concept of the discount factor.
Brief summary of what I understand.
A deep convolutional neural network is used to estimate the value of the optimal expected value of an action. The network has to minimize the loss function
$$
L_i=\mathbb{E}_{s,a,r}\left[(\mathbb{E}_{s'}\left[y|s,a\right]-Q(s,a;\theta_i))^2\right]
$$
where $\mathbb{E}_{s'}\left[y|s,a\right]$ is
$$
\mathbb{E}\left[r+\gamma max_{a'} Q(s',a';\theta^-_i)\right|s,a]
$$
Where $Q$ is a cumulative score value and $r$ is the score value for the action chosen. $s,a$ and $s',a'$ are respectively the state and the action chosen at the time $t$ and the state and the action at the time $t'$. The $\theta^-_i$ are the weights of the network at the previous iteration. The $\gamma$ is a discount factor that takes into account the temporal difference of the score values. The $i$ subscript is the temporal step.
The problem here is to understand why $\gamma$ does not depend on $\theta$.
From the mathematical point of view, $\gamma$ is the discount factor and represents the likelihood of reaching the state $s'$ from the state $s$.
I guess that the network actually learns to rescale the $Q$ according to the true value of $\gamma$, so why not let $\gamma=1$?
Brief summary of what I understand.
A deep convolutional neural network is used to estimate the value of the optimal expected value of an action. The network has to minimize the loss function
$$
L_i=\mathbb{E}_{s,a,r}\left[(\mathbb{E}_{s'}\left[y|s,a\right]-Q(s,a;\theta_i))^2\right]
$$
where $\mathbb{E}_{s'}\left[y|s,a\right]$ is
$$
\mathbb{E}\left[r+\gamma max_{a'} Q(s',a';\theta^-_i)\right|s,a]
$$
Where $Q$ is a cumulative score value and $r$ is the score value for the action chosen. $s,a$ and $s',a'$ are respectively the state and the action chosen at the time $t$ and the state and the action at the time $t'$. The $\theta^-_i$ are the weights of the network at the previous iteration. The $\gamma$ is a discount factor that takes into account the temporal difference of the score values. The $i$ subscript is the temporal step.
The problem here is to understand why $\gamma$ does not depend on $\theta$.
From the mathematical point of view, $\gamma$ is the discount factor and represents the likelihood of reaching the state $s'$ from the state $s$.
I guess that the network actually learns to rescale the $Q$ according to the true value of $\gamma$, so why not let $\gamma=1$?
Solution
The discount factor does not represent the likelihood of reaching the state $s′ $from the state $s$. That would be $p(s'|s,a)$, which is not used in Q-Learning, since it is model-free (only model-based reinforcement learning methods use those transition probabilities).
The discount factor $γ$ is a hyperparameter tuned by the user which represents how much future events lose their value according to how far away in time they are. In the referred formula, you are saying that the value $y$ for your current state $s$ is the instantaneous reward for this state plus what you expect to receive in the future starting from $s$. But that future term must be discounted, because future rewards may not (if $γ < 1$) have the same value as receiving a reward right now (just like we prefer to receive \$100 now instead of \$100 tomorrow). It is up to you to choose how much you want to depreciate your future rewards (it is problem-dependent). A discount factor of 0 would mean that you only care about immediate rewards. The higher your discount factor, the farther your rewards will propagate through time.
I suggest that you read the Sutton & Barto book before trying Deep-Q in order to learn pure Reinforcement Learning outside the context of neural networks, which may be confusing you.
The discount factor $γ$ is a hyperparameter tuned by the user which represents how much future events lose their value according to how far away in time they are. In the referred formula, you are saying that the value $y$ for your current state $s$ is the instantaneous reward for this state plus what you expect to receive in the future starting from $s$. But that future term must be discounted, because future rewards may not (if $γ < 1$) have the same value as receiving a reward right now (just like we prefer to receive \$100 now instead of \$100 tomorrow). It is up to you to choose how much you want to depreciate your future rewards (it is problem-dependent). A discount factor of 0 would mean that you only care about immediate rewards. The higher your discount factor, the farther your rewards will propagate through time.
I suggest that you read the Sutton & Barto book before trying Deep-Q in order to learn pure Reinforcement Learning outside the context of neural networks, which may be confusing you.
Context
StackExchange Computer Science Q#44905, answer score: 6
Revisions (0)
No revisions yet.