patternMinor
Which matrix of Q values is being used here?
Viewed 0 times
valueshereusedbeingwhichmatrix
Problem
This question refers to this paper: Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
In section 2.1, equations (5) and (6), I am wondering which Q values are being used to adjust the weights of the restricted boltzmann machine:
Option 1: the Q values generated by the original MDP
Option 2: the approximate Q values obtained by calculating the (negative of the) free energy of the RBM
Follow up question: When considering states $(s^t, a^t)$ and $(s^{t+1}, a^{t+1})$, how do we determine which values of $a^t$ and $a^{t+1}$ to use? Are these the optimal actions from the original MDP, or are these to be obtained through the alternating Gibbs sampling mentioned later on...(which doesn't make sense, since we would not have weights required for this CD)
Thanks for your help
In section 2.1, equations (5) and (6), I am wondering which Q values are being used to adjust the weights of the restricted boltzmann machine:
Option 1: the Q values generated by the original MDP
Option 2: the approximate Q values obtained by calculating the (negative of the) free energy of the RBM
Follow up question: When considering states $(s^t, a^t)$ and $(s^{t+1}, a^{t+1})$, how do we determine which values of $a^t$ and $a^{t+1}$ to use? Are these the optimal actions from the original MDP, or are these to be obtained through the alternating Gibbs sampling mentioned later on...(which doesn't make sense, since we would not have weights required for this CD)
Thanks for your help
Solution
- Reinforcement learning works by bootstrapping, which means that Q values are updated based on previous estimates. Also, MDP's don't generate Q values, just rewards (Q values are long-term estimates of accumulated future rewards). So the answer is option 2: the Q values are obtained by calculating the (negative of the) free energy of the RBM. You use the RBM to calculate the current Q value and use it together with the value from the previous step to update your estimate.
- The paper uses the Sarsa algorithm, which means it is on-policy. In other words, it uses only actual performed actions. It means that the values are updated only on the next step $t+1$, where you chose your action $a^{t+1}$. In summary, you use Gibbs sampling to select actions and, on time $t+1$ you update the weights using $s^t, a^t, s^{t+1}$ and $a^{t+1}$.
Context
StackExchange Computer Science Q#44633, answer score: 2
Revisions (0)
No revisions yet.