patternMinor
Dyna-Q in non-deterministic domains
Viewed 0 times
domainsdynadeterministicnon
Problem
I've implemented the Dyna-Q reinforcement learning algorithm and it works perfectly on a discrete deterministic environment, the cliff. However, when applying it to a continuous environment (mountain car) with discretized states (which introduces non-determinism), it fails everytime (using function approximation is not in question here).
My question is: Sutton's description of Dyna-Q explicitly mentions its applicability to deterministic domains. Are there necessary tweaks for it to work in non-deterministic domains?
My question is: Sutton's description of Dyna-Q explicitly mentions its applicability to deterministic domains. Are there necessary tweaks for it to work in non-deterministic domains?
Solution
One approach is outlined in Tucker Balch's Machine Learning for Trading course on Udacity:
In a nutshell: create a $T_c$ table ("T count") that counts the number of times each subsequent state is reached after taking an action in a given state. Initialize all values to 0.00001 to avoid division by 0 errors. While executing QLearning, observe $s,a,s^\prime$ and increment $T_c[s,a,s^\prime]$.
Formula for calculating the probability of each s' from [s,a]
$$ \frac{T_c[s,a,s^\prime]}{\sum_i T_c[s,a,i]}$$
Dyna-Q then updates a state's expected reward as the weighted sum of the reward in each s' weighted by their relative probability.
- Learning T
- Learning R
In a nutshell: create a $T_c$ table ("T count") that counts the number of times each subsequent state is reached after taking an action in a given state. Initialize all values to 0.00001 to avoid division by 0 errors. While executing QLearning, observe $s,a,s^\prime$ and increment $T_c[s,a,s^\prime]$.
Formula for calculating the probability of each s' from [s,a]
$$ \frac{T_c[s,a,s^\prime]}{\sum_i T_c[s,a,i]}$$
- The numerator is the count of times s,a lead to each s'.
- The denominator is the sum of $T_c[s,a,:]$ (normalizes each s' to its probability)
Dyna-Q then updates a state's expected reward as the weighted sum of the reward in each s' weighted by their relative probability.
Context
StackExchange Computer Science Q#56614, answer score: 2
Revisions (0)
No revisions yet.