HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Dyna-Q in non-deterministic domains

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
domainsdynadeterministicnon

Problem

I've implemented the Dyna-Q reinforcement learning algorithm and it works perfectly on a discrete deterministic environment, the cliff. However, when applying it to a continuous environment (mountain car) with discretized states (which introduces non-determinism), it fails everytime (using function approximation is not in question here).

My question is: Sutton's description of Dyna-Q explicitly mentions its applicability to deterministic domains. Are there necessary tweaks for it to work in non-deterministic domains?

Solution

One approach is outlined in Tucker Balch's Machine Learning for Trading course on Udacity:

  • Learning T



  • Learning R



In a nutshell: create a $T_c$ table ("T count") that counts the number of times each subsequent state is reached after taking an action in a given state. Initialize all values to 0.00001 to avoid division by 0 errors. While executing QLearning, observe $s,a,s^\prime$ and increment $T_c[s,a,s^\prime]$.

Formula for calculating the probability of each s' from [s,a]

$$ \frac{T_c[s,a,s^\prime]}{\sum_i T_c[s,a,i]}$$

  • The numerator is the count of times s,a lead to each s'.



  • The denominator is the sum of $T_c[s,a,:]$ (normalizes each s' to its probability)



Dyna-Q then updates a state's expected reward as the weighted sum of the reward in each s' weighted by their relative probability.

Context

StackExchange Computer Science Q#56614, answer score: 2

Revisions (0)

No revisions yet.