patternMinor

Dyna-Q in non-deterministic domains

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

reinforcement-learning artificial-intelligence cs stackoverflow

domainsdynadeterministicnon

Problem

I've implemented the Dyna-Q reinforcement learning algorithm and it works perfectly on a discrete deterministic environment, the cliff. However, when applying it to a continuous environment (mountain car) with discretized states (which introduces non-determinism), it fails everytime (using function approximation is not in question here).

My question is: Sutton's description of Dyna-Q explicitly mentions its applicability to deterministic domains. Are there necessary tweaks for it to work in non-deterministic domains?

Solution

One approach is outlined in Tucker Balch's Machine Learning for Trading course on Udacity:

Learning T

Learning R

In a nutshell: create a $T_c$ table ("T count") that counts the number of times each subsequent state is reached after taking an action in a given state. Initialize all values to 0.00001 to avoid division by 0 errors. While executing QLearning, observe $s,a,s^\prime$ and increment $T_c[s,a,s^\prime]$.

Formula for calculating the probability of each s' from [s,a]

$$ \frac{T_c[s,a,s^\prime]}{\sum_i T_c[s,a,i]}$$

The numerator is the count of times s,a lead to each s'.

The denominator is the sum of $T_c[s,a,:]$ (normalizes each s' to its probability)

Dyna-Q then updates a state's expected reward as the weighted sum of the reward in each s' weighted by their relative probability.

Context

StackExchange Computer Science Q#56614, answer score: 2

Revisions (0)

No revisions yet.