HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How to implement the regret matching algorithm?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
theimplementalgorithmhowregretmatching

Problem

My question is the following: How to calculate the regret in practice?

I am trying to implement the regret matching algorithm but I do not understand how to do it.

  • First, I have $n$ players with the joint action space $\mathcal{A}=\{a_0, a_1,\cdots,a_m\}^n.$



  • Then, I fix some period $T$. The action set $A^t\in\mathcal{A}$ is the action set chosen by players at time $t$. After the period $T$ (every player has chosen an action). So I get $u_i(A^t)$.



-
Now the regret of player $i$ of not playing action $a_i$ in the past is: (here $A^t\oplus a_i$ denotes the strategy set obtained if player $i$ changed its strategy from $a'_i$ to $a_i$)
$$\max\limits_{a_i\in A_i}\left\{\dfrac{1}{T}\sum_{t\leqslant T}\left(u_i(A^t\oplus a_i )-u_i(A^t)\right)\right\}.$$
I do not understand how to calculate this summation. Why there is a max over the action $a_i\in A_i$? Should I calculate the regret of all actions in $A_i$ and calculate the maximum? Also, In Hart's paper, the maximum is $\max\{R, 0\}$. Why is there such a difference?

I mean if the regret was: $\dfrac{1}{T}\sum_{t\leqslant T}\left(u_i(A^t\oplus a_i )-u_i(A^t)\right),$
the calculation would be easy for me.

The regret is defined in the following two papers [1] (see page 4, equation (2.1c)) and [2] (see page 3, section I, subsection B).

  • A simple adaptive procedure leading to correlated equilibrium by S. Hart et al (2000)



  • Distributed algorithms for approximating wireless network capacity by Michael Dinitz (2010)



I would like to get some helps from you. Any suggestions step by step how to implement such an algorithm please?

Solution

The index set of the max operation is $A_i$, the actions of player $i$. The formula says: take each such action $a_i \in A_i$ and compute its regret (with the sub-formula you say you can implement easily), and then take the maximum of those regrets.

The reason for the $\max(R,0)$ is that actions with negative regrets are performing worse than the action currently chosen.

To implement this in code, just set a temporary variable $t$ to be 0. Now loop through the actions one by one, and for each action $a$, compute its regret $r$, and set $t$ as $\max(r,t)$. Note that this approach includes the $\max(R,0)$ operation; to do this without that, set $t$ initially to $-\infty$.

Context

StackExchange Computer Science Q#27915, answer score: 5

Revisions (0)

No revisions yet.