patternMinor

Efficient encoding of sudoku puzzles

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

combinatorics information-theory cs stackoverflow sudoku modelling

sudokuefficientpuzzlesencoding

Problem

Specifying any arbitrary 9x9 grid requires giving the position and value of each square. A naïve encoding for this might give 81 (x, y, value) triplets, requiring 4 bits for each x, y, and value (1-9 = 9 values = 4 bits) for a total of 81x4x3 = 972 bits. By numbering each square, one can reduce the positional information to 7 bits, dropping a bit for each square and a total of 891 bits. By specifying a predetermined order, one can reduce this more drastically to just the 4 bits for each value for a total of 324 bits. However, a sudoku can have missing numbers. This provides the potential for reducing the number of numbers that have to be specified, but may require additional bits for indicating positions. Using our 11-bit encoding of (position, value), we can specify a puzzle with $n$ clues with $11n$ bits, e.g. a minimal (17) puzzle requires 187 bits. The best encoding I've thought of so far is to use one bit for each space to indicate whether it's filled and, if so, the following 4 bits encode the number. This requires $81+4n$ bits, 149 for a minimal puzzle ($n=17$). Is there a more efficient encoding, preferably without a database of each valid sudoku setup? (Bonus points for addressing a general $n$ from $N \times N$ puzzle)

It just occurred to me that many puzzles will be a rotation of another, or have a simple permutation of digits. Perhaps that could help reduce the bits required.

According to Wikipedia,

The number of classic 9×9 Sudoku solution grids is 6,670,903,752,021,072,936,960 (sequence A107739 in OEIS), or approximately $6.67×10^{21}$.

If I did my math right ($\frac{ln{(6,670,903,752,021,072,936,960)}}{ln{(2)}}$), that comes out to 73 (72.498) bits of information for a lookup table.

But:

The number of essentially different solutions, when symmetries such as rotation, reflection, permutation and relabelling are taken into account, was shown to be just 5,472,730,538[15] (sequence A109741 in OEIS).

That gives 33 (32.35) bits, so it's p

Solution

Is there a more efficient encoding, preferably without a database of each valid sudoku setup?

Yes. I can think of an encoding improving your 149-bit encoding of a minimal $9\times 9$ puzzle in 6 or 9 bits, depending on a condition. This is without a database or any register of other solutions or partial boards. Here it goes:

First, you use $4$ bits to encode a number $m$ with a minimal number of appearances in the board. The next $4$ bits encode the actual number $\ell$ of times $m$ appears. The next $7\ell$ bits encode each of the positions in which $m$ appears.

The following $81-\ell$ bits are flags indicating whether the remaining positions have a number or not (you just skip the positions in which $m$ is). Whenever one of these bits is 1, then the next 3 bits indicate which number it is (in the ordered set $\{1,\ldots,9\}$ without $m$). For example, if $m=4$ and the 3 bits are 101, then the number in the corresponding position on the board is the the 5th (counting from 0) in the set $\{1,2,3,5,6,7,8,9\}$, so it is $6$. Numbers $jm$ will be encoded as $j-2$. Since we had already written $\ell$ positions, only $3(n-\ell)$ bits will be added to encode the rest of the board in this step.

Thus, the total number of bits required to encode a board using this procedure is $$B=4+4+7\ell+(81-\ell)+3(n-\ell)=89+3\ell+3n.$$

For $n=17$, we note that $\ell$ can be 0 or 1 (in general, $\ell\leq\lfloor n/9\rfloor$). Thus, $B$ can be 140 or 143 depending on whether there's a number not appearing on the board.

It's worth pointing out that Kevin's solution is way better in the general case. This encoding uses at most 149 bits only for $n\in\{17,18,19\}$, or for $n=20$ provided that $\ell=0$. At least it shows a general idea on how to take advantage of the fact that $N=9$ is very close to $2^{\lfloor\log_2N\rfloor}$ (which means we tend to "lose memory" by using 4 bits per value, since 4 bits allow us to express $N=16$ numbers as well.

Example. Consider the following board with $n=17$ clues.

.  .  .   .  .  .   .  1  .
4  .  .   .  .  .   .  .  .
.  2  .   .  .  .   .  .  .

.  .  .   .  5  .   4  .  7
.  .  8   .  .  .   3  .  .
.  .  1   .  9  .   .  .  .

3  .  .   4  .  .   2  .  .
.  5  .   1  .  .   .  .  .
.  .  .   8  .  6   .  .  .

Here, no number does not appear on the board, and numbers 6, 7 and 9 appear only once. We take $m=7$ (0111) and $\ell=1$ (0001). Reading the positions from left to right and then from top to bottom, $m$ appears in the position $36$ (0100100). Thus, our encoding begins with 011100010100100.

Next, we need seven 0s, one 1 and the 3-bit encoding of the number $1$, then a 0 followed by a 1 and the 3-bit encoding of $4$, etc. (0000000100101100). Eventually, we will skip the position where $m=7$ is, and we will encode 8 as 110 (the 6th number counting from 0 on the list ${1,2,3,4,5,6,8,9}$) and 9 as 111. The full encoding goes as follows:

// m=7, l=1 and its position on the board.
011100010100100
// Numbers 1 and 4 at the beginning. Note that 1 is encoded 000, and 4 is 011.
0000000100001011
// Numbers 2 and 5.
0000000001001000000000001100
// Numbers 4 and 8. We skip the appearance of 7 and encode 8 as 110.
010110001110
// 3, 1 and 9. 9 is encoded as 111.
00010100000100001111
// 3, 4, 2, 5, 1, 8, 6 and the last empty cells.
0000101000101100100100011000100000000000111001101000

The complete encoding is

01110001010010000000001001010110000000001001000000000001100010110001110000101000001000011110000101000101100100100011000100000000000111001101000

, and the reader can check the length of that string is indeed 143 :-)

Code Snippets

.  .  .   .  .  .   .  1  .
4  .  .   .  .  .   .  .  .
.  2  .   .  .  .   .  .  .

.  .  .   .  5  .   4  .  7
.  .  8   .  .  .   3  .  .
.  .  1   .  9  .   .  .  .

3  .  .   4  .  .   2  .  .
.  5  .   1  .  .   .  .  .
.  .  .   8  .  6   .  .  .

// m=7, l=1 and its position on the board.
011100010100100
// Numbers 1 and 4 at the beginning. Note that 1 is encoded 000, and 4 is 011.
0000000100001011
// Numbers 2 and 5.
0000000001001000000000001100
// Numbers 4 and 8. We skip the appearance of 7 and encode 8 as 110.
010110001110
// 3, 1 and 9. 9 is encoded as 111.
00010100000100001111
// 3, 4, 2, 5, 1, 8, 6 and the last empty cells.
0000101000101100100100011000100000000000111001101000

Context

StackExchange Computer Science Q#165, answer score: 6

Revisions (0)

No revisions yet.