patternMinor

I want a fast rank only shuffling algorithm for a full 52 card deck

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

wantfastfullrankdeckcardshufflingalgorithmforonly

Problem

My goal is to do some large iteration Monte Carlo simulation of different card games that require on the order of 1 billion random decks to get a reliably accurate answer. The bulk of the computer time is actually shuffling/preparing the decks, not searching for the "hit" criteria I am looking for.

I am using an string function rich interpreted language. I have bit functions available that work up to 32 bit at a time. I am only interested in the ranks of the 52 card deck, not the suits. So I am looking for a fast algorithm that will give me 4 of each of the 13 ranks for a full 52 card deck. It doesn't matter which order the suits of a particular rank come in since we are ignoring the suits (other than we are enforcing that there only be 4 of each suit for each rank).

My first attempt was to just generate a normal deck with the ranks and suits "encoded" using cards numbered 1 thru 52 with an array of flags telling which cards were already seen. Each random number generated, (from 1 to 52), was for determining the next card (if there was no collision). This algorithm worked but was kinda slow, only producing about 6000 full decks per second on a 3.06 Ghz dual core computer (using only about 50% of the CPU).

My 2nd attempt was just to tweak the first attempt by only picking the first 48 cards (since the "seen" ratio is very high for the last 4 cards and likely many collisions will occur, slowing it down), then choosing a random number from 1 to 24 to handle the permutations of the last 4 "missing" cards which are known from simply scanning the flag array and finding the 4 that are marked "false" meaning they haven't been seen yet. This was roughly 50% faster as I was able to generate about 9000 full decks per second vs. about 6000 previously.

For my 3rd attempt, I am using a 32 bit random number to place 4 or 8 cards at a time from the unshuffled deck into a built up shuffled deck.

Any help/ideas would be greatly appreciated.

Solution

I'm pretty sure there's no way to take any significant advantage of the fact that you don't care about the suits. The most effective solution is likely to just generate random permutations of the integers from 0 to 51, and map these to card ranks by dividing by 4 and rounding down (which can be done with a simple bit shift, since 4 is a power of 2).

As Apass.Jack notes in their answer, for generating the permutations you almost certainly want to use some form of the Fisher–Yates–Durstenfeld–Knuth shuffle, since it avoids the inefficiency of the sample-and-reject-duplicates method when only a few valid choices remain. Since you're generating permutations of consecutive integers, you may want to consider using the "inside-out" variant of the shuffle (probably first described by Knuth in 1994):

for i from 0 to n − 1 do
    j ← random integer such that 0 ≤ j ≤ i
    if j ≠ i
        a[i] ← a[j]
    a[j] ← i

The main advantage of this algorithm over the "standard" FYDK shuffle is that you don't need to initialize the results array a before shuffling it, since it gets dynamically initialized during the shuffle.

Nonetheless, you may want to implement both a standard FYDK shuffle and the inside-out variant and benchmark them, since their different memory access patterns can affect real-world performance in ways that are hard to predict in advance.

In any case, at this point the biggest remaining opportunities for optimization are likely to be:

using random numbers more efficiently,

using a faster random number generator, and

switching to a lower-level language.

Of these three, I'd actually recommend starting with #3; this is exactly the kind of problem where using a low-level compiled language like C or C++ (or even a mid-level language with a good JIT compiler, like Java) can provide significant performance benefits with relatively little cost in development effort.

Once you've ported your code to a low-level language, you may want to consider optimizing your choice of RNG. There are some pretty fast PRNGs out there (my personal favorite being Bob Jenkins' "flea" RNG, as reviewed e.g. here), but to properly realize their benefits, you need to also minimize the overhead cost of calling the generator (ideally, letting your compiler inline the PRNG code directly into your shuffling loop).

Finally, it's worth noting that most PRNGs generate something like 32 or more bits of (pseudo)randomness per call, whereas a Fisher-Yates shuffle of a 52 card deck only uses less than 6 bits per iteration. Thus, there could certainly be room for optimization by splitting RNG outputs into smaller pieces and reusing them for multiple iterations of the shuffling loop.

That said, I'd personally rather avoid this option, at least until I had exhausted all other possible avenues for optimization. One problem is that, even if your RNG provides perfectly random outputs, it's easy to introduce subtle bugs in the splitting process that can bias your results in ways that may not be so easy to detect. Another problem is that trying to squeeze every last drop of entropy out of your (P)RNG output is a good way to magnify any non-random biases that output may have. Personally, I would feel safer using a fast but possibly not quite perfect PRNG to generate lots of extra random bits, and converting them to random numbers in the required range in a simple and relatively foolproof way, than trying to use up every last bit of my RNG output and relying on that RNG and my conversion code having no subtle weaknesses.

Still, there are some simple optimizations you can safely make here. For example, you can trivially save one RNG call (and some other unnecessary work) per shuffle by modifying the code above to look like this:

a[0] ← 0
for i from 1 to n − 1 do
    j ← random integer such that 0 ≤ j ≤ i
    if j ≠ i
        a[i] ← a[j]
    a[j] ← i

The only difference between this code and the version I copied from Wikipedia above is that I skipped the first iteration, where we always had i = j = 0, and instead just initialized a[0] before the loop. While such little optimizations won't have much effect alone, they're pretty easy and safe and can add up.

In fact, since you said don't care about the card suits, I guess we can save a few more RNG calls:

for i from 0 to 3 do
    a[i] ← 0
for i from 4 to n − 1 do
    j ← random integer such that 0 ≤ j ≤ i
    if j ≠ i
        a[i] ← a[j]
    a[j] ← ⌊i / 4⌋

This generates the card ranks directly, by doing the division by 4 inside the initialize-and-shuffle loop, and splits the first four iterations of that loop (where all elements of the array are still identical) into a separate loop. This way, you'll only need 48 RNG calls to shuffle 52 cards. That's almost an 8% speedup (assuming that random number generation dominates the time cost), which might be significant just by itself.

Ps. FWIW, while precalculating the shuffled decks and storing them on

Code Snippets

for i from 0 to n − 1 do
    j ← random integer such that 0 ≤ j ≤ i
    if j ≠ i
        a[i] ← a[j]
    a[j] ← i

a[0] ← 0
for i from 1 to n − 1 do
    j ← random integer such that 0 ≤ j ≤ i
    if j ≠ i
        a[i] ← a[j]
    a[j] ← i

for i from 0 to 3 do
    a[i] ← 0
for i from 4 to n − 1 do
    j ← random integer such that 0 ≤ j ≤ i
    if j ≠ i
        a[i] ← a[j]
    a[j] ← ⌊i / 4⌋

// standard FYDK shuffle (descending order)
for i from n - 1 down to 1 do
    j ← random integer such that 0 ≤ j ≤ i
    if j ≠ i: swap a[i] and a[j]

// "inside-out" version (Knuth 1994)
for i from 1 up to n − 1 do
    j ← random integer such that 0 ≤ j ≤ i
    if j ≠ i: swap a[i] and a[j]

r ← random 256-bit integer (i.e. 0 ≤ r < 2^256)

// optional bias elimination via rejection sampling
while r < 2^256 mod 52!
    r ← random 256-bit integer (i.e. 0 ≤ r < 2^256)

// inside-out FYDK shuffle-and-initialize
a[0] ← 0
for i from 1 to 51 do
    j ← r mod (i + 1)
    r ← ⌊r / (i + 1)⌋
    if j ≠ i: a[i] ← a[j]
    a[j] ← i

Context

StackExchange Computer Science Q#98273, answer score: 3

Revisions (0)

No revisions yet.