patternMinor

Large Power Mod - Extreme Efficiency

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

powerlargemodefficiencyextreme

Problem

I've implemented my own 'power-mod' procedure which will compute a^b 'mod' n for some large b and even larger n. I think the way I've done it is a common procedure, and it does seem well optimised, but I really need to squeeze the maximum performance out of it as it is currently the bottleneck in my program (~75% of the CPU time).

First I will explain in words the procedure (ironically, in an imperative style), then display the code (which is quite short).

In Words

Calculating a^b mod n:

Obtain the binary expansion of b in Big Endian form. Call it bE.

For example, b = 10 becomes b = [1,0,1,0] which is the reverse of its conventional binary representation 0101. This part is well optimised (~5% of the CPU time).

-
Drop the first element of bE (essentially because this will always be 1-- I don't care about b=0 as this is trivial).

-
Start with result = 1. (i.e. a^0)

-
If bE is empty, then return result.

-
Take the next element from bE to be k.

If k is 0, then do result = result^2 'mod' n

If k is 1, then do result = result^2 * a 'mod' n

-
Go to step 4.

The Code

-- This part computes the Big Endian binary representation list of an Integer `k`. It's not a bottleneck, so I'm not so concerned about this.
binarizeE k = binarizeETab k []
  where
    binarizeETab 0 xs  = xs
    binarizeETab k xs = binarizeETab (fst kdivmod) ((snd kdivmod):xs)
      where
        kdivmod = k `divMod` 2

-- The part that I want to optimise to oblivion!
-- Doing `rem` and then ending with a final `mod` may make an incredibly minute improvement for extremely large numbers as compared with using `mod` each and every step.
largepowmod num pow modbase = (operate num opList) `mod` modbase
  where
    opList = drop 1 $ binarizeE pow
    operate k    []   = k
    operate k (0:ops) = operate (k^2         `rem` modbase) ops
    operate k (1:ops) = operate ((k^2 * num) `rem` modbase) ops

My Optimistic Goal

The background

Solution

TL;DR

Use strictness annotations at the right point to get rid of thunks:

largepowmod :: Integer -> Integer -> Integer -> Integer
largepowmod num pow modbase = (operate num opList) `mod` modbase
  where
    opList = drop 1 $ binarizeE pow
    operate !k    []   = k
    operate !k (0:ops) = operate (k^2         `rem` modbase) ops
    operate !k (1:ops) = operate ((k^2 * num) `rem` modbase) ops
--          ^

Note that this is equivalent to using foldl'.

Proper benchmarks

First of all, let's get reliable benchmarks. We can use Criterion for this:

import Criterion.Main
import Text.Printf (printf)

main :: IO()
main = defaultMain [
  bgroup "largepowmod" $ 
    flip map ([1,3..5] ++ [9999]) $ \i ->
      bench (printf "10^1000 + %d" i) $   
        let n = 10^1000 + i :: Integer
        in whnf (largepowmod 2 ((n - 1) `div` 2)) n
  ]

On my machine, I get about 16.4ms per call with -O2:

benchmarking largepowmod/10^1000 + 1
time                 16.14 ms   (16.07 ms .. 16.22 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 16.18 ms   (16.13 ms .. 16.23 ms)
std dev              124.6 μs   (99.83 μs .. 164.8 μs)

benchmarking largepowmod/10^1000 + 3
time                 16.44 ms   (16.26 ms .. 16.63 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 16.35 ms   (16.29 ms .. 16.42 ms)
std dev              168.3 μs   (135.2 μs .. 223.2 μs)

benchmarking largepowmod/10^1000 + 5
time                 16.24 ms   (16.19 ms .. 16.30 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 16.35 ms   (16.30 ms .. 16.41 ms)
std dev              138.4 μs   (89.59 μs .. 182.9 μs)

benchmarking largepowmod/10^1000 + 9999
time                 16.45 ms   (16.34 ms .. 16.58 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 16.37 ms   (16.30 ms .. 16.43 ms)
std dev              165.2 μs   (129.3 μs .. 235.4 μs)

But let's not stop here. Let's have a look at the profile:

$ ghc --make Test.hs -O2 -prof -auto-all && ./Test +RTS -s -p && head -n20 Test.prof

        Tue Sep 27 13:32 2016 Time and Allocation Profiling Report  (Final)

           Test +RTS -s -p -RTS --output benchmark.html

        total time  =       19.65 secs   (19652 ticks @ 1000 us, 1 processor)
        total alloc = 8,760,057,056 bytes  (excludes profiling overheads)

COST CENTRE         MODULE                %time %alloc

largepowmod.operate Main                   91.9   83.5
getGCStats          Criterion.Measurement   5.1   10.6
getOverhead         Criterion.Monad         2.2    1.5
applyGCStats        Criterion.Measurement   0.1    1.5

                                                                                                        individual     inherited
COST CENTRE                                  MODULE                                   no.     entries  %time %alloc   %time %alloc

MAIN                                         MAIN                                     273           0    0.0    0.0   100.0  100.0
 main                                        Main                                     550           0    0.0    0.0   100.0   99.9
  main.\                                     Main                                     745           0    0.0    0.0    91.9   83.5
   largepowmod                               Main                                     746        1208    0.0    0.0    91.9   83.5
    largepowmod.operate                      Main                                     752     4011768   91.9   83.5    91.9   83.5
    largepowmod.opList                       Main                                     751        1208    0.0    0.0     0.0    0.0

So most of the time is used in operate. It's allocating most of the data. However, that's a little bit of a red herring, since binarizeE is inlined at this point, and poor operate is merely allocating the list that will be generated lazily.

Getting stricter

However, we're not strict enough. In operate we create a large chunk of intermediate data:

operate k    []   = k
operate k (0:ops) = operate (k^2         `rem` modbase) ops
operate k (1:ops) = operate ((k^2 * num) `rem` modbase) ops
--                           ^^^^^^^^^^^^^^^^^^^^^^^^^

Haskell is lazy, therefore (k^2 * num) rem modbase isn't evaluated until it's actually needed. Therefore we allocate \$\mathcal O(\log_2 p)\$ additional terms. This isn't necessary. We can provide a small fix to make sure that k has been fully evaluated:

largepowmod' :: Integer -> Integer -> Integer -> Integer
largepowmod' num pow modbase = (operate num opList) `mod` modbase
  where
    opList = drop 1 $ binarizeE pow
    operate !k    []   = k
    operate !k (0:ops) = operate (k^2         `rem` modbase) ops
    operate !k (1:ops) = operate ((k^2 * num) `rem` modbase) ops

This needs -XBangPatterns. How does this fare? We change our main:

```
main :: IO()
main = defaultMain

Code Snippets

largepowmod :: Integer -> Integer -> Integer -> Integer
largepowmod num pow modbase = (operate num opList) `mod` modbase
  where
    opList = drop 1 $ binarizeE pow
    operate !k    []   = k
    operate !k (0:ops) = operate (k^2         `rem` modbase) ops
    operate !k (1:ops) = operate ((k^2 * num) `rem` modbase) ops
--          ^

import Criterion.Main
import Text.Printf (printf)

main :: IO()
main = defaultMain [
  bgroup "largepowmod" $ 
    flip map ([1,3..5] ++ [9999]) $ \i ->
      bench (printf "10^1000 + %d" i) $   
        let n = 10^1000 + i :: Integer
        in whnf (largepowmod 2 ((n - 1) `div` 2)) n
  ]

benchmarking largepowmod/10^1000 + 1
time                 16.14 ms   (16.07 ms .. 16.22 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 16.18 ms   (16.13 ms .. 16.23 ms)
std dev              124.6 μs   (99.83 μs .. 164.8 μs)

benchmarking largepowmod/10^1000 + 3
time                 16.44 ms   (16.26 ms .. 16.63 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 16.35 ms   (16.29 ms .. 16.42 ms)
std dev              168.3 μs   (135.2 μs .. 223.2 μs)

benchmarking largepowmod/10^1000 + 5
time                 16.24 ms   (16.19 ms .. 16.30 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 16.35 ms   (16.30 ms .. 16.41 ms)
std dev              138.4 μs   (89.59 μs .. 182.9 μs)

benchmarking largepowmod/10^1000 + 9999
time                 16.45 ms   (16.34 ms .. 16.58 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 16.37 ms   (16.30 ms .. 16.43 ms)
std dev              165.2 μs   (129.3 μs .. 235.4 μs)

$ ghc --make Test.hs -O2 -prof -auto-all && ./Test +RTS -s -p && head -n20 Test.prof

        Tue Sep 27 13:32 2016 Time and Allocation Profiling Report  (Final)

           Test +RTS -s -p -RTS --output benchmark.html

        total time  =       19.65 secs   (19652 ticks @ 1000 us, 1 processor)
        total alloc = 8,760,057,056 bytes  (excludes profiling overheads)

COST CENTRE         MODULE                %time %alloc

largepowmod.operate Main                   91.9   83.5
getGCStats          Criterion.Measurement   5.1   10.6
getOverhead         Criterion.Monad         2.2    1.5
applyGCStats        Criterion.Measurement   0.1    1.5


                                                                                                        individual     inherited
COST CENTRE                                  MODULE                                   no.     entries  %time %alloc   %time %alloc

MAIN                                         MAIN                                     273           0    0.0    0.0   100.0  100.0
 main                                        Main                                     550           0    0.0    0.0   100.0   99.9
  main.\                                     Main                                     745           0    0.0    0.0    91.9   83.5
   largepowmod                               Main                                     746        1208    0.0    0.0    91.9   83.5
    largepowmod.operate                      Main                                     752     4011768   91.9   83.5    91.9   83.5
    largepowmod.opList                       Main                                     751        1208    0.0    0.0     0.0    0.0

operate k    []   = k
operate k (0:ops) = operate (k^2         `rem` modbase) ops
operate k (1:ops) = operate ((k^2 * num) `rem` modbase) ops
--                           ^^^^^^^^^^^^^^^^^^^^^^^^^

Context

StackExchange Code Review Q#142388, answer score: 4

Revisions (0)

No revisions yet.