patternMinor
Large Power Mod - Extreme Efficiency
Viewed 0 times
powerlargemodefficiencyextreme
Problem
I've implemented my own 'power-mod' procedure which will compute
First I will explain in words the procedure (ironically, in an imperative style), then display the code (which is quite short).
In Words
Calculating
For example,
-
Drop the first element of
-
Start with
-
If
-
Take the next element from
-
Go to step 4.
The Code
My Optimistic Goal
The background
a^b 'mod' n for some large b and even larger n. I think the way I've done it is a common procedure, and it does seem well optimised, but I really need to squeeze the maximum performance out of it as it is currently the bottleneck in my program (~75% of the CPU time).First I will explain in words the procedure (ironically, in an imperative style), then display the code (which is quite short).
In Words
Calculating
a^b mod n:- Obtain the binary expansion of
bin Big Endian form. Call itbE.
For example,
b = 10 becomes b = [1,0,1,0] which is the reverse of its conventional binary representation 0101. This part is well optimised (~5% of the CPU time).-
Drop the first element of
bE (essentially because this will always be 1-- I don't care about b=0 as this is trivial).-
Start with
result = 1. (i.e. a^0)-
If
bE is empty, then return result. -
Take the next element from
bE to be k. - If
kis 0, then doresult = result^2 'mod' n
- If
kis 1, then doresult = result^2 * a 'mod' n
-
Go to step 4.
The Code
-- This part computes the Big Endian binary representation list of an Integer `k`. It's not a bottleneck, so I'm not so concerned about this.
binarizeE k = binarizeETab k []
where
binarizeETab 0 xs = xs
binarizeETab k xs = binarizeETab (fst kdivmod) ((snd kdivmod):xs)
where
kdivmod = k `divMod` 2
-- The part that I want to optimise to oblivion!
-- Doing `rem` and then ending with a final `mod` may make an incredibly minute improvement for extremely large numbers as compared with using `mod` each and every step.
largepowmod num pow modbase = (operate num opList) `mod` modbase
where
opList = drop 1 $ binarizeE pow
operate k [] = k
operate k (0:ops) = operate (k^2 `rem` modbase) ops
operate k (1:ops) = operate ((k^2 * num) `rem` modbase) opsMy Optimistic Goal
The background
Solution
TL;DR
Use strictness annotations at the right point to get rid of thunks:
Note that this is equivalent to using
Proper benchmarks
First of all, let's get reliable benchmarks. We can use Criterion for this:
On my machine, I get about 16.4ms per call with
But let's not stop here. Let's have a look at the profile:
So most of the time is used in
Getting stricter
However, we're not strict enough. In
Haskell is lazy, therefore
This needs
```
main :: IO()
main = defaultMain
Use strictness annotations at the right point to get rid of thunks:
largepowmod :: Integer -> Integer -> Integer -> Integer
largepowmod num pow modbase = (operate num opList) `mod` modbase
where
opList = drop 1 $ binarizeE pow
operate !k [] = k
operate !k (0:ops) = operate (k^2 `rem` modbase) ops
operate !k (1:ops) = operate ((k^2 * num) `rem` modbase) ops
-- ^Note that this is equivalent to using
foldl'.Proper benchmarks
First of all, let's get reliable benchmarks. We can use Criterion for this:
import Criterion.Main
import Text.Printf (printf)
main :: IO()
main = defaultMain [
bgroup "largepowmod" $
flip map ([1,3..5] ++ [9999]) $ \i ->
bench (printf "10^1000 + %d" i) $
let n = 10^1000 + i :: Integer
in whnf (largepowmod 2 ((n - 1) `div` 2)) n
]On my machine, I get about 16.4ms per call with
-O2:benchmarking largepowmod/10^1000 + 1
time 16.14 ms (16.07 ms .. 16.22 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 16.18 ms (16.13 ms .. 16.23 ms)
std dev 124.6 μs (99.83 μs .. 164.8 μs)
benchmarking largepowmod/10^1000 + 3
time 16.44 ms (16.26 ms .. 16.63 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 16.35 ms (16.29 ms .. 16.42 ms)
std dev 168.3 μs (135.2 μs .. 223.2 μs)
benchmarking largepowmod/10^1000 + 5
time 16.24 ms (16.19 ms .. 16.30 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 16.35 ms (16.30 ms .. 16.41 ms)
std dev 138.4 μs (89.59 μs .. 182.9 μs)
benchmarking largepowmod/10^1000 + 9999
time 16.45 ms (16.34 ms .. 16.58 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 16.37 ms (16.30 ms .. 16.43 ms)
std dev 165.2 μs (129.3 μs .. 235.4 μs)But let's not stop here. Let's have a look at the profile:
$ ghc --make Test.hs -O2 -prof -auto-all && ./Test +RTS -s -p && head -n20 Test.prof
Tue Sep 27 13:32 2016 Time and Allocation Profiling Report (Final)
Test +RTS -s -p -RTS --output benchmark.html
total time = 19.65 secs (19652 ticks @ 1000 us, 1 processor)
total alloc = 8,760,057,056 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
largepowmod.operate Main 91.9 83.5
getGCStats Criterion.Measurement 5.1 10.6
getOverhead Criterion.Monad 2.2 1.5
applyGCStats Criterion.Measurement 0.1 1.5
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 273 0 0.0 0.0 100.0 100.0
main Main 550 0 0.0 0.0 100.0 99.9
main.\ Main 745 0 0.0 0.0 91.9 83.5
largepowmod Main 746 1208 0.0 0.0 91.9 83.5
largepowmod.operate Main 752 4011768 91.9 83.5 91.9 83.5
largepowmod.opList Main 751 1208 0.0 0.0 0.0 0.0So most of the time is used in
operate. It's allocating most of the data. However, that's a little bit of a red herring, since binarizeE is inlined at this point, and poor operate is merely allocating the list that will be generated lazily.Getting stricter
However, we're not strict enough. In
operate we create a large chunk of intermediate data:operate k [] = k
operate k (0:ops) = operate (k^2 `rem` modbase) ops
operate k (1:ops) = operate ((k^2 * num) `rem` modbase) ops
-- ^^^^^^^^^^^^^^^^^^^^^^^^^Haskell is lazy, therefore
(k^2 * num) rem modbase isn't evaluated until it's actually needed. Therefore we allocate \$\mathcal O(\log_2 p)\$ additional terms. This isn't necessary. We can provide a small fix to make sure that k has been fully evaluated:largepowmod' :: Integer -> Integer -> Integer -> Integer
largepowmod' num pow modbase = (operate num opList) `mod` modbase
where
opList = drop 1 $ binarizeE pow
operate !k [] = k
operate !k (0:ops) = operate (k^2 `rem` modbase) ops
operate !k (1:ops) = operate ((k^2 * num) `rem` modbase) opsThis needs
-XBangPatterns. How does this fare? We change our main:```
main :: IO()
main = defaultMain
Code Snippets
largepowmod :: Integer -> Integer -> Integer -> Integer
largepowmod num pow modbase = (operate num opList) `mod` modbase
where
opList = drop 1 $ binarizeE pow
operate !k [] = k
operate !k (0:ops) = operate (k^2 `rem` modbase) ops
operate !k (1:ops) = operate ((k^2 * num) `rem` modbase) ops
-- ^import Criterion.Main
import Text.Printf (printf)
main :: IO()
main = defaultMain [
bgroup "largepowmod" $
flip map ([1,3..5] ++ [9999]) $ \i ->
bench (printf "10^1000 + %d" i) $
let n = 10^1000 + i :: Integer
in whnf (largepowmod 2 ((n - 1) `div` 2)) n
]benchmarking largepowmod/10^1000 + 1
time 16.14 ms (16.07 ms .. 16.22 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 16.18 ms (16.13 ms .. 16.23 ms)
std dev 124.6 μs (99.83 μs .. 164.8 μs)
benchmarking largepowmod/10^1000 + 3
time 16.44 ms (16.26 ms .. 16.63 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 16.35 ms (16.29 ms .. 16.42 ms)
std dev 168.3 μs (135.2 μs .. 223.2 μs)
benchmarking largepowmod/10^1000 + 5
time 16.24 ms (16.19 ms .. 16.30 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 16.35 ms (16.30 ms .. 16.41 ms)
std dev 138.4 μs (89.59 μs .. 182.9 μs)
benchmarking largepowmod/10^1000 + 9999
time 16.45 ms (16.34 ms .. 16.58 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 16.37 ms (16.30 ms .. 16.43 ms)
std dev 165.2 μs (129.3 μs .. 235.4 μs)$ ghc --make Test.hs -O2 -prof -auto-all && ./Test +RTS -s -p && head -n20 Test.prof
Tue Sep 27 13:32 2016 Time and Allocation Profiling Report (Final)
Test +RTS -s -p -RTS --output benchmark.html
total time = 19.65 secs (19652 ticks @ 1000 us, 1 processor)
total alloc = 8,760,057,056 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
largepowmod.operate Main 91.9 83.5
getGCStats Criterion.Measurement 5.1 10.6
getOverhead Criterion.Monad 2.2 1.5
applyGCStats Criterion.Measurement 0.1 1.5
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 273 0 0.0 0.0 100.0 100.0
main Main 550 0 0.0 0.0 100.0 99.9
main.\ Main 745 0 0.0 0.0 91.9 83.5
largepowmod Main 746 1208 0.0 0.0 91.9 83.5
largepowmod.operate Main 752 4011768 91.9 83.5 91.9 83.5
largepowmod.opList Main 751 1208 0.0 0.0 0.0 0.0operate k [] = k
operate k (0:ops) = operate (k^2 `rem` modbase) ops
operate k (1:ops) = operate ((k^2 * num) `rem` modbase) ops
-- ^^^^^^^^^^^^^^^^^^^^^^^^^Context
StackExchange Code Review Q#142388, answer score: 4
Revisions (0)
No revisions yet.