patternMinor
Optimizing unboxed array operations in Haskell
Viewed 0 times
operationsarrayunboxedoptimizinghaskell
Problem
Consider this simplified code which successively permutes array elements:
On my machine, with
import Data.Word
import Data.Bits
import Data.Array.Unboxed
import Data.Array.Base
import Data.Array.ST
test3 :: UArray Int Word32 -> UArray Int Word32
test3 arr = arr `seq` runSTUArray (change arr) where
change a' = do
a (Word32, Word32)
test4 (x, y) = x `seq` y `seq` (y + 1, x - 1)
apply :: Int -> (Int -> a -> a) -> a -> a
apply n f v0 = n `seq` v0 `seq` applyLoop 0 v0 where
applyLoop i v
| i == n - 1 = res
| otherwise = applyLoop (i + 1) res
where res = f i v
main :: IO ()
main = let arr = listArray (0, 1) [0, 1] :: UArray Int Word32
in print $ apply 10000000 (\_ x -> x `seq` test3 x) arrOn my machine, with
ghc 7.4.2 and -O3 flag the version with test4 (plain tuples) shows ~10 times better performance then the version with UArray (also, according to the profiler, UArray version has total allocation of ~1.5 Gb). Are there any other optimizations I could do? I'd prefer to use the array version, because it will be more convenient in the real application.Solution
Okay, first let's establish a baseline. I'm on a 32-bit system now, so the allocation figures and also the timings are somewhat different than on a 64-bit system, but the trends are mostly the same. To see what we deal with, let's not compile for profiling, but just plain with optimisations (
I have created a repository with all versions of the code I ran. I took the liberty and removed all superfluous
which changes the result if
That gives
in contrast to
from
First, the core for using
You can't beat that without short-cutting. You get a tight loop using only unboxed values in registers, it does no allocation. But note that
So, we have a lean and mean unboxed loop to aim for, and a sluggish alternative allocating tons. What is allocating so much, and taking so much time there?
The
allocating a new `
-O2).I have created a repository with all versions of the code I ran. I took the liberty and removed all superfluous
seqs, and slightly changed applyLoop, toapplyLoop i v
| i == n = v
| otherwise = applyLoop (i + 1) $ f i vwhich changes the result if
apply is initially called with a zero argument (the new one immediately returns the original argument, while the old looped until i reaches -1 after wrap-around of the counter), but for positive arguments the semantics remain unchanged. For very small positive n, the one extra comparison the new version does might make a measurable difference if f is inlined and trivial, but for nontrivial n or f, the extra comparison is negligible, and the new version avoids code duplication that happens with the old.That gives
$ ./test3 +RTS -s
array (0,1) [(0,0),(1,1)]
560,049,112 bytes allocated in the heap
32,160 bytes copied during GC
42,632 bytes maximum residency (2 sample(s))
22,904 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1083 colls, 0 par 0.02s 0.02s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.35s ( 1.35s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.37s ( 1.37s elapsed)
%GC time 1.3% (1.3% elapsed)
Alloc rate 415,944,041 bytes per MUT second
Productivity 98.6% of total user, 98.4% of total elapsedin contrast to
$ ./test4 +RTS -s
(0,1)
47,492 bytes allocated in the heap
1,756 bytes copied during GC
42,632 bytes maximum residency (1 sample(s))
18,808 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.04s ( 0.04s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.04s ( 0.04s elapsed)
%GC time 0.5% (0.6% elapsed)
Alloc rate 1,182,455 bytes per MUT second
Productivity 98.9% of total user, 103.1% of total elapsedfrom
test4. The difference is enormous. Well, let's look at the core to find out why, and how we can reduce it.First, the core for using
test4:Rec {
Main.main_$s$wapplyLoop [Occ=LoopBreaker]
:: GHC.Prim.Int#
-> GHC.Prim.Word#
-> GHC.Prim.Word#
-> (# GHC.Word.Word32, GHC.Word.Word32 #)
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL]
Main.main_$s$wapplyLoop =
\ (sc :: GHC.Prim.Int#)
(sc1 :: GHC.Prim.Word#)
(sc2 :: GHC.Prim.Word#) ->
case sc of wild {
__DEFAULT ->
Main.main_$s$wapplyLoop
(GHC.Prim.+# wild 1)
(GHC.Prim.plusWord# sc2 (__word 1))
(GHC.Prim.minusWord# sc1 (__word 1));
10000000 -> (# GHC.Word.W32# sc1, GHC.Word.W32# sc2 #)
}
end Rec }You can't beat that without short-cutting. You get a tight loop using only unboxed values in registers, it does no allocation. But note that
seqing x and y in test4 is necessary here to aid the strictness analyser. Without it, you get a loop using boxed Word32s building up huge thunks. The seqs that originally were in apply resp. in the function (\_ x -> x seq testN x) passed to it make no difference whatsoever.So, we have a lean and mean unboxed loop to aim for, and a sluggish alternative allocating tons. What is allocating so much, and taking so much time there?
The
applyLoop worker in this case begins withcase GHC.ST.runSTRep
@ (Data.Array.Base.UArray GHC.Types.Int GHC.Word.Word32)
(\ (@ s) (s :: GHC.Prim.State# s) ->
let {
n# [Dmd=Just L] :: GHC.Prim.Int#
[LclId, Str=DmdType]
n# = GHC.Prim.sizeofByteArray# ww4 } in
case GHC.Prim.newByteArray# @ s n# s of _ { (# ipv, ipv1 #) ->
case {__pkg_ccall array-0.4.0.1 memcpy forall s.
GHC.Prim.MutableByteArray# s
-> GHC.Prim.ByteArray#
-> GHC.Prim.Word#
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Prim.Addr# #)}allocating a new `
Code Snippets
applyLoop i v
| i == n = v
| otherwise = applyLoop (i + 1) $ f i v$ ./test3 +RTS -s
array (0,1) [(0,0),(1,1)]
560,049,112 bytes allocated in the heap
32,160 bytes copied during GC
42,632 bytes maximum residency (2 sample(s))
22,904 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1083 colls, 0 par 0.02s 0.02s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.35s ( 1.35s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.37s ( 1.37s elapsed)
%GC time 1.3% (1.3% elapsed)
Alloc rate 415,944,041 bytes per MUT second
Productivity 98.6% of total user, 98.4% of total elapsed$ ./test4 +RTS -s
(0,1)
47,492 bytes allocated in the heap
1,756 bytes copied during GC
42,632 bytes maximum residency (1 sample(s))
18,808 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.04s ( 0.04s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.04s ( 0.04s elapsed)
%GC time 0.5% (0.6% elapsed)
Alloc rate 1,182,455 bytes per MUT second
Productivity 98.9% of total user, 103.1% of total elapsedRec {
Main.main_$s$wapplyLoop [Occ=LoopBreaker]
:: GHC.Prim.Int#
-> GHC.Prim.Word#
-> GHC.Prim.Word#
-> (# GHC.Word.Word32, GHC.Word.Word32 #)
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL]
Main.main_$s$wapplyLoop =
\ (sc :: GHC.Prim.Int#)
(sc1 :: GHC.Prim.Word#)
(sc2 :: GHC.Prim.Word#) ->
case sc of wild {
__DEFAULT ->
Main.main_$s$wapplyLoop
(GHC.Prim.+# wild 1)
(GHC.Prim.plusWord# sc2 (__word 1))
(GHC.Prim.minusWord# sc1 (__word 1));
10000000 -> (# GHC.Word.W32# sc1, GHC.Word.W32# sc2 #)
}
end Rec }case GHC.ST.runSTRep
@ (Data.Array.Base.UArray GHC.Types.Int GHC.Word.Word32)
(\ (@ s) (s :: GHC.Prim.State# s) ->
let {
n# [Dmd=Just L] :: GHC.Prim.Int#
[LclId, Str=DmdType]
n# = GHC.Prim.sizeofByteArray# ww4 } in
case GHC.Prim.newByteArray# @ s n# s of _ { (# ipv, ipv1 #) ->
case {__pkg_ccall array-0.4.0.1 memcpy forall s.
GHC.Prim.MutableByteArray# s
-> GHC.Prim.ByteArray#
-> GHC.Prim.Word#
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Prim.Addr# #)}Context
StackExchange Code Review Q#24811, answer score: 6
Revisions (0)
No revisions yet.