HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Optimizing unboxed array operations in Haskell

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
operationsarrayunboxedoptimizinghaskell

Problem

Consider this simplified code which successively permutes array elements:

import Data.Word
import Data.Bits
import Data.Array.Unboxed
import Data.Array.Base
import Data.Array.ST

test3 :: UArray Int Word32 -> UArray Int Word32
test3 arr = arr `seq` runSTUArray (change arr) where
    change a' = do
        a  (Word32, Word32)
test4 (x, y) = x `seq` y `seq` (y + 1, x - 1)

apply :: Int -> (Int -> a -> a) -> a -> a
apply n f v0 = n `seq` v0 `seq` applyLoop 0 v0 where
    applyLoop i v
        | i == n - 1 = res
        | otherwise = applyLoop (i + 1) res
        where res = f i v

main :: IO ()
main = let arr = listArray (0, 1) [0, 1] :: UArray Int Word32
    in print $ apply 10000000 (\_ x -> x `seq` test3 x) arr


On my machine, with ghc 7.4.2 and -O3 flag the version with test4 (plain tuples) shows ~10 times better performance then the version with UArray (also, according to the profiler, UArray version has total allocation of ~1.5 Gb). Are there any other optimizations I could do? I'd prefer to use the array version, because it will be more convenient in the real application.

Solution

Okay, first let's establish a baseline. I'm on a 32-bit system now, so the allocation figures and also the timings are somewhat different than on a 64-bit system, but the trends are mostly the same. To see what we deal with, let's not compile for profiling, but just plain with optimisations (-O2).

I have created a repository with all versions of the code I ran. I took the liberty and removed all superfluous seqs, and slightly changed applyLoop, to

applyLoop i v
    | i == n    = v
    | otherwise = applyLoop (i + 1) $ f i v


which changes the result if apply is initially called with a zero argument (the new one immediately returns the original argument, while the old looped until i reaches -1 after wrap-around of the counter), but for positive arguments the semantics remain unchanged. For very small positive n, the one extra comparison the new version does might make a measurable difference if f is inlined and trivial, but for nontrivial n or f, the extra comparison is negligible, and the new version avoids code duplication that happens with the old.

That gives

$ ./test3 +RTS -s
array (0,1) [(0,0),(1,1)]
    560,049,112 bytes allocated in the heap
        32,160 bytes copied during GC
        42,632 bytes maximum residency (2 sample(s))
        22,904 bytes maximum slop
            1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
Gen  0      1083 colls,     0 par    0.02s    0.02s     0.0000s    0.0001s
Gen  1         2 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    1.35s  (  1.35s elapsed)
GC      time    0.02s  (  0.02s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    1.37s  (  1.37s elapsed)

%GC     time       1.3%  (1.3% elapsed)

Alloc rate    415,944,041 bytes per MUT second

Productivity  98.6% of total user, 98.4% of total elapsed


in contrast to

$ ./test4 +RTS -s
(0,1)
        47,492 bytes allocated in the heap
        1,756 bytes copied during GC
        42,632 bytes maximum residency (1 sample(s))
        18,808 bytes maximum slop
            1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
Gen  1         1 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    0.04s  (  0.04s elapsed)
GC      time    0.00s  (  0.00s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    0.04s  (  0.04s elapsed)

%GC     time       0.5%  (0.6% elapsed)

Alloc rate    1,182,455 bytes per MUT second

Productivity  98.9% of total user, 103.1% of total elapsed


from test4. The difference is enormous. Well, let's look at the core to find out why, and how we can reduce it.

First, the core for using test4:

Rec {
Main.main_$s$wapplyLoop [Occ=LoopBreaker]
  :: GHC.Prim.Int#
     -> GHC.Prim.Word#
     -> GHC.Prim.Word#
     -> (# GHC.Word.Word32, GHC.Word.Word32 #)
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL]
Main.main_$s$wapplyLoop =
  \ (sc :: GHC.Prim.Int#)
    (sc1 :: GHC.Prim.Word#)
    (sc2 :: GHC.Prim.Word#) ->
    case sc of wild {
      __DEFAULT ->
        Main.main_$s$wapplyLoop
          (GHC.Prim.+# wild 1)
          (GHC.Prim.plusWord# sc2 (__word 1))
          (GHC.Prim.minusWord# sc1 (__word 1));
      10000000 -> (# GHC.Word.W32# sc1, GHC.Word.W32# sc2 #)
    }
end Rec }


You can't beat that without short-cutting. You get a tight loop using only unboxed values in registers, it does no allocation. But note that seqing x and y in test4 is necessary here to aid the strictness analyser. Without it, you get a loop using boxed Word32s building up huge thunks. The seqs that originally were in apply resp. in the function (\_ x -> x seq testN x) passed to it make no difference whatsoever.

So, we have a lean and mean unboxed loop to aim for, and a sluggish alternative allocating tons. What is allocating so much, and taking so much time there?

The applyLoop worker in this case begins with

case GHC.ST.runSTRep
       @ (Data.Array.Base.UArray GHC.Types.Int GHC.Word.Word32)
       (\ (@ s) (s :: GHC.Prim.State# s) ->
          let {
            n# [Dmd=Just L] :: GHC.Prim.Int#
            [LclId, Str=DmdType]
            n# = GHC.Prim.sizeofByteArray# ww4 } in
          case GHC.Prim.newByteArray# @ s n# s of _ { (# ipv, ipv1 #) ->
          case {__pkg_ccall array-0.4.0.1 memcpy forall s.
                              GHC.Prim.MutableByteArray# s
                              -> GHC.Prim.ByteArray#
                              -> GHC.Prim.Word#
                              -> GHC.Prim.State# GHC.Prim.RealWorld
                              -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Prim.Addr# #)}


allocating a new `

Code Snippets

applyLoop i v
    | i == n    = v
    | otherwise = applyLoop (i + 1) $ f i v
$ ./test3 +RTS -s
array (0,1) [(0,0),(1,1)]
    560,049,112 bytes allocated in the heap
        32,160 bytes copied during GC
        42,632 bytes maximum residency (2 sample(s))
        22,904 bytes maximum slop
            1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
Gen  0      1083 colls,     0 par    0.02s    0.02s     0.0000s    0.0001s
Gen  1         2 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    1.35s  (  1.35s elapsed)
GC      time    0.02s  (  0.02s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    1.37s  (  1.37s elapsed)

%GC     time       1.3%  (1.3% elapsed)

Alloc rate    415,944,041 bytes per MUT second

Productivity  98.6% of total user, 98.4% of total elapsed
$ ./test4 +RTS -s
(0,1)
        47,492 bytes allocated in the heap
        1,756 bytes copied during GC
        42,632 bytes maximum residency (1 sample(s))
        18,808 bytes maximum slop
            1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
Gen  1         1 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    0.04s  (  0.04s elapsed)
GC      time    0.00s  (  0.00s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    0.04s  (  0.04s elapsed)

%GC     time       0.5%  (0.6% elapsed)

Alloc rate    1,182,455 bytes per MUT second

Productivity  98.9% of total user, 103.1% of total elapsed
Rec {
Main.main_$s$wapplyLoop [Occ=LoopBreaker]
  :: GHC.Prim.Int#
     -> GHC.Prim.Word#
     -> GHC.Prim.Word#
     -> (# GHC.Word.Word32, GHC.Word.Word32 #)
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL]
Main.main_$s$wapplyLoop =
  \ (sc :: GHC.Prim.Int#)
    (sc1 :: GHC.Prim.Word#)
    (sc2 :: GHC.Prim.Word#) ->
    case sc of wild {
      __DEFAULT ->
        Main.main_$s$wapplyLoop
          (GHC.Prim.+# wild 1)
          (GHC.Prim.plusWord# sc2 (__word 1))
          (GHC.Prim.minusWord# sc1 (__word 1));
      10000000 -> (# GHC.Word.W32# sc1, GHC.Word.W32# sc2 #)
    }
end Rec }
case GHC.ST.runSTRep
       @ (Data.Array.Base.UArray GHC.Types.Int GHC.Word.Word32)
       (\ (@ s) (s :: GHC.Prim.State# s) ->
          let {
            n# [Dmd=Just L] :: GHC.Prim.Int#
            [LclId, Str=DmdType]
            n# = GHC.Prim.sizeofByteArray# ww4 } in
          case GHC.Prim.newByteArray# @ s n# s of _ { (# ipv, ipv1 #) ->
          case {__pkg_ccall array-0.4.0.1 memcpy forall s.
                              GHC.Prim.MutableByteArray# s
                              -> GHC.Prim.ByteArray#
                              -> GHC.Prim.Word#
                              -> GHC.Prim.State# GHC.Prim.RealWorld
                              -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Prim.Addr# #)}

Context

StackExchange Code Review Q#24811, answer score: 6

Revisions (0)

No revisions yet.