patternMinor
Optimizing ByteString escaping
Viewed 0 times
bytestringescapingoptimizing
Problem
I wrote a string escaping function in C, and I'm trying to rewrite it to Haskell. The C version, along with comments explaining why it does what it does, can be found on GitHub.
Here's a naïve implementation based on
I'd expect this to be a few times slower than the C version. Nope, it is 125 times slower than the C version.
Then, I tried using blaze-builder:
This made a difference. It's even slower, 300 times slower than the C version. I thought blaze-builder was supposed to be really fast!
Simply summing bytes, by folding over the input in a similar fashi
Here's a naïve implementation based on
Data.ByteString.concatMap:{-# LANGUAGE ViewPatterns #-}
import Data.Bits
import Data.ByteString (ByteString)
import Data.ByteString.Char8 ()
import Data.ByteString.Internal (c2w, w2c)
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
escapeCopyBytea :: ByteString -> ByteString
escapeCopyBytea = B.concatMap f
where
f (w2c -> '\\') = B.replicate 4 (c2w '\\')
f c | c >= 32 && c ByteString) -> L.ByteString -> L.ByteString
mapChunks f = L.fromChunks . map f . L.toChunks
main :: IO ()
main = L.getContents >>= L.putStr . mapChunks escapeCopyByteaI'd expect this to be a few times slower than the C version. Nope, it is 125 times slower than the C version.
Then, I tried using blaze-builder:
import Blaze.ByteString.Builder
import Data.Bits
import Data.Monoid (mappend, mconcat, mempty)
import Data.ByteString (ByteString)
import Data.Word (Word8)
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
writeBackslash :: Write
writeBackslash = writeWord8 92
escape1 :: Word8 -> Builder
escape1 92 = fromWrite $ writeBackslash
`mappend` writeBackslash
`mappend` writeBackslash
`mappend` writeBackslash
escape1 c | c >= 32 && c Builder
escapeCopyBytea2 = B.foldl' f mempty
where
f b c = b `mappend` escape1 c
main :: IO ()
main = L.getContents >>= L.putStr . toLazyByteString . mconcat
. map escapeCopyBytea2 . L.toChunksThis made a difference. It's even slower, 300 times slower than the C version. I thought blaze-builder was supposed to be really fast!
Simply summing bytes, by folding over the input in a similar fashi
Solution
Whew, this is quite a tricky one. The main problem here is that you really, really need to keep Haskell from allocating anything in the relatively spacious inner loop. Otherwise, you are looking at a dozen bytes going through the garbage collector for every byte you escape.
As you note, blaze-builder has a reputation for being fast, mainly for being pretty good at this. Let's start off with an explanation why that is. Here's roughly how a
The compiler will replace all
So this actually just takes the pointers from
So, what's the problem in your case? The trouble is the "glue code". Let us look at the source of
For every byte, this worker loop forces
So, how to improve this? Remember again that our
Going through with this, we actually get a lazy right fold:
(Note I also removed the
The type signature looks a lot better now - and from my testing this actually performs roughly twice as fast as the original version. Where does the rest of the performance get lost? Well, a look at the Core shows code like
This means that GHC is generating a lot of partially-applied functions - primarily because the optimizer doesn't want to lose sharing in case we want to call
So we need to force GHC to use higher arity for this function's implementation. Unfortunately, I am somewhat out of ideas how to accomplish that in an elegant way. So here's the manual solution -- writing our lambdas out at the top of the expression:
I get decent performance out of this version here, and I suppose it's slightly better than implementing it completely by hand. It also happens to not require
Code can be found at GitHub. I hope this helps in some way :)
As you note, blaze-builder has a reputation for being fast, mainly for being pretty good at this. Let's start off with an explanation why that is. Here's roughly how a
Builder is defined (see Blaze.ByteString.Builder.Internal.Types):newtype Builder = Builder (forall r. BuildStep r -> BuildStep r)
newtype BuildStep a = BufRange -> IO (BuildSignal a)
data BufRange a = BufRange !(Ptr Word8) !(Ptr Word8)The compiler will replace all
newtype wrappers with simple "casts", and un-box BufRange where possible. So this means that, say, your escape1 function would actually have an after-optimization type closer to:escape1 :: Word8 -> (BufRange -> IO (BuildSignal a)) -> BufRange -> IO (BuildSignal a)So this actually just takes the pointers from
BufRange to write data using the IO monad, then finally calls the continuation. Ideally, we even know the continuation in question and can jump directly to the code. At no point does this require heap-allocation, everything gets nicely optimized out.So, what's the problem in your case? The trouble is the "glue code". Let us look at the source of
ByteString's foldl, with some additional type annotations:lgo :: Builder -> Addr# -> Addr# -> IO Builder
lgo !z !p !q | p == q = return z
| otherwise = do c <- peek p
lgo (f z c) (p `plusPtr` 1) qFor every byte, this worker loop forces
f in order to get the new "accumulated value". Unfortunately, that's a pretty bad idea here, as our accumulator is a Builder - a closure (as we saw above). Therefore this will actually allocate a huge list of closures, but not start producing anything of value until the very end.So, how to improve this? Remember again that our
f has type ... -> BuildStep r -> BuildStep r somewhere deep down. We would actually like to pass lgo as the BuildStep continuation, so after inlining it becomes a simple jump. We also don't really need the accumulation parameter, as the Builder is taking care of accumulating its result by itself.Going through with this, we actually get a lazy right fold:
lgo :: Addr# -> Addr# -> BuildStep r -> BuildStep r
lgo !p !q | p == q = v
| otherwise = f (inlinePerformIO (peek p)) (lgo (p `plusPtr` 1) q)(Note I also removed the
IO by using inlinePerformIO, which might be unsafe in other usage scenarios, as any addresses escaping due to lazy evaluation might point to a garbage-collected buffer later!)The type signature looks a lot better now - and from my testing this actually performs roughly twice as fast as the original version. Where does the rest of the performance get lost? Well, a look at the Core shows code like
lgo = \p q -> case p == q of
True -> \r -> let ... in \stp -> ...
False ->This means that GHC is generating a lot of partially-applied functions - primarily because the optimizer doesn't want to lose sharing in case we want to call
lgo with the same p and q, but different BuildSteps (which we never will, but GHC can't prove that).So we need to force GHC to use higher arity for this function's implementation. Unfortunately, I am somewhat out of ideas how to accomplish that in an elegant way. So here's the manual solution -- writing our lambdas out at the top of the expression:
foldBuilder :: (Word8 -> Builder) -> B.ByteString -> Builder
foldBuilder f (PS x s l) =
fromBuildStepCont $ \cont range ->
withForeignPtr x $ \ptr -> do
let lgo !p !q !range'
| p == q = cont range'
| otherwise = do
c <- peek p
let p' = p `plusPtr` 1
step = unBuilder (f c) (BuildStep $ lgo p' q)
runBuildStep step range'
lgo (ptr `plusPtr` s) (ptr `plusPtr` (s+l)) rangeI get decent performance out of this version here, and I suppose it's slightly better than implementing it completely by hand. It also happens to not require
inlinePerformIO at all any more, which is a nice plus.Code can be found at GitHub. I hope this helps in some way :)
Code Snippets
newtype Builder = Builder (forall r. BuildStep r -> BuildStep r)
newtype BuildStep a = BufRange -> IO (BuildSignal a)
data BufRange a = BufRange !(Ptr Word8) !(Ptr Word8)escape1 :: Word8 -> (BufRange -> IO (BuildSignal a)) -> BufRange -> IO (BuildSignal a)lgo :: Builder -> Addr# -> Addr# -> IO Builder
lgo !z !p !q | p == q = return z
| otherwise = do c <- peek p
lgo (f z c) (p `plusPtr` 1) qlgo :: Addr# -> Addr# -> BuildStep r -> BuildStep r
lgo !p !q | p == q = v
| otherwise = f (inlinePerformIO (peek p)) (lgo (p `plusPtr` 1) q)lgo = \p q -> case p == q of
True -> \r -> let ... in \stp -> ...
False ->Context
StackExchange Code Review Q#9998, answer score: 4
Revisions (0)
No revisions yet.