patternMinor

Optimizing ByteString escaping

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

bytestringescapingoptimizing

Problem

I wrote a string escaping function in C, and I'm trying to rewrite it to Haskell. The C version, along with comments explaining why it does what it does, can be found on GitHub.

Here's a naïve implementation based on Data.ByteString.concatMap:

{-# LANGUAGE ViewPatterns #-}
import Data.Bits
import Data.ByteString          (ByteString)
import Data.ByteString.Char8    ()
import Data.ByteString.Internal (c2w, w2c)

import qualified Data.ByteString        as B
import qualified Data.ByteString.Lazy   as L

escapeCopyBytea :: ByteString -> ByteString
escapeCopyBytea = B.concatMap f
    where
        f (w2c -> '\\') = B.replicate 4 (c2w '\\')
        f c | c >= 32 && c  ByteString) -> L.ByteString -> L.ByteString
mapChunks f = L.fromChunks . map f . L.toChunks

main :: IO ()
main = L.getContents >>= L.putStr . mapChunks escapeCopyBytea

I'd expect this to be a few times slower than the C version. Nope, it is 125 times slower than the C version.

Then, I tried using blaze-builder:

import Blaze.ByteString.Builder
import Data.Bits
import Data.Monoid              (mappend, mconcat, mempty)
import Data.ByteString          (ByteString)
import Data.Word                (Word8)

import qualified Data.ByteString        as B
import qualified Data.ByteString.Lazy   as L

writeBackslash :: Write
writeBackslash = writeWord8 92

escape1 :: Word8 -> Builder
escape1 92 = fromWrite $ writeBackslash
               `mappend` writeBackslash
               `mappend` writeBackslash
               `mappend` writeBackslash
escape1 c | c >= 32 && c  Builder
escapeCopyBytea2 = B.foldl' f mempty
    where
        f b c = b `mappend` escape1 c

main :: IO ()
main = L.getContents >>= L.putStr . toLazyByteString . mconcat
                       . map escapeCopyBytea2 . L.toChunks

This made a difference. It's even slower, 300 times slower than the C version. I thought blaze-builder was supposed to be really fast!

Simply summing bytes, by folding over the input in a similar fashi

Solution

Whew, this is quite a tricky one. The main problem here is that you really, really need to keep Haskell from allocating anything in the relatively spacious inner loop. Otherwise, you are looking at a dozen bytes going through the garbage collector for every byte you escape.

As you note, blaze-builder has a reputation for being fast, mainly for being pretty good at this. Let's start off with an explanation why that is. Here's roughly how a Builder is defined (see Blaze.ByteString.Builder.Internal.Types):

newtype Builder = Builder (forall r. BuildStep r -> BuildStep r)
newtype BuildStep a = BufRange -> IO (BuildSignal a)
data BufRange a = BufRange !(Ptr Word8) !(Ptr Word8)

The compiler will replace all newtype wrappers with simple "casts", and un-box BufRange where possible. So this means that, say, your escape1 function would actually have an after-optimization type closer to:

escape1 :: Word8 -> (BufRange -> IO (BuildSignal a)) -> BufRange -> IO (BuildSignal a)

So this actually just takes the pointers from BufRange to write data using the IO monad, then finally calls the continuation. Ideally, we even know the continuation in question and can jump directly to the code. At no point does this require heap-allocation, everything gets nicely optimized out.

So, what's the problem in your case? The trouble is the "glue code". Let us look at the source of ByteString's foldl, with some additional type annotations:

lgo :: Builder -> Addr# -> Addr# -> IO Builder
lgo !z !p !q | p == q    = return z
             | otherwise = do c <- peek p
                              lgo (f z c) (p `plusPtr` 1) q

For every byte, this worker loop forces f in order to get the new "accumulated value". Unfortunately, that's a pretty bad idea here, as our accumulator is a Builder - a closure (as we saw above). Therefore this will actually allocate a huge list of closures, but not start producing anything of value until the very end.

So, how to improve this? Remember again that our f has type ... -> BuildStep r -> BuildStep r somewhere deep down. We would actually like to pass lgo as the BuildStep continuation, so after inlining it becomes a simple jump. We also don't really need the accumulation parameter, as the Builder is taking care of accumulating its result by itself.

Going through with this, we actually get a lazy right fold:

lgo :: Addr# -> Addr# -> BuildStep r -> BuildStep r
lgo !p !q | p == q    = v
          | otherwise = f (inlinePerformIO (peek p)) (lgo (p `plusPtr` 1) q)

(Note I also removed the IO by using inlinePerformIO, which might be unsafe in other usage scenarios, as any addresses escaping due to lazy evaluation might point to a garbage-collected buffer later!)

The type signature looks a lot better now - and from my testing this actually performs roughly twice as fast as the original version. Where does the rest of the performance get lost? Well, a look at the Core shows code like

lgo = \p q -> case p == q of
                True -> \r -> let ... in \stp -> ...
                False ->

This means that GHC is generating a lot of partially-applied functions - primarily because the optimizer doesn't want to lose sharing in case we want to call lgo with the same p and q, but different BuildSteps (which we never will, but GHC can't prove that).

So we need to force GHC to use higher arity for this function's implementation. Unfortunately, I am somewhat out of ideas how to accomplish that in an elegant way. So here's the manual solution -- writing our lambdas out at the top of the expression:

foldBuilder :: (Word8 -> Builder) -> B.ByteString -> Builder
foldBuilder f (PS x s l) =
  fromBuildStepCont $ \cont range ->
    withForeignPtr x $ \ptr -> do
      let lgo !p !q !range'
            | p == q    = cont range'
            | otherwise = do
                c <- peek p
                let p' = p `plusPtr` 1
                    step = unBuilder (f c) (BuildStep $ lgo p' q)
                runBuildStep step range'
      lgo (ptr `plusPtr` s) (ptr `plusPtr` (s+l)) range

I get decent performance out of this version here, and I suppose it's slightly better than implementing it completely by hand. It also happens to not require inlinePerformIO at all any more, which is a nice plus.

Code can be found at GitHub. I hope this helps in some way :)

Code Snippets

newtype Builder = Builder (forall r. BuildStep r -> BuildStep r)
newtype BuildStep a = BufRange -> IO (BuildSignal a)
data BufRange a = BufRange !(Ptr Word8) !(Ptr Word8)

escape1 :: Word8 -> (BufRange -> IO (BuildSignal a)) -> BufRange -> IO (BuildSignal a)

lgo :: Builder -> Addr# -> Addr# -> IO Builder
lgo !z !p !q | p == q    = return z
             | otherwise = do c <- peek p
                              lgo (f z c) (p `plusPtr` 1) q

lgo :: Addr# -> Addr# -> BuildStep r -> BuildStep r
lgo !p !q | p == q    = v
          | otherwise = f (inlinePerformIO (peek p)) (lgo (p `plusPtr` 1) q)

lgo = \p q -> case p == q of
                True -> \r -> let ... in \stp -> ...
                False ->

Context

StackExchange Code Review Q#9998, answer score: 4

Revisions (0)

No revisions yet.