HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Parsing and plotting complex numbers

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
plottingnumbersparsingandcomplex

Problem

I am trying to write a Haskell application that parses a bunch of complex numbers from a Mathematica output and then produces a PPM image.

The end product looks like this:

The code to create this is as follows:

```
module Main where

import PPM
import System.IO
import Control.Applicative as A
import Control.Monad.ST
import qualified Data.Vector as V
import qualified Data.Vector.Mutable as M
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy.Char8 as LB
import qualified Data.Attoparsec.ByteString.Char8 as P
import qualified Data.Attoparsec.ByteString.Lazy as LP

apply :: a -> (a -> b) -> b
apply a = \f -> f a

---------- PARSING ----------

iT = B.pack "*I"

data Complex = Complex Double Double
deriving (Show)

cAdd:: Complex -> [Complex] -> Complex
cAdd (Complex a b) [(Complex c d)] = Complex (a + c) (b + d)
cAdd c _ = c
cConv :: Complex -> Complex
cConv (Complex a b) = Complex b a

parseTerm :: P.Parser Complex
parseTerm = parseDouble parseConv
where
parseDouble = liftA (\x -> apply (Complex x 0)) $
liftA2 (\x y -> read ((x:y) ++ "0")) P.anyChar $ many $ P.satisfy $ \x -> P.isDigit x || x == '.'
parseConv = P.option id $ liftA (const cConv) $ P.string iT

test :: P.Parser String
test = liftA2 (\x y -> (x:y)) P.anyChar $ many $ P.satisfy $ \x -> P.isDigit x || x == '.'

parseComplex :: P.Parser Complex
parseComplex = liftA2 cAdd parseTerm $ many $ many (P.char '+') A.*> parseTerm

---------- IMAGE PROCESSING ----------

bounds = 2.0 :: Double
imgSize = 400 :: Int

intensity = 300.0 :: Double
falloff = 0.8 :: Double
(r, g, b) = (255.0, 76.5, 25.5) :: (Double, Double, Double)

computeIndex :: Double -> Int
computeIndex x = (+) 1 $ floor $ (x + bounds) / (2 * bounds / fromIntegral imgSize)

convCoord ::(Int, Int) -> Int
convCoord (h, v) = v * imgSize + h

computeCIndex :: Complex -> Int
computeCIndex (Complex r i) = convCoord (computeIndex r, computeIndex i)

genVec :: [Complex] -> V.Vector Int
gen

Solution

The program consists of a pipeline of operations:

  • breaking up the input file into lines



  • parsing each line as a complex number



  • updating the histogram vector



  • writing out the PPM file



The first step to optimizing the program is to measure how long each of these steps is taking.

criterion is a great tool, but because it runs the test function multiple times it isn't well suited for operations which take many seconds. Also, criterion is designed to get precise timings, and we just need ballpark figures. For our timings we're just going to use the shell's time command.

I first created a test file containing a million random complex numbers, and ran the entire pipeline to get a base line reading. Creating the the PPM from the million complex numbers took 33 secs. To understand if that was a reasonable number or not I wrote a quick perl script to just read in and add up the real parts of the complex numbers:

my $sum = 0;
while (<>) {
  if (m/^(.*?)[+]/) { $sum += $1; }
}
print $sum, "\n";


and it only took 1.7 seconds. Clearly there is a big problem somewhere with the Haskell program.

The next step it to write alternative main functions which cut off the pipeline at various stages, e.g.:

A main which just prints out the number of lines read:

mainLength path =  do
    rawData <- liftA LB.words (LB.readFile path)
    let formatedData = map (extr.LP.parse parseComplex) rawData
    print $ length formatedData


A main which just adds up the real parts of the parsed numbers:

mainSum path = do
    rawData <- liftA LB.words (LB.readFile path)
    let formatedData = map (extr.LP.parse parseComplex) rawData
        rpart (Complex r _) = r
    print $ sum $ map rpart $ formatedData


and the timings I got are as follows:

mainLength   0.18 secs
mainSum     17.37 secs


So what work is being done in mainSum which is not being done in mainLength? Due to laziness it is the conversion of the parsed bytestrings to Double values.

A quick scan of the code reveals that you are using read to perform this conversion. read is notoriously slow, and should be avoided for performance critical code.

A replacement for read :: ByteString -> Double

If you search Google you can find this Stackoverflow article:

https://stackoverflow.com/questions/4489518/efficient-number-reading-in-haskell

Mostly because the answer was submitted by Don Stewart I decided on using the bytestring-lexing module, and came up with this version of parseComplex:

parseComplex' = do
  r  0
                       Just (d, _) -> d


The timing for mainSum using parseComplex' is:

mainSum'  0.6 secs


So now the first two stages of the pipeline are very performant. The next step is to figure out why updating the histogram vector and writing out the PPM file is taking so long.

Using Ubboxed Vectors

I've found another important optimization - you want to use unboxed vectors instead of the regular vectors. Here is an alternate version of genVec:

import qualified Data.Vector.Unboxed as UnboxedV
import qualified Data.Vector.Unboxed.Mutable as UnboxedM

genVec :: [Complex] -> UnboxedV.Vector Int
genVec xs = runST $ do
  mv  do
    let x = computeCIndex c
    count <- UnboxedM.unsafeRead mv x
    UnboxedM.unsafeWrite mv x (count+1)
  UnboxedV.freeze mv


This will cut down the run time by another couple of seconds (which is now significant since the whole pipeline takes now only takes about 4 secs to run.)

Update:

I think you'll find that improving the Double parsing is about the best you can do for a single threaded program. To scale to 600M points you are going to have to use multiple threads / machines. Fortunately this is a classic map-reduce problem, so there's a lot of tools and libraries (not necessarily in Haskell) that you can draw upon.

If you just want to scale on a single machine using multiple threads, you can put together your own solution using a module like Control.Monad.Par. For a clustering solution you'll probably have to use a third-party framework like Hadoop in which case you might be interested in the hadron package - there is also a video describing it here: https://vimeo.com/90189610

Code Snippets

my $sum = 0;
while (<>) {
  if (m/^(.*?)[+]/) { $sum += $1; }
}
print $sum, "\n";
mainLength path =  do
    rawData <- liftA LB.words (LB.readFile path)
    let formatedData = map (extr.LP.parse parseComplex) rawData
    print $ length formatedData
mainSum path = do
    rawData <- liftA LB.words (LB.readFile path)
    let formatedData = map (extr.LP.parse parseComplex) rawData
        rpart (Complex r _) = r
    print $ sum $ map rpart $ formatedData
mainLength   0.18 secs
mainSum     17.37 secs
parseComplex' = do
  r <- P.takeTill (== '+')
  P.char '+'
  i <- P.takeTill (== '*')
  P.string iT  -- or just skip this
  return $ Complex (toDouble r) (toDouble i)
  where toDouble s = case (readSigned readDecimal) s of
                       Nothing -> 0
                       Just (d, _) -> d

Context

StackExchange Code Review Q#98730, answer score: 5

Revisions (0)

No revisions yet.