HiveBrain v1.2.0
Get Started
← Back to all entries
patternModerate

Graph generator for undirected graphs

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
graphgraphsgeneratorundirectedfor

Problem

I am still pretty new to Haskell and am working on a graph generator for undirected unlabeled graphs containing loops and I have a bottleneck in the following functions. So far, I have not given any thought to performance and just looked for correctness.

I know this kind of question is not really popular, but I'd be interested in general guidelines for improving performance given some naive but correct implementation like the following.

For instance:

  • Where to enforce strictness?



  • Should I use vector/array instead of lists? When, where and why?



  • Should I improve single functions by replacing recursion using folds/maps/etc. or similar?



Additional info:

  • I'm using -O2 with ghc



  • profiling shows (when normalizing runtime of connections to 100%)



  • arqSeq 40% (with almost all time spend in boundsequences)



  • connectionCombinations 60% with a quarter of the time spend in occurences



I don't think you need to understand all the details. I'm looking more for micro-improvements (the following determines arcs from a given node to equivalence classes of vertices grouped by their degree).

```
import Data.List
import Control.Monad

type Arcs = Int
type Count = Int
type Vertex = Int
type MSequence = [Int]
data EquivClass = EC { order :: Int, verts :: [Vertex] } deriving (Eq, Show)
type ECPartition = [EquivClass]
type NodeSelection = [(Arcs,Count)]
type ECNodeSelection = (EquivClass, NodeSelection)

-- number of occurences of unique elements in a list
occurences :: (Ord a) => [a] -> [(a, Int)]
occurences = map (\xs@(x:_) -> (x, length xs)) . group . reverse . sort

-- number of vertices in an equivalence class
ecLength :: EquivClass -> Int
ecLength = length . verts

-- for a given y = (y_1,...,y_n) and a bound m, find all vectors
-- x = (x_1,...,x_n) such that |x| = m and x_i a -> [a] -> [[a]]
boundSequences m x | m sum x == m)
ranges = map (\x -> [0..x])

-- return m-sequences (combinations of number of arcs) for the given
-- order

Solution

-- for a given y = (y_1,...,y_n) and a bound m, find all vectors
-- x = (x_1,...,x_n) such that |x| = m and x_i  a -> [a] -> [[a]]
boundSequences m x | m  sum x == m)
          ranges = map (\x -> [0..x])


Excepting some fortunate cases, you are wasting a lot of work here. sequence . ranges produces (y_1+1) ... (y_n+1) lists you have to check. If you write a function to produce only the successful lists, you can gain a lot of performance for larger input (it won't do too much for short lists with small elements, but it will help for those too).

Shouldn't the case sum x < m return [] rather than [[]]?

-- return all the possible arc combinations  to an equivalence
-- class for a given number of arcs
connectionCombinations :: Int -> EquivClass -> [NodeSelection]
connectionCombinations arcs = map groupOcc . prune arcs . sequence . orderRep
    where orderRep (EC o v) = replicate (length v) [0..o]
          prune a = nub . map (reverse . sort) . filter ((== a) . sum)
          groupOcc = filter ((/= 0) . fst) . occurences


Instead of reverse . sort, use sortBy (flip compare). That won't make much difference for short lists, but it's cleaner, IMO. prune arcs . sequence contains another occurrence of the boundSequences problem, creating a lot of lists and filtering out most of them immediately. nub is not good, it is quadratic (worst case, O(total*distinct) in general). The things you're nubbing have an Ord instance, if the order doesn't matter, map head . group . sort is a much faster nub than nub (still better Data.Set.toList . Data.Set.fromList), if order matters, keep a Set of seen elements and for each new, output that and add it to the set of seen.

I've not yet profiled, but it may be that the profile is misleading because it attributes costs to the functions where results are forced rather than where they are calculated, that needs some investigation, check back later.

Okay, the given data doesn't run long enough to produce a meaningful profile, so I changed it to

let cons = connections [EC 4 [1 .. 6], EC 6 [2 .. 9]] 5


That runs long enough to collect a handful of samples. Unfortunately, the profile wasn't very informative:

total time  =        7.90 secs   (395 ticks @ 20 ms)
    total alloc = 8,098,386,128 bytes  (excludes profiling overheads)
COST CENTRE                    MODULE               %time %alloc
connectionCombinations         Main                 100.0  100.0


So I inserted a couple of {-# SCC #-} pragmas. The biggo in that programme is the sequence . orderRep, unsurprisingly. A rewrite along the lines mentioned above,

boundSequences :: (Num a, Ord a, Enum a) => a -> [a] -> [[a]]
boundSequences m xs
    | sm < m    = []
    | sm == m   = [xs]
    | otherwise = go sm m xs
      where
        sm = sum xs
        go _ r []
            | r == 0 = [[]]
            | otherwise = []
        go _ r [y]
            | y < r     = []
            | otherwise = [[r]]
        go s r (y:ys) = do
            let mny | s < r+y   = r+y-s
                    | otherwise = 0
                mxy = min y r
            c <- [mny .. mxy]
            map (c:) (go (s-y) (r-c) ys)


and using that in connectionCombinations instead of prune arcs . sequence (requires a small change in orderRep),

connectionCombinations :: Int -> EquivClass -> [NodeSelection]
connectionCombinations arcs = map groupOcc . nub . map (sortBy (flip compare)) . 
                                     boundSequences arcs . orderRep
    where orderRep (EC o v) = replicate (length v) o
          groupOcc = filter ((/= 0) . fst) . occurences


brought the running time (ghc -O2, no profiling) down significantly:

dafis@schwartz:~/Cairo> time ./orArcs > /dev/null

real    0m5.836s
user    0m5.593s
sys     0m0.231s
dafis@schwartz:~/Cairo> time ./arcs > /dev/null

real    0m0.008s
user    0m0.005s
sys     0m0.003s


Whether any further optimisations (e.g. the abovementioned nub) are necessary, and where most of the time in the actual programme is spent, so that one could identify the most important points, I cannot tell without more realistic data.

Code Snippets

-- for a given y = (y_1,...,y_n) and a bound m, find all vectors
-- x = (x_1,...,x_n) such that |x| = m and x_i <= y_i
boundSequences :: (Num a, Ord a, Enum a) => a -> [a] -> [[a]]
boundSequences m x | m <= sum x = (fByM . sequence . ranges) x
                   | otherwise = [[]]
    where fByM = filter (\x -> sum x == m)
          ranges = map (\x -> [0..x])
-- return all the possible arc combinations  to an equivalence
-- class for a given number of arcs
connectionCombinations :: Int -> EquivClass -> [NodeSelection]
connectionCombinations arcs = map groupOcc . prune arcs . sequence . orderRep
    where orderRep (EC o v) = replicate (length v) [0..o]
          prune a = nub . map (reverse . sort) . filter ((== a) . sum)
          groupOcc = filter ((/= 0) . fst) . occurences
let cons = connections [EC 4 [1 .. 6], EC 6 [2 .. 9]] 5
total time  =        7.90 secs   (395 ticks @ 20 ms)
    total alloc = 8,098,386,128 bytes  (excludes profiling overheads)
COST CENTRE                    MODULE               %time %alloc
connectionCombinations         Main                 100.0  100.0
boundSequences :: (Num a, Ord a, Enum a) => a -> [a] -> [[a]]
boundSequences m xs
    | sm < m    = []
    | sm == m   = [xs]
    | otherwise = go sm m xs
      where
        sm = sum xs
        go _ r []
            | r == 0 = [[]]
            | otherwise = []
        go _ r [y]
            | y < r     = []
            | otherwise = [[r]]
        go s r (y:ys) = do
            let mny | s < r+y   = r+y-s
                    | otherwise = 0
                mxy = min y r
            c <- [mny .. mxy]
            map (c:) (go (s-y) (r-c) ys)

Context

StackExchange Code Review Q#6488, answer score: 11

Revisions (0)

No revisions yet.