HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Split a string on spaces, accounting for backslash as escape character, in Haskell

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
backslashcharacterspacessplitforhaskellstringaccountingescape

Problem

This code works, but since I'm just learning Haskell I'm wondering if it could have been done in a more idiomatic way.

The main function is called tokenize, and it uses helper functions called tokenizeOne and tokenizeSeveral.

Here is the code:

tokenizeOne "" (' ':rest) = tokenizeOne "" rest 
tokenizeOne str ('\\':ch:rest) = tokenizeOne (str ++ [ch]) rest
tokenizeOne x (' ':rest) = (x, rest)
tokenizeOne x "" = (x, "")
tokenizeOne x (ch:rest) = tokenizeOne (x ++ [ch]) rest

tokenizeSeveral x "" = x
tokenizeSeveral x str = let (token, rest) = tokenizeOne "" str in
    tokenizeSeveral (x ++ [token]) rest

tokenize str = tokenizeSeveral [] str

Solution

Apart from general code layout I don't see much to improve - the only issue worth talking about is that you should never ever actually use something like ++ [ch]. Your function is needlessly O(n²) for that reason alone. That might not sound like much here, but it is a good idea to shake this habit early.

So let's talk about that a bit. Your options generally are:

-
Find a library function that frees you from actually constructing the list yourself. But to be fair, for this examples I can't actually think of anything helpful myself.

-
Use a data structure that copes better with it, such as Seq from Data.Sequence.

-
First construct the list using just (:), then reverse it at the end. That might seem inefficient, but given that reverse is only O(n), it will never actually increase algorithmic complexity.

-
Rebuild your function so it constructs the list in the right order right away.

Here is what things look like if we go for the last option:

import Control.Arrow ( first )
token :: String -> (String, String)
token ""          = ("", "")
token (' ' :  cs) = ("", cs)
token ('\\':c:cs) = first (c:) $ token cs
token (c   :  cs) = first (c:) $ token cs


The key here is that we add the character c using first (c:) after the recursive call returns, and therefore to the front of the string.

A notable advantage is that this version has better laziness properties:

*Main> head $ fst $ token ('t':undefined)
't'


We know that the first character is 't' even without looking at the rest of the string, whereas your function always needs to finish a complete token before it can return. And as we are programming in Haskell, we often like our functions as lazy as possible.

Finally, a matching tokens implementation:

tokens :: String -> [String]
tokens str
  | null str'  = []
  | otherwise  = tok : tokens rest
  where str' = dropWhile (==' ') str
        (tok, rest) = token str'


This is again basically your version using direct list construction. Only difference is that we now deal with whitespace once per token, which I feel is slightly more elegant.

Code Snippets

import Control.Arrow ( first )
token :: String -> (String, String)
token ""          = ("", "")
token (' ' :  cs) = ("", cs)
token ('\\':c:cs) = first (c:) $ token cs
token (c   :  cs) = first (c:) $ token cs
*Main> head $ fst $ token ('t':undefined)
't'
tokens :: String -> [String]
tokens str
  | null str'  = []
  | otherwise  = tok : tokens rest
  where str' = dropWhile (==' ') str
        (tok, rest) = token str'

Context

StackExchange Code Review Q#40968, answer score: 2

Revisions (0)

No revisions yet.