snippetMinor

Text tokenizer example

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

exampletexttokenizer

Problem

Here is a partial example of a text tokenizer. I'm looking for ways to improve one particular line in this code:

implicit class textFile(val fileName: String)  {

    def toDict() = {
      io.Source.fromFile(fileName).getLines.flatMap(_.split("\\\\r?\\\\n")).toList
    }
  }

  def filterComments(m: List[String]) : List[String] = m match {
    case Nil => List()
    case x :: xs => x.takeWhile(c => c != '#').trim :: filterComments(xs)
  }

  def filterEmptyLines(m: List[String]) : List[String] = m match {

    case Nil => List()
    case x :: xs => if(x.isEmpty) filterEmptyLines(xs) else x :: filterEmptyLines(xs)
  }

  def splitParameters(m: List[String]) : List[String] = m match {

    case Nil => List()
    case x :: xs => x.split("\\s*(=>|,|\\s)\\s*").map(_.trim).toList ++ splitParameters(xs)

  }

  val dict = "test.txt".toDict()
  println(splitParameters(filterEmptyLines(filterComments(dict))))

The line I would like to break up is the last one:

println(splitParameters(filterEmptyLines(filterComments(dict))))

Is there any idiomatic (Scala) way to make this more readable but keeping it compact, on the same line preferable?

Solution

I know you've found a solution you like for the final line of your code, but I'm a bit concerned with the overall performance and characteristics.of your code.

Oh, I do have an alternative to your compactness solution but I will keep that for a different, shorter answer.

Confusion of metaphors

filterEmptyLines filters the lines from dict. filterComments does not. It transforms them from possibly-commented lines into lines with no comments. This may seem like a minor or pedantic point, but calling two different things by the same name can cause errors. stripComments would be a better name, I think.

Overengineered and unsafe functions

None of your three functions are stack safe. Each one of them is recursive but none of them is tail-recursive. This means that processing a large file could blow the stack.

Each one of them can be rewritten safely using combinator functions, which are usually more efficient than pattern matching and explicit recursion (and gives an extra bonus that I'll explain later).

filterComments

This is not tail recursive because it prepends the transformed x to the beginning of the list returned by xs.filterComments. It could be rewritten with a tail-recursive inner helper function, but this function can be done much more simply as a map...

def filterComments(m: List[String]): List[String] =
  m map (_.takeWhile(c => c != '#').trim)

filterEmptyLines

Unsafe for the same reason as filterComments. Can be written safely as

def filterEmptyLines(m: List[String]): List[String] =
  m filter (! _.isEmpty)

splitParameters

Same as the other two, but multiple recursive applications of ++ is even more expensive. This can be rewritten safely as

def splitParameters(m: List[String]): List[String] =
  m flatMap (_.split("\\s*(=>|,|\\s)\\s*").map(_.trim))

Oh, what happened to .toList? Why does that work without it? The answer is that Scala's flatMap does an implicit conversion of the array to a list. Google canBuildFrom if you want to learn something about Scala collection internals.

Over-specific types

Once those three functions are rewritten to use combinators rather than pattern matching, they don't need to take or return List[String]. They could take and return Seq[String]. This gives you more freedom abut what you pass in (could be List[String], could be some other Seq-based collection). No performance penalty (the appropriate filter/map/flatmap of the actual type will be called) and much more flexibility. OK, at the end you would have to convert the sequence back into a list (or whatever you want the final form to be), but this allows you delay that decision till it is important. This is that extra bonus I mentioned before.

And I'm about to explain why using Seq - or possibly Iterator could give a big performance boost.

Multiple traversals and intermediate collections

getLines returns an iterator (a lazy, one-pass collection which only processes each element on demand). But you immediately convert it to a list, reading the whole (potentially large) file into memory. Then each transformation in turn creates an entirely new collection. So you actually create 4 collections in a row, traversing the entire contents of the file 3 times (possibly 4 if there are no empty or comment-only lines). But I'm pretty sure you only care about the final one.

Even if you always want to process the entire file, that's expensive (the bigger the file, the worse it gets). And what if you only want to process the first n lines or process the file in chunks, not wasting memory on processed and not-yet-processed chunks?

There's a pretty simple solution which will give you all those options (but which doesn't force you to overcomplicate things just because you might want those options later).

Be lazy (use views, iterators or streams)

Don't keep a reference to any of the intermediate transformations.

Being Lazy

Option 1: List View

If you're sure you always want to read the whole file into a list, turn that list into a view. When you apply map, filter' orflatMap` to a view, it doesn't process the whole collection yet. It returns the original collection wrapped in a delayed transformation and only applies the transformation when you ask for the elements and only to the elements you ask for. If you just apply another transformation to the new view, again, no processing is done, you just get a new view. So if you apply all three functions to the view and only then ask for the results as a list, the original collection will only be traversed once, applying all three transformations to one element before proceeding to the next.

If you only take the first 10 lines from the final view, it will only process lines until it has 10 non-empty, non-comment-only lines. The rest of the collection inside the view will remain untouched till you ask for it.

And none of the functions needs to know that laziness is happenin

Code Snippets

def filterComments(m: List[String]): List[String] =
  m map (_.takeWhile(c => c != '#').trim)

def filterEmptyLines(m: List[String]): List[String] =
  m filter (! _.isEmpty)

def splitParameters(m: List[String]): List[String] =
  m flatMap (_.split("\\s*(=>|,|\\s)\\s*").map(_.trim))

"test.txt".toDict().filterComments.filterEmptyLines.splitParameters

val dict = "test.txt".toDict()
dict.filterComments.filterEmptyLines.splitParameters

Context

StackExchange Code Review Q#104071, answer score: 2

Revisions (0)

No revisions yet.