patternMinor

Reading lines from a file in random order

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

randomreadingfileorderfromlines

Problem

I originally wrote this as an answer to a question on Stack Overflow, but it turned out so nicely that I decided to post it here to see if I can make it even better.

(defn char-seq
  "Returns a lazy sequence of characters from rdr. rdr must implement
  java.io.Reader."
  [rdr]
  (let [c (.read rdr)]
    (if-not (neg? c)
      (cons (char c) (lazy-seq (char-seq rdr))))))

(defn line-offsets
  "Returns a lazy sequence of offsets of all lines in s."
  [s]
  (if (seq s)
    (->> (partition-all 3 1 s)
         (map-indexed
          (fn [i [a b c]]
            (cond
              (= b \newline) (if c (+ 2 i))
              (= a \return) (if b (inc i)))))
         (filter identity)
         (cons 0))))

(defn ordered-line-seq
  "Returns the lines of text from raf at each offset in offsets as a lazy
  sequence of strings. raf must implement java.io.RandomAccessFile."
  [raf offsets]
  (map (fn [i]
         (.seek raf i)
         (.readLine raf))
       offsets))

Example usage:

(let [filename "data.txt"
      offsets (with-open [rdr (clojure.java.io/reader filename)]
                (shuffle (line-offsets (char-seq rdr))))]
  (with-open [raf (java.io.RandomAccessFile. filename "r")]
    (run! println (ordered-line-seq raf offsets))))

I'm interested in any way to improve this code, but here are some specific things I'm looking for:

Since RandomAccessFile obviously uses byte offsets, and line-offsets returns character offsets, this code won't work for Unicode files. Is there a way to fix that problem without adding a lot of complexity?

Are a, b, and c good names for the parameters of the anonymous function in line-offsets?

What exactly are the guidelines for parameter ordering in named functions like ordered-line-seq?

Solution

Regarding your questions:

-
You're reading bytes. Fortunately, the only three line ending
conventions all involve Newline and Return, both a single byte.
There's nothing more to it. So if you stay with an encoding using
single bytes for (most of) their characters, such as UTF-8, there will
be no problem. For other Unicode encodings you'd have to read more
than one byte per character and also do some decoding.

Frankly even without encodings this is a hairy topic,
c.f. Newline on Wikipedia.
Though line-offsets uses the same convention as readLine, so I
guess that's just FYI.

-
I'd at least not use short/one-letter variable names for function
parameters, because they also serve as documentation. E.g. r
doesn't tell the reader anything, rdr is barely english, but
reader is clear and doesn't cost that much more in terms of
characters. Same with s and sequence. For a, b, c in
line-offsets I'd even say that c1 to c3 would be more helpful.

-
It'd be best to orient yourself at standard functions. It's probably
better to have the most important "subject" first, then subsequently
the rest unless there's a good reason not to. Parameters at the ends
can also very easily be curried, maybe that's another hint to think
about it.

Other than that, yeah, looks good. I don't quite see why the laziness
is necessary, since it's calling shuffle anyway, but perhaps it's
useful in some situations?

Context

StackExchange Code Review Q#109948, answer score: 2

Revisions (0)

No revisions yet.