HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Reading lines from a file in random order

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
randomreadingfileorderfromlines

Problem

I originally wrote this as an answer to a question on Stack Overflow, but it turned out so nicely that I decided to post it here to see if I can make it even better.

(defn char-seq
"Returns a lazy sequence of characters from rdr. rdr must implement
java.io.Reader."
[rdr]
(let [c (.read rdr)]
(if-not (neg? c)
(cons (char c) (lazy-seq (char-seq rdr))))))

(defn line-offsets
"Returns a lazy sequence of offsets of all lines in s."
[s]
(if (seq s)
(->> (partition-all 3 1 s)
(map-indexed
(fn [i [a b c]]
(cond
(= b \newline) (if c (+ 2 i))
(= a \return) (if b (inc i)))))
(filter identity)
(cons 0))))

(defn ordered-line-seq
"Returns the lines of text from raf at each offset in offsets as a lazy
sequence of strings. raf must implement java.io.RandomAccessFile."
[raf offsets]
(map (fn [i]
(.seek raf i)
(.readLine raf))
offsets))


Example usage:

(let [filename "data.txt"
offsets (with-open [rdr (clojure.java.io/reader filename)]
(shuffle (line-offsets (char-seq rdr))))]
(with-open [raf (java.io.RandomAccessFile. filename "r")]
(run! println (ordered-line-seq raf offsets))))


I'm interested in any way to improve this code, but here are some specific things I'm looking for:

  • Since RandomAccessFile obviously uses byte offsets, and line-offsets returns character offsets, this code won't work for Unicode files. Is there a way to fix that problem without adding a lot of complexity?



  • Are a, b, and c good names for the parameters of the anonymous function in line-offsets?



  • What exactly are the guidelines for parameter ordering in named functions like ordered-line-seq?

Solution

Regarding your questions:

-
You're reading bytes. Fortunately, the only three line ending
conventions all involve Newline and Return, both a single byte.
There's nothing more to it. So if you stay with an encoding using
single bytes for (most of) their characters, such as UTF-8, there will
be no problem. For other Unicode encodings you'd have to read more
than one byte per character and also do some decoding.

Frankly even without encodings this is a hairy topic,
c.f. Newline on Wikipedia.
Though line-offsets uses the same convention as readLine, so I
guess that's just FYI.

-
I'd at least not use short/one-letter variable names for function
parameters, because they also serve as documentation. E.g. r
doesn't tell the reader anything, rdr is barely english, but
reader is clear and doesn't cost that much more in terms of
characters. Same with s and sequence. For a, b, c in
line-offsets I'd even say that c1 to c3 would be more helpful.

-
It'd be best to orient yourself at standard functions. It's probably
better to have the most important "subject" first, then subsequently
the rest unless there's a good reason not to. Parameters at the ends
can also very easily be curried, maybe that's another hint to think
about it.

Other than that, yeah, looks good. I don't quite see why the laziness
is necessary, since it's calling shuffle anyway, but perhaps it's
useful in some situations?

Context

StackExchange Code Review Q#109948, answer score: 2

Revisions (0)

No revisions yet.