patternMinor
Reading lines from a file in random order
Viewed 0 times
randomreadingfileorderfromlines
Problem
I originally wrote this as an answer to a question on Stack Overflow, but it turned out so nicely that I decided to post it here to see if I can make it even better.
Example usage:
I'm interested in any way to improve this code, but here are some specific things I'm looking for:
(defn char-seq
"Returns a lazy sequence of characters from rdr. rdr must implement
java.io.Reader."
[rdr]
(let [c (.read rdr)]
(if-not (neg? c)
(cons (char c) (lazy-seq (char-seq rdr))))))
(defn line-offsets
"Returns a lazy sequence of offsets of all lines in s."
[s]
(if (seq s)
(->> (partition-all 3 1 s)
(map-indexed
(fn [i [a b c]]
(cond
(= b \newline) (if c (+ 2 i))
(= a \return) (if b (inc i)))))
(filter identity)
(cons 0))))
(defn ordered-line-seq
"Returns the lines of text from raf at each offset in offsets as a lazy
sequence of strings. raf must implement java.io.RandomAccessFile."
[raf offsets]
(map (fn [i]
(.seek raf i)
(.readLine raf))
offsets))
Example usage:
(let [filename "data.txt"
offsets (with-open [rdr (clojure.java.io/reader filename)]
(shuffle (line-offsets (char-seq rdr))))]
(with-open [raf (java.io.RandomAccessFile. filename "r")]
(run! println (ordered-line-seq raf offsets))))
I'm interested in any way to improve this code, but here are some specific things I'm looking for:
- Since
RandomAccessFileobviously uses byte offsets, andline-offsetsreturns character offsets, this code won't work for Unicode files. Is there a way to fix that problem without adding a lot of complexity?
- Are
a,b, andcgood names for the parameters of the anonymous function inline-offsets?
- What exactly are the guidelines for parameter ordering in named functions like
ordered-line-seq?
Solution
Regarding your questions:
-
You're reading bytes. Fortunately, the only three line ending
conventions all involve Newline and Return, both a single byte.
There's nothing more to it. So if you stay with an encoding using
single bytes for (most of) their characters, such as UTF-8, there will
be no problem. For other Unicode encodings you'd have to read more
than one byte per character and also do some decoding.
Frankly even without encodings this is a hairy topic,
c.f. Newline on Wikipedia.
Though
guess that's just FYI.
-
I'd at least not use short/one-letter variable names for function
parameters, because they also serve as documentation. E.g.
doesn't tell the reader anything,
characters. Same with
-
It'd be best to orient yourself at standard functions. It's probably
better to have the most important "subject" first, then subsequently
the rest unless there's a good reason not to. Parameters at the ends
can also very easily be curried, maybe that's another hint to think
about it.
Other than that, yeah, looks good. I don't quite see why the laziness
is necessary, since it's calling
useful in some situations?
-
You're reading bytes. Fortunately, the only three line ending
conventions all involve Newline and Return, both a single byte.
There's nothing more to it. So if you stay with an encoding using
single bytes for (most of) their characters, such as UTF-8, there will
be no problem. For other Unicode encodings you'd have to read more
than one byte per character and also do some decoding.
Frankly even without encodings this is a hairy topic,
c.f. Newline on Wikipedia.
Though
line-offsets uses the same convention as readLine, so Iguess that's just FYI.
-
I'd at least not use short/one-letter variable names for function
parameters, because they also serve as documentation. E.g.
rdoesn't tell the reader anything,
rdr is barely english, butreader is clear and doesn't cost that much more in terms ofcharacters. Same with
s and sequence. For a, b, c inline-offsets I'd even say that c1 to c3 would be more helpful.-
It'd be best to orient yourself at standard functions. It's probably
better to have the most important "subject" first, then subsequently
the rest unless there's a good reason not to. Parameters at the ends
can also very easily be curried, maybe that's another hint to think
about it.
Other than that, yeah, looks good. I don't quite see why the laziness
is necessary, since it's calling
shuffle anyway, but perhaps it'suseful in some situations?
Context
StackExchange Code Review Q#109948, answer score: 2
Revisions (0)
No revisions yet.