HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Implementation of Python's re.split in Clojure (with capturing parentheses)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
capturingparenthesesclojurewithsplitpythonimplementation

Problem

If you use capturing parenthesis in the regular expression pattern in Python's re.split() function, it will include the matching groups in the result (Python's documentation).

I need this in my Clojure code and I didn't find an implementation of this, nor a Java method for achieving the same result.

(use '[clojure.string :as string :only [blank?]])

(defn re-tokenize [re text]
  (let [matcher (re-matcher re text)]
    (defn inner [last-index result]
      (if (.find matcher)
        (let [start-index (.start matcher)
              end-index (.end matcher)
              match (.group matcher)
              insert (subs text last-index start-index)]
          (if (string/blank? insert)
            (recur end-index (conj result match))
            (recur end-index (conj result insert match))))
        (conj result (subs text last-index))))
    (inner 0 [])))


Example:

(re-tokenize #"(\W+)" "...words, words...")
  => ["..." "words" ", " "words" "..." ""]


How could I make this simpler and / or more efficient (maybe also more Clojure-ish)?

Solution

You can adjust you implementation to be a lazy-seq for some added performance:

(use '[clojure.string :as string :only [blank?]])

(defn re-tokenizer [re text]
(let [matcher (re-matcher re text)]
((fn step [last-index]
(when (re-find matcher)
(let [start-index (.start matcher)
end-index (.end matcher)
match (.group matcher)
insert (subs text last-index start-index)]
(if (string/blank? insert)
(cons match (lazy-seq (step end-index)))
(cons insert (cons match (lazy-seq (step end-index))))))))
0)))


This implementation will be more efficient as the results will only be calculated as needed. For instance if you only needed the first 10 results from a really long string you can use:

(take 10 (re-tokenize #"(\W+)" really-long-string)


and only the first 10 elements will be computed.

Context

StackExchange Code Review Q#14113, answer score: 3

Revisions (0)

No revisions yet.