patternpythonMinor
Implementation of Python's re.split in Clojure (with capturing parentheses)
Viewed 0 times
capturingparenthesesclojurewithsplitpythonimplementation
Problem
If you use capturing parenthesis in the regular expression pattern in Python's
I need this in my Clojure code and I didn't find an implementation of this, nor a Java method for achieving the same result.
Example:
How could I make this simpler and / or more efficient (maybe also more Clojure-ish)?
re.split() function, it will include the matching groups in the result (Python's documentation).I need this in my Clojure code and I didn't find an implementation of this, nor a Java method for achieving the same result.
(use '[clojure.string :as string :only [blank?]])
(defn re-tokenize [re text]
(let [matcher (re-matcher re text)]
(defn inner [last-index result]
(if (.find matcher)
(let [start-index (.start matcher)
end-index (.end matcher)
match (.group matcher)
insert (subs text last-index start-index)]
(if (string/blank? insert)
(recur end-index (conj result match))
(recur end-index (conj result insert match))))
(conj result (subs text last-index))))
(inner 0 [])))Example:
(re-tokenize #"(\W+)" "...words, words...")
=> ["..." "words" ", " "words" "..." ""]How could I make this simpler and / or more efficient (maybe also more Clojure-ish)?
Solution
You can adjust you implementation to be a lazy-seq for some added performance:
This implementation will be more efficient as the results will only be calculated as needed. For instance if you only needed the first 10 results from a really long string you can use:
and only the first 10 elements will be computed.
(use '[clojure.string :as string :only [blank?]])
(defn re-tokenizer [re text]
(let [matcher (re-matcher re text)]
((fn step [last-index]
(when (re-find matcher)
(let [start-index (.start matcher)
end-index (.end matcher)
match (.group matcher)
insert (subs text last-index start-index)]
(if (string/blank? insert)
(cons match (lazy-seq (step end-index)))
(cons insert (cons match (lazy-seq (step end-index))))))))
0)))
This implementation will be more efficient as the results will only be calculated as needed. For instance if you only needed the first 10 results from a really long string you can use:
(take 10 (re-tokenize #"(\W+)" really-long-string)
and only the first 10 elements will be computed.
Context
StackExchange Code Review Q#14113, answer score: 3
Revisions (0)
No revisions yet.