patternpythonMinor

Implementation of Python's re.split in Clojure (with capturing parentheses)

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

capturingparenthesesclojurewithsplitpythonimplementation

Problem

If you use capturing parenthesis in the regular expression pattern in Python's re.split() function, it will include the matching groups in the result (Python's documentation).

I need this in my Clojure code and I didn't find an implementation of this, nor a Java method for achieving the same result.

(use '[clojure.string :as string :only [blank?]])

(defn re-tokenize [re text]
  (let [matcher (re-matcher re text)]
    (defn inner [last-index result]
      (if (.find matcher)
        (let [start-index (.start matcher)
              end-index (.end matcher)
              match (.group matcher)
              insert (subs text last-index start-index)]
          (if (string/blank? insert)
            (recur end-index (conj result match))
            (recur end-index (conj result insert match))))
        (conj result (subs text last-index))))
    (inner 0 [])))

Example:

(re-tokenize #"(\W+)" "...words, words...")
  => ["..." "words" ", " "words" "..." ""]

How could I make this simpler and / or more efficient (maybe also more Clojure-ish)?

Solution

You can adjust you implementation to be a lazy-seq for some added performance:

(use '[clojure.string :as string :only [blank?]])

(defn re-tokenizer [re text]
  (let [matcher (re-matcher re text)]
    ((fn step [last-index]
       (when (re-find matcher)
         (let [start-index (.start matcher)
               end-index (.end matcher)
               match (.group matcher)
               insert (subs text last-index start-index)]
           (if (string/blank? insert)
             (cons match (lazy-seq (step end-index)))
             (cons insert (cons match (lazy-seq (step end-index))))))))
     0)))

This implementation will be more efficient as the results will only be calculated as needed. For instance if you only needed the first 10 results from a really long string you can use:

(take 10 (re-tokenize #"(\W+)" really-long-string)

and only the first 10 elements will be computed.

Context

StackExchange Code Review Q#14113, answer score: 3

Revisions (0)

No revisions yet.