patternrustMinor

Port of NLTK tokenizing code from Python to Rust

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

tokenizingnltkpythonportcodefromrust

Problem

I'm working on a port of NLTK to Rust.

I am fairly new to Rust, so I wanted to post a small file to check if I was writing idiomatic Rust. I've included the original Python. The Python has docstrings that describes usage, so I didn't include comments for brevity in the Rust.

This file passes all of the tests we've written, so we're certain that it mostly works.

util.py

``


from re import finditer

def string_span_tokenize(s, sep):
    r"""
    Return the offsets of the tokens in s, as a sequence of

(start, end)


    tuples, by splitting the string at each occurrence of sep.

        >>> from nltk.tokenize.util import string_span_tokenize
        >>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
        ... two of them.\n\nThanks.'''
        >>> list(string_span_tokenize(s, " "))
        [(0, 4), (5, 12), (13, 17), (18, 26), (27, 30), (31, 36), (37, 37),
        (38, 44), (45, 48), (49, 55), (56, 58), (59, 73)]

    :param s: the string to be tokenized
    :type s: str
    :param sep: the token separator
    :type sep: str
    :rtype: iter(tuple(int, int))
    """
    if len(sep) == 0:
        raise ValueError("Token delimiter must not be empty")
    left = 0
    while True:
        try:
            right = s.index(sep, left)
            if right != 0:
                yield left, right
        except ValueError:
            if left != len(s):
                yield left, len(s)
            break

        left = right + len(sep)

def regexp_span_tokenize(s, regexp):
    r"""
    Return the offsets of the tokens in s, as a sequence of

(start, end)`
tuples, by splitting the string at each successive match of regexp.

>>> from nltk.tokenize import WhitespaceTokenizer
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36),
(38, 44), (45,

Solution

string_span_tokenize

if sep.len() == 0 {
    Err(String::from("Error! Separator has a length of 0!"))
} else {
    // ...
    return Ok(result);
}

You use an implicit and an explicit return, but both could be expressed as implicit returns. Alternatively, change the Err path to use return, then you don't need an else block and you can reduce indentation:

if sep.len() == 0 {
    return Err(String::from("Error! Separator has a length of 0!"));
}

// ...
Ok(result)

// TODO: we'll likely want to do some error checking
// to ensure s.len() and str.len() don't exceed usize::MAX

You don't need to do that, since len() returns a usize, and usize::MAX is the maximum value. A condition such as s.len() > usize::MAX would always be false.

let mut result: Vec = Vec::new();

The type annotation here is not strictly necessary.

let right = s[left..].find(sep); // TODO: Will this work on unicode?

Strings in Rust are encoded in UTF-8. Standard library functions operate on Unicode code points, so both the string and the separator can contain fancy characters.

match right {
    Some(right_idx) => {
        if right_idx != 0 {
            result.push((left, right_idx));
        }
        r_idx = right_idx;
    },
    None => {
        if left != strlen {
            result.push((left, strlen));
        }
        break;
    }
}

Three things:

-
I would call the right_idx variable right, shadowing the outer right variable. Inside the Some branch, there's no reason to access the outer right variable, so why not reuse the name?

-
Instead of a match expression, you could use an if let expression. For example:

if let Some(right) = right {
    if right != 0 {
        result.push((left, right));
    }
    r_idx = right;
} else {
    if left != strlen {
        result.push((left, strlen));
    }
    break;
}

-
You don't need to define the r_idx variable as mutable if you move it inside the loop and take advantage of the fact that match and if let are expressions.

let r_idx =
    if let Some(right) = right {
        if right != 0 {
            result.push((left, right));
        }
        right
    } else {
        if left != strlen {
            result.push((left, strlen));
        }
        break;
    };

Here, the if branch evaluates to right and the else branch diverges. When the else branch is taken, the break statement breaks out of the loop, so the expression never produces a value. Rust handles this gracefully without us having to put a dummy value on the else branch just to get the types to match.

regexp_span_tokenize

Nothing to say about this function.

spans_to_relative

pub fn spans_to_relative(spans: Vec) -> Vec {
    // ...
}

The function takes a Vec by value, but doesn't take advantage of its storage — it's forcing a move for no reason. To make the function more general, it should take a slice instead:

pub fn spans_to_relative(spans: &[(usize, usize)]) -> Vec {
    // ...
}

When you call the function, you'll need to write spans_to_relative(&x) instead of spans_to_relative(x).

let mut result: Vec = Vec::new();

Here, you know exactly how many elements the Vec will have before you return it, so you could allocate it with Vec::with_capacity(spans.len()) to avoid reallocations while filling it.

Code Snippets

if sep.len() == 0 {
    Err(String::from("Error! Separator has a length of 0!"))
} else {
    // ...
    return Ok(result);
}

if sep.len() == 0 {
    return Err(String::from("Error! Separator has a length of 0!"));
}

// ...
Ok(result)

// TODO: we'll likely want to do some error checking
// to ensure s.len() and str.len() don't exceed usize::MAX

let mut result: Vec<(usize, usize)> = Vec::new();

let right = s[left..].find(sep); // TODO: Will this work on unicode?

Context

StackExchange Code Review Q#112354, answer score: 5

Revisions (0)

No revisions yet.