patternrustMinor
Port of NLTK tokenizing code from Python to Rust
Viewed 0 times
tokenizingnltkpythonportcodefromrust
Problem
I'm working on a port of NLTK to Rust.
I am fairly new to Rust, so I wanted to post a small file to check if I was writing idiomatic Rust. I've included the original Python. The Python has docstrings that describes usage, so I didn't include comments for brevity in the Rust.
This file passes all of the tests we've written, so we're certain that it mostly works.
util.py
``
tuples, by splitting the string at each successive match of regexp.
>>> from nltk.tokenize import WhitespaceTokenizer
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36),
(38, 44), (45,
I am fairly new to Rust, so I wanted to post a small file to check if I was writing idiomatic Rust. I've included the original Python. The Python has docstrings that describes usage, so I didn't include comments for brevity in the Rust.
This file passes all of the tests we've written, so we're certain that it mostly works.
util.py
``
from re import finditer
def string_span_tokenize(s, sep):
r"""
Return the offsets of the tokens in s, as a sequence of (start, end)
tuples, by splitting the string at each occurrence of sep.
>>> from nltk.tokenize.util import string_span_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> list(string_span_tokenize(s, " "))
[(0, 4), (5, 12), (13, 17), (18, 26), (27, 30), (31, 36), (37, 37),
(38, 44), (45, 48), (49, 55), (56, 58), (59, 73)]
:param s: the string to be tokenized
:type s: str
:param sep: the token separator
:type sep: str
:rtype: iter(tuple(int, int))
"""
if len(sep) == 0:
raise ValueError("Token delimiter must not be empty")
left = 0
while True:
try:
right = s.index(sep, left)
if right != 0:
yield left, right
except ValueError:
if left != len(s):
yield left, len(s)
break
left = right + len(sep)
def regexp_span_tokenize(s, regexp):
r"""
Return the offsets of the tokens in s, as a sequence of (start, end)`tuples, by splitting the string at each successive match of regexp.
>>> from nltk.tokenize import WhitespaceTokenizer
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36),
(38, 44), (45,
Solution
string_span_tokenizeif sep.len() == 0 {
Err(String::from("Error! Separator has a length of 0!"))
} else {
// ...
return Ok(result);
}You use an implicit and an explicit return, but both could be expressed as implicit returns. Alternatively, change the
Err path to use return, then you don't need an else block and you can reduce indentation:if sep.len() == 0 {
return Err(String::from("Error! Separator has a length of 0!"));
}
// ...
Ok(result)// TODO: we'll likely want to do some error checking
// to ensure s.len() and str.len() don't exceed usize::MAXYou don't need to do that, since
len() returns a usize, and usize::MAX is the maximum value. A condition such as s.len() > usize::MAX would always be false.let mut result: Vec = Vec::new();The type annotation here is not strictly necessary.
let right = s[left..].find(sep); // TODO: Will this work on unicode?Strings in Rust are encoded in UTF-8. Standard library functions operate on Unicode code points, so both the string and the separator can contain fancy characters.
match right {
Some(right_idx) => {
if right_idx != 0 {
result.push((left, right_idx));
}
r_idx = right_idx;
},
None => {
if left != strlen {
result.push((left, strlen));
}
break;
}
}Three things:
-
I would call the
right_idx variable right, shadowing the outer right variable. Inside the Some branch, there's no reason to access the outer right variable, so why not reuse the name?-
Instead of a
match expression, you could use an if let expression. For example:if let Some(right) = right {
if right != 0 {
result.push((left, right));
}
r_idx = right;
} else {
if left != strlen {
result.push((left, strlen));
}
break;
}-
You don't need to define the
r_idx variable as mutable if you move it inside the loop and take advantage of the fact that match and if let are expressions.let r_idx =
if let Some(right) = right {
if right != 0 {
result.push((left, right));
}
right
} else {
if left != strlen {
result.push((left, strlen));
}
break;
};Here, the
if branch evaluates to right and the else branch diverges. When the else branch is taken, the break statement breaks out of the loop, so the expression never produces a value. Rust handles this gracefully without us having to put a dummy value on the else branch just to get the types to match.regexp_span_tokenizeNothing to say about this function.
spans_to_relativepub fn spans_to_relative(spans: Vec) -> Vec {
// ...
}The function takes a
Vec by value, but doesn't take advantage of its storage — it's forcing a move for no reason. To make the function more general, it should take a slice instead:pub fn spans_to_relative(spans: &[(usize, usize)]) -> Vec {
// ...
}When you call the function, you'll need to write
spans_to_relative(&x) instead of spans_to_relative(x).let mut result: Vec = Vec::new();Here, you know exactly how many elements the
Vec will have before you return it, so you could allocate it with Vec::with_capacity(spans.len()) to avoid reallocations while filling it.Code Snippets
if sep.len() == 0 {
Err(String::from("Error! Separator has a length of 0!"))
} else {
// ...
return Ok(result);
}if sep.len() == 0 {
return Err(String::from("Error! Separator has a length of 0!"));
}
// ...
Ok(result)// TODO: we'll likely want to do some error checking
// to ensure s.len() and str.len() don't exceed usize::MAXlet mut result: Vec<(usize, usize)> = Vec::new();let right = s[left..].find(sep); // TODO: Will this work on unicode?Context
StackExchange Code Review Q#112354, answer score: 5
Revisions (0)
No revisions yet.