HiveBrain v1.2.0
Get Started
← Back to all entries
patternrustMinor

Tokenize s-expressions in Rust

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
rusttokenizeexpressions

Problem

I'm writing an s-expression tokenizer in Rust.

I have the following code written, but it is not idiomatic Rust - it is quite ugly. TokenizerI is simply an interface with a body that is really of no import here other than to note that _s is unused (hence the underscore).

extern crate regex;

use regex::Regex;
use std::cmp;

use tokenize::api::TokenizerI;

pub struct SExprTokenizer {
    strict: bool,
    open_paren: &'static str,
    close_paren: &'static str,
    paren_regexp: regex::Regex
}

impl TokenizerI for SExprTokenizer {

    fn tokenize(&'a self, _s: &'a str) -> Result, String> {
        let mut result = Vec::new();
        let mut pos = 0;
        let mut depth = 0;

        for cap in self.paren_regexp.captures_iter(_s) {
            let paren = cap.at(0).unwrap();

            let (start, end) = cap.pos(0).unwrap();

            if depth == 0 {
                for token in _s[pos..start].split(|c: char| c.is_whitespace())
                                           .filter(|s| !s.is_empty()) {
                        result.push(token);
                }
                pos = start;
            }
            if paren == self.open_paren {
                depth = depth + 1;
            }
            if paren == self.close_paren {
                if self.strict && depth == 0 {
                    return Err(String::from(format!("Unmatched open token at {}", pos)));
                }
                depth = cmp::max(0, depth-1);
                if depth == 0 {
                    result.push(&_s[pos..end]);
                    pos = end;
                }
            }
        }
        Ok(result)
    }
}


It is based on the following Python:

```
import re

from nltk.tokenize.api import TokenizerI

class SExprTokenizer(TokenizerI):
"""
A tokenizer that divides strings into s-expressions.
An s-expresion can be either:

- a parenthesized expression, including any nested parenthesized
expressions, or
- a sequence of

Solution

-
The biggest thing I can say is listen to your tests. The very first thing I noticed as I was skimming is how repetitive they were. Even worse, the repeated parts drowned out the interesting aspects of the tests.

Remember that your tests are your first chance to see how a potential end-user might use your code. Use them as an opportunity to reflect on your API and how ergonomic it is.

Here, I've chosen to introduce the builder pattern. This allows you to reduce the boilerplate duplication and consolidate more complicated logic (like the construction of the regex), as well as provide a sane set of defaults.

As a side-benefit, I can see that you haven't tested with non-strict tokenization because I didn't need to add that to the builder.

-
Avoid abbreviations in globally-visible items such as types. These items are read way more then they will be written. You can use your editors autocomplete functionality to avoid RSI if that's a concern for you.

-
I'd strongly encourage you to avoid Hungarian notation. Hungarian notation was useful in languages like C where the type system was a bit looser, although many people misused it to simply repeat the type of the variable. It should have been used like distance_meters, but people misused it as distance_int. It's especially bad if you use initials that don't even correspond! Rust doesn't have interfaces, it has traits. While the two things help solve similar problems, things have different names for a reason.

-
I don't know why you are using &'a self. That's a very unusual pattern as it ties the lifetime of the result to the lifetime of the tokenizer. Said another way, this code doesn't work:

let text = "{a b {c d}} e f {g} }";

let result = {
    let tokenizer = make_tokenizer();
    tokenizer.tokenize(text)
};


-
The s argument is used in your tokenizer, so it should not have an underscore. I'm unclear as to why you bring up the trait definition in your original post, but it seems to be a non sequitur. Maybe you are unaware that the argument names don't have to match between the trait definition and the implementation(s)?

trait Foo {
    fn foo(&self, a: u8);
}

impl Foo for bool {
    fn foo(&self, _: u8) { }
}

impl Foo for String {
    fn foo(&self, value: u8) { println!("{}", value) }
}


-
format! already returns a String — there's no need to use String::from.

-
I avoid having long values in a for-loop expression. Break it out into a separate line and give it a name.

-
You can sometimes pass a function directly to items that take a closure. It's a little tricky when a lifetime is involved, but is_whitespace works just fine.

-
Instead of iterating over the iterator and pushing, just use extend.

-
Use the += operator and friends instead of a = a + b.

-
Use else if to show that you expect that the two if statements can never be both true. As a small benefit, you might get a tiny performance benefit.

-
There's no need to use max when subtracting the depth as is it never possible to subtract more than one level (at least your tests don't fail with this change). If you did need it for some reason, I'd look into saturating_sub.

-
I'd avoid unwraps in your tests, especially for things that you expect to fail. Instead, I'd just assert that your value is wrapped in an Ok or Err, as appropriate. I believe this to give better test errors.

-
Since the test module is nested within your production code, which presumably would already be namespaced appropriated, there's no need to further namespace it by adding _sexpr.

```
extern crate regex;

use regex::Regex;

trait Tokenizer {
fn tokenize(&self, s: &'a str) -> Result, String>;
}

struct SExpressionTokenizerBuilder {
strict: bool,
open_paren: &'static str,
close_paren: &'static str,
}

impl SExpressionTokenizerBuilder {
fn new() -> SExpressionTokenizerBuilder {
SExpressionTokenizerBuilder {
strict: true,
open_paren: "(",
close_paren: ")",
}
}

fn open_close(self, open: &'static str, close: &'static str) -> SExpressionTokenizerBuilder {
SExpressionTokenizerBuilder {
open_paren: open,
close_paren: close,
..self
}
}

fn build(self) -> SExpressionTokenizer {
let paren_regexp = Regex::new(
&format!("\\{}|\\{}", self.open_paren, self.close_paren)
).unwrap();

SExpressionTokenizer {
strict: self.strict,
open_paren: self.open_paren,
close_paren: self.close_paren,
paren_regexp: paren_regexp,
}
}
}

struct SExpressionTokenizer {
strict: bool,
open_paren: &'static str,
close_paren: &'static str,
paren_regexp: regex::Regex
}

impl Tokenizer for SExpressionTokenizer {
fn tokenize(&self, s: &'a str) -> Result, String> {
let mut result = Vec::new();
let mut pos = 0;
let mut

Code Snippets

let text = "{a b {c d}} e f {g} }";

let result = {
    let tokenizer = make_tokenizer();
    tokenizer.tokenize(text)
};
trait Foo {
    fn foo(&self, a: u8);
}

impl Foo for bool {
    fn foo(&self, _: u8) { }
}

impl Foo for String {
    fn foo(&self, value: u8) { println!("{}", value) }
}
extern crate regex;

use regex::Regex;

trait Tokenizer {
    fn tokenize<'a>(&self, s: &'a str) -> Result<Vec<&'a str>, String>;
}

struct SExpressionTokenizerBuilder {
    strict: bool,
    open_paren: &'static str,
    close_paren: &'static str,
}

impl SExpressionTokenizerBuilder {
    fn new() -> SExpressionTokenizerBuilder {
        SExpressionTokenizerBuilder {
            strict: true,
            open_paren: "(",
            close_paren: ")",
        }
    }

    fn open_close(self, open: &'static str, close: &'static str) -> SExpressionTokenizerBuilder {
        SExpressionTokenizerBuilder {
            open_paren: open,
            close_paren: close,
            ..self
        }
    }

    fn build(self) -> SExpressionTokenizer {
        let paren_regexp = Regex::new(
            &format!("\\{}|\\{}", self.open_paren, self.close_paren)
        ).unwrap();

        SExpressionTokenizer {
            strict: self.strict,
            open_paren: self.open_paren,
            close_paren: self.close_paren,
            paren_regexp: paren_regexp,
        }
    }
}

struct SExpressionTokenizer {
    strict: bool,
    open_paren: &'static str,
    close_paren: &'static str,
    paren_regexp: regex::Regex
}

impl Tokenizer for SExpressionTokenizer {
    fn tokenize<'a>(&self, s: &'a str) -> Result<Vec<&'a str>, String> {
        let mut result = Vec::new();
        let mut pos = 0;
        let mut depth = 0;

        for cap in self.paren_regexp.captures_iter(s) {
            let paren = cap.at(0).unwrap();

            let (start, end) = cap.pos(0).unwrap();

            if depth == 0 {
                let tokens =
                    s[pos..start]
                    .split(char::is_whitespace)
                    .filter(|s| !s.is_empty());

                result.extend(tokens);

                pos = start;
            }

            if paren == self.open_paren {
                depth += 1;
            } else if paren == self.close_paren {
                if self.strict && depth == 0 {
                    return Err(format!("Unmatched open token at {}", pos));
                }
                depth -= 1;
                if depth == 0 {
                    result.push(&s[pos..end]);
                    pos = end;
                }
            }
        }
        Ok(result)
    }
}

#[cfg(test)]
mod test {
    use super::{Tokenizer, SExpressionTokenizerBuilder};

    #[test]
    fn passing_strict_parens_test() {
        let tokenizer = SExpressionTokenizerBuilder::new().build();

        let text = "(a b (c d)) e f (g)";
        let expected = vec!["(a b (c d))", "e", "f", "(g)"];
        let result = tokenizer.tokenize(text);
        assert_eq!(Ok(expected), result);
    }

    #[test]
    fn passing_strict_braces_test() {
        let tokenizer = SExpressionTokenizerBuilder::new().open_close("{", "}").build();

        let text = "{a b {c d}} e f {g}";
        let expected = vec!["{a b {c d}}", "e", "f", "{g}"];
        let result = token

Context

StackExchange Code Review Q#112990, answer score: 9

Revisions (0)

No revisions yet.