HiveBrain v1.2.0
Get Started
← Back to all entries
patternrustMinor

Idiomatic word counting in Rust

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
rustcountingwordidiomatic

Problem

My goal is to read from standard input, break up the input into words, case-insensitively, and produce a report to standard output, where each line is a word followed by a space followed by its count. The output should be sorted by words.

This is my code:

use std::collections::BTreeMap;
use std::io;
use std::io::BufRead;

fn main() {
    let mut counts: BTreeMap = BTreeMap::new();
    let stdin = io::stdin();
    for line_result in stdin.lock().lines() {
        match line_result {
            Ok(line) => {
                let lowercase_line = line.to_lowercase();
                let words = lowercase_line.split(|c: char| {
                     !(c.is_alphabetic() || c == '\'')
                 }).filter(|s| !s.is_empty());
                 for word in words {
                    *(counts.entry(word.to_string()).or_insert(0)) += 1;
                 }
            },
            Err(e) => {
                panic!("Error parsing stdin: {:?}", e);
            }
        }
    }
    for (key, value) in counts.iter() {
        println!("{} {}", key, value);
    }
}


My questions are:

  • Is BTree the proper dictionary?



  • I know that there is a regex crate, but I would like to stay with things in standard Rust. That said, splitting is a terrible way to break up lines because you have to filter empties. Is there a way to just match the words, rather than splitting on non-word sequences?



  • Is matching on the Err part of the result proper? Or should we let the script crash? Is panicking okay?



  • I noticed one is not allowed to say let words = line.to_lowercase().split(...) because of the infamous"borrowed reference does not live long enough" but is there a cleaner way?



  • Is there a nicer way to count words in a map? I don't like the asterisk.



  • I wish I didn't have to do an explicit lock on stdin.



Rust has a lot of things going for it, but when I compare what I got to the much prettier Julia version of this script, namely...

``
counts = Dict{AbstractStri

Solution

There is no "proper" dictionary, there are just different trade-offs. In this case, we have that HashMap gives us better asymptotic random access whereas BTreeMap gives us sortedness.

Sorting the a HashMap after-the-fact is well and good, but BTreeMap is already sorted so it seems like the better choice.

Rust very heavily gives tasks out to crates. This is guided by RFC 1242, and you might notice regex is in rust-lang-nursery. This means it is official and it is "standard rust"; it's just not in the standard library.

Plus, it's as easy as adding regex = "0.1" to your Cargo.toml, so that's no reason to avoid it.

Ignoring that, that

let words = line.to_lowercase().split(...)


doesn't work is just a fact of life right now, although that will eventually get fixed with non-lexical lifetimes.

These imports

use std::io;
use std::io::BufRead;


are nicer as

use std::io::{self, BufRead};


The large match

match line_result {
    Ok(line) => {
        ...
    },
    Err(e) => {
        panic!("Error parsing stdin: {:?}", e);
    }
}


would be nicer as

let line = match line_result {
    Ok(line) => line,
    Err(e) => panic!("Error parsing stdin: {:?}", e),
};

...


or even

let line = line_result.unwrap_or_else(
    |e| panic!("Error parsing stdin: {:?}", e)
);


But in this case the error handling doesn't add anything, so I'd just unwrap. All of these examples panic; a method that doesn't would involve error handling by printing and then returning from the function. Printing is normally worse to debug (no tracebacks) but nicer for end-users.

Instead of

*(counts.entry(word.to_string()).or_insert(0)) += 1;


which always allocates a new String, one can index with a borrowed &str. Sadly this isn't supported by the entry API right now, but you can hack around it:

if let Some(count) = counts.get_mut(word) {
    *count += 1;
    continue;
}
counts.insert(word.into(), 1);


Note the use of if let/continue instead of match is because of lexical lifetimes.

Since this is currently pretty ugly and the speedup probably doesn't matter, I'll leave this as a hypothetical. You say you "don't like the asterisk", but that's kind'a how it's meant to be done. I guess you could go the Julia route (get + unwrap_or + insert), but that's not really better.

After a few miscellaneous changes, the code for me looks like

extern crate regex;

use std::collections::BTreeMap;
use std::io::{self, BufRead};

use regex::Regex;

fn main() {
    let word_re = Regex::new(r"[a-z']+").unwrap();

    let mut counts: BTreeMap = BTreeMap::new();

    let stdin = io::stdin();
    for line in stdin.lock().lines() {
        let line = line.unwrap().to_lowercase();
        let matches = word_re.find_iter(&line);
        let words = matches.map(|(x, y)| &line[x..y]);

        for word in words {
            *counts.entry(word.into()).or_insert(0) += 1;
        }
    }
    for (key, value) in counts.iter() {
        println!("{} {}", key, value);
    }
}


This isn't as nice as the Julia code, but it's giving you a lot of opportunities to be a lot more efficient, and it's catching a lot more errors. It's true that locking stdin feels like a chore on 50-line examples, but it fits Rust's macro-goals of faster, safer APIs.

Code Snippets

let words = line.to_lowercase().split(...)
use std::io;
use std::io::BufRead;
use std::io::{self, BufRead};
match line_result {
    Ok(line) => {
        ...
    },
    Err(e) => {
        panic!("Error parsing stdin: {:?}", e);
    }
}
let line = match line_result {
    Ok(line) => line,
    Err(e) => panic!("Error parsing stdin: {:?}", e),
};

...

Context

StackExchange Code Review Q#127350, answer score: 5

Revisions (0)

No revisions yet.