patternrustMinor
Idiomatic word counting in Rust
Viewed 0 times
rustcountingwordidiomatic
Problem
My goal is to read from standard input, break up the input into words, case-insensitively, and produce a report to standard output, where each line is a word followed by a space followed by its count. The output should be sorted by words.
This is my code:
My questions are:
Rust has a lot of things going for it, but when I compare what I got to the much prettier Julia version of this script, namely...
``
counts = Dict{AbstractStri
This is my code:
use std::collections::BTreeMap;
use std::io;
use std::io::BufRead;
fn main() {
let mut counts: BTreeMap = BTreeMap::new();
let stdin = io::stdin();
for line_result in stdin.lock().lines() {
match line_result {
Ok(line) => {
let lowercase_line = line.to_lowercase();
let words = lowercase_line.split(|c: char| {
!(c.is_alphabetic() || c == '\'')
}).filter(|s| !s.is_empty());
for word in words {
*(counts.entry(word.to_string()).or_insert(0)) += 1;
}
},
Err(e) => {
panic!("Error parsing stdin: {:?}", e);
}
}
}
for (key, value) in counts.iter() {
println!("{} {}", key, value);
}
}My questions are:
- Is
BTreethe proper dictionary?
- I know that there is a regex crate, but I would like to stay with things in standard Rust. That said, splitting is a terrible way to break up lines because you have to filter empties. Is there a way to just match the words, rather than splitting on non-word sequences?
- Is matching on the
Errpart of the result proper? Or should we let the script crash? Is panicking okay?
- I noticed one is not allowed to say
let words = line.to_lowercase().split(...) because of the infamous"borrowed reference does not live long enough"but is there a cleaner way?
- Is there a nicer way to count words in a map? I don't like the asterisk.
- I wish I didn't have to do an explicit lock on stdin.
Rust has a lot of things going for it, but when I compare what I got to the much prettier Julia version of this script, namely...
``
counts = Dict{AbstractStri
Solution
There is no "proper" dictionary, there are just different trade-offs. In this case, we have that
Sorting the a
Rust very heavily gives tasks out to crates. This is guided by RFC 1242, and you might notice
Plus, it's as easy as adding
Ignoring that, that
doesn't work is just a fact of life right now, although that will eventually get fixed with non-lexical lifetimes.
These imports
are nicer as
The large
would be nicer as
or even
But in this case the error handling doesn't add anything, so I'd just
Instead of
which always allocates a new
Note the use of
Since this is currently pretty ugly and the speedup probably doesn't matter, I'll leave this as a hypothetical. You say you "don't like the asterisk", but that's kind'a how it's meant to be done. I guess you could go the Julia route (
After a few miscellaneous changes, the code for me looks like
This isn't as nice as the Julia code, but it's giving you a lot of opportunities to be a lot more efficient, and it's catching a lot more errors. It's true that locking
HashMap gives us better asymptotic random access whereas BTreeMap gives us sortedness.Sorting the a
HashMap after-the-fact is well and good, but BTreeMap is already sorted so it seems like the better choice.Rust very heavily gives tasks out to crates. This is guided by RFC 1242, and you might notice
regex is in rust-lang-nursery. This means it is official and it is "standard rust"; it's just not in the standard library.Plus, it's as easy as adding
regex = "0.1" to your Cargo.toml, so that's no reason to avoid it.Ignoring that, that
let words = line.to_lowercase().split(...)doesn't work is just a fact of life right now, although that will eventually get fixed with non-lexical lifetimes.
These imports
use std::io;
use std::io::BufRead;are nicer as
use std::io::{self, BufRead};The large
matchmatch line_result {
Ok(line) => {
...
},
Err(e) => {
panic!("Error parsing stdin: {:?}", e);
}
}would be nicer as
let line = match line_result {
Ok(line) => line,
Err(e) => panic!("Error parsing stdin: {:?}", e),
};
...or even
let line = line_result.unwrap_or_else(
|e| panic!("Error parsing stdin: {:?}", e)
);But in this case the error handling doesn't add anything, so I'd just
unwrap. All of these examples panic; a method that doesn't would involve error handling by printing and then returning from the function. Printing is normally worse to debug (no tracebacks) but nicer for end-users.Instead of
*(counts.entry(word.to_string()).or_insert(0)) += 1;which always allocates a new
String, one can index with a borrowed &str. Sadly this isn't supported by the entry API right now, but you can hack around it:if let Some(count) = counts.get_mut(word) {
*count += 1;
continue;
}
counts.insert(word.into(), 1);Note the use of
if let/continue instead of match is because of lexical lifetimes.Since this is currently pretty ugly and the speedup probably doesn't matter, I'll leave this as a hypothetical. You say you "don't like the asterisk", but that's kind'a how it's meant to be done. I guess you could go the Julia route (
get + unwrap_or + insert), but that's not really better.After a few miscellaneous changes, the code for me looks like
extern crate regex;
use std::collections::BTreeMap;
use std::io::{self, BufRead};
use regex::Regex;
fn main() {
let word_re = Regex::new(r"[a-z']+").unwrap();
let mut counts: BTreeMap = BTreeMap::new();
let stdin = io::stdin();
for line in stdin.lock().lines() {
let line = line.unwrap().to_lowercase();
let matches = word_re.find_iter(&line);
let words = matches.map(|(x, y)| &line[x..y]);
for word in words {
*counts.entry(word.into()).or_insert(0) += 1;
}
}
for (key, value) in counts.iter() {
println!("{} {}", key, value);
}
}This isn't as nice as the Julia code, but it's giving you a lot of opportunities to be a lot more efficient, and it's catching a lot more errors. It's true that locking
stdin feels like a chore on 50-line examples, but it fits Rust's macro-goals of faster, safer APIs.Code Snippets
let words = line.to_lowercase().split(...)use std::io;
use std::io::BufRead;use std::io::{self, BufRead};match line_result {
Ok(line) => {
...
},
Err(e) => {
panic!("Error parsing stdin: {:?}", e);
}
}let line = match line_result {
Ok(line) => line,
Err(e) => panic!("Error parsing stdin: {:?}", e),
};
...Context
StackExchange Code Review Q#127350, answer score: 5
Revisions (0)
No revisions yet.