patternrustMinor
A Rust iterator to match and replace subsequences
Viewed 0 times
iteratorandmatchreplacesubsequencesrust
Problem
I needed to do transformations on a data stream, not on an element-by-element basis, but on subsequences. This is what I've come up with:
https://github.com/peterjoel/rust-iter-replace/blob/master/src/lib.rs
I am relatively new to Rust, and would love some feedback on any areas of the code: style, design, memory/performance no-nos. But performance (throughput) is particularly important in this application, as I'll be using it to "compress" very large (multi GB) pieces of data, in memory and on disc.
Design Overview
The central struct
The
The section of code where I remove elements from the
https://github.com/peterjoel/rust-iter-replace/blob/master/src/lib.rs
I am relatively new to Rust, and would love some feedback on any areas of the code: style, design, memory/performance no-nos. But performance (throughput) is particularly important in this application, as I'll be using it to "compress" very large (multi GB) pieces of data, in memory and on disc.
Design Overview
The central struct
Replace, keeps track of two buffers: buffer_out and buffer_in, a set of partial match candidates and some other members for keeping track of state between invocations of next().buffer_out is data that is fully processed and ready to pass to the next iterator adapter - this will either contain unmatched data, or the full replacement sequence. buffer_in contains data that may or may not match, and gets copied to buffer_out as soon as it can be shown that it doesn't match and gets erased when it does. I chose a VecDeque for buffer_out because it generally gets written to at the back and read from at the start. As I write this, I realise that buffer_in could have been just a Vec. Maybe I'll change that.The
BTreeSet, candidates, keeps track of the index when the first item of a partial match occurred. As soon as a partial match no longer matches then it is discarded. I chose a BTreeSet because I needed to access the smallest element to know when I can flush any part of buffer_in. But actually its elements are added in size order - which I'm not taking advantage of here - so there could be a better data structure which can exploit that invariant. The section of code where I remove elements from the
candidates set isn't great. I originally wrote a trait and had it as a method of BTreeSet, but I had some errors (I forgot now) to do with the type signature, where I couldn't get it to match the expecteSolution
The
This is a bit ugly:
You can simplify this with
The
part is ugly; using
The
Then
is just
Later
is just
Idiomatic styling seems to be no spaces around
should be
That said, all of
is better as
...except for the fact that
Though at this point it might make sense to just do this at the start of the function. This adds a bit of length but simplifies things.
Later, we have
which is nicer as
```
pub trait ReplaceIter {
'consume label is a bit strange; just return instead.This is a bit ugly:
match self.candidates.iter().cloned()
.next()
.into_iter()
.find(|x| self.index - x + 1 == self.replace_from.len()) {
None => {
...
},
Some(_) => {
...
}
}You can simplify this with
any and splitting this onto a new line:let complete_match = self.candidates.iter().cloned().next().into_iter()
.any(|x| self.index - x + 1 == self.replace_from.len());
if complete_match {
...
}
else {
...
}The
.iter().cloned().next().into_iter().any(|x| self.index - x + 1 == self.replace_from.len())part is ugly; using
.into_iter().any() on an Option to test it like this is odd, and the cloned is unneeded. How about.first().map_or(false, |x| *index - x == replace_from.len() - 1);The
complete_match case always returns, so we can drop the else:let full_match = self.candidates.first().map_or(false, |x|
*index - x == replace_from.len() - 1
);
if full_match {
// A match! So replace it and clear all the partial matches
self.candidates.clear();
for &x in self.replace_with.iter() {
self.buffer_out.push_back(x);
}
self.buffer_in.clear();
self.flushed_index = self.index;
return;
}
// We can flush the inbound buffer up to the first partial match
// (or the full buffer if there are no partial matches)
let flush_index = self.candidates.iter().next().map(|x| x - 1).unwrap_or(self.index);
if flush_index > self.flushed_index {
let mut flush: VecDeque = self.buffer_in.drain(0 .. flush_index - self.flushed_index).collect();
self.buffer_out.append(&mut flush);
self.flushed_index = flush_index;
return;
}Then
for &x in self.replace_with.iter() {
self.buffer_out.push_back(x);
}is just
self.buffer_out.extend(self.replace_with);Later
self.candidates.iter().next().map(|x| x - 1).unwrap_or(self.index)is just
self.candidates.iter().next().map_or(self.index, |x| x - 1)Idiomatic styling seems to be no spaces around
... Soself.buffer_in.drain(0 .. flush_index - self.flushed_index).collect()should be
self.buffer_in.drain(..flush_index - self.flushed_index).collect()That said, all of
let mut flush = self.buffer_in.drain(..flush_index - self.flushed_index).collect();
self.buffer_out.append(&mut flush);is better as
let flush = self.buffer_in.drain(..flush_index - self.flushed_index);
self.buffer_out.extend(flush);candidates should be a Vec, since you don't really use the fact it's a BTreeSet usefully. The removal then looks likeself.candidates.retain(|start|
self.replace_from[self.index - *start] == item
);...except for the fact that
retain borrows self mutably so self.replace_from doesn't work. This is fixable with a destructuring assignment.{
let Replace { ref mut candidates, ref replace_from, index, .. } = *self;
candidates.retain(|start|
replace_from[index - *start] == item
);
}Though at this point it might make sense to just do this at the start of the function. This adds a bit of length but simplifies things.
candidates should really be inserted into before index is incremented. This lets us work with more natural half-open ranges. We can also skip the conditional for candidates.push by just reusing the if in candidates.retain.let Replace {
ref mut iter,
ref mut buffer_out,
ref mut buffer_in,
replace_from,
replace_with,
ref mut candidates,
ref mut index,
ref mut flushed_index
} = *self;
for item in iter {
buffer_in.push_back(item);
candidates.push(*index);
candidates.retain(|&start| replace_from[*index - start] == item);
*index += 1;
let full_match = candidates.first().map_or(false, |start|
*index - start == replace_from.len()
);
if full_match {
// Replace it and clear all the partial matches
candidates.clear();
buffer_out.extend(replace_with);
buffer_in.clear();
*flushed_index = *index;
return;
}
// We can flush the inbound buffer up to the first partial match
// (or the full buffer if there are no partial matches)
let flush_index = candidates.iter().next().unwrap_or(index);
if flush_index > flushed_index {
let flush = buffer_in.drain(..flush_index - *flushed_index);
buffer_out.extend(flush);
*flushed_index = *flush_index;
return;
}
}Later, we have
pub trait ReplaceIter where
I: Iterator,
T: Ord {
fn replace(self, from: &'a [T], to: &'a [T]) -> Replace;
}
impl ReplaceIter for I where
I: Iterator,
T: Eq + Ord + Copy {
fn replace(self, from: &'a [T], to: &'a [T]) -> Replace {
Replace::adapt(self, from, to)
}
}which is nicer as
```
pub trait ReplaceIter {
Code Snippets
match self.candidates.iter().cloned()
.next()
.into_iter()
.find(|x| self.index - x + 1 == self.replace_from.len()) {
None => {
...
},
Some(_) => {
...
}
}let complete_match = self.candidates.iter().cloned().next().into_iter()
.any(|x| self.index - x + 1 == self.replace_from.len());
if complete_match {
...
}
else {
...
}.iter().cloned().next().into_iter().any(|x| self.index - x + 1 == self.replace_from.len()).first().map_or(false, |x| *index - x == replace_from.len() - 1);let full_match = self.candidates.first().map_or(false, |x|
*index - x == replace_from.len() - 1
);
if full_match {
// A match! So replace it and clear all the partial matches
self.candidates.clear();
for &x in self.replace_with.iter() {
self.buffer_out.push_back(x);
}
self.buffer_in.clear();
self.flushed_index = self.index;
return;
}
// We can flush the inbound buffer up to the first partial match
// (or the full buffer if there are no partial matches)
let flush_index = self.candidates.iter().next().map(|x| x - 1).unwrap_or(self.index);
if flush_index > self.flushed_index {
let mut flush: VecDeque<_> = self.buffer_in.drain(0 .. flush_index - self.flushed_index).collect();
self.buffer_out.append(&mut flush);
self.flushed_index = flush_index;
return;
}Context
StackExchange Code Review Q#124653, answer score: 5
Revisions (0)
No revisions yet.