HiveBrain v1.2.0
Get Started
← Back to all entries
patternrustMinor

A Rust iterator to match and replace subsequences

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
iteratorandmatchreplacesubsequencesrust

Problem

I needed to do transformations on a data stream, not on an element-by-element basis, but on subsequences. This is what I've come up with:
https://github.com/peterjoel/rust-iter-replace/blob/master/src/lib.rs

I am relatively new to Rust, and would love some feedback on any areas of the code: style, design, memory/performance no-nos. But performance (throughput) is particularly important in this application, as I'll be using it to "compress" very large (multi GB) pieces of data, in memory and on disc.

Design Overview

The central struct Replace, keeps track of two buffers: buffer_out and buffer_in, a set of partial match candidates and some other members for keeping track of state between invocations of next().

buffer_out is data that is fully processed and ready to pass to the next iterator adapter - this will either contain unmatched data, or the full replacement sequence. buffer_in contains data that may or may not match, and gets copied to buffer_out as soon as it can be shown that it doesn't match and gets erased when it does. I chose a VecDeque for buffer_out because it generally gets written to at the back and read from at the start. As I write this, I realise that buffer_in could have been just a Vec. Maybe I'll change that.

The BTreeSet, candidates, keeps track of the index when the first item of a partial match occurred. As soon as a partial match no longer matches then it is discarded. I chose a BTreeSet because I needed to access the smallest element to know when I can flush any part of buffer_in. But actually its elements are added in size order - which I'm not taking advantage of here - so there could be a better data structure which can exploit that invariant.

The section of code where I remove elements from the candidates set isn't great. I originally wrote a trait and had it as a method of BTreeSet, but I had some errors (I forgot now) to do with the type signature, where I couldn't get it to match the expecte

Solution

The 'consume label is a bit strange; just return instead.

This is a bit ugly:

match self.candidates.iter().cloned()
    .next()
    .into_iter()
    .find(|x| self.index - x + 1 == self.replace_from.len()) {
        None => {
            ...
        },
        Some(_) => {
            ...
        }
    }


You can simplify this with any and splitting this onto a new line:

let complete_match = self.candidates.iter().cloned().next().into_iter()
                     .any(|x| self.index - x + 1 == self.replace_from.len());
if complete_match {
    ...
}
else {
    ...
}


The

.iter().cloned().next().into_iter().any(|x| self.index - x + 1 == self.replace_from.len())


part is ugly; using .into_iter().any() on an Option to test it like this is odd, and the cloned is unneeded. How about

.first().map_or(false, |x| *index - x == replace_from.len() - 1);


The complete_match case always returns, so we can drop the else:

let full_match = self.candidates.first().map_or(false, |x|
    *index - x == replace_from.len() - 1
);
if full_match {
    // A match! So replace it and clear all the partial matches
    self.candidates.clear();
    for &x in self.replace_with.iter() {
        self.buffer_out.push_back(x);
    }
    self.buffer_in.clear();
    self.flushed_index = self.index;
    return;
}
// We can flush the inbound buffer up to the first partial match
// (or the full buffer if there are no partial matches)
let flush_index = self.candidates.iter().next().map(|x| x - 1).unwrap_or(self.index);
if flush_index > self.flushed_index {
    let mut flush: VecDeque = self.buffer_in.drain(0 .. flush_index - self.flushed_index).collect();
    self.buffer_out.append(&mut flush);
    self.flushed_index = flush_index;
    return;
}


Then

for &x in self.replace_with.iter() {
    self.buffer_out.push_back(x);
}


is just

self.buffer_out.extend(self.replace_with);


Later

self.candidates.iter().next().map(|x| x - 1).unwrap_or(self.index)


is just

self.candidates.iter().next().map_or(self.index, |x| x - 1)


Idiomatic styling seems to be no spaces around ... So

self.buffer_in.drain(0 .. flush_index - self.flushed_index).collect()


should be

self.buffer_in.drain(..flush_index - self.flushed_index).collect()


That said, all of

let mut flush = self.buffer_in.drain(..flush_index - self.flushed_index).collect();
self.buffer_out.append(&mut flush);


is better as

let flush = self.buffer_in.drain(..flush_index - self.flushed_index);
self.buffer_out.extend(flush);


candidates should be a Vec, since you don't really use the fact it's a BTreeSet usefully. The removal then looks like

self.candidates.retain(|start|
    self.replace_from[self.index - *start] == item
);


...except for the fact that retain borrows self mutably so self.replace_from doesn't work. This is fixable with a destructuring assignment.

{
    let Replace { ref mut candidates, ref replace_from, index, .. } = *self;
    candidates.retain(|start|
        replace_from[index - *start] == item
    );
}


Though at this point it might make sense to just do this at the start of the function. This adds a bit of length but simplifies things.

candidates should really be inserted into before index is incremented. This lets us work with more natural half-open ranges. We can also skip the conditional for candidates.push by just reusing the if in candidates.retain.

let Replace {
    ref mut iter,
    ref mut buffer_out,
    ref mut buffer_in,
    replace_from,
    replace_with,
    ref mut candidates,
    ref mut index,
    ref mut flushed_index
} = *self;

for item in iter {
    buffer_in.push_back(item);
    candidates.push(*index);
    candidates.retain(|&start| replace_from[*index - start] == item);
    *index += 1;

    let full_match = candidates.first().map_or(false, |start|
        *index - start == replace_from.len()
    );
    if full_match {
        // Replace it and clear all the partial matches
        candidates.clear();
        buffer_out.extend(replace_with);
        buffer_in.clear();
        *flushed_index = *index;
        return;
    }
    // We can flush the inbound buffer up to the first partial match
    // (or the full buffer if there are no partial matches)
    let flush_index = candidates.iter().next().unwrap_or(index);
    if flush_index > flushed_index {
        let flush = buffer_in.drain(..flush_index - *flushed_index);
        buffer_out.extend(flush);
        *flushed_index = *flush_index;
        return;
    }
}


Later, we have

pub trait ReplaceIter where
    I: Iterator,
    T: Ord {

    fn replace(self, from: &'a [T], to: &'a [T]) -> Replace;
}

impl  ReplaceIter for I where
    I: Iterator,
    T: Eq + Ord + Copy {

    fn replace(self, from: &'a [T], to: &'a [T]) -> Replace {
        Replace::adapt(self, from, to)
    }
}


which is nicer as

```
pub trait ReplaceIter {

Code Snippets

match self.candidates.iter().cloned()
    .next()
    .into_iter()
    .find(|x| self.index - x + 1 == self.replace_from.len()) {
        None => {
            ...
        },
        Some(_) => {
            ...
        }
    }
let complete_match = self.candidates.iter().cloned().next().into_iter()
                     .any(|x| self.index - x + 1 == self.replace_from.len());
if complete_match {
    ...
}
else {
    ...
}
.iter().cloned().next().into_iter().any(|x| self.index - x + 1 == self.replace_from.len())
.first().map_or(false, |x| *index - x == replace_from.len() - 1);
let full_match = self.candidates.first().map_or(false, |x|
    *index - x == replace_from.len() - 1
);
if full_match {
    // A match! So replace it and clear all the partial matches
    self.candidates.clear();
    for &x in self.replace_with.iter() {
        self.buffer_out.push_back(x);
    }
    self.buffer_in.clear();
    self.flushed_index = self.index;
    return;
}
// We can flush the inbound buffer up to the first partial match
// (or the full buffer if there are no partial matches)
let flush_index = self.candidates.iter().next().map(|x| x - 1).unwrap_or(self.index);
if flush_index > self.flushed_index {
    let mut flush: VecDeque<_> = self.buffer_in.drain(0 .. flush_index - self.flushed_index).collect();
    self.buffer_out.append(&mut flush);
    self.flushed_index = flush_index;
    return;
}

Context

StackExchange Code Review Q#124653, answer score: 5

Revisions (0)

No revisions yet.