HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

C++14-ish code for creating q-grams of a string

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
gramscreatingishforcodestring

Problem

I want to return an iterable collection of all q-grams of a provided string. This means all sub-words of length Q. Let Q be 3 for example:

normalized as an already whitespace-stripped and upcased word or string. PREFIX3 and SUFFIX3 are constant strings ^^ and$$ that are put surrounding the string before splitting. That is a common trick to get more q-words.

For the word INDEX the resulting q-grams are therefore:

^^I  ^IN  IND  NDE  DEX  EX$  X$


My version uses several C++14 constructs. This is intended, because the code is part of a lesson for teaching about C++14.

Is there anything I could do better, especially with respect to C++14-ish code?

vector qwordify(const string& normalized) const
{
    auto word = PREFIX3 + normalized + SUFFIX3; 
    vector result {};
    auto left = word.cbegin();
    auto right = word.cbegin() + Q; // valid, because |"^^"|+|"$"| => 3.
    for( ; right <= word.end(); ++left, ++right) {
        result.emplace_back(left, right);
    }
    return result;
}


Here are the things I considered especially:

  • Despite of returning a value, result is not copied, because of RVO.



  • Lots of auto, which I like.



  • Use of cbegin() and cend(), because I declared auto word and not const auto word. I could have done, but for this demonstration I liked to show the c-variants.



  • Use of emplace_back, which really saves a copy here, correct?



I feel like some places could be improved, maybe. The iterators maybe? Getting rid of the loop in favor of an algorithm? Maybe even not returning a full vector at all, but only an iterable proxy-object, but that would too long a shot for this small example, I think.

Solution

Things I like:

  • (Mostly) a pretty neatly packaged, self-contained component.



  • Written primarily to emphasize simplicity



  • and let the compiler deal with making it fast



Things I don't like so well:

  • Q, PREFIX3, and SUFFIX3 are apparently globals (or maybe class variables?) so I can't see what they really are.



  • Subtle interaction: PREFIX3 and SUFFIX3 both (seem to) implicitly assume that Q = 3.



  • qword has a well-known meaning in Intel assembly language. Unless you're sure this will only be used by an audience that immediately knows what you intend, another name might be better.



  • Written non-generically, for no particularly good reason.



  • It almost seems like qwordify is really two separate things shoved together: a generic "generate all the results in a sliding window", and a specialized "generate all the qwords from a string".



  • If you can use the new string_view, it would be a good choice here.



Ignoring point 6 (since it's not really C++14), we could end up with code on this order:

// the generic algorithm:
template 
void window(InIt b, InIt e, OutIt d, size_type len) {
    for (auto s = std::next(b, len); s != e; ++s, ++b, ++d)
        *d = { b, s };
}

// The piece specifically for q-grams
std::vector generate_qgram(std::string const &in, size_t q) {
    std::string input = std::string(q-1, '^') + in + std::string(q, ');

    std::vector words;

    window(input.begin(), input.end(), std::back_inserter(words), q);
    return words;
}

// and a little demo code:
int main() {
    auto ret = generate_qgram("INDEX", 3);

    std::copy(ret.begin(), ret.end(), std::ostream_iterator(std::cout, "\n"));
}

Code Snippets

// the generic algorithm:
template <class InIt, class OutIt, class size_type>
void window(InIt b, InIt e, OutIt d, size_type len) {
    for (auto s = std::next(b, len); s != e; ++s, ++b, ++d)
        *d = { b, s };
}

// The piece specifically for q-grams
std::vector<std::string> generate_qgram(std::string const &in, size_t q) {
    std::string input = std::string(q-1, '^') + in + std::string(q, '$');

    std::vector<std::string> words;

    window(input.begin(), input.end(), std::back_inserter(words), q);
    return words;
}

// and a little demo code:
int main() {
    auto ret = generate_qgram("INDEX", 3);

    std::copy(ret.begin(), ret.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
}

Context

StackExchange Code Review Q#139394, answer score: 3

Revisions (0)

No revisions yet.