HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

C++ HyperLogLog Implementation

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
hyperloglogimplementationstackoverflow

Problem

I'm a scientific C programmer moving my way over to using Modern C++. I found myself needing a HyperLogLog implementation, and I wanted to use this for practice.

I plan to move a number of these functions over into an accompanying header file, but I do find it convenient at the moment to only need to include it (esp. in unit tests), and it was easier to post this to CodeReview as one chunk anyhow.

I know I'm not quite writing idiomatic C++, and that's why I've posted this here. I want to improve my code to be more modern, as well as just good code.

It does seem to work -- I have unit tests for random integers, all integers in given ranges (e.g., 1-100000), and real world examples, covering addition and set difference operations.

However, I'm particularly unsure about my copy and move constructors and my assignment operators. I have C++17 at my disposal, but I am currently only using features though C++14.

```
#ifndef _HLL_H_
#define _HLL_H_
#include
#include
#include
#include
#include
#include
#include
#include
#include "logutil.h" // for LOG_DEBUG

namespace mystuff {

constexpr double make_alpha(size_t m) {
switch(m) {
case 16: return .673;
case 32: return .697;
case 64: return .709;
default: return 0.7213 / (1 + 1.079/m);
}
}

class hll_t {
// HyperLogLog implementation.
// To make it general, the actual point of entry is a 64-bit integer hash function.
// Therefore, you have to perform a hash function to convert various types into a suitable query.
// We could also cut our memory requirements by switching to only using 6 bits per element,
// (up to 64 leading zeros), though the gains would be relatively small
// given how memory-efficient this structure is.

// Attributes
size_t np_;
size_t m_;
double alpha_;
double relative_error_;
std::vector core_;
double sum_;
int is_calculated_;

public:
// Constructor
hll_t(size_t np=20): np_(np), m_(1 > (64 - np_);
const uint3

Solution

I won't get into the details of the actual hyperloglog since I'm not familiar with, but I can certainly provide advice about modern C++. Here we go:

-
In your move assignment operator, core_ = other.core_ should be core_ = std::move(other.core_) if you really want to take advantage of move semantics.

-
You explicitly pass other.m_ to your copy constructor initialization, even though you implemented it in terms of your copy assignment operator, which already copies other.m_. This is a bit redundant.

-
I see that you implement your copy constructor in terms of your copy assignment operator. In general the assignment operator is implemented in term of the constructor, using the copy-and-swap idiom. It should look like this:

hll_t &operator=(hll_t other) {
    using std::swap;
    swap(*this, other);
    return *this;
}


The explanation is quite subtle, but basically it has to do with code factoring, correctness and exception safety. You should really look at the linked post, which explains the advantages of the idiom in great details.

-
That said, since your copy/move constructor/assignment operator just copies/moves everything, the best way to implement it is still to tell to the compiler to generate the correct implementations for you, avoiding errors altogether:

hll_t(const hll_t&) = default;
hll_t(hll_t&&) = default;
hll_t& operator=(const hll_t&) = default;
hll_t& operator=(hll_t&&) = default;


The default implementations will just copy or move every member of the class, which is exactly what your implementation does, so you can delete your manual implementation of these functions and let the compiler do the job :)

-
You should mark the one-parameter constructor explicit to avoid implicit conversions from double (such conversions are unwanted most of the time). That said, if you do so, the default constructor will also be explicit, which isn't desirable. The best solution would be to separate them and to use constructor delegation like this:

hll_t(std::size_t np): np_(np), m_(1 << np), alpha_(make_alpha(m_)),
                       relative_error_(1.03896 / std::sqrt(m_)),
                       core_(m_, 0),
                       sum_(0.), is_calculated_(0) {}

hll_t(): hll_t(20) {}


-
You should always use the type bool and the constants true and false for is_calculated_ since it's obviously a boolean variable (is_ready even returns a bool already). It will make things clearer for everyone at first glance.

-
When a method does not alter the instance of the class, you can mark it as const to make sure that you can call said method, even if the hll_t instance is const. The method is_ready can be const-qualified:

bool is_ready() const {
    return is_calculated_;
}


-
If you didn't have a two-phase initialization with is_calculted_ and computed everything at construction, then many more functions could be marked as const and hll_t could actually be usable as a const object.

You could make some variables mutable and const-qualify more methods, but I'd always think at least twice before making anything mutable. It's alays like opening a strange door. I know that it's sometimes used when you have expensive computations and caches like you do. Please read a lot about it if you ever decide to perform such a change.

-
While operators like operator@= can only be implemented directly into the class, it's idiomatic to implement operators like operator@ outside of the class, in terms of the operator@=. In your case operator^= wouldn't make sense, but you could reimplement operator^ as a free function:

double operator^(hll_t& lhs, hll_t rhs&) {
    hll_t tmp(lhs);
    lhs += rhs;
    tmp.sum();
    return lhs.report() + rhs.report() - tmp.report();
}


I guess that you can even drop the explicit call to tmp.sum() since tmp.report() already computes it.

-
It's good practice to always std::-qualify components of your library. You've done pretty well, but still forgot to fully qualify std::uin8_t and std::size_t.

-
As others have said in the comments, identifiers starting with an underscore followed by a capital letter are reserved for the implementers of the compiler and standard library. _HLL_H_ can easily be renamed HLL_H_ to solve that problem.

It's true that POSIX officially reserves *_t names in the global namespace, but that's probably highly disregarded by most. I'd have to check but I believe that many popular C and/or C++ libraries simply don't care.

That said you could rename your class hyperloglog, which isn't that long, is really descriptive, and doesn't use any reserved identifier.

-
It's excellent practice to (almost) always use braces after control statements like if and for, even if they only exist for a single statement. It will avoid stupid problems like Apple's goto fail; bug in the long run. It also helps to better understand scopes when visual ind

Code Snippets

hll_t &operator=(hll_t other) {
    using std::swap;
    swap(*this, other);
    return *this;
}
hll_t(const hll_t&) = default;
hll_t(hll_t&&) = default;
hll_t& operator=(const hll_t&) = default;
hll_t& operator=(hll_t&&) = default;
hll_t(std::size_t np): np_(np), m_(1 << np), alpha_(make_alpha(m_)),
                       relative_error_(1.03896 / std::sqrt(m_)),
                       core_(m_, 0),
                       sum_(0.), is_calculated_(0) {}

hll_t(): hll_t(20) {}
bool is_ready() const {
    return is_calculated_;
}
double operator^(hll_t& lhs, hll_t rhs&) {
    hll_t tmp(lhs);
    lhs += rhs;
    tmp.sum();
    return lhs.report() + rhs.report() - tmp.report();
}

Context

StackExchange Code Review Q#146281, answer score: 4

Revisions (0)

No revisions yet.