HiveBrain v1.2.0
Get Started
← Back to all entries
patternMajor

What use are groups, monoids, and rings in database computations?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
groupswhatarecomputationsdatabasemonoidsanduserings

Problem

Why would a company like Twitter be interested in algebraic concepts like groups, monoids and rings? See their repository at github:twitter/algebird.

All I could find is:


Implementations of Monoids for interesting approximation algorithms,
such as Bloom filter, HyperLogLog and CountMinSketch. These allow you
to think of these sophisticated operations like you might numbers, and
add them up in hadoop or online to produce powerful statistics and
analytics.

and in another part of the GitHub page:


It was originally developed as part of Scalding's Matrix API, where
Matrices had values which are elements of
Monoids, Groups, or Rings. Subsequently, it was clear that the code had broader application
within Scalding and on other projects within Twitter.

What could this broader application be? within Twitter and for general interest?

It seems like composition aggregations of databases have a monoid-like structure.

Same question on Quora: What is Twitter's interest in abstract algebra (with algebird)?

I have math background but I'm not computer scientist. It would be great to have "real-world" uses of monoids and semi-groups. These are normally considered useless theoretical constructs, and ignored in many abstract algebra courses (for lack of anything interesting to say).

Solution

The main answer is that by exploiting semi-group structure, we can build systems that parallelize correctly without knowing the underlying operation (the user is promising associativity).

By using Monoids, we can take advantage of sparsity (we deal with a lot of sparse matrices, where almost all values are a zero in some Monoid).

By using Rings, we can do matrix multiplication over things other than numbers (which on occasion we have done).

The algebird project itself (as well as the issue history) pretty clearly explains what is going on here: we are building a lot of algorithms for aggregation of large data sets, and leveraging the structure of the operations gives us a win on the systems side (which is usually the pain point when trying to productionize algorithms on 1000s of nodes).

Solve the systems problems once for any Semigroup/Monoid/Group/Ring, and then you can plug in any algorithm without having to think about Memcache, Hadoop, Storm, etc...

Context

StackExchange Computer Science Q#9648, answer score: 28

Revisions (0)

No revisions yet.