HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How do I keep data in denormalized Cassandra tables in sync?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
tableskeepdenormalizedhowcassandradatasync

Problem

I have data that is spread across multiple denormalised Cassandra tables. Each table is optimised for fast retrieval by a corresponding application query but it means that data is duplicated across different tables.

How do we keep them synchronised? What is the recommended approach for updating denormalised tables?

Solution

Background

If you are new to Cassandra and have only just started to delve into data modelling, it’s understandable that you would be unfamiliar with the idea of having denormalized tables. In relational databases, we are used to normalizing entities into separate tables and using foreign keys to tie together related records.

In Cassandra, we do the exact opposite and we instead duplicate data in multiple tables to optimise reads. You’ll have situations where the same piece of data is duplicated in 2 or more tables. An example is where you have users commenting on videos uploaded to a video sharing site such as killrvideo.com. The application can display all the comments associated with a video or the user can list all the comments they have posted on the site. This means we have to store the comments in 2 separate tables -- one called comments_by_video and another called comments_by_user. Here is what the tables look like:

CREATE TABLE comments_by_video (
   videoid uuid,
   commentid timeuuid,
   userid uuid,
   comment text,
   PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);


CREATE TABLE comments_by_user (
   userid uuid,
   commentid timeuuid,
   videoid uuid,
   comment text,
   PRIMARY KEY (userid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);


Solution

So back to the question: How do we keep the tables in sync? The Cassandra Query Language (or CQL) has a BATCH feature that groups inserts and updates into one atomic transaction. When a user posts a new comment on the KillrVideo site, we need to do 2 insert statements -- an INSERT to the comments_by_video table, and another to the comments_by_user table:

INSERT INTO comments_by_video (videoid, commentid, userid, comment) VALUES (...);


INSERT INTO comments_by_user (userid, commentid, videoid, comment) VALUES (...);


In cqlsh, we enclose these 2 inserts in a BATCH so they both get executed as one transaction:

BEGIN BATCH
  INSERT INTO comments_by_video (videoid, commentid, userid, comment) VALUES (...);
  INSERT INTO comments_by_user (userid, commentid, videoid, comment) VALUES (...);
APPLY BATCH;


Here's an example for executing a batch in Java:

PreparedStatement psCommentsByVideo = session.prepare(
    "INSERT INTO comments_by_video (videoid, commentid, userid, comment) "
        + "VALUES (...)");
PreparedStatement psCommentsByUser = session.prepare(
    "INSERT INTO comments_by_user (userid, commentid, videoid, comment) "
        + "VALUES (...)");

BatchStatement batch = BatchStatement.newInstance(
    DefaultBatchType.LOGGED,
    psCommentsByVideo.bind(...),
    psCommentsByUser.bind(...);

session.execute(batch);


The atomic guarantee for batches is that either ALL the statements are done successfully or none of them happens. This ensures that if one table is updated we know that all of the tables are updated.
Caution

It is important to note that batches in Cassandra are not an optimisation in the same way that you use batches in relational databases. CQL batches are not intended to be used for bulk-loading data and will in fact be slower compared to individual write requests.

Only use batches when necessary or you risk overloading the coordinator nodes which can affect the overall performance of your cluster.
References

  • Tutorial - Application Development | BATCHes @ DataStax Academy



  • Docs - Batching data insertion and updates @ DataStax Docs

Code Snippets

CREATE TABLE comments_by_video (
   videoid uuid,
   commentid timeuuid,
   userid uuid,
   comment text,
   PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
CREATE TABLE comments_by_user (
   userid uuid,
   commentid timeuuid,
   videoid uuid,
   comment text,
   PRIMARY KEY (userid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
INSERT INTO comments_by_video (videoid, commentid, userid, comment) VALUES (...);
INSERT INTO comments_by_user (userid, commentid, videoid, comment) VALUES (...);
BEGIN BATCH
  INSERT INTO comments_by_video (videoid, commentid, userid, comment) VALUES (...);
  INSERT INTO comments_by_user (userid, commentid, videoid, comment) VALUES (...);
APPLY BATCH;

Context

StackExchange Database Administrators Q#320094, answer score: 3

Revisions (0)

No revisions yet.