HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Table just for grouping

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
justfortablegrouping

Problem

Is it a common case to have a table with a single column for the purpose of grouping rows in another table?

I'm inserting data in batches and I want to have an autoincrement key for each batch to be able to group data based on generated id.

Concretely I want to get from this

A
id, x, y, b_id
id PRIMARY KEY
b_id FOREIGN KEY REFERENCES B.id

B
id, timestamp
id PRIMARY KEY

SELECT count(*) as number, B.timestamp FROM A inner join B on A.b_id=B.id
                     where A.x='value' and A.y='value'
                     group by B.id;


to

A
id, x, y, timestamp, b_id
id PRIMARY KEY
b_id FOREIGN KEY REFERENCES B.id

B
id
id PRIMARY KEY

SELECT count(*) as number, A.timestamp FROM A
                         where A.x='value' and A.y='value'
                         group by A.b_id, A.timestamp;


So basically move timestamp to B (denormalize) and use foreign key only for grouping. I want to avoid having join only for the timestamp placed in B. Tables are quite big (60M of rows) and join is very slow. If I still filter on A and have foreign key only for grouping then that would speed up things a lot.

Concretely, I'm using MySQL.

Solution

Is it common to mark rows with a batch number? Yes it is. In accounting contexts I have seen this used widely for audit and reconciliation purposes.

Is having a separate table to store batch-related values a good idea? Yes it is. Good practice in relational DB design is to have one table per entity type. Since your problem domain recognises "batch" as a concept, with its own values, it should have a table.

Should this table use an autoincrement column? Well, this is very much an implementation detail. Does your problem domain allow for gaps, negatives, re-use after delete, out of order values etc. etc? Does your DBMS software guarantee the necessary features? If so then, sure, go for it. If not, write your own generator with the needed guarantees.

Will denormalising timestamp from the "batch" table to the "data" table improve performance? Can't say. It will make the data table wider, which should cause more IO. There may be enough spare space in each page that this doesn't happen, however. If all the data, or at least the woking set, fits in memory, this consideration is void.

If there are good indexes in place and they are being used I'd guess denormalisation won't change much. You could try rewriting the query. Sometimes if the optimizer starts from a different place it ends up with a different plan. Put select..from a group by b_id in a subquery, then join that to b to get the timestamp.

You need to understand where the time is consumed and address that. And of course test, test, test.

Context

StackExchange Database Administrators Q#191070, answer score: 2

Revisions (0)

No revisions yet.