HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Benchmarking of SQL data crunching using a common datum

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
sqlcommonusingdatumcrunchingdatabenchmarking

Problem

@nhgrif and I were curious about a couple different ways to aggregate a number of records based on a duplicate/common datum. Two different ideas will be shown. Any advice on all code but especially making it perform better is appreciated.

Note: @nhgrif did not write most of this, so if it's wrong it's my fault...

Make a bunch of data

Code:

set search_path = test;
-- create tables to seed a large number of records
drop table if exists contract_number_seed;
create table contract_number_seed(
    contract_number int
    );
insert into contract_number_seed 
    (contract_number)
    select * from generate_series(1, 100000)
    ;
drop table if exists serial_number_seed;
create table serial_number_seed(
    serial_number int
    );
insert into serial_number_seed
    (serial_number)
    select * from generate_series(1, 10)
    ;
drop table if exists status_seed;
create table status_seed(
    status text
    );
insert into status_seed (status)
values (NULL), ('one'), ('two'), ('three');
-- generate a whole bunch of records from seed tables
drop table if exists seeded;
select 
    contract_number_seed.contract_number,
    serial_number_seed.serial_number,
    status_seed.status
into seeded
from contract_number_seed
cross join serial_number_seed
cross join status_seed;
-- Query returned successfully: 4000000 rows affected, 11368 ms execution time.


Just for the record:

set search_path = test;
select * from seeded;
-- Total query runtime: 8056 ms.
-- 4000000 rows retrieved.


Using LEFT OUTER JOIN

PS: Capitalization is different in this; @nhgrif wrote it and I was lazy.

PPS: Table aliases are not that great but I'm just benchmarking, please forgive...

```
SET SEARCH_PATH = test;
SELECT
seeded.Serial_Number,
COUNT(one.Contract_Number) as One,
COUNT(two.Contract_Number) as Two,
COUNT(three.Contract_Number) as Three
FROM seeded
LEFT JOIN seeded as one
ON one.Contract_Number = seeded.Contract_Number
AND seeded.Status = 'one'

Solution

There is a bug in your second version. Here is one row from the result:

1;0;2;0


That is, Serial_Number 1 has zero ones, two twos, and zero threes.

But let's do a quick sanity check:

select count(*) from seeded
where serial_number = 1 and status = 'one';


This query returns 100000.

There are also rows with the same Serial_Number, which I think is not what you want.

I think you want a query like this:

select 
    seeded.Serial_Number, 
    sum(case when seeded.Status = 'one' then 1 else 0 end) as ones,
    sum(case when seeded.Status = 'two' then 1 else 0 end) as twos,
    sum(case when seeded.Status = 'three' then 1 else 0 end) as threes
from seeded
group by seeded.Serial_Number;


This takes ~2.1s on my machine.

Just for fun, let's look at what pgAdmin's EXPLAIN visualisation looks like for this query

versus the 30m query

Code Snippets

select count(*) from seeded
where serial_number = 1 and status = 'one';
select 
    seeded.Serial_Number, 
    sum(case when seeded.Status = 'one' then 1 else 0 end) as ones,
    sum(case when seeded.Status = 'two' then 1 else 0 end) as twos,
    sum(case when seeded.Status = 'three' then 1 else 0 end) as threes
from seeded
group by seeded.Serial_Number;

Context

StackExchange Code Review Q#60446, answer score: 8

Revisions (0)

No revisions yet.