HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Better database for "keep always the 5 latest entries per ID and delete older"?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
thelatestperdeletekeepalwaysbetterdatabaseforand

Problem

I have a PostgreSQL database with a history table where I store an fooID (not a primary key serial but a text), an attribute target and the current timestamp whenever the target for this fooID changes:

CREATE TABLE history (
  fooId       text not null,
  target      text not null,
  updated_at  timestamp not null default default now()
);


There are several million entries in this table and a couple of thousand changes per day. Every day I will scan the table and keep the last 5 entries per fooId and delete any older entries.

The DELETE ... WHERE id in ... rank() over (partiton by nr order by created_at... query is not my question, that works. Just takes awful long.

My question is rather: Is a standard PostgreSQL table the best approach for this problem?

Would PostgreSQL table partitioning help here? I know partitioning is used to easily throw away big chunks of data that are older than X days, but partitioning by fooId would seem to create too many partitions in my case.

Are there NoSQL databases that would be much faster because they store the data differently?

Are there other PostgreSQL tricks that help me to store the data differently and more optimized for the use case of the daily purging (SELECTs are not an issue, it gets rarely queried)?

An exclusive lock for an 1 hour per week would be acceptable. There are about 1 million different fooIDs so no chance of having a separate table per fooID.

Solution

One option I can think of is to delete the oldest row as soon as you insert a new one. That will only require a very quick lookup limited to at most 6 rows rather then going through all rows at a time.

To do that efficiently you need a unique key on the table:

create table history 
(
  id          serial primary key, -- to make a lookup on a single row efficient
  fooid       text not null,
  target      text not null,
  updated_at  timestamp not null default now()
);
-- to make finding the oldest row for one fooid efficient
create index on history(fooid, updated_at);


Then create a trigger that only keeps the 5 most recent rows for a fooid:

create or replace function remove_last() 
  returns trigger
as
$
begin
  with ranked as (
    select id, row_number() over (partition by fooid order by updated_at) as rn
    from history
    where id <> new.id
      and fooid = new.fooid 
  )
  delete from history
  where id in (select id 
               from ranked 
               where rn >= 5);
  return new;
end;
$
language plpgsql;

create trigger remove_last_trigger 
  after insert on history
  for each row execute procedure remove_last();


On my laptop, a simple test setup with 5 million rows (1 million different fooid values) showed a trigger overhead that was less then a millisecond:

insert into history (fooid, target) values ('1', 'new stuff');


QUERY PLAN
----------------------------------------------------------------------------------------------------
Insert on stuff.history (cost=0.00..0.01 rows=1 width=76) (actual time=0.062..0.062 rows=0 loops=1)
Buffers: shared hit=8
-> Result (cost=0.00..0.01 rows=1 width=76) (actual time=0.017..0.017 rows=1 loops=1)
Output: nextval('history_id_seq'::regclass), '1'::text, 'new stuff'::text, now()
Buffers: shared hit=1
Planning time: 0.024 ms
Trigger remove_last_trigger: time=0.438 calls=1
Execution time: 0.524 ms

How much overhead that is in your system and whether or not that is acceptable for you I don't know, but "thousand changes per day" doesn't sound like a very busy system.

Code Snippets

create table history 
(
  id          serial primary key, -- to make a lookup on a single row efficient
  fooid       text not null,
  target      text not null,
  updated_at  timestamp not null default now()
);
-- to make finding the oldest row for one fooid efficient
create index on history(fooid, updated_at);
create or replace function remove_last() 
  returns trigger
as
$$
begin
  with ranked as (
    select id, row_number() over (partition by fooid order by updated_at) as rn
    from history
    where id <> new.id
      and fooid = new.fooid 
  )
  delete from history
  where id in (select id 
               from ranked 
               where rn >= 5);
  return new;
end;
$$
language plpgsql;

create trigger remove_last_trigger 
  after insert on history
  for each row execute procedure remove_last();
insert into history (fooid, target) values ('1', 'new stuff');

Context

StackExchange Database Administrators Q#186041, answer score: 3

Revisions (0)

No revisions yet.