principlesqlModerate
Lookup performance of Numeric vs String
Viewed 0 times
lookupstringperformancenumeric
Problem
I'm working on a project that is using the instagram key format.
TL;DR 64-bit integer ids.
They'll be used for lookups, and we also like them for sorting and batching since they will naturally sort by creation time.
The values are between 2^63 and 2^64, so (just) too big to fit inside a
So it seems our options for storage are
TL;DR 64-bit integer ids.
They'll be used for lookups, and we also like them for sorting and batching since they will naturally sort by creation time.
The values are between 2^63 and 2^64, so (just) too big to fit inside a
BIGINT.So it seems our options for storage are
numeric(20) or varchar. varchar is not as ideal since we'd have to zero pad them for sorting to work, but would there be a performance hit to using a numeric for lookups?Solution
Hate to be captain obvious on this one, but Instagram generously provides a function that you linked to that stores the keys as
They're actually using PostgreSQL. From that function you can see that they're returning a
We know this because hardcoding a shard of
Quick test on their code,
Decomposing the ID
Now let's play. For teh extra sexy, we can create helper functions that get the internal components from the ID. In the event you want to know the shard Instagram is using or their internal timestamp.
Playing around, let's get a test id.
Returns the following,
Rolling our own Instagram ID Domain
You can even create an explicit
Let's drop what we had,
And start over,
Everything works as before, but now you can
This is likely the best solution you can obtain shy of a C implementation, and you probably
bigint.CREATE SCHEMA insta5;
CREATE SEQUENCE insta5.table_id_seq;
CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $
DECLARE
our_epoch bigint := 1314220021721;
seq_id bigint;
now_millis bigint;
shard_id int := 5;
BEGIN
-- The %1024, is just a way of saying they only want 10bit wraparound.
SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
result := (now_millis - our_epoch) << 23;
result := result | (shard_id << 10);
result := result | (seq_id);
END;
$ LANGUAGE plpgsql;They're actually using PostgreSQL. From that function you can see that they're returning a
bigint. So certainly you can store the result of that function in bigint. As a special note, this is likely not the function that they're using. That function probably has a signature more like this,insta5.next_id(smallint shard, OUT result bigint);We know this because hardcoding a shard of
5 is not all that useful, and they seem to indicate that they're using this functionality. So in that blog id they brag that their ID compromises of- 64bits total
- 64-23 = 41 bits for timestamp
- 64-41 = 23 bits for shard + sequence id
- 10 bits for the sequence id.
- 13 bits for the shard.
Quick test on their code,
test=# SELECT insta5.next_id();
next_id
---------------------
1671372309237077023
(1 row)Decomposing the ID
Now let's play. For teh extra sexy, we can create helper functions that get the internal components from the ID. In the event you want to know the shard Instagram is using or their internal timestamp.
-- 13 bits for shard
CREATE FUNCTION insta5.get_shard(id bigint)
RETURNS smallint
AS $
SELECT ((id>51)::smallint;
$ LANGUAGE sql;
-- 10 bits for sequence id
CREATE FUNCTION insta5.get_sequence(id bigint)
RETURNS smallint
AS $
SELECT ((id>54)::smallint;
$ LANGUAGE sql;
-- 41 bits for timestamp
CREATE OR REPLACE FUNCTION insta5.get_ts(id bigint)
RETURNS timestamp without time zone
AS $
SELECT to_timestamp(((id >> 23) + 1314220021721 ) / 1000 )::timestamp without time zone;
$ LANGUAGE sql;Playing around, let's get a test id.
SELECT insta5.next_id();
next_id
---------------------
1671390786412876801
(1 row)
SELECT
insta5id,
insta5.get_ts(insta5id),
insta5.get_shard(insta5id),
insta5.get_sequence(insta5id)
FROM (VALUES
(1671390786412876801::bigint),
(insta5.next_id())
) AS t(insta5id);Returns the following,
insta5id | get_ts | get_shard | get_sequence
---------------------+---------------------+-----------+--------------
1671390786412876801 | 2017-12-16 17:02:09 | 5 | 1
1671392537048257538 | 2017-12-16 17:05:38 | 5 | 2
(2 rows)Rolling our own Instagram ID Domain
You can even create an explicit
DOMAIN over the type if you wish to really clean this up.. This is how I would personally store this, note I made a few further modifications.- I added
COMMENTS-- always good practice.
- Made the functions
IMMUTABLE
- Added
insta5.next_idrequire an explicit shard.
Let's drop what we had,
DROP SCHEMA insta5 CASCADE;And start over,
CREATE SCHEMA insta5;
COMMENT ON SCHEMA insta5 IS 'Instagram';
CREATE DOMAIN insta5.id AS bigint;
COMMENT ON DOMAIN insta5.id IS $Instagram's internal ID type, based on example from "Sharding & IDs at Instagram"$;
CREATE SEQUENCE insta5.table_id_seq;
CREATE OR REPLACE FUNCTION insta5.next_id(shard_id smallint)
RETURNS insta5.id
AS $
DECLARE
our_epoch bigint := 1314220021721;
seq_id bigint;
result insta5.id;
now_millis bigint;
BEGIN
SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
result := (now_millis - our_epoch) >51)::smallint;
$ LANGUAGE sql
IMMUTABLE;
COMMENT ON FUNCTION insta5.get_shard(insta5.id)
IS '13 bits from insta5.id representing shard';
CREATE OR REPLACE FUNCTION insta5.get_sequence(id insta5.id)
RETURNS smallint
AS $
SELECT ((id>54)::smallint;
$ LANGUAGE sql
IMMUTABLE;
COMMENT ON FUNCTION insta5.get_sequence(insta5.id)
IS '10 bits from insta5.id representing sequence';
CREATE OR REPLACE FUNCTION insta5.get_ts(id insta5.id)
RETURNS timestamp without time zone
AS $
SELECT to_timestamp(((id >> 23) + 1314220021721 ) / 1000 )::timestamp without time zone;
$ LANGUAGE sql
IMMUTABLE;
COMMENT ON FUNCTION insta5.get_ts(insta5.id)
IS '41 bits from insta5.id representing timestamp';Everything works as before, but now you can
CREATE SCHEMA mySchema;
CREATE TABLE mySchema.mydata ( insta5id insta5.id ) ;This is likely the best solution you can obtain shy of a C implementation, and you probably
Code Snippets
CREATE SCHEMA insta5;
CREATE SEQUENCE insta5.table_id_seq;
CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$
DECLARE
our_epoch bigint := 1314220021721;
seq_id bigint;
now_millis bigint;
shard_id int := 5;
BEGIN
-- The %1024, is just a way of saying they only want 10bit wraparound.
SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
result := (now_millis - our_epoch) << 23;
result := result | (shard_id << 10);
result := result | (seq_id);
END;
$$ LANGUAGE plpgsql;insta5.next_id(smallint shard, OUT result bigint);test=# SELECT insta5.next_id();
next_id
---------------------
1671372309237077023
(1 row)-- 13 bits for shard
CREATE FUNCTION insta5.get_shard(id bigint)
RETURNS smallint
AS $$
SELECT ((id<<41)>>51)::smallint;
$$ LANGUAGE sql;
-- 10 bits for sequence id
CREATE FUNCTION insta5.get_sequence(id bigint)
RETURNS smallint
AS $$
SELECT ((id<<54)>>54)::smallint;
$$ LANGUAGE sql;
-- 41 bits for timestamp
CREATE OR REPLACE FUNCTION insta5.get_ts(id bigint)
RETURNS timestamp without time zone
AS $$
SELECT to_timestamp(((id >> 23) + 1314220021721 ) / 1000 )::timestamp without time zone;
$$ LANGUAGE sql;SELECT insta5.next_id();
next_id
---------------------
1671390786412876801
(1 row)
SELECT
insta5id,
insta5.get_ts(insta5id),
insta5.get_shard(insta5id),
insta5.get_sequence(insta5id)
FROM (VALUES
(1671390786412876801::bigint),
(insta5.next_id())
) AS t(insta5id);Context
StackExchange Database Administrators Q#193287, answer score: 10
Revisions (0)
No revisions yet.