HiveBrain v1.2.0
Get Started
← Back to all entries
principlesqlModerate

Lookup performance of Numeric vs String

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
lookupstringperformancenumeric

Problem

I'm working on a project that is using the instagram key format.

TL;DR 64-bit integer ids.

They'll be used for lookups, and we also like them for sorting and batching since they will naturally sort by creation time.

The values are between 2^63 and 2^64, so (just) too big to fit inside a BIGINT.

So it seems our options for storage are numeric(20) or varchar. varchar is not as ideal since we'd have to zero pad them for sorting to work, but would there be a performance hit to using a numeric for lookups?

Solution

Hate to be captain obvious on this one, but Instagram generously provides a function that you linked to that stores the keys as bigint.

CREATE SCHEMA insta5;
CREATE SEQUENCE insta5.table_id_seq;

CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $
DECLARE
    our_epoch bigint := 1314220021721;
    seq_id bigint;
    now_millis bigint;
    shard_id int := 5;
BEGIN
    -- The %1024, is just a way of saying they only want 10bit wraparound.
    SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;

    SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
    result := (now_millis - our_epoch) << 23;
    result := result | (shard_id << 10);
    result := result | (seq_id);
END;
$ LANGUAGE plpgsql;


They're actually using PostgreSQL. From that function you can see that they're returning a bigint. So certainly you can store the result of that function in bigint. As a special note, this is likely not the function that they're using. That function probably has a signature more like this,

insta5.next_id(smallint shard, OUT result bigint);


We know this because hardcoding a shard of 5 is not all that useful, and they seem to indicate that they're using this functionality. So in that blog id they brag that their ID compromises of

  • 64bits total



  • 64-23 = 41 bits for timestamp



  • 64-41 = 23 bits for shard + sequence id



  • 10 bits for the sequence id.



  • 13 bits for the shard.



Quick test on their code,

test=# SELECT insta5.next_id();
       next_id       
---------------------
 1671372309237077023
(1 row)


Decomposing the ID

Now let's play. For teh extra sexy, we can create helper functions that get the internal components from the ID. In the event you want to know the shard Instagram is using or their internal timestamp.

-- 13 bits for shard
CREATE FUNCTION insta5.get_shard(id bigint)
RETURNS smallint
AS $
  SELECT ((id>51)::smallint;
$ LANGUAGE sql;

-- 10 bits for sequence id
CREATE FUNCTION insta5.get_sequence(id bigint)
RETURNS smallint
AS $
  SELECT ((id>54)::smallint;
$ LANGUAGE sql;

-- 41 bits for timestamp
CREATE OR REPLACE FUNCTION insta5.get_ts(id bigint)
RETURNS timestamp without time zone
AS $
  SELECT to_timestamp(((id >> 23) + 1314220021721 ) / 1000 )::timestamp without time zone;
$ LANGUAGE sql;


Playing around, let's get a test id.

SELECT insta5.next_id();
       next_id       
---------------------
 1671390786412876801
(1 row)

SELECT
  insta5id,
  insta5.get_ts(insta5id),
  insta5.get_shard(insta5id),
  insta5.get_sequence(insta5id)
FROM (VALUES
  (1671390786412876801::bigint),
  (insta5.next_id())
) AS t(insta5id);


Returns the following,

insta5id       |       get_ts        | get_shard | get_sequence 
---------------------+---------------------+-----------+--------------
 1671390786412876801 | 2017-12-16 17:02:09 |         5 |            1
 1671392537048257538 | 2017-12-16 17:05:38 |         5 |            2
(2 rows)


Rolling our own Instagram ID Domain

You can even create an explicit DOMAIN over the type if you wish to really clean this up.. This is how I would personally store this, note I made a few further modifications.

  • I added COMMENTS -- always good practice.



  • Made the functions IMMUTABLE



  • Added insta5.next_id require an explicit shard.



Let's drop what we had,

DROP SCHEMA insta5 CASCADE;


And start over,

CREATE SCHEMA insta5;
COMMENT ON SCHEMA insta5 IS 'Instagram';

CREATE DOMAIN insta5.id AS bigint;
COMMENT ON DOMAIN insta5.id IS $Instagram's internal ID type, based on example from "Sharding & IDs at Instagram"$;

CREATE SEQUENCE insta5.table_id_seq;

CREATE OR REPLACE FUNCTION insta5.next_id(shard_id smallint)
RETURNS insta5.id
AS $
DECLARE
    our_epoch bigint := 1314220021721;
    seq_id bigint;
    result insta5.id;
    now_millis bigint;
BEGIN
    SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;

    SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
    result := (now_millis - our_epoch) >51)::smallint;
$ LANGUAGE sql
IMMUTABLE;

COMMENT ON FUNCTION insta5.get_shard(insta5.id)
  IS '13 bits from insta5.id representing shard';

CREATE OR REPLACE FUNCTION insta5.get_sequence(id insta5.id)
RETURNS smallint
AS $
  SELECT ((id>54)::smallint;
$ LANGUAGE sql
IMMUTABLE;

COMMENT ON FUNCTION insta5.get_sequence(insta5.id)
  IS '10 bits from insta5.id representing sequence';

CREATE OR REPLACE FUNCTION insta5.get_ts(id insta5.id)
RETURNS timestamp without time zone
AS $
  SELECT to_timestamp(((id >> 23) + 1314220021721 ) / 1000 )::timestamp without time zone;
$ LANGUAGE sql
IMMUTABLE;

COMMENT ON FUNCTION insta5.get_ts(insta5.id)
  IS '41 bits from insta5.id representing timestamp';


Everything works as before, but now you can

CREATE SCHEMA mySchema;
CREATE TABLE mySchema.mydata ( insta5id  insta5.id ) ;


This is likely the best solution you can obtain shy of a C implementation, and you probably

Code Snippets

CREATE SCHEMA insta5;
CREATE SEQUENCE insta5.table_id_seq;

CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$
DECLARE
    our_epoch bigint := 1314220021721;
    seq_id bigint;
    now_millis bigint;
    shard_id int := 5;
BEGIN
    -- The %1024, is just a way of saying they only want 10bit wraparound.
    SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;

    SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
    result := (now_millis - our_epoch) << 23;
    result := result | (shard_id << 10);
    result := result | (seq_id);
END;
$$ LANGUAGE plpgsql;
insta5.next_id(smallint shard, OUT result bigint);
test=# SELECT insta5.next_id();
       next_id       
---------------------
 1671372309237077023
(1 row)
-- 13 bits for shard
CREATE FUNCTION insta5.get_shard(id bigint)
RETURNS smallint
AS $$
  SELECT ((id<<41)>>51)::smallint;
$$ LANGUAGE sql;

-- 10 bits for sequence id
CREATE FUNCTION insta5.get_sequence(id bigint)
RETURNS smallint
AS $$
  SELECT ((id<<54)>>54)::smallint;
$$ LANGUAGE sql;

-- 41 bits for timestamp
CREATE OR REPLACE FUNCTION insta5.get_ts(id bigint)
RETURNS timestamp without time zone
AS $$
  SELECT to_timestamp(((id >> 23) + 1314220021721 ) / 1000 )::timestamp without time zone;
$$ LANGUAGE sql;
SELECT insta5.next_id();
       next_id       
---------------------
 1671390786412876801
(1 row)

SELECT
  insta5id,
  insta5.get_ts(insta5id),
  insta5.get_shard(insta5id),
  insta5.get_sequence(insta5id)
FROM (VALUES
  (1671390786412876801::bigint),
  (insta5.next_id())
) AS t(insta5id);

Context

StackExchange Database Administrators Q#193287, answer score: 10

Revisions (0)

No revisions yet.