HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Estimate query speed for +1.5 billion rows mysql table

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
estimaterowsbillionquerymysqlforspeedtable

Problem

I'm working on audio fingerprinting problem where I need to
query a very large table in terms of number of rows (at least 1.5 billion rows), but relatively OK in size (23G), and retrieve about 50K to 100K rows in total, using multiple queries (between 20 and 50 queries).

The table has 3 columns, a hash and two int values. No constraints whatsoever. The hash column has a lot of collisions/duplicates. Here is the output for show create table

CREATE TABLE `fingerprints` (
  `hash` binary(10) NOT NULL,
  `int1` mediumint(8) unsigned NOT NULL,
  `int2` mediumint(8) unsigned NOT NULL,
  KEY `hash` (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci


The query is simple, here is an example:

select int1 ,int2 from fingerprints 
  WHERE hash in 
    (UNHEX("1ff99335cce004f2765d"),UNHEX("14c4b93ed575982ed2e4")
     ,UNHEX("41044b0cf21dc8ac8f9b"),UNHEX("a791403ca116b4da53dd")
     ,UNHEX("d9f91514b900c25fa095"),UNHEX("3349f906deae6cd32883")
     ,UNHEX("221c0e3e2bc243fb0fe5") .... more here);


I've tried different hardware specs (using AWS with only one machine/instance). Different my.cnf configurations but no significant performance boost.

the target speed threshold for this operation (total queries time) is 5 sec. But the best I've got in average is 3 sec for only one single query (if I have 20 queries, the total operation time is 1 minute).

final note: when profiling the query, SHOW profile command shows that the slowest part was (SENDING DATA) state. The query becomes slower when the result set is larger (i.e retrieving 10k rows takes about 6 sec, while retrieving 1000 rows takes 2 sec)

Questions:

  • What is the speed estimation for such query scenario for an SSD machine with enough RAM to hold indices. I have no experience working


at this scale.

  • Do you have recommendation for specific db setup? should I try mysql Memory engine? partitioning here is a necessity with distributed


machines? should I switch to innodb?

my setup:

Solution

I was able to solve this slowness problem by executing the following query:

alter table fingerprints order by hash;


-
I have a lot of repetition on the hash column (there is 34m unique hashes only). If I understood the situation correctly, ordering made the reading a lot more sequential for my use case (select * from table where hash = ***).

-
If you check the output of SHOW INDEX, the cardinality value is NULL. After executing order by command, the cardinality now= the number of unique hashes = 34m. which makes sense. I guess this is the root problem. see: https://stackoverflow.com/questions/6521673/is-null-cardinality-in-an-index-a-problem-mysql-5-x

The job that took around 60 sec, now takes 350msec only.

Code Snippets

alter table fingerprints order by hash;

Context

StackExchange Database Administrators Q#234081, answer score: 4

Revisions (0)

No revisions yet.