patternsqlMinor
Estimate query speed for +1.5 billion rows mysql table
Viewed 0 times
estimaterowsbillionquerymysqlforspeedtable
Problem
I'm working on audio fingerprinting problem where I need to
query a very large table in terms of number of rows (at least 1.5 billion rows), but relatively OK in size (23G), and retrieve about 50K to 100K rows in total, using multiple queries (between 20 and 50 queries).
The table has 3 columns, a hash and two int values. No constraints whatsoever. The hash column has a lot of collisions/duplicates. Here is the output for show create table
The query is simple, here is an example:
I've tried different hardware specs (using AWS with only one machine/instance). Different my.cnf configurations but no significant performance boost.
the target speed threshold for this operation (total queries time) is 5 sec. But the best I've got in average is 3 sec for only one single query (if I have 20 queries, the total operation time is 1 minute).
final note: when profiling the query, SHOW profile command shows that the slowest part was (SENDING DATA) state. The query becomes slower when the result set is larger (i.e retrieving 10k rows takes about 6 sec, while retrieving 1000 rows takes 2 sec)
Questions:
at this scale.
machines? should I switch to innodb?
my setup:
query a very large table in terms of number of rows (at least 1.5 billion rows), but relatively OK in size (23G), and retrieve about 50K to 100K rows in total, using multiple queries (between 20 and 50 queries).
The table has 3 columns, a hash and two int values. No constraints whatsoever. The hash column has a lot of collisions/duplicates. Here is the output for show create table
CREATE TABLE `fingerprints` (
`hash` binary(10) NOT NULL,
`int1` mediumint(8) unsigned NOT NULL,
`int2` mediumint(8) unsigned NOT NULL,
KEY `hash` (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ciThe query is simple, here is an example:
select int1 ,int2 from fingerprints
WHERE hash in
(UNHEX("1ff99335cce004f2765d"),UNHEX("14c4b93ed575982ed2e4")
,UNHEX("41044b0cf21dc8ac8f9b"),UNHEX("a791403ca116b4da53dd")
,UNHEX("d9f91514b900c25fa095"),UNHEX("3349f906deae6cd32883")
,UNHEX("221c0e3e2bc243fb0fe5") .... more here);I've tried different hardware specs (using AWS with only one machine/instance). Different my.cnf configurations but no significant performance boost.
the target speed threshold for this operation (total queries time) is 5 sec. But the best I've got in average is 3 sec for only one single query (if I have 20 queries, the total operation time is 1 minute).
final note: when profiling the query, SHOW profile command shows that the slowest part was (SENDING DATA) state. The query becomes slower when the result set is larger (i.e retrieving 10k rows takes about 6 sec, while retrieving 1000 rows takes 2 sec)
Questions:
- What is the speed estimation for such query scenario for an SSD machine with enough RAM to hold indices. I have no experience working
at this scale.
- Do you have recommendation for specific db setup? should I try mysql Memory engine? partitioning here is a necessity with distributed
machines? should I switch to innodb?
my setup:
Solution
I was able to solve this slowness problem by executing the following query:
-
I have a lot of repetition on the hash column (there is 34m unique hashes only). If I understood the situation correctly, ordering made the reading a lot more sequential for my use case (select * from table where hash = ***).
-
If you check the output of SHOW INDEX, the cardinality value is NULL. After executing order by command, the cardinality now= the number of unique hashes = 34m. which makes sense. I guess this is the root problem. see: https://stackoverflow.com/questions/6521673/is-null-cardinality-in-an-index-a-problem-mysql-5-x
The job that took around 60 sec, now takes 350msec only.
alter table fingerprints order by hash;-
I have a lot of repetition on the hash column (there is 34m unique hashes only). If I understood the situation correctly, ordering made the reading a lot more sequential for my use case (select * from table where hash = ***).
-
If you check the output of SHOW INDEX, the cardinality value is NULL. After executing order by command, the cardinality now= the number of unique hashes = 34m. which makes sense. I guess this is the root problem. see: https://stackoverflow.com/questions/6521673/is-null-cardinality-in-an-index-a-problem-mysql-5-x
The job that took around 60 sec, now takes 350msec only.
Code Snippets
alter table fingerprints order by hash;Context
StackExchange Database Administrators Q#234081, answer score: 4
Revisions (0)
No revisions yet.