HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Database design to handle millions of rows in MySQL

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rowsmillionshandledesigndatabasemysql

Problem

We are running an application that is collecting data much faster than we anticipated. Trying to addapt to that, we are doing a redesig of the database.
After reading this, this and this, I am not sure what the best approach for our design is... considering our HW is very humble.

There are three main tables that are causing problems:

  • SCANS



  • DOMAINS



  • DOCUMENTS



  • VALUES



Currently we have one single table to store data. The relation between them is:

  • 1 SCAN -> (avg 4x) DOMAINS -> (avg 3000) MANY DOCUMENTS -> (avg 51000) MANY VALUES



  • 1 SCAN points to average 4 entries on DOMAINS.



  • 4 entries on DOMAINS point to average 12.000 entries ON DOCUMENTS



  • 12000 entries ON DOCUMENTS point TO average 204000 entries on VALUES



We are currently performing around 100 scans/day. That is inserting around 20.400.000 items per day into VALUES.

We are considering to split VALUES table as one "VALUE_table_per_month":

  • VALUES_year_month with the intention to distribute the load between them. But if we multiply the number of scanners, this mechanism is not escalable.



  • VALUES_year_month_day then we will end up with so many tables into the same DB.



In both cases, if we increase the number of scans per day, none of the solutions seems scalable.

At this point, to keep all the data into a centralized DB does not seem the best option for scalability reasons... but at the same time, a distributed system will increase the load time significantly.

What would be a reasonable approach? I am sure we are not the first team to find this issue! :P

EDIT

How much data do we read per query?

That depends on the SCAN. Not all scans have the same amount of data. The range varies between:

  • 1 SCAN --> 200 VALUES



  • 1 SCAN --> 200.000 VALUES



The information is presented on a front end to the end user. So we have splitted how the queries are requested to the backend to avoid overload the server, but in some cases it is not enought due the high number of VALUES.

When is the data read

Solution

Do not split a table simply because it is big.

Consider PARTITIONing a big table if you need to delete "old" data.

Do consider 'sharding' when the number of writes become too big for a single machine.

250 rows inserted per second on SSD device does not, by itself, trigger any of the above reasons to split.

If you have a retention period of, say, 2 months, then PARTITION BY RANGE(TO_DAYS(...)) and doing a monthly DROP PARTITION + REORGANIZE PARTITION is advisable. More discussion: Partition

(weigth = 150 OR weigth = 100) is a rather strange test. Are there no values between 100 and 150, or do you deliberately filter them out? I ask because OR complicates optimization.

The query you presented needs

INDEX(id_analysis, is_idden, is_hidden_by_user, weight)


The query is improperly written because of ONLY_FULL_GROUP_BY. I doubt if url, filetype, and severity are "dependent" on value.

Please elaborate on value versus id_value. This sounds like another mistake in the query.

Please elaborate on why Documents and Values are separate. It smells like "over-normalization".

Or maybe I am really confused by the name VALUES because of it containing url, filetype, severity.

Please provide SHOW CREATE TABLE for each table.

In Data Warehouse situation, Summary Tables are often the answer to performance. Can you summarize the counts for each day, then sum up those subtotals?

Code Snippets

INDEX(id_analysis, is_idden, is_hidden_by_user, weight)

Context

StackExchange Database Administrators Q#315959, answer score: 8

Revisions (0)

No revisions yet.