patternsqlMinor
Use MySQL to regularly do multi-way joins on 100+ GB tables?
Viewed 0 times
tablesmultiwayregularlymysql100usejoins
Problem
Background:
I’ve created a web application that I would like to be able to scale reasonably well. I know I'm not Google or Twitter, but my app uses a fairly large amount of data for each user and thus has fairly high data requirements. I want to be ready to scale reasonably well without having to re-architect everything later.
I consider myself a software developer, not a database expert. That’s why I am posting here. Hopefully someone with a lot more database expertise can give me advice.
With a relatively large number of users, but nothing like Facebook numbers, I expect to have a DB that looks like this:
One "Big table":
4 other tables:
One of these tables is used for storing averages -- its schema is bigint(20) id, varchar(20) string_id, datetime date_created, float average_value
What I want to do -- two relatively expensive queries:
-
Calculate new average values:
-
Create de-normalized, read-only records for serving users:
I’ve created a web application that I would like to be able to scale reasonably well. I know I'm not Google or Twitter, but my app uses a fairly large amount of data for each user and thus has fairly high data requirements. I want to be ready to scale reasonably well without having to re-architect everything later.
I consider myself a software developer, not a database expert. That’s why I am posting here. Hopefully someone with a lot more database expertise can give me advice.
With a relatively large number of users, but nothing like Facebook numbers, I expect to have a DB that looks like this:
One "Big table":
- 250 million records
- 20 columns
- Approximately 100 GB of data
- Has an indexed bigint(20) foreign key
- Has an indexed varchar(500) string_id column
- Has an int(11) "value" column
4 other tables:
- 10 million records each
- Approximately 2 - 4 GB of data each
- each of these tables has 4 - 8 columns
- one column is datetime date_created
- one column is the varchar(500) string_id column
- one or two columns from each of these tables will be selected in a join
One of these tables is used for storing averages -- its schema is bigint(20) id, varchar(20) string_id, datetime date_created, float average_value
What I want to do -- two relatively expensive queries:
-
Calculate new average values:
- Using a foreign key, select up to several million separate records from the big table.
- Calculate a new average, grouping by the string_id .
- Insert results into the averages table.
- As currently constructed, this query uses two joins.
-
Create de-normalized, read-only records for serving users:
- Use a foreign key to select anywhere from 1,000-40,000 records from the big table.
- Join with each of the other four tables on the newest record with the string id column.
- Insert the results into a de-normalized table.
- These records are for use by the front-end to display information to users.
- As currently constructed,
Solution
Have you tried piling more data and benchmarking it? 100K rows is inconsequential. Try 250M or 500M like you're expecting you'll need to handle and see where the bottlenecks are.
An RDBMS can do a lot of things if you pay careful attention to the limitations and try and work with the strengths of the system. They're exceptionally good at some things, and terrible at others, so you will need to experiment to be sure it's the right fit.
For some batch processing jobs, you really cannot beat flat files, loading the data into RAM, smashing it around using a series of loops and temporary variables, and dumping out the results. MySQL will never, ever be able to match that sort of speed, but if tuned properly and used correctly it can come within an order of magnitude.
What you'll want to do is investigate how your data can be partitioned. Do you have one big set of data with too much in the way of cross-links to be able to split it up, or are there natural places to partition it? If you can partition it you won't have one table with a whole pile of rows, but potentially many significantly smaller ones. Smaller tables, with much smaller indexes, tend to perform better.
From a hardware perspective, you'll need to test to see how your platform performs. Sometimes memory is essential. Other times it's disk I/O. It really depends on what you're doing with the data. You'll need to pay close attention to your CPU usage and look for high levels of IO wait to know where the problem lies.
Whenever possible, split your data across multiple systems. You can use MySQL Cluster if you're feeling brave, or simply spin up many independent instances of MySQL where each stores an arbitrary portion of the complete data set using some partitioning scheme that makes sense.
An RDBMS can do a lot of things if you pay careful attention to the limitations and try and work with the strengths of the system. They're exceptionally good at some things, and terrible at others, so you will need to experiment to be sure it's the right fit.
For some batch processing jobs, you really cannot beat flat files, loading the data into RAM, smashing it around using a series of loops and temporary variables, and dumping out the results. MySQL will never, ever be able to match that sort of speed, but if tuned properly and used correctly it can come within an order of magnitude.
What you'll want to do is investigate how your data can be partitioned. Do you have one big set of data with too much in the way of cross-links to be able to split it up, or are there natural places to partition it? If you can partition it you won't have one table with a whole pile of rows, but potentially many significantly smaller ones. Smaller tables, with much smaller indexes, tend to perform better.
From a hardware perspective, you'll need to test to see how your platform performs. Sometimes memory is essential. Other times it's disk I/O. It really depends on what you're doing with the data. You'll need to pay close attention to your CPU usage and look for high levels of IO wait to know where the problem lies.
Whenever possible, split your data across multiple systems. You can use MySQL Cluster if you're feeling brave, or simply spin up many independent instances of MySQL where each stores an arbitrary portion of the complete data set using some partitioning scheme that makes sense.
Context
StackExchange Database Administrators Q#23328, answer score: 4
Revisions (0)
No revisions yet.