patternMinor
Slow JOIN on tables with millions of rows
Viewed 0 times
rowstablesmillionswithjoinslow
Problem
In my application I have to join tables with millions of rows. I have a query like this:
The table "files" has 10 million rows, and the table "value_text" has 40 million rows.
This query is too slow, it takes between 40s (15000 results) - 3 minutes (65000 results) to be executed.
I had thought about divide the two queries, but I can't because sometimes I need to order by the joined column (value)...
What can I do? I use SQL Server with Azure. Specifically, Azure SQL Database with pricing/model tier "PRS1 PremiumRS (125 DTUs)".
I'm receiving a lot of data but I think the internet connection is not a bottleneck, because in other queries I receive a lot of data too and they're faster.
I've tried using the client table as a subquery and removing
I have 1428 rows in client table.
Additional info
```
CREATE TABLE [dbo].clients NOT NULL,
[code] nvarchar NOT NULL,
[password] nchar NOT NULL,
[name] nvarchar NOT NULL DEFAULT (N''),
[email] nvarchar NULL DEFAULT (NULL),
[entity] [int] NOT NULL DEFAULT ((0)),
[users] [int] NOT NULL DEFAULT ((0)),
[status] varchar NOT NULL DEFAULT ('inactive'),
[created] datetime2 NULL DEFAULT (getdate()),
[activated] datetime2 NULL DEFAULT (getdate()),
[client_type] varchar NOT NULL DEFAULT ('normal'),
[current_size] [bigint] NOT NULL DEFAULT ((0)),
CONSTRAINT [PK_clients_id] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS =
SELECT DISTINCT "f"."id" AS "FileId"
, "f"."name" AS "FileName"
, "f"."year" AS "FileYear"
, "vt"."value" AS "value"
FROM files "f"
JOIN "clients" "cl" ON("f"."cid" = "cl"."id" AND "cl"."id" = 10)
LEFT JOIN "value_text" "vt" ON ("f"."id" = "vt"."id_file" AND "vt"."id_field" = 65739)
GROUP BY "f"."id", "f"."name", "f"."year", "vt"."value"The table "files" has 10 million rows, and the table "value_text" has 40 million rows.
This query is too slow, it takes between 40s (15000 results) - 3 minutes (65000 results) to be executed.
I had thought about divide the two queries, but I can't because sometimes I need to order by the joined column (value)...
What can I do? I use SQL Server with Azure. Specifically, Azure SQL Database with pricing/model tier "PRS1 PremiumRS (125 DTUs)".
I'm receiving a lot of data but I think the internet connection is not a bottleneck, because in other queries I receive a lot of data too and they're faster.
I've tried using the client table as a subquery and removing
DISTINCT with the same results.I have 1428 rows in client table.
Additional info
clients table:```
CREATE TABLE [dbo].clients NOT NULL,
[code] nvarchar NOT NULL,
[password] nchar NOT NULL,
[name] nvarchar NOT NULL DEFAULT (N''),
[email] nvarchar NULL DEFAULT (NULL),
[entity] [int] NOT NULL DEFAULT ((0)),
[users] [int] NOT NULL DEFAULT ((0)),
[status] varchar NOT NULL DEFAULT ('inactive'),
[created] datetime2 NULL DEFAULT (getdate()),
[activated] datetime2 NULL DEFAULT (getdate()),
[client_type] varchar NOT NULL DEFAULT ('normal'),
[current_size] [bigint] NOT NULL DEFAULT ((0)),
CONSTRAINT [PK_clients_id] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS =
Solution
I think you need this index (as Krismorte suggested):
The following index is probably not required as you appear to have a suitable existing index (not mentioned in the question) but I include it for completeness:
Express the query as:
This should give an execution plan like:
The
With the new index, you may also find the original query text produces a very similar plan, also with good performance.
You may also need to update the statistics on the
After updating the statistics on the files table the query is working very fast in all the cases. If in the future I add more fields in "files" table, should I update the index or something?
If you add more columns to the file table (and use/return them in your query) you will need to add them to the index (minimally as included columns) to keep the index "covering". Otherwise, the optimizer may choose to scan the files table rather than looking up the columns not present in the index. You might also choose to make
CREATE NONCLUSTERED INDEX [IX dbo.value_text id_file, id_field, value]
ON dbo.value_text (id_file, id_field, [value]);The following index is probably not required as you appear to have a suitable existing index (not mentioned in the question) but I include it for completeness:
CREATE NONCLUSTERED INDEX [IX dbo.files cid (id, year, name)]
ON dbo.files (cid)
INCLUDE
(
id,
[year],
[name]
);Express the query as:
SELECT
FileId = F.id,
[FileName] = F.[name],
FileYear = F.[year],
V.[value]
FROM dbo.files AS F
JOIN dbo.clients AS C
ON C.id = F.cid
OUTER APPLY
(
SELECT DISTINCT
VT.[value]
FROM dbo.value_text AS VT
WHERE
VT.id_file = F.id
AND VT.id_field = 65739
) AS V
WHERE
C.id = 10
OPTION (RECOMPILE);This should give an execution plan like:
The
OPTION (RECOMPILE) is optional. Only add if you find the ideal plan shape is different for different parameter values. There are other possible solutions to such "parameter-sniffing" issues.With the new index, you may also find the original query text produces a very similar plan, also with good performance.
You may also need to update the statistics on the
files table, since the estimate in the supplied plan for cid = 19 is not accurate:After updating the statistics on the files table the query is working very fast in all the cases. If in the future I add more fields in "files" table, should I update the index or something?
If you add more columns to the file table (and use/return them in your query) you will need to add them to the index (minimally as included columns) to keep the index "covering". Otherwise, the optimizer may choose to scan the files table rather than looking up the columns not present in the index. You might also choose to make
cid part of a clustering index on that table instead. It depends. Ask a new question if you want clarification on these points.Code Snippets
CREATE NONCLUSTERED INDEX [IX dbo.value_text id_file, id_field, value]
ON dbo.value_text (id_file, id_field, [value]);CREATE NONCLUSTERED INDEX [IX dbo.files cid (id, year, name)]
ON dbo.files (cid)
INCLUDE
(
id,
[year],
[name]
);SELECT
FileId = F.id,
[FileName] = F.[name],
FileYear = F.[year],
V.[value]
FROM dbo.files AS F
JOIN dbo.clients AS C
ON C.id = F.cid
OUTER APPLY
(
SELECT DISTINCT
VT.[value]
FROM dbo.value_text AS VT
WHERE
VT.id_file = F.id
AND VT.id_field = 65739
) AS V
WHERE
C.id = 10
OPTION (RECOMPILE);Context
StackExchange Database Administrators Q#196304, answer score: 7
Revisions (0)
No revisions yet.