HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Slow JOIN on tables with millions of rows

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rowstablesmillionswithjoinslow

Problem

In my application I have to join tables with millions of rows. I have a query like this:

SELECT DISTINCT "f"."id" AS "FileId"  
, "f"."name" AS "FileName"  
, "f"."year" AS "FileYear"  
, "vt"."value" AS "value"  
FROM files "f"  
JOIN "clients" "cl" ON("f"."cid" = "cl"."id" AND "cl"."id" = 10)  
LEFT JOIN "value_text" "vt" ON ("f"."id" = "vt"."id_file" AND "vt"."id_field" = 65739)  
GROUP BY "f"."id", "f"."name", "f"."year", "vt"."value"


The table "files" has 10 million rows, and the table "value_text" has 40 million rows.

This query is too slow, it takes between 40s (15000 results) - 3 minutes (65000 results) to be executed.

I had thought about divide the two queries, but I can't because sometimes I need to order by the joined column (value)...

What can I do? I use SQL Server with Azure. Specifically, Azure SQL Database with pricing/model tier "PRS1 PremiumRS (125 DTUs)".

I'm receiving a lot of data but I think the internet connection is not a bottleneck, because in other queries I receive a lot of data too and they're faster.

I've tried using the client table as a subquery and removing DISTINCT with the same results.

I have 1428 rows in client table.
Additional info

clients table:

```
CREATE TABLE [dbo].clients NOT NULL,
[code] nvarchar NOT NULL,
[password] nchar NOT NULL,
[name] nvarchar NOT NULL DEFAULT (N''),
[email] nvarchar NULL DEFAULT (NULL),
[entity] [int] NOT NULL DEFAULT ((0)),
[users] [int] NOT NULL DEFAULT ((0)),
[status] varchar NOT NULL DEFAULT ('inactive'),
[created] datetime2 NULL DEFAULT (getdate()),
[activated] datetime2 NULL DEFAULT (getdate()),
[client_type] varchar NOT NULL DEFAULT ('normal'),
[current_size] [bigint] NOT NULL DEFAULT ((0)),
CONSTRAINT [PK_clients_id] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS =

Solution

I think you need this index (as Krismorte suggested):

CREATE NONCLUSTERED INDEX [IX dbo.value_text id_file, id_field, value]
ON dbo.value_text (id_file, id_field, [value]);


The following index is probably not required as you appear to have a suitable existing index (not mentioned in the question) but I include it for completeness:

CREATE NONCLUSTERED INDEX [IX dbo.files cid (id, year, name)]
ON dbo.files (cid)
INCLUDE
(
    id,
    [year],
    [name]
);


Express the query as:

SELECT
    FileId = F.id,
    [FileName] = F.[name],
    FileYear = F.[year],
    V.[value]
FROM dbo.files AS F
JOIN dbo.clients AS C
    ON C.id = F.cid
OUTER APPLY
(
        SELECT DISTINCT
            VT.[value]
        FROM dbo.value_text AS VT
        WHERE
            VT.id_file = F.id
            AND VT.id_field = 65739
) AS V
WHERE 
    C.id = 10
OPTION (RECOMPILE);


This should give an execution plan like:

The OPTION (RECOMPILE) is optional. Only add if you find the ideal plan shape is different for different parameter values. There are other possible solutions to such "parameter-sniffing" issues.

With the new index, you may also find the original query text produces a very similar plan, also with good performance.

You may also need to update the statistics on the files table, since the estimate in the supplied plan for cid = 19 is not accurate:


After updating the statistics on the files table the query is working very fast in all the cases. If in the future I add more fields in "files" table, should I update the index or something?

If you add more columns to the file table (and use/return them in your query) you will need to add them to the index (minimally as included columns) to keep the index "covering". Otherwise, the optimizer may choose to scan the files table rather than looking up the columns not present in the index. You might also choose to make cid part of a clustering index on that table instead. It depends. Ask a new question if you want clarification on these points.

Code Snippets

CREATE NONCLUSTERED INDEX [IX dbo.value_text id_file, id_field, value]
ON dbo.value_text (id_file, id_field, [value]);
CREATE NONCLUSTERED INDEX [IX dbo.files cid (id, year, name)]
ON dbo.files (cid)
INCLUDE
(
    id,
    [year],
    [name]
);
SELECT
    FileId = F.id,
    [FileName] = F.[name],
    FileYear = F.[year],
    V.[value]
FROM dbo.files AS F
JOIN dbo.clients AS C
    ON C.id = F.cid
OUTER APPLY
(
        SELECT DISTINCT
            VT.[value]
        FROM dbo.value_text AS VT
        WHERE
            VT.id_file = F.id
            AND VT.id_field = 65739
) AS V
WHERE 
    C.id = 10
OPTION (RECOMPILE);

Context

StackExchange Database Administrators Q#196304, answer score: 7

Revisions (0)

No revisions yet.