snippetsqlMinor
How can I speed up a query that orders by a calculated field?
Viewed 0 times
canfieldquerythatordershowspeedcalculated
Problem
I will try and give an example - this is not my table structure - I'm simply trying to outline the issue in order to find a solution...
Person
Id, Name
BrothersNames
Id, Name
SistersNames
Id, Name
PersonBrothers (join table)
PersonId, BrotherNameId
PersonSisters (join table)
PersonId, SisterNameId
OK - so imagine this database holds every person from a small country. The database holds a record of the names of everyone's brothers and sisters (it does not map a person to their brother or sister - just their names) so that we can find out statistics about names.
Obviously lots of names are shared so the join tables normalise this for us.
What I want to do is take one user and find out the number of matches of brother's names and number of matches of sister's names with every other user in the system, then add those two matches together and order by that descending. So this would give us a list of users who have the most number of brothers and sister's names in common.
I'm really only interested in the top ten matches but I think I have to get the whole result set to work out the top ten matches.
Please note that in my actual data a person can have a million brothers or a milllion sisters. This is where I'm getting performance issues.
This is how I'm calculating the matches for brothers and I do the same for sisters
Please let me know if I should specify more info...
I am using SQL Server 2008 but is probably platform agnostic..
Person
Id, Name
BrothersNames
Id, Name
SistersNames
Id, Name
PersonBrothers (join table)
PersonId, BrotherNameId
PersonSisters (join table)
PersonId, SisterNameId
OK - so imagine this database holds every person from a small country. The database holds a record of the names of everyone's brothers and sisters (it does not map a person to their brother or sister - just their names) so that we can find out statistics about names.
Obviously lots of names are shared so the join tables normalise this for us.
What I want to do is take one user and find out the number of matches of brother's names and number of matches of sister's names with every other user in the system, then add those two matches together and order by that descending. So this would give us a list of users who have the most number of brothers and sister's names in common.
I'm really only interested in the top ten matches but I think I have to get the whole result set to work out the top ten matches.
Please note that in my actual data a person can have a million brothers or a milllion sisters. This is where I'm getting performance issues.
This is how I'm calculating the matches for brothers and I do the same for sisters
select p.id, matches
FROM Person p
LEFT JOIN
(
SELECT
COUNT(*) AS Matches,
pbn.PersonId
FROM PersonBrothersNames pbn
INNER JOIN Brothersnames bn on pbn.BrothernameId =bn.Id
inner join PersonBrothersName otherpbn on otherpbn.BrothernameId = bn.Id
WHERE pbn.PersonId= @PersonId and pbn.PersonId <> otherpbn.personid
GROUP BY pbn.PersonId
) As BrothersNamesJoin ON BrothersNamesJoin.Person = p.IdPlease let me know if I should specify more info...
I am using SQL Server 2008 but is probably platform agnostic..
Solution
If you don't really need zero-second actuality, you could just run your query time to time and cache the results.
If you still need to have real-time data on this (sacrificing insert performance), I would do this:
Since self-joins are not allowed in indexed views, you need to create two copies of each table:
Create an indexed view on their join:
Same for sisters.
You should manually maintain these tables to have same data (write a trigger, insert/update/delete both etc).
Now we can easily get pairs with the most brothers and most sisters:
All we need now is to fetch a greatest sum. Unfortunately, we cannot index a join of these views (it's a pure implementation flaw, there's no theoretical limitation for this).
So we need to do the following: the top pair cannot have less brothers than the top sis pair. Same holds for the sisters. So we have this query:
If the numbers of common brothers and sisters are correlated, this query will be very fast.
This solution has two drawbacks:
-
-
Persons with 0 brothers or 0 sisters will not be indexed. If there's a chance that top pair will not have brothers or sisters, you should amend the last query a little.
If you still need to have real-time data on this (sacrificing insert performance), I would do this:
Since self-joins are not allowed in indexed views, you need to create two copies of each table:
CREATE TABLE personBrother
(
personId INT NOT NULL,
brotherName INT NOT NULL
)
CREATE TABLE personBrother2
(
personId INT NOT NULL,
brotherName INT NOT NULL
)Create an indexed view on their join:
CREATE VIEW
commonBrothers
WITH SCHEMABINDING
AS
SELECT p1.personId AS p1,
p2.personId AS p2,
COUNT_BIG(*) AS cnt
FROM dbo.personBrother p1
JOIN dbo.personBrother2 p2
ON p1.brotherName = p2.brotherName
WHERE p1.personId < p2.personId
GROUP BY
p1.personId, p2.personId
CREATE UNIQUE CLUSTERED INDEX
ux_commonBrothers_p1_p2
ON commonBrothers (p1, p2)
CREATE INDEX
ix_commonBrothers_cnt
ON commonBrothers (cnt)Same for sisters.
You should manually maintain these tables to have same data (write a trigger, insert/update/delete both etc).
Now we can easily get pairs with the most brothers and most sisters:
SELECT TOP 1 WITH TIES
*
FROM commonBrothers
ORDER BY
cnt DESCAll we need now is to fetch a greatest sum. Unfortunately, we cannot index a join of these views (it's a pure implementation flaw, there's no theoretical limitation for this).
So we need to do the following: the top pair cannot have less brothers than the top sis pair. Same holds for the sisters. So we have this query:
SELECT TOP 1 WITH TIES
cb.p1, cb.p2, cb.cnt + cs.cnt AS totalCnt
FROM commonBrothers cb
JOIN commonSisters cs
ON cs.p1 = cb.p1
AND cs.p2 = cb.p2
WHERE cs.cnt >=
(
SELECT MAX(cst.cnt)
FROM (
SELECT TOP 1 WITH TIES
p1, p2
FROM commonBrothers
ORDER BY
cnt DESC
) cbt
JOIN commonSisters cst
ON cst.p1 = cbt.p1
AND cst.p2 = cbt.p2
)
AND cb.cnt >=
(
SELECT MAX(cbt.cnt)
FROM (
SELECT TOP 1 WITH TIES
p1, p2
FROM commonSisters
ORDER BY
cnt DESC
) cst
JOIN commonBrothers cbt
ON cbt.p1 = cst.p1
AND cbt.p2 = cst.p2
)
ORDER BY
totalCnt DESCIf the numbers of common brothers and sisters are correlated, this query will be very fast.
This solution has two drawbacks:
-
DML performance: if you insert or delete a record for a name shared by million brothers, the indexed view will get 2M inserts or delete. This is the price you pay for real-time query: the kind of data you are asking for cannot be easily indexed.-
Persons with 0 brothers or 0 sisters will not be indexed. If there's a chance that top pair will not have brothers or sisters, you should amend the last query a little.
Code Snippets
CREATE TABLE personBrother
(
personId INT NOT NULL,
brotherName INT NOT NULL
)
CREATE TABLE personBrother2
(
personId INT NOT NULL,
brotherName INT NOT NULL
)CREATE VIEW
commonBrothers
WITH SCHEMABINDING
AS
SELECT p1.personId AS p1,
p2.personId AS p2,
COUNT_BIG(*) AS cnt
FROM dbo.personBrother p1
JOIN dbo.personBrother2 p2
ON p1.brotherName = p2.brotherName
WHERE p1.personId < p2.personId
GROUP BY
p1.personId, p2.personId
CREATE UNIQUE CLUSTERED INDEX
ux_commonBrothers_p1_p2
ON commonBrothers (p1, p2)
CREATE INDEX
ix_commonBrothers_cnt
ON commonBrothers (cnt)SELECT TOP 1 WITH TIES
*
FROM commonBrothers
ORDER BY
cnt DESCSELECT TOP 1 WITH TIES
cb.p1, cb.p2, cb.cnt + cs.cnt AS totalCnt
FROM commonBrothers cb
JOIN commonSisters cs
ON cs.p1 = cb.p1
AND cs.p2 = cb.p2
WHERE cs.cnt >=
(
SELECT MAX(cst.cnt)
FROM (
SELECT TOP 1 WITH TIES
p1, p2
FROM commonBrothers
ORDER BY
cnt DESC
) cbt
JOIN commonSisters cst
ON cst.p1 = cbt.p1
AND cst.p2 = cbt.p2
)
AND cb.cnt >=
(
SELECT MAX(cbt.cnt)
FROM (
SELECT TOP 1 WITH TIES
p1, p2
FROM commonSisters
ORDER BY
cnt DESC
) cst
JOIN commonBrothers cbt
ON cbt.p1 = cst.p1
AND cbt.p2 = cst.p2
)
ORDER BY
totalCnt DESCContext
StackExchange Database Administrators Q#11841, answer score: 5
Revisions (0)
No revisions yet.