patternsqlModerate
SQL Query challenge
Viewed 0 times
sqlquerychallenge
Problem
I have what seemed like an easy query to write, that really has me stumped - can anyone help me please?
So this is really easy on paper, but I can't translate it into SQL.
Lets say you have some graph data:
x | y
1 | 5.5
2 | 8
3 | 2
4 | 3
5 | 1
And you essentially want to draw a line sloping down on top of the data, touching at least 2 of the top-most points, like this...
To find when new data steps above this line-- So I am thinking this comes down to the basic formula for a graph y=A + Bx
So, if we know A and B, for a given value of x, we can predict y (and more importantly whether new values of y are above or below the value we predict for y based on this line).
Can anyone help please?
If it helps anyone, very quick create & data statement...
create table graph1 (x int, y decimal(4,2));
insert into graph1 values (1,5.5),(2,8),(3,2),(4,3),(5,1);
I can do a least squares regression to generate an equation for the linear regression (a line through the middle of the data) - but how would I generate one for the top of the channel the data sits in??
So this is really easy on paper, but I can't translate it into SQL.
Lets say you have some graph data:
x | y
1 | 5.5
2 | 8
3 | 2
4 | 3
5 | 1
And you essentially want to draw a line sloping down on top of the data, touching at least 2 of the top-most points, like this...
To find when new data steps above this line-- So I am thinking this comes down to the basic formula for a graph y=A + Bx
So, if we know A and B, for a given value of x, we can predict y (and more importantly whether new values of y are above or below the value we predict for y based on this line).
Can anyone help please?
If it helps anyone, very quick create & data statement...
create table graph1 (x int, y decimal(4,2));
insert into graph1 values (1,5.5),(2,8),(3,2),(4,3),(5,1);
I can do a least squares regression to generate an equation for the linear regression (a line through the middle of the data) - but how would I generate one for the top of the channel the data sits in??
Solution
The whole set of lines that satisfy the problem are known as a Pareto efficient front (or frontier or set) or the convex hull problem.
(See also my answer in the similar question at StackOverflow: How to write a query selecting reasonable trade-offs?).
So, one way to solve would be to identify first the points of the front, with a query like below. I assume there is a unique constraint on
This would result in:
The lines that pass through these points are our candidates. We have to eliminate however the points (like (4,3)) where another line passes above it. This would be much easier with a recursive CTE but MySQL has not implemented CTEs at all.
The query would not be hard to write - but you do need to experiment with adding some indexes on the temp table (you might even make it a normal InnoDB table), especially if the number of points is high:
Result:
If the number of points expected at this point is high (like 1000+) a SQL query for MySQL would probably be inefficient as it would require a 3-way cross join (of the temp table). Another option would be to have a query that uses mysql variables and runs through the (temp) table in order of the
Another option would be to write the whole query in one go. Good for testing with small data - but I wouldn't even try it if there were millions of rows:
Result:
Note for all the queries. I would prefer if both
An index on
(See also my answer in the similar question at StackOverflow: How to write a query selecting reasonable trade-offs?).
So, one way to solve would be to identify first the points of the front, with a query like below. I assume there is a unique constraint on
(x): CREATE TABLE pareto AS
SELECT a.x
, a.y
FROM graph1 AS a
WHERE NOT EXISTS
( SELECT 1
FROM graph1 AS b
WHERE b.x > a.x
AND b.y >= a.y
) ;
SELECT *
FROM pareto ;This would result in:
+------+------+
| x | y |
+------+------+
| 2.00 | 8.00 |
| 4.00 | 3.00 |
| 5.00 | 1.00 |
+------+------+The lines that pass through these points are our candidates. We have to eliminate however the points (like (4,3)) where another line passes above it. This would be much easier with a recursive CTE but MySQL has not implemented CTEs at all.
The query would not be hard to write - but you do need to experiment with adding some indexes on the temp table (you might even make it a normal InnoDB table), especially if the number of points is high:
SELECT p.x
, p.y
FROM pareto AS p
WHERE NOT EXISTS
( SELECT 1
FROM pareto AS st
CROSS JOIN pareto AS en
WHERE st.x < p.x
AND p.x < en.x
AND (p.x * (en.y-st.y) + p.y * (en.x-st.x) < 0)
) ;Result:
+------+------+
| x | y |
+------+------+
| 2.00 | 8.00 |
| 5.00 | 1.00 |
+------+------+If the number of points expected at this point is high (like 1000+) a SQL query for MySQL would probably be inefficient as it would require a 3-way cross join (of the temp table). Another option would be to have a query that uses mysql variables and runs through the (temp) table in order of the
x coordinate, eliminating the covered points - either in one or multiple batches.Another option would be to write the whole query in one go. Good for testing with small data - but I wouldn't even try it if there were millions of rows:
SELECT a.x
, a.y
FROM graph1 AS a
WHERE NOT EXISTS
( SELECT 1
FROM graph1 AS b
WHERE b.x > a.x
AND b.y >= a.y
)
AND NOT EXISTS
( SELECT 1
FROM graph1 AS st
CROSS JOIN graph1 AS en
WHERE st.x < a.x
AND a.x < en.x
AND (a.x * (en.y-st.y) + a.y * (en.x-st.x) < 0)
) ;Result:
+------+------+
| x | y |
+------+------+
| 2.00 | 8.00 |
| 5.00 | 1.00 |
+------+------+Note for all the queries. I would prefer if both
x and y where the same type (so make x decimal(4,2) as y).An index on
(x,y) would be used by both queries.Code Snippets
CREATE TABLE pareto AS
SELECT a.x
, a.y
FROM graph1 AS a
WHERE NOT EXISTS
( SELECT 1
FROM graph1 AS b
WHERE b.x > a.x
AND b.y >= a.y
) ;
SELECT *
FROM pareto ;+------+------+
| x | y |
+------+------+
| 2.00 | 8.00 |
| 4.00 | 3.00 |
| 5.00 | 1.00 |
+------+------+SELECT p.x
, p.y
FROM pareto AS p
WHERE NOT EXISTS
( SELECT 1
FROM pareto AS st
CROSS JOIN pareto AS en
WHERE st.x < p.x
AND p.x < en.x
AND (p.x * (en.y-st.y) + p.y * (en.x-st.x) < 0)
) ;+------+------+
| x | y |
+------+------+
| 2.00 | 8.00 |
| 5.00 | 1.00 |
+------+------+SELECT a.x
, a.y
FROM graph1 AS a
WHERE NOT EXISTS
( SELECT 1
FROM graph1 AS b
WHERE b.x > a.x
AND b.y >= a.y
)
AND NOT EXISTS
( SELECT 1
FROM graph1 AS st
CROSS JOIN graph1 AS en
WHERE st.x < a.x
AND a.x < en.x
AND (a.x * (en.y-st.y) + a.y * (en.x-st.x) < 0)
) ;Context
StackExchange Database Administrators Q#96548, answer score: 12
Revisions (0)
No revisions yet.