HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Azure SQL cross-database queries returns too many rows with joins

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rowssqlcrosswithtoodatabaseazurereturnsmanyqueries

Problem

I have a problem with Azure SQL cross-database queries, it looks like if we join data across databases the join is not distributed as input to the external database.

Currently I tried to implement a simple example and therefore I used this starting point:
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-elastic-query-getting-started-vertical.

As preperation I created 199999 CustomerInformation entries and 200 OrderInformation entries.

Following query works as expected the CustomerInformation Table returns just one row:

SELECT o.OrderId, c.CustomerID, c.CustomerName
  FROM OrderInformation o
  JOIN CustomerInformation c ON o.CustomerID = c.CustomerID
 WHERE c.CustomerID = 2


But any query with a join that does not specifiy the CustomerID returns all the rows from the Customers Database.

SELECT o.OrderId, c.CustomerID, c.CustomerName
  FROM OrderInformation o
  JOIN CustomerInformation c ON o.CustomerID = c.CustomerID
 WHERE o.OrderId = 1870


Therefore the queries are really slow because the servers transfer to many rows. Do I miss some setting or is it not possible to separate this data with the suggested queries.

Thanks for your help.

Solution

Recreating the issue

Test data below

SELECT o.OrderId, c.CustomerID, c.CustomerName
FROM OrderInformation o
JOIN CustomerInformation c ON o.CustomerID = c.CustomerID
WHERE o.OrderId = 155;


All records are returned by the remote query operator, to filter down to one when the inner join is applied.

Some more background on pushing down the predicate to a remote database:


Previously, elastic database query could not push parameterized
operations to remote databases. As a result, sometimes large row sets
had to be unnecessarily brought local to evaluate these operations.
With the recent improvements, parameterized operations can now be
pushed to remote databases and be evaluated remotely. For a query over
an external table and a local table like the following, this can now
avoid transferring millions of rows by evaluating the selective filter
in the WHERE clause on the remote database:

DECLARE @low int
DECLARE @high int
SET @low = 100
SET @high = 200

SELECT c.CustomerId, c.Name, count(OrderId) 
FROM remote_customers c
JOIN local_orders o
ON c.CustomerId = o.CustomerId 
WHERE c.CustomerId > @low and c.CustomerId < @high
GROUP BY c.CustomerId, c.Name;


Source

Above query shows an example of using parameters to define a range from where to read the data. The difference with your problem is that the filtering is happening on the local table.

If we know that this predicate can be pushed down to the remote table

SELECT o.OrderId,c.CustomerID, c.CustomerName
FROM CustomerInformation c 
JOIN OrderInformation o on o.CustomerId = c.CustomerID
WHERE o.customerID > 1 and o.Customerid < 10;


You could look for a range of customerid's and filter on these

MAX() and MIN() functions on o.Customerid are used to get the upper and lower boundaries for WHERE o.OrderId = 155

DECLARE @MAX INT, @MIN INT;

SELECT @MAX= max(o.Customerid) , @MIN = MIN(o.Customerid)
FROM OrderInformation o 
WHERE o.OrderId = 155;

SELECT o.OrderId,c.CustomerID, c.CustomerName
FROM CustomerInformation c 
WHERE c.customerID >= @MIN and c.Customerid <= @MAX;


The complete result

DECLARE @MAX INT, @MIN INT;

 SELECT @MAX= max(o.Customerid) , @MIN = MIN(o.Customerid)
 FROM
 OrderInformation o 
 WHERE o.OrderId = 155;

 ;WITH CTE as(
 SELECT c.CustomerID, c.CustomerName
 FROM CustomerInformation c 
 WHERE c.customerID >= @MIN 
 AND c.Customerid <= @MAX
 )

 SELECT o.OrderId, c.CustomerID, c.CustomerName
 FROM CTE c
 INNER JOIN  OrderInformation o
 on o.customerID = c.CustomerID
 where o.Orderid = 155;


Pushing down the predicates:

Execution plan of this query

RemoteQuery="SELECT [c].[CustomerID] [Col1005],[c].[CustomerName] [Col1007] FROM [CustomerInformation] [c] WHERE [c].[CustomerID]>=? AND [c].[CustomerID]<=?"


It depends on your data & parameters passed how much this could filter out on the remote table. YMMV

Test data

Customerinformation (remote db)

CREATE TABLE dbo.CustomerInformation (CustomerID int identity(1,1) , CustomerName nvarchar(255));

DECLARE @i int = 1;
WHILE @i < 20000
BEGIN
INSERT INTO CustomerInformation(CustomerName)
VALUES ('Frodo');

SET @i += 1;
END


On the database that will use the external table

CREATE TABLE OrderInformation(OrderId Int identity(1,1), CustomerId int,Ordervalue int);

SET NOCOUNT ON;
DECLARE @i int = 1;
WHILE @i < 200 
BEGIN
INSERT INTO OrderInformation(CustomerId,Ordervalue)
Values (@I,111);

set @i += 1;
END

CREATE EXTERNAL TABLE  [dbo].CustomerInformation(
    CustomerID [int],
CustomerName nvarchar(255))
    WITH  
(  
    DATA_SOURCE = Stackoverflow  
);

Code Snippets

SELECT o.OrderId, c.CustomerID, c.CustomerName
FROM OrderInformation o
JOIN CustomerInformation c ON o.CustomerID = c.CustomerID
WHERE o.OrderId = 155;
DECLARE @low int
DECLARE @high int
SET @low = 100
SET @high = 200

SELECT c.CustomerId, c.Name, count(OrderId) 
FROM remote_customers c
JOIN local_orders o
ON c.CustomerId = o.CustomerId 
WHERE c.CustomerId > @low and c.CustomerId < @high
GROUP BY c.CustomerId, c.Name;
SELECT o.OrderId,c.CustomerID, c.CustomerName
FROM CustomerInformation c 
JOIN OrderInformation o on o.CustomerId = c.CustomerID
WHERE o.customerID > 1 and o.Customerid < 10;
DECLARE @MAX INT, @MIN INT;

SELECT @MAX= max(o.Customerid) , @MIN = MIN(o.Customerid)
FROM OrderInformation o 
WHERE o.OrderId = 155;

SELECT o.OrderId,c.CustomerID, c.CustomerName
FROM CustomerInformation c 
WHERE c.customerID >= @MIN and c.Customerid <= @MAX;
DECLARE @MAX INT, @MIN INT;

 SELECT @MAX= max(o.Customerid) , @MIN = MIN(o.Customerid)
 FROM
 OrderInformation o 
 WHERE o.OrderId = 155;

 ;WITH CTE as(
 SELECT c.CustomerID, c.CustomerName
 FROM CustomerInformation c 
 WHERE c.customerID >= @MIN 
 AND c.Customerid <= @MAX
 )

 SELECT o.OrderId, c.CustomerID, c.CustomerName
 FROM CTE c
 INNER JOIN  OrderInformation o
 on o.customerID = c.CustomerID
 where o.Orderid = 155;

Context

StackExchange Database Administrators Q#226895, answer score: 2

Revisions (0)

No revisions yet.