HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Can I optimize this MERGE statement?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
thiscanstatementmergeoptimize

Problem

I am attempting a single column merge between two tables. The first table (VisitorSession) has 40,000,000 rows. The second (ShoppingCart) has 9,000,000 rows.

In my development environment, the query takes just under 8 minutes. But in the production environment, it should take significantly less (much more powerful machine). However, I anticipate the query taking at least 2 minutes to run in production. I know that this query was causing timeouts for other developers in the development environment, which means it could easily cause timeouts for customers. Is there a safer and/or faster way to perform this query?

declare @dt datetime = cast(dateadd(month, -6, getdate()) as date);

merge ShoppingCart as TargetTable  -- 07:55 to complete in Dev
using 
(
  select * from -- 04:55 to run select, resulting in 12,727,927 rows in Dev
  (
    select
      visitorid  -- int, not null, foreign key
      ,useripaddress  -- varchar(55), null
      ,row_number() over 
      (partition by visitorid order by createdate desc) as [row]
    from VisitorSession (nolock)
    where UserIPAddress is not null
    and CreateDate > @dt   -- createdate is a datetime, not null
  ) as subTbl
  where subTbl.[row] = 1
) as SourceTable
on (TargetTable.VisitorID = SourceTable.VisitorID)  -- visitorid is not a primary key
when matched
  then update set
  TargetTable.UserIpAddress = SourceTable.UserIpAddress;

Solution

Personally, I don't like MERGE because there are many unresolved bugs:

  • This blog post (scroll down to "Other MERGE issues")



  • This answer by AlexKuznetsov



  • These posts on Paul White's blog



  • I published a cautionary tip about MERGE here, including references to several bugs that haven't been addressed.



  • And this list may not even be exhaustive - given the number of bugs discovered by this small set of users, I have to wonder how many haven't been discovered yet?



MERGE also gives a false sense of security in terms of optimistic concurrency and race conditions. See Dan Guzman's blog post for more details.

I'm not trying to be a fear-mongerer here. But I also find the syntax unintuitive and daunting. So I would only use it in cases where it's actually needed and I can prove that I'm not affected by any of the above issues. I don't know what I would ever possibly gain from using it for an operation that could only ever end in an UPDATE anyway.

So here is how I would do it instead, using syntax that is much more familiar to me:

;WITH s AS 
(
  SELECT VisitorID, UserIpAddress FROM 
  (
    SELECT 
      VisitorID,
      UserIpAddress,
      rn = ROW_NUMBER() OVER (PARTITION BY VisitorID ORDER BY CreateDate DESC)
    FROM dbo.VisitorSession
    WHERE UserIpAddress IS NOT NULL
    AND CreateDate > @dt
  ) AS x
  WHERE rn = 1
)
UPDATE c
  SET c.UserIpAddress = s.UserIpAddress
  FROM dbo.ShoppingCart AS c
  INNER JOIN s
  ON c.VisitorID = s.VisitorID;


You can also break this operation into chunks to reduce the impact on the transaction log which may, in turn, reduce the overall duration. I blogged about this here.

Here is how I would handle that approach:

DECLARE 
  @dt DATE = DATEADD(MONTH, -6, SYSDATETIME()), 
  @rc INT = 1;

WHILE @rc > 0
BEGIN

  BEGIN TRANSACTION;

  ;WITH s AS 
  (
    SELECT TOP (100000) VisitorID, UserIpAddress FROM
    (
      SELECT 
        VisitorID,
        UserIpAddress,
        rn = ROW_NUMBER() OVER (PARTITION BY VisitorID ORDER BY CreateDate DESC)
      FROM dbo.VisitorSession AS s
      WHERE UserIpAddress IS NOT NULL
      AND CreateDate > @dt
      AND EXISTS
      ( 
        SELECT 1 FROM dbo.ShoppingCart AS c
          WHERE c.VisitorID = s.VisitorID
          AND (c.UserIpAddress <> s.UserIpAddress
          OR c.UserIpAddress IS NULL)
      )
    ) AS x
    WHERE rn = 1
  )
  UPDATE c
    SET c.UserIpAddress = s.UserIpAddress
    FROM dbo.ShoppingCart AS c
    INNER JOIN s
    ON c.VisitorID = s.VisitorID;

  SET @rc = @@ROWCOUNT;

  COMMIT TRANSACTION;
END


Of course, as the blog post illustrates, you can gain just about as much time by making sure your log is big enough to handle the entire transaction without having to grow - most of the delay is likely coming from many, many autogrow operations accommodating your large transaction. Sadly until you've done the operation once it can be very difficult to try and guess how much transaction log you'll need...

Code Snippets

;WITH s AS 
(
  SELECT VisitorID, UserIpAddress FROM 
  (
    SELECT 
      VisitorID,
      UserIpAddress,
      rn = ROW_NUMBER() OVER (PARTITION BY VisitorID ORDER BY CreateDate DESC)
    FROM dbo.VisitorSession
    WHERE UserIpAddress IS NOT NULL
    AND CreateDate > @dt
  ) AS x
  WHERE rn = 1
)
UPDATE c
  SET c.UserIpAddress = s.UserIpAddress
  FROM dbo.ShoppingCart AS c
  INNER JOIN s
  ON c.VisitorID = s.VisitorID;
DECLARE 
  @dt DATE = DATEADD(MONTH, -6, SYSDATETIME()), 
  @rc INT = 1;

WHILE @rc > 0
BEGIN

  BEGIN TRANSACTION;

  ;WITH s AS 
  (
    SELECT TOP (100000) VisitorID, UserIpAddress FROM
    (
      SELECT 
        VisitorID,
        UserIpAddress,
        rn = ROW_NUMBER() OVER (PARTITION BY VisitorID ORDER BY CreateDate DESC)
      FROM dbo.VisitorSession AS s
      WHERE UserIpAddress IS NOT NULL
      AND CreateDate > @dt
      AND EXISTS
      ( 
        SELECT 1 FROM dbo.ShoppingCart AS c
          WHERE c.VisitorID = s.VisitorID
          AND (c.UserIpAddress <> s.UserIpAddress
          OR c.UserIpAddress IS NULL)
      )
    ) AS x
    WHERE rn = 1
  )
  UPDATE c
    SET c.UserIpAddress = s.UserIpAddress
    FROM dbo.ShoppingCart AS c
    INNER JOIN s
    ON c.VisitorID = s.VisitorID;

  SET @rc = @@ROWCOUNT;

  COMMIT TRANSACTION;
END

Context

StackExchange Database Administrators Q#44067, answer score: 17

Revisions (0)

No revisions yet.