HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

Columnstore: terrible execution plan - filter instead of seek

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
terribleseekcolumnstoreplaninsteadfilterexecution

Problem

See the execution plan on https://www.brentozar.com/pastetheplan/?id=SyLQIPDtF (SQL 2016 Enterprise)

  • I have a datawarehouse table peak_reporting_data which tracks the activity per day and hour and contains about 4 billion rows per months with an clustered columnstoreindex partitioned by date_key (one partition per day)



  • in the table peak_reporting_monats_peaks I have aggregated this table and ordered / ranked it by the month peak. There are 3 types of activity (kpi_type), for each I have up to 24h * 31 days = 744 rows per month with [monats_peak] ranked from 1 to 744. It has an unique index over month_key, kpi_type, monats_peak.



  • For the most active hour (per kpi_type) I need some more details, so I wrote the following query / view:



SELECT prmp.month_key
         , prd.*
      FROM mba.peak_reporting_monats_peaks        AS prmp
      LEFT LOOP JOIN (SELECT prd.date_key
                           , prd.hour
                           , prd.kpi_type
                           , prd.is_dr_brand
                           , prd.type_id_usage
                           , prd.product_identifier
                           , SUM(prd.kb) / 1024.0 / 1024.0 AS gb
                           , SUM(CAST(prd.sek AS BIGINT))  AS sek
                           , SUM(prd.anzahl)               AS anzahl
                           , SUM(prd.kb) / 439453125.0     AS gbits
                        FROM db1.mba.peak_reporting_data AS prd 
                       GROUP BY prd.date_key
                              , prd.kpi_type
                              , prd.is_dr_brand
                              , prd.hour
                              , prd.type_id_usage
                              , prd.product_identifier
                      ) AS prd
        ON prd.date_key  = prmp.date_key
       AND prd.hour      = prmp.hour
     WHERE prmp.monats_peak = 1
       AND prmp.month_key = 202107


Since there are exactly 3 rows with monats_peak = 1 per month in peak_reporting_mona

Solution

You will pretty much never want a column store scan on the inner side of a nested loops join.

The engine doesn't support batch mode in that scenario (batch mode column store scans can't be rewound). Notice the plan you uploaded shows the column store scan ran in row mode.

The separate Filter isn't particularly interesting. Non-sargable predicates can't always be pushed down to a child scan or seek. In this case, the engine won't combine dynamic partition elimination with residual predicates. It's an inefficiency, but not the main issue here.

Lose the join hints and let the optimizer pick the plan it wants. You will likely get something similar to the hash-hinted plan you uploaded, which ran in 1600ms. Yes, the whole column store is scanned but the bitmaps created at the hash join are very effective - reducing the 35B rows to 37M. The entire process completes in 1.5s, which isn't too bad at all. Note the batch mode bitmaps allow rowgroup-level elimination (including for read-ahead) and other trickery, so you don't end up reading 35B rows.

As an aside, your original nested loops plan did include partition elimination:

If you really want to pursue the partition-elimination loops-style strategy - and it may well be worth doing so - you will need to do a bit of extra work to get an efficient batch mode column store scan on the inner side of a nested loops join.

As I said, it is not possible to get this arrangement naturally. You need to 'hide' the inner-side operation in a separate execution scope to get (potentially parallel) batch mode execution for the repeated column store scans.

This can be achieved by:

  • Rewriting the left join as an apply



  • Putting the apply side in a table-valued (not inline!) function



The left join can be converted to an apply quite easily. The correlated parameters will be date_key and [hour]. You would then use APPLY to call the function for each row from peak_reporting_monats_peaks.

If you do this right, you will get partition elimination, parallelism, and the batch mode column store scan.

Quick example from the supplied code:

TVF:

CREATE FUNCTION dbo.F
(
    @date_key integer,
    @hour tinyint
)
RETURNS @T table
(
    kpi_type char(1) NOT NULL,
    is_dr_brand bit NULL,
    type_id_usage bigint NOT NULL,
    product_identifier bigint NOT NULL,
    gb decimal(19,6) NOT NULL,
    sek bigint NOT NULL,
    anzahl integer NOT NULL,
    gbits decimal (19,6) NOT NULL
)
WITH SCHEMABINDING
AS
BEGIN
    INSERT @T
    SELECT 
        prd.kpi_type,
        prd.is_dr_brand,
        prd.type_id_usage,
        prd.product_identifier,
        gb = SUM(prd.kb) / 1024.0 / 1024.0,
        sek = SUM(CAST(prd.sek AS BIGINT)),
        anzahl = SUM(prd.anzahl),
        gbits = SUM(prd.kb) / 439453125.0
    FROM mba.peak_reporting_data AS prd
    WHERE
        prd.date_key = @date_key
        AND prd.[hour] = @hour
    GROUP BY
        prd.kpi_type,
        prd.is_dr_brand,
        prd.type_id_usage,
        prd.product_identifier;

    RETURN;
END;


Query:

SELECT
    PRMP.month_key,
    PRMP.date_key,
    [PRMP].[hour],
    F.kpi_type,
    F.is_dr_brand,
    F.type_id_usage,
    F.product_identifier,
    F.gb,
    F.sek,
    F.anzahl,
    F.gbits
FROM mba.peak_reporting_monats_peaks AS PRMP
OUTER APPLY dbo.F(PRMP.date_key, PRMP.[hour]) AS F
WHERE
    PRMP.monats_peak = 1
    AND PRMP.month_key = 202107;


Plan:

TVF scan properties (batch mode, partition elimination)

Code Snippets

CREATE FUNCTION dbo.F
(
    @date_key integer,
    @hour tinyint
)
RETURNS @T table
(
    kpi_type char(1) NOT NULL,
    is_dr_brand bit NULL,
    type_id_usage bigint NOT NULL,
    product_identifier bigint NOT NULL,
    gb decimal(19,6) NOT NULL,
    sek bigint NOT NULL,
    anzahl integer NOT NULL,
    gbits decimal (19,6) NOT NULL
)
WITH SCHEMABINDING
AS
BEGIN
    INSERT @T
    SELECT 
        prd.kpi_type,
        prd.is_dr_brand,
        prd.type_id_usage,
        prd.product_identifier,
        gb = SUM(prd.kb) / 1024.0 / 1024.0,
        sek = SUM(CAST(prd.sek AS BIGINT)),
        anzahl = SUM(prd.anzahl),
        gbits = SUM(prd.kb) / 439453125.0
    FROM mba.peak_reporting_data AS prd
    WHERE
        prd.date_key = @date_key
        AND prd.[hour] = @hour
    GROUP BY
        prd.kpi_type,
        prd.is_dr_brand,
        prd.type_id_usage,
        prd.product_identifier;

    RETURN;
END;
SELECT
    PRMP.month_key,
    PRMP.date_key,
    [PRMP].[hour],
    F.kpi_type,
    F.is_dr_brand,
    F.type_id_usage,
    F.product_identifier,
    F.gb,
    F.sek,
    F.anzahl,
    F.gbits
FROM mba.peak_reporting_monats_peaks AS PRMP
OUTER APPLY dbo.F(PRMP.date_key, PRMP.[hour]) AS F
WHERE
    PRMP.monats_peak = 1
    AND PRMP.month_key = 202107;

Context

StackExchange Database Administrators Q#303459, answer score: 7

Revisions (0)

No revisions yet.