patterncMinor

Bilinear scaling using SSE2 on Core 2 CPUs

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

bilinearcpuscorescalingusingsse2

Problem

I am looking for some help with improving this bilinear scaling SSE2 code on Core 2 CPUs.

On my Atom N270 and on an i7, this code is about 2x faster than the MMX code. But under Core 2 CPUs, it is only equal to the MMX code.

void ConversionProcess::convert_SSE2(BBitmap from, BBitmap to)
{
    uint32 fromBPR, toBPR, fromBPRDIV4, x, y, yr, xr;

    ULLint start = rdtsc();
    ULLint stop;
    if (from && to) {
        uint32 width, height;
        width = from->Bounds().IntegerWidth() + 1;
        height = from->Bounds().IntegerHeight() + 1;

        uint32 toWidth, toHeight;
        toWidth = to->Bounds().IntegerWidth() + 1;
        toHeight = to->Bounds().IntegerHeight() + 1;

        fromBPR = from->BytesPerRow();
        fromBPRDIV4 = fromBPR >> 2;
        toBPR = to->BytesPerRow();

        uint32 x_ratio = ((width-1) Bits();
        uint8 fromPtr1 = (uint8)from->Bits();
        uint8 fromPtr2 = (uint8)from->Bits() + fromBPR;

        struct FilterInfo {
            uint16 one_minus_diff;      // one minus diff
            uint16 diff;                // diff value used to calculate the weights used to average the pixels
            uint16 one_minus_diff_rep;  // one minus diff repeated
            uint16 diff_rep;            // diff value used to calculate the weights used to average the pixels repeated
        };

        FilterInfo xWeights = (FilterInfo )memalign(16, toWidth * 8);
        FilterInfo yWeights = (FilterInfo )memalign(16, toHeight * 8);
        uint32 xIndexes = (uint32 )memalign(16, (toWidth+2) * 4);     // will overread by 2 index
        uint32 yIndexes = (uint32 )memalign(16, toHeight * 4);

        x = 0;
        for (uint32 j=0;j > 7;
            xWeights[j].diff = x - (xr > 7;
            yWeights[j].diff = y - (yr

;
; Copyright (C) 2011 David McPaul
;
; All rights reserved. Distributed under the terms of the MIT License.
;

; A rather unoptimised bilinear scaler

%macro cglobal 1
global _%1
%define %1 _%1

Solution

Probably not what you wished to hear, but honestly, my suggestion would be to rewrite the code using SSE2 intrinsics and see what GCC (or MSVC) is able to do here with scheduling and loop unrolling.

That said, you do not appear to be doing any prefetching. Why is the prefetchnta line commented out? Did you try with different values?

Context

StackExchange Code Review Q#1006, answer score: 2

Revisions (0)

No revisions yet.