patterncMinor
Bilinear scaling using SSE2 on Core 2 CPUs
Viewed 0 times
bilinearcpuscorescalingusingsse2
Problem
I am looking for some help with improving this bilinear scaling SSE2 code on Core 2 CPUs.
On my Atom N270 and on an i7, this code is about 2x faster than the MMX code. But under Core 2 CPUs, it is only equal to the MMX code.
; Copyright (C) 2011 David McPaul
;
; All rights reserved. Distributed under the terms of the MIT License.
;
; A rather unoptimised bilinear scaler
%macro cglobal 1
global _%1
%define %1 _%1
On my Atom N270 and on an i7, this code is about 2x faster than the MMX code. But under Core 2 CPUs, it is only equal to the MMX code.
void ConversionProcess::convert_SSE2(BBitmap from, BBitmap to)
{
uint32 fromBPR, toBPR, fromBPRDIV4, x, y, yr, xr;
ULLint start = rdtsc();
ULLint stop;
if (from && to) {
uint32 width, height;
width = from->Bounds().IntegerWidth() + 1;
height = from->Bounds().IntegerHeight() + 1;
uint32 toWidth, toHeight;
toWidth = to->Bounds().IntegerWidth() + 1;
toHeight = to->Bounds().IntegerHeight() + 1;
fromBPR = from->BytesPerRow();
fromBPRDIV4 = fromBPR >> 2;
toBPR = to->BytesPerRow();
uint32 x_ratio = ((width-1) Bits();
uint8 fromPtr1 = (uint8)from->Bits();
uint8 fromPtr2 = (uint8)from->Bits() + fromBPR;
struct FilterInfo {
uint16 one_minus_diff; // one minus diff
uint16 diff; // diff value used to calculate the weights used to average the pixels
uint16 one_minus_diff_rep; // one minus diff repeated
uint16 diff_rep; // diff value used to calculate the weights used to average the pixels repeated
};
FilterInfo xWeights = (FilterInfo )memalign(16, toWidth * 8);
FilterInfo yWeights = (FilterInfo )memalign(16, toHeight * 8);
uint32 xIndexes = (uint32 )memalign(16, (toWidth+2) * 4); // will overread by 2 index
uint32 yIndexes = (uint32 )memalign(16, toHeight * 4);
x = 0;
for (uint32 j=0;j > 7;
xWeights[j].diff = x - (xr > 7;
yWeights[j].diff = y - (yr
;; Copyright (C) 2011 David McPaul
;
; All rights reserved. Distributed under the terms of the MIT License.
;
; A rather unoptimised bilinear scaler
%macro cglobal 1
global _%1
%define %1 _%1
Solution
Probably not what you wished to hear, but honestly, my suggestion would be to rewrite the code using SSE2 intrinsics and see what GCC (or MSVC) is able to do here with scheduling and loop unrolling.
That said, you do not appear to be doing any prefetching. Why is the
That said, you do not appear to be doing any prefetching. Why is the
prefetchnta line commented out? Did you try with different values?Context
StackExchange Code Review Q#1006, answer score: 2
Revisions (0)
No revisions yet.