patterncppMinor

Solving a one-dimensional Euler equation for fluid dynamics

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

solvingoneeulerfordynamicsdimensionalequationfluid

Problem

I just wrote a program to solve one dimensional Euler equation for fluid dynamics. Attached is a snippet of the vtune profile result for one of the function.
There are a few things which looks weird to me. For example, difference in line 72 takes more time than division in line 70. Also line 70 takes more time than 69 which include a sqrt! Can anyone explain the discrepancy? Also suggestion to improve (in terms of performance) this piece of code is welcome.

This particular routine is called many times in the code so I want this to be as fast as possible.

Additional information:

-
This module is part of a multi-threaded program (using OpenMO).

-
I am using Intel C++ compiler, icpc, with -O3.

-
My ultimate aim is to run this code on Xeon-Phi.

Function

```
__declspec(vector) void roeflux(double rhol, double rhor, double ul, double ur, double pl, double pr,
double f_0, double f_1, double *f_2){
double el = pl/(GAMMA_M) + 0.5frholul*ul;
double er = pr/(GAMMA_M) + 0.5frhorur*ur;
double hl = (el + pl)/rhol;
double hr = (er + pr)/rhor;

double sqrtrhol = sqrt(rhol);
double sqrtrhor = sqrt(rhor);
double den_inverse = 1/(sqrtrhol + sqrtrhor);
double uavg = (sqrtrholul + sqrtrhorur)*den_inverse;
double havg = (sqrtrholhl + sqrtrhorhr)*den_inverse;
double cavg = sqrt(GAMMA_M(havg - 0.5fuavg*uavg));
double cavg_inverse = 1.0f/cavg;

double d1 = rhor - rhol;
double d2 = rhorur - rholul;
double d3 = er - el;

double alpha_2 = GAMMA_M((havg - uavguavg)d1 + uavgd2 - d3)cavg_inversecavg_inverse;
double alpha_3 = 0.5f(d2 + (cavg - uavg)d1 - cavgalpha_2)cavg_inverse;
double alpha_1 = d1 - alpha_2 - alpha_3;

double lambda_1 = fabs(uavg - cavg);
double lambda_2 = fabs(uavg);
double lambda_3 = fabs(uavg + cavg);

double f1 = lambda_1alpha_1 + lambda_2alpha_2 + lambda_3*alpha_3;
double f2 = lambda_1alpha_1(uavg-cavg) + lambda_2alpha_2uavg + lambda_3alpha_3(uavg+cavg);
double f3 = l

Solution

I'm going to assume your target machine is a super scalar CPU with some variation of Tomasulo architecture. I'm also going to assume that there is no significant cache miss ratio at cause for your performance (you should verify this).

When compiling optimised code, the compiler will rearrange operations to reduce stalls due to availability of input data to the instructions. Hence you cannot completely trust the line numbers in the profiler. As the CPU is super scalar some instructions may execute and complete out of order (but are committed in order).

I see that you are computing the reciprocal of cavg and den to save yourself two divisions. It is possible (but doubtful) that this could cause a delay stage as the two dependent operations could end up stalling until the result is available instead of executing in parallel on different execution units.

Contrived example:

double a_inv = 1.0/a;
double b = x*a_inv;
double c = y*a_inv;

If executed as is this would require two execution stages. The two multiplications would have to wait until the division is completed before they can start. The total delay is 1 div + 1 mul (the two mul can execute in parallel on the cpu). Compare this to:

double b = x/a;
double c = y/a;

Execution can begin immediately and the total delay is 1 div as the divs can run in parallel. This is assuming the compiler doesn't do this behind your back already (I'm not sure if it is allowed to by the IEEE rules) . It also depends on if there is enough instructions to fill the execution units while calculating the reciprocal. Any way it's worth trying.

Apart from that I can't think of anything that would make the function as it is any faster.

If you really want speed you need to use SIMD instructions in addition to using many thread. But that requires you to restructure your data from "array of structs" to "structs of arrays".

Edit: I've had an opportunity to take a closer look and there are some things that you can do to improve performance.

This calculates cavg* 1/cavg at two occasions:

double alpha_3 = 0.5f(d2 + (cavg - uavg)d1 - cavgalpha_2)cavg_inverse;

which can be re-written as:

double alpha_3 = 0.5f*((d2 - uavg*d1)*cavg_inverse + d1 - alpha_2);

This reduces from 4 mult + 3 add to 3 mult + 3 add.

Here you need to make sure that the compiler actually emits fabs instruction and not cmp+cmov:

double lambda_1 =  fabs(uavg - cavg);
double lambda_2 =  fabs(uavg);
double lambda_3 =  fabs(uavg + cavg);

Here, the compiler will perform CSE so lambda_x*alpha_x products will already be available and we can rewrite:

double f1 = lambda_1*alpha_1 + lambda_2*alpha_2 + lambda_3*alpha_3;
double f2 = lambda_1*alpha_1*(uavg-cavg) + lambda_2*alpha_2*uavg + lambda_3*alpha_3*(uavg+cavg);
double f3 = lambda_1*alpha_1*(havg-cavg*uavg) + 0.5f*lambda_2*alpha_2*uavg*uavg + lambda_3*alpha_3*(havg+cavg*uavg);

as:

double la1 = lambda_1*alpha_1; // For readability, compiler does this anyway
double la2 = lambda_2*alpha_2;
double la3 = lambda_3*alpha_3;
double f1 = la1 + la2 + la3;
double f2 = uavg*f1 - cavg*(la1-la3);
double f3 = la1*(havg-cavg*uavg) + 0.5f*la2*uavg*uavg + la3*(havg+cavg*uavg);

which reduces the expression of f2 from 3 mult + 4 add to 2 mult + 2 add as the results from f1 can be re-used. The compiler should put the calculation of uavg*f1 towards the end of the function and calculate f3 while f1 computes.

There could be more of the same type of optimizations you can do by rearranging operations but unfortunately I don't have time to give a more thorough look.

Disclaimer: The compiler may or may not do these optimizations depending on the FP precision setting. For example -ffast-math on GCC would probably do these, but only if -ffast-math as IEEE is not associative due to round off errors. There is an equivalent option for ICC but I don't know what it's called. But be advised this option can break stuff, like NAN handling/checking.

Code Snippets

double a_inv = 1.0/a;
double b = x*a_inv;
double c = y*a_inv;

double b = x/a;
double c = y/a;

double alpha_3 = 0.5f*((d2 - uavg*d1)*cavg_inverse + d1 - alpha_2);

double lambda_1 =  fabs(uavg - cavg);
double lambda_2 =  fabs(uavg);
double lambda_3 =  fabs(uavg + cavg);

double f1 = lambda_1*alpha_1 + lambda_2*alpha_2 + lambda_3*alpha_3;
double f2 = lambda_1*alpha_1*(uavg-cavg) + lambda_2*alpha_2*uavg + lambda_3*alpha_3*(uavg+cavg);
double f3 = lambda_1*alpha_1*(havg-cavg*uavg) + 0.5f*lambda_2*alpha_2*uavg*uavg + lambda_3*alpha_3*(havg+cavg*uavg);

Context

StackExchange Code Review Q#63469, answer score: 4

Revisions (0)

No revisions yet.