patterncMinor
FIR filters in C
Viewed 0 times
firfiltersstackoverflow
Problem
I wrote 2 filters in C for the Altera DE2 Nios II FPGA, one floating-point and one fixed-point. I've verified that they perform correctly and now I wonder if you can give examples for improvement or optimization? I'll reduce the C library to a small library and turn on optimization, and perhaps you can suggest how to do other improvements?
The floating-point program:
The fixed-point program:
```
#include
#include "system.h"
#include "alt_types.h"
#include
#include
#include
#define TIME 1
signed char input[4]; / The 4 most recent input values /
char get_q7( void );
void put_q7( char );
void firFixed(signed char input[4]);
const int c0 = (0.0299 128 + 0.5); / Converting from float to Q7 by multiplying by 2^n i.e. 128 = 2^7 since we use Q7 and round to the nearest integer by multiplying with 0.5. The fraction will be truncated. */
const int c1 = (0.4701 * 128 + 0
The floating-point program:
#include
#include "system.h"
#include "alt_types.h"
#include
#include
#include
float microseconds(int ticks)
{
return (float) 1000000 * (float) ticks / (float) alt_timestamp_freq();
}
void start_measurement()
{
/* Flush caches */
alt_dcache_flush_all();
alt_icache_flush_all();
/* Measure */
alt_timestamp_start();
time_1 = alt_timestamp();
}
void stop_measurement()
{
time_2 = alt_timestamp();
ticks = time_2 - time_1;
}
float floatFIR(float inVal, float* x, float* coef, int len)
{
float y = 0.0;
int i;
start_measurement();
for (i = (len-1) ; i > 0 ; i--)
{
x[i] = x[i-1];
y = y + (coef[i] * x[i]);
}
x[0] = inVal;
y = y + (coef[0] * x[0]);
stop_measurement();
printf("%5.2f us", (float) microseconds(ticks - timer_overhead));
printf("(%d ticks)\n", (int) (ticks - timer_overhead));
printf("Sum: %f\n", y);
return y;
}
int main(int argc, char** argv)
{
/* Calculate Timer Overhead */
// Average of 10 measurements */
int i;
timer_overhead = 0;
for (i = 0; i 0)
{
y = floatFIR(inVal, x, coef, 4);
}
return 0;
}The fixed-point program:
```
#include
#include "system.h"
#include "alt_types.h"
#include
#include
#include
#define TIME 1
signed char input[4]; / The 4 most recent input values /
char get_q7( void );
void put_q7( char );
void firFixed(signed char input[4]);
const int c0 = (0.0299 128 + 0.5); / Converting from float to Q7 by multiplying by 2^n i.e. 128 = 2^7 since we use Q7 and round to the nearest integer by multiplying with 0.5. The fraction will be truncated. */
const int c1 = (0.4701 * 128 + 0
Solution
A good first approach is always to look for a library call that already does what you need and that was optimized for your platform. For a FIR filter, that might e.g. be
For a hand written approach, the key issues are picking the right data types (as discussed by @WilliamMorris) and exploiting the parallelism of the target platform. Since you’re targeting an FPGA, you even get to pick the level of parallelism. On the other hand, FPGAs are not necessarily great with arbitrary loops, so I would take a very close look at whether you can get away with using a constant number of coefficients.
Once you’ve decided on an appropriate level of parallelism, break up the data dependency in your loop. Right now, every iteration needs to wait for the previous iteration to complete. If you, e.g., want to have 4-way parallelism, something like this might work (assuming the # of coefficients is divisible by 4):
cblas_sdot in the BLAS library.For a hand written approach, the key issues are picking the right data types (as discussed by @WilliamMorris) and exploiting the parallelism of the target platform. Since you’re targeting an FPGA, you even get to pick the level of parallelism. On the other hand, FPGAs are not necessarily great with arbitrary loops, so I would take a very close look at whether you can get away with using a constant number of coefficients.
Once you’ve decided on an appropriate level of parallelism, break up the data dependency in your loop. Right now, every iteration needs to wait for the previous iteration to complete. If you, e.g., want to have 4-way parallelism, something like this might work (assuming the # of coefficients is divisible by 4):
float y0=0.0f, y1=0.0f, y2=0.0f, y3=0.0f;
memmov(&x[1], &x[0], (len-1)*sizeof(x[0]));
x[0] = inVal;
for (int i=len; i>0; i-=4) {
y0 += x[i-4]*coeff[i-4];
y1 += x[i-3]*coeff[i-3];
y2 += x[i-2]*coeff[i-2];
y3 += x[i-1]*coeff[i-1];
}
y0 += y2;
y1 += y3;
return y0+y1;Code Snippets
float y0=0.0f, y1=0.0f, y2=0.0f, y3=0.0f;
memmov(&x[1], &x[0], (len-1)*sizeof(x[0]));
x[0] = inVal;
for (int i=len; i>0; i-=4) {
y0 += x[i-4]*coeff[i-4];
y1 += x[i-3]*coeff[i-3];
y2 += x[i-2]*coeff[i-2];
y3 += x[i-1]*coeff[i-1];
}
y0 += y2;
y1 += y3;
return y0+y1;Context
StackExchange Code Review Q#32444, answer score: 3
Revisions (0)
No revisions yet.