patternpythonModerate

Possible optimizations for calculating squared euclidean distance

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview numpy c stackoverflow performance python clustering

euclideandistancecalculatingpossiblesquaredforoptimizations

Problem

I need to do a few hundred million euclidean distance calculations every day in a Python project.

Here is what I started out with:

#!/usr/bin/python
import numpy as np

def euclidean_dist_square(x, y):
    diff = np.array(x) - np.array(y)
    return np.dot(diff, diff)

This is quite fast and I already dropped the sqrt calculation since I need to rank items only (nearest-neighbor search). It is still the bottleneck of the script though. Therefore I have written a C extension, which calculates the distance. The calculation is always done with 128-dimensional vectors.

#include "euclidean.h"
#include 

double euclidean(double x[128], double y[128])
{
    double Sum;
    for(int i=0;i<128;i++)
    {
        Sum = Sum + pow((x[i]-y[i]),2.0);
    }
    return Sum;
}

This is the relevant part of the extension, the complete code for the extension is here.
Now this gives a nice speedup in comparison to the numpy version.

But is there any way to speed this up further (this is my first C extension ever so I assume there is)? With the number of times this function is used every day, every microsecond would actually provide a benefit.

Some of you might suggest porting this completely from Python to another language, unfortunately this is a larger project and not an option.

Solution

You can decrease Python's overhead by parsing the args tuple and creating the return value manually.

Try changing these portions of your code:

/* Parse the input tuple */
if (!PyArg_ParseTuple(args, "OO", &x_obj, &y_obj))
    return NULL;

>

/* Build the output tuple */
PyObject *ret = Py_BuildValue("d", value);
return ret;

to this (untested) code:

/* Parse the input tuple manually */
if (PyTuple_GET_SIZE(args) != 2) {
    PyErr_SetString(PyExc_TypeError, "requires 2 arguments");
    return NULL;
}

x_obj = PyTuple_GET_ITEM(args, 0);
y_obj = PyTuple_GET_ITEM(args, 1);

>

/* Return double as Python float */
return PyFloat_FromDouble(value);

Updates

Without any changes, your code takes ~340ns on my computer. Changing the argument processing and how the return value is created decreases the running time to ~295ns. Replacing pow() with multiplication didn't change the performance with my compiler (GCC 4.8.2). Changing the exponent to a non-integral value significantly increased the running time so I assume constant integer exponents are optimized by the compiler.

Next, I simplified the conversion of the numpy arrays by checking that the objects passed are numpy arrays. This avoids creating new objects. I also assumed that the data type is a "float64". Extra checks could be added if needed. With these changes, the running time is ~220ns. Code is below:

static PyObject *euclidean_euclidean(PyObject *self, PyObject *args)
{
    PyObject *x_obj, *y_obj;

    if (PyTuple_GET_SIZE(args) != 2) {
        PyErr_SetString(PyExc_TypeError, "requires 2 arguments");
        return NULL;
    }

    x_obj = PyTuple_GET_ITEM(args, 0);
    y_obj = PyTuple_GET_ITEM(args, 1);

    if (!PyArray_Check(x_obj) || !PyArray_Check(y_obj)) {
        PyErr_SetString(PyExc_TypeError, "requires array arguments");
        return NULL;
    }

    /* Get pointers to the data as C-types. */
    double *x    = (double*)PyArray_DATA(x_obj);
    double *y    = (double*)PyArray_DATA(y_obj);

    double value = euclidean(x, y);

    if (value < 0.0) {
        PyErr_SetString(PyExc_RuntimeError,
                    "Euclidean returned an impossible value.");
        return NULL;
    }

    /* Build the output tuple */
    return PyFloat_FromDouble(value);
}

If I remove the call to euclidean(), the running time is ~75ns. If I remove all the the argument parsing and just return the value 0.0, the running time is ~72ns. I did a few more tests to confirm running times and Python's overhead is consistently ~75ns and the euclidean() function has running time of ~150ns.

Code Snippets

/* Parse the input tuple */
if (!PyArg_ParseTuple(args, "OO", &x_obj, &y_obj))
    return NULL;

<< code snipped >>


/* Build the output tuple */
PyObject *ret = Py_BuildValue("d", value);
return ret;

/* Parse the input tuple manually */
if (PyTuple_GET_SIZE(args) != 2) {
    PyErr_SetString(PyExc_TypeError, "requires 2 arguments");
    return NULL;
}

x_obj = PyTuple_GET_ITEM(args, 0);
y_obj = PyTuple_GET_ITEM(args, 1);

<< code snipped >>

/* Return double as Python float */
return PyFloat_FromDouble(value);

static PyObject *euclidean_euclidean(PyObject *self, PyObject *args)
{
    PyObject *x_obj, *y_obj;

    if (PyTuple_GET_SIZE(args) != 2) {
        PyErr_SetString(PyExc_TypeError, "requires 2 arguments");
        return NULL;
    }

    x_obj = PyTuple_GET_ITEM(args, 0);
    y_obj = PyTuple_GET_ITEM(args, 1);

    if (!PyArray_Check(x_obj) || !PyArray_Check(y_obj)) {
        PyErr_SetString(PyExc_TypeError, "requires array arguments");
        return NULL;
    }

    /* Get pointers to the data as C-types. */
    double *x    = (double*)PyArray_DATA(x_obj);
    double *y    = (double*)PyArray_DATA(y_obj);

    double value = euclidean(x, y);

    if (value < 0.0) {
        PyErr_SetString(PyExc_RuntimeError,
                    "Euclidean returned an impossible value.");
        return NULL;
    }

    /* Build the output tuple */
    return PyFloat_FromDouble(value);
}

Context

StackExchange Code Review Q#52218, answer score: 14

Revisions (0)

No revisions yet.