HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaModerate

Algorithms to find statistical information of an array

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
arrayalgorithmsstatisticalfindinformation

Problem

I was working on a program that creates an array, then gives statistical information about that array such as std dev, median, mode average etc.

I have written a method for each statistical value, however I am not too sure that they are efficient, particularly the getMode() method.

I would be really happy if you helped me to make these methods memory efficient and fast as possible. Thanks, here are my methods:

private static int getMode(int[] x) {

    int max_mode = 0;
    int count = 0, count_max = 0;
    for (int i = 0; i = count_max) {
            count_max = count;
            max_mode = x[i];
        }
        count = 0;
    }
    return max_mode;
}

private static double getMedian(int[] x) {
    int[] sorted = x;
    sort(sorted);
    double median;
    if (sorted.length % 2 == 0) {
        median = sorted[(sorted.length / 2 - 1 + sorted.length / 2) / 2];
        median += 0.5;
    } else {
        median = sorted[sorted.length / 2];
    }

    return median;

}

private static int[] sort(int[] x) {
    int temp;
    int[] sorted = x;
    for (int i = 1; i  x[j]) {
                tempmax = x[i];
            }

        }
        if (tempmax > realmax) {
            realmax = tempmax;
        }
        tempmax = 0;
    }
    return realmax;
}

private static int getMin(int[] x) {
    int templow = 0, reallow = 0;
    for (int i = 0; i < x.length; i++) {
        for (int j = 0; j < x.length; j++) {
            if (x[i] < x[j]) {
                templow = x[i];
            }
        }
        if (templow <= reallow) {
            reallow = templow;
        }
        templow = 0;
    }
    return reallow;
}

private static double getStdDev(int[] x) {
    double sum = 0;
    double avg = getAverage(x);
    for (int i = 0; i < x.length; i++) {
        sum += (x[i] - avg) * (x[i] - avg);
    }
    return Math.sqrt(sum / (x.length - 1));
}

Solution

Glaring Deficiencies

-
Let me venture here that your getMedian method is slower than your getMode method. Thanks to your sort. You use a naive bubble-sort, which is \$O(n^2)\$. It can be done using \$O(n\log(n))\$ using quicksort (for practical input), mergesort, heapsort, etc. However, you can get the best of all worlds by just using the static method sort() from java.util.Arrays. It will get you an implementation of mergesort or Timsort, both very fast compared to bubble sort.

-
About getAverage. You recalculate the sum in a loop where you already have a getSum method. Use that.

-
You have really inconsistent return types in some places, especially getSum and getMedian.

-
In getMedian, int[] sorted = x; probably doesn't do what you want it to do, which is create a copy of the argument array which you can sort. Try using int[] sorted = Arrays.copyOf(x, x.length); instead, which will do the required deep copy instead of a reference assignment.

I disagree with @coderodde's approach. You should get what you ask for when computing statistics. You should not incur costs which you do not deserve. That is, if you want to get the average of all the elements in an array, you should not have to pay the overhead of sorting it and calculating the median.

Algorithmic improvements

-
In getMin and getMax, set the default value of the accumulator variable to the first element of the array, i.e., arr[0], and start the loop at 1 instead of 0. It makes more sense. Or, better, use Streams, they have these methods in-built.

-
Use @coderodde's idea for calculating mode, as that is \$O(n+n\log(n)) = O(n\log(n))\$, whereas yours is currently \$O(n^2)\$ in the worst case. That is, his way is faster.

Note, however, that your mode calculation algorithm is a bit problematic; for an input set {1, 1, 2, 4, 4}, your code will return 4, though for a multimodal set it would be mathematically appropriate to return the first mode, i.e., 1 here.

I have managed to fashion a faster approach to the problem, involving only partially sorting and partitioning the array, however, this works correctly only for unimodal sets (and works in a mathematically correct fashion for multimodal sets). I use the naive approach to get all the modes of a multimodal set.

In my code it appears that I need \$O(n)\$ additional space due to the Set used for maintaining multimodal data, but for unimodal data, this is unnecessary and it can be made to work in \$O(n\log(n))\$ time and \$O(1)\$ additional space.

I make both pieces of code return a double[] (in the former case, a single-element array even for multimodal sets), for API consistency.

-
Use a selection algorithm, such as the Floyd-Rivest Algorithm for calculating the median in linear time, instead of the best-case \$O(n\log(n))\$ when using sorting.

-
Use a numerically stable one-pass algorithm like Welford's Method to get the variance / standard deviation.

Ideas

-
You should name getAverage to getArithmeticMean, as that is the more appropriate statistical term here.

-
As regards the above, you could try implementing a getGeometricMean, which would be calculated as follows:

\$GM = \sqrt[n]{\prod\limits_{i=1}^n x_i}\$

Or, the \$n\$th root of the product of \$n\$ numbers.

Suggestions

-
Java 8's Streams would help a lot here.

-
Declare your method parameters final. That way, the caller will not have to worry about you changing their data out from under their feet (it's not especially helpful with arrays, but even so).

-
Try using the for-each loop where you don't need explicit indexing, rather just the elements in sequence, e.g. in getMode.

-
Try to fail-fast: throw exceptions when you encounter exceptional conditions, like being supplied an empty array (java.util.NoSuchElementException seems to be a good fit for this use case).

New Code:

Note that this is not one of the best examples when it comes to coding in Java, but take a look at it anyway (there are comments when the going gets tough). There are usually 2 versions of each method, one utilizing your original idea, the other utilizing a faster algorithm. The naive versions have UnOpt suffixed to their names. Except in the case of getMode as noted above, both versions are semantically equivalent. In case of the Floyd-Rivest algorithm used in select, there are some magic numbers, you can look at the linked Wikipedia article for details.

```
import java.util.*;
import java.util.stream.*;
import static java.lang.Math.*;
public class Stats{
private static double[] unbox(Collection collection){
return collection.stream().mapToDouble(Double::doubleValue).toArray();
}

public static double[] getModesUnOpt(final double[] x) {
Set modes = new HashSet<>(x.length);
long count = 0, countMax = 0;
for (final double v : x) {
count = Arrays.stream(x).filter(y -> y == v).count();
if (count >= countMax) {

Code Snippets

import java.util.*;
import java.util.stream.*;
import static java.lang.Math.*;
public class Stats{
    private static double[] unbox(Collection<Double> collection){
        return collection.stream().mapToDouble(Double::doubleValue).toArray();
    }

    public static double[] getModesUnOpt(final double[] x) {
        Set<Double> modes = new HashSet<>(x.length);
        long count = 0, countMax = 0;
        for (final double v : x) {
            count = Arrays.stream(x).filter(y -> y == v).count();
            if (count >= countMax) {
                if(count > countMax){
                    countMax = count;
                    modes.clear();
                }
                modes.add(v);
            }
            count = 0;
        }
        return unbox(modes);
    }

    public static double getModeUnOpt(final double[] x) {
        return getModes(x)[0];
    }

    private static class Mode{
        public double Value = 0.0;
        public int Count = 0;
    }

    public static class Pair<L, R>{
        public final L left;
        public final R right;
        public Pair(L left, R right){
            this.left = left;
            this.right = right;
        }
    }

    public static double getMode(final double[] x){
        return getModes(x)[0];
    }

    public static double[] getModes(final double[] x){
        return findModePartialSortRecursive(x, 0, x.length, new Mode(), new HashSet<>()).right.stream().mapToDouble(y -> y.Value).toArray();
    }

    private static Pair<Mode, Set<Mode>> findModePartialSortRecursive(final double[] array, int begin, int end, Mode best, Set<Mode> modes){
        int pivotIndex = (int) (begin + (random() * (end - begin)));
        Mode mode = best;
        double pivot = array[pivotIndex];
        int left = begin, right = end - 1, pos = left;
        Pair<Mode, Set<Mode>> data = new Pair<>(best, modes);
        while (pos <= right){
            if (array[pos] < pivot){
                array[left++] = array[pos++];
            }
            else if (array[pos] > pivot){
                swap(array, right, pos);
                --right;
            }
            else{
                ++pos;
            }
        }
        int pivotCount = right - left + 1;
        for (int i = left; i <= right; i++){
            array[i] = pivot;
        }
        if (pivotCount >= mode.Count){
            if(pivotCount > mode.Count){
                mode.Value = pivot;
                mode.Count = pivotCount;
            }
            modes.add(mode);
            data.left = mode;
            data.right = modes;
        }      
        int leftCount = left - begin;
        if (leftCount > mode.Count){
            data = findModePartialSortRecursive(array, begin, left, mode, modes);
        }
        int rightCount = end - right - 1;
        if (rightCount > mode.Count){
            data = findModePartialSortRecursive(array, right + 1, end, mode, modes);
        }
        return data;
    }

    public static void swap(

Context

StackExchange Code Review Q#149979, answer score: 12

Revisions (0)

No revisions yet.