HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Generating a histogram efficiently from the datasets in a Map

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
efficientlythemapdatasetsgeneratinghistogramfrom

Problem

I have a map in which I store total bytes as the key and count of users as the value:

Map userHistogramInfo = new HashMap();


Now I need to make a histogram as mentioned below:

  • How many users have data with 200,000 bytes (200 kilo bytes)



  • How many users have data with 500 bytes



  • How many users have data with 200 bytes



  • How many users have data with 100 bytes



As an example it might be possible that 100 million users have data of 500 bytes. I need to make a histogram of current data usage of the users.

We can define a certain bucket size under which I need to find out how many users have data within that bucket.

public void generateHistogram(Map userHistogramInfo) {
    int[] definitionInBytes = { 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480, 40960, 81920 };
    long[] buckets = new long[definitionInBytes.length];
    System.out.println("Below is the Histogram List sorted on the bytes: \n");
    System.out.println("BytesStored           Number");
    SortedSet keys = new TreeSet(userHistogramInfo.keySet());
    for (Integer key : keys) {
        Integer value = userHistogramInfo.get(key);
        System.out.println(key + "                      " + value);
    }

    System.out.println();
    for (Integer time : userHistogramInfo.keySet()) {
        for (int i = definitionInBytes.length - 1; i >= 0; i--) {
            if (time >= definitionInBytes[i]) {
                buckets[i] += userHistogramInfo.get(time);
                break;
            }
        }
    }
    for (int i = 0; i < definitionInBytes.length; i++) {
        String period = "";
        if (i == definitionInBytes.length - 1) {
            period = "greater than " + definitionInBytes[i] + " bytes";
        } else {
            period = "between " + (definitionInBytes[i] + 1) + " and " + definitionInBytes[i + 1] + " bytes";
        }
        System.out.println(buckets[i] + " came back " + period);
    }
}


Is there any better way of doing this and making the

Solution

Summary

  • Fix the bugs.



  • Make sure your source code can be understood.



  • Further improve your symbol names.



  • Write more, shorter methods.



  • Use Formatter.



  • Use smarter algorithms.



  • Use Stream.collect() to create the second histogram.



  • Unit test.



Fix the bugs

When you have a user which consumes less than 10 bytes of data, you don't count that. I think that's a bug. I cannot conclude from the way the source code is written that this behavior is intentional. You probably want to

  • start definitionInBytes with 0



  • assert that the key in the map is >= 0



or

  • assert that the key in the map is >= 10



This would also be useful for testing. You would want that the sum of values (total number of users) in the map and the buckets is same, that's an invariant.

You could assert that invariant:

assert userHistogramInfo.values().mapToInt(Integer::intValue).sum() == Arrays.stream(buckets).sum();


If it's actually intentional that it doesn't start with zero, you might consider documenting that. Right now, it too much looks like programming by coincidence and as if it might fall over sooner or later.

Besides, depending on your user base, you actually might want to go for Long instead of Integer. With Integer you can use only 2 Billion users. Given a world population of soon 8 Billion, of which soon everyone has internet access, if you manage to get everyone as user, your code might overflow. Not yet very likely, but think of the COBOL (and other) code that was written with 2-digit dates (sometimes for good reason like space) and then had to be changed later to prevent Y2K trouble. Our code often lives quite long. Why not prepare it for the future if the cost for that in the present is almost nothing.

Understandable source code

This is mostly met. I had no trouble whatsoever understanding your source code. A few things how I could have understood your source code even faster are mentioned in different points. Just there's one thing in the source code which completely puzzles me. It's this comment:

//This is for database select


I do not see any database or query in context, and I also do not see how the loop that follows this comment would be related to a database query. I'm left puzzled by that comment. Puzzlement without resolution.

Good symbol names

I find the name userHistogramInfo ambiguous and therefore confusing, because the method actually deals with two histograms.

  • Avoid info, data and such stuff in symbol names, it does not communicate intent, it just adds noise.



  • For Maps communicate the key and value in the symbol name.



Maybe usersPerSize could be a good name for the input histogram. And usersPerSizeRange could be a good name for the second histogram which is generated from the first.

Use smaller methods

Your method generateHistogram() is too long. Ideally, methods do just one thing, they do it well, and they do it only (Robert C. Martin, Clean Code). It's also known as the SRP - Single Responsibility Principle applied to methods, or "extract 'till you drop" (Robert C. Martin, Clean Code).

Let's see what this method does:

  • It formats the original histogram.



  • It prints the original histogram.



  • It generates a new histogram, based on ranges instead of individual sizes.



  • It formats the new histogram.



  • It prints the new histogram.



That's far too much for just one method.
Each of these things should go in a separate method.

Don't be afraid of small methods in Java. The JIT of the JVM is taking care of optimization very very well.

Use Formatter

Formatter is a very useful class for formatting Strings. You don't need to care about the platform's line ending, you use "%n" and Java takes care of the rest. And you have all the usual formatting options, like auto-filling with spaces or leading zero. Hosch250 already explained this nicely in his answer.

Use smarter algorithms

Your array definitionInBytes is sorted. The expectation of most programmers would probably be that if you have a sorted array, you use binary search to identify the location, however, you use linear search.

You could use Arrays.binarySearch() for your use case.

Here's a method which gets the bucket index for a value:

public static int getBucketIndex(final int value) {
    // assert isSorted(definitionInBytes);
    final int searchResult = Arrays.binarySearch(definitionInBytes, value);
    if (searchResult >= 0)
        return searchResult;
    // Warning: returns -1 in case value is less than definitionInBytes[0];
    return -searchResult - 2;
}


Use Stream.collect() to create the second histogram.

This is not really a must. It just might be interesting for you to see what Java can do for you.

Actually, your two arrays definitionInBytes and buckets are a histogram, a Map, just like the input data.
You're grouping the input keys by ranges, summing the values. You could let Java do that for you.

Here's a method which creates

Code Snippets

assert userHistogramInfo.values().mapToInt(Integer::intValue).sum() == Arrays.stream(buckets).sum();
//This is for database select
public static int getBucketIndex(final int value) {
    // assert isSorted(definitionInBytes);
    final int searchResult = Arrays.binarySearch(definitionInBytes, value);
    if (searchResult >= 0)
        return searchResult;
    // Warning: returns -1 in case value is less than definitionInBytes[0];
    return -searchResult - 2;
}
import static java.util.stream.Collectors.*;

public class Histogram {
    // ...
    private static final int[] definitionInBytes = { 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480, 40960, 81920 };

    public static Map<Integer, Integer> groupByRange(final Map<Integer, Integer> usersPerSize) {
        return usersPerSize
            .entrySet()
            .parallelStream()
            .collect(groupingByConcurrent(Histogram::getBucketFromEntry, summingInt(Entry::getValue)));
    }

    public static Integer getBucketFromEntry(final Entry<Integer, Integer> entry) {
        return getFloor(entry.getKey(), definitionInBytes);
    }

    public static int getFloor(final int value, final int[] floors) {
        final int searchResult = Arrays.binarySearch(floors, value);
        if (searchResult >= 0) return floors[searchResult];
        if (-searchResult - 2 < 0) return Integer.MIN_VALUE;
        return floors[-searchResult - 2];
    }
}
private static final int[] floors = { 10, 100, 1000 };

@Test
public void testGetFloor() {
    assertEquals(Integer.MIN_VALUE, getFloor(Integer.MIN_VALUE, floors));
    assertEquals(Integer.MIN_VALUE, getFloor(0, floors));
    assertEquals(Integer.MIN_VALUE, getFloor(9, floors));
    assertEquals(10, getFloor(10, floors));
    assertEquals(10, getFloor(99, floors));
    assertEquals(100, getFloor(100, floors));
    assertEquals(100, getFloor(999, floors));
    assertEquals(1000, getFloor(1000, floors));
    assertEquals(1000, getFloor(Integer.MAX_VALUE, floors));
}

Context

StackExchange Code Review Q#82719, answer score: 6

Revisions (0)

No revisions yet.