patternjavaMinor
Generating a histogram efficiently from the datasets in a Map
Viewed 0 times
efficientlythemapdatasetsgeneratinghistogramfrom
Problem
I have a map in which I store total bytes as the key and count of users as the value:
Now I need to make a histogram as mentioned below:
As an example it might be possible that 100 million users have data of 500 bytes. I need to make a histogram of current data usage of the users.
We can define a certain bucket size under which I need to find out how many users have data within that bucket.
Is there any better way of doing this and making the
Map userHistogramInfo = new HashMap();Now I need to make a histogram as mentioned below:
- How many users have data with 200,000 bytes (200 kilo bytes)
- How many users have data with 500 bytes
- How many users have data with 200 bytes
- How many users have data with 100 bytes
As an example it might be possible that 100 million users have data of 500 bytes. I need to make a histogram of current data usage of the users.
We can define a certain bucket size under which I need to find out how many users have data within that bucket.
public void generateHistogram(Map userHistogramInfo) {
int[] definitionInBytes = { 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480, 40960, 81920 };
long[] buckets = new long[definitionInBytes.length];
System.out.println("Below is the Histogram List sorted on the bytes: \n");
System.out.println("BytesStored Number");
SortedSet keys = new TreeSet(userHistogramInfo.keySet());
for (Integer key : keys) {
Integer value = userHistogramInfo.get(key);
System.out.println(key + " " + value);
}
System.out.println();
for (Integer time : userHistogramInfo.keySet()) {
for (int i = definitionInBytes.length - 1; i >= 0; i--) {
if (time >= definitionInBytes[i]) {
buckets[i] += userHistogramInfo.get(time);
break;
}
}
}
for (int i = 0; i < definitionInBytes.length; i++) {
String period = "";
if (i == definitionInBytes.length - 1) {
period = "greater than " + definitionInBytes[i] + " bytes";
} else {
period = "between " + (definitionInBytes[i] + 1) + " and " + definitionInBytes[i + 1] + " bytes";
}
System.out.println(buckets[i] + " came back " + period);
}
}Is there any better way of doing this and making the
Solution
Summary
Fix the bugs
When you have a user which consumes less than 10 bytes of data, you don't count that. I think that's a bug. I cannot conclude from the way the source code is written that this behavior is intentional. You probably want to
or
This would also be useful for testing. You would want that the sum of values (total number of users) in the map and the buckets is same, that's an invariant.
You could assert that invariant:
If it's actually intentional that it doesn't start with zero, you might consider documenting that. Right now, it too much looks like programming by coincidence and as if it might fall over sooner or later.
Besides, depending on your user base, you actually might want to go for
Understandable source code
This is mostly met. I had no trouble whatsoever understanding your source code. A few things how I could have understood your source code even faster are mentioned in different points. Just there's one thing in the source code which completely puzzles me. It's this comment:
I do not see any database or query in context, and I also do not see how the loop that follows this comment would be related to a database query. I'm left puzzled by that comment. Puzzlement without resolution.
Good symbol names
I find the name
Maybe
Use smaller methods
Your method
Let's see what this method does:
That's far too much for just one method.
Each of these things should go in a separate method.
Don't be afraid of small methods in Java. The JIT of the JVM is taking care of optimization very very well.
Use
Use smarter algorithms
Your array
You could use
Here's a method which gets the bucket index for a value:
Use
This is not really a must. It just might be interesting for you to see what Java can do for you.
Actually, your two arrays
You're grouping the input keys by ranges, summing the values. You could let Java do that for you.
Here's a method which creates
- Fix the bugs.
- Make sure your source code can be understood.
- Further improve your symbol names.
- Write more, shorter methods.
- Use
Formatter.
- Use smarter algorithms.
- Use
Stream.collect()to create the second histogram.
- Unit test.
Fix the bugs
When you have a user which consumes less than 10 bytes of data, you don't count that. I think that's a bug. I cannot conclude from the way the source code is written that this behavior is intentional. You probably want to
- start
definitionInByteswith 0
assertthat the key in the map is>= 0
or
assertthat the key in the map is>= 10
This would also be useful for testing. You would want that the sum of values (total number of users) in the map and the buckets is same, that's an invariant.
You could assert that invariant:
assert userHistogramInfo.values().mapToInt(Integer::intValue).sum() == Arrays.stream(buckets).sum();If it's actually intentional that it doesn't start with zero, you might consider documenting that. Right now, it too much looks like programming by coincidence and as if it might fall over sooner or later.
Besides, depending on your user base, you actually might want to go for
Long instead of Integer. With Integer you can use only 2 Billion users. Given a world population of soon 8 Billion, of which soon everyone has internet access, if you manage to get everyone as user, your code might overflow. Not yet very likely, but think of the COBOL (and other) code that was written with 2-digit dates (sometimes for good reason like space) and then had to be changed later to prevent Y2K trouble. Our code often lives quite long. Why not prepare it for the future if the cost for that in the present is almost nothing.Understandable source code
This is mostly met. I had no trouble whatsoever understanding your source code. A few things how I could have understood your source code even faster are mentioned in different points. Just there's one thing in the source code which completely puzzles me. It's this comment:
//This is for database selectI do not see any database or query in context, and I also do not see how the loop that follows this comment would be related to a database query. I'm left puzzled by that comment. Puzzlement without resolution.
Good symbol names
I find the name
userHistogramInfo ambiguous and therefore confusing, because the method actually deals with two histograms.- Avoid
info,dataand such stuff in symbol names, it does not communicate intent, it just adds noise.
- For
Maps communicate the key and value in the symbol name.
Maybe
usersPerSize could be a good name for the input histogram. And usersPerSizeRange could be a good name for the second histogram which is generated from the first.Use smaller methods
Your method
generateHistogram() is too long. Ideally, methods do just one thing, they do it well, and they do it only (Robert C. Martin, Clean Code). It's also known as the SRP - Single Responsibility Principle applied to methods, or "extract 'till you drop" (Robert C. Martin, Clean Code).Let's see what this method does:
- It formats the original histogram.
- It prints the original histogram.
- It generates a new histogram, based on ranges instead of individual sizes.
- It formats the new histogram.
- It prints the new histogram.
That's far too much for just one method.
Each of these things should go in a separate method.
Don't be afraid of small methods in Java. The JIT of the JVM is taking care of optimization very very well.
Use
FormatterFormatter is a very useful class for formatting Strings. You don't need to care about the platform's line ending, you use "%n" and Java takes care of the rest. And you have all the usual formatting options, like auto-filling with spaces or leading zero. Hosch250 already explained this nicely in his answer.Use smarter algorithms
Your array
definitionInBytes is sorted. The expectation of most programmers would probably be that if you have a sorted array, you use binary search to identify the location, however, you use linear search.You could use
Arrays.binarySearch() for your use case.Here's a method which gets the bucket index for a value:
public static int getBucketIndex(final int value) {
// assert isSorted(definitionInBytes);
final int searchResult = Arrays.binarySearch(definitionInBytes, value);
if (searchResult >= 0)
return searchResult;
// Warning: returns -1 in case value is less than definitionInBytes[0];
return -searchResult - 2;
}Use
Stream.collect() to create the second histogram.This is not really a must. It just might be interesting for you to see what Java can do for you.
Actually, your two arrays
definitionInBytes and buckets are a histogram, a Map, just like the input data.You're grouping the input keys by ranges, summing the values. You could let Java do that for you.
Here's a method which creates
Code Snippets
assert userHistogramInfo.values().mapToInt(Integer::intValue).sum() == Arrays.stream(buckets).sum();//This is for database selectpublic static int getBucketIndex(final int value) {
// assert isSorted(definitionInBytes);
final int searchResult = Arrays.binarySearch(definitionInBytes, value);
if (searchResult >= 0)
return searchResult;
// Warning: returns -1 in case value is less than definitionInBytes[0];
return -searchResult - 2;
}import static java.util.stream.Collectors.*;
public class Histogram {
// ...
private static final int[] definitionInBytes = { 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480, 40960, 81920 };
public static Map<Integer, Integer> groupByRange(final Map<Integer, Integer> usersPerSize) {
return usersPerSize
.entrySet()
.parallelStream()
.collect(groupingByConcurrent(Histogram::getBucketFromEntry, summingInt(Entry::getValue)));
}
public static Integer getBucketFromEntry(final Entry<Integer, Integer> entry) {
return getFloor(entry.getKey(), definitionInBytes);
}
public static int getFloor(final int value, final int[] floors) {
final int searchResult = Arrays.binarySearch(floors, value);
if (searchResult >= 0) return floors[searchResult];
if (-searchResult - 2 < 0) return Integer.MIN_VALUE;
return floors[-searchResult - 2];
}
}private static final int[] floors = { 10, 100, 1000 };
@Test
public void testGetFloor() {
assertEquals(Integer.MIN_VALUE, getFloor(Integer.MIN_VALUE, floors));
assertEquals(Integer.MIN_VALUE, getFloor(0, floors));
assertEquals(Integer.MIN_VALUE, getFloor(9, floors));
assertEquals(10, getFloor(10, floors));
assertEquals(10, getFloor(99, floors));
assertEquals(100, getFloor(100, floors));
assertEquals(100, getFloor(999, floors));
assertEquals(1000, getFloor(1000, floors));
assertEquals(1000, getFloor(Integer.MAX_VALUE, floors));
}Context
StackExchange Code Review Q#82719, answer score: 6
Revisions (0)
No revisions yet.