HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Calculate min, max, average, and variance on a large dataset

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
varianceaveragemindatasetmaxlargecalculateand

Problem

I got a piece of Java code using Hadoop to calculate min, max, average and variance on a large dataset made of (index value) couples separated by a newline:

0  11839923.64831265
1  5710431.90800272


It's compiled locally and run on a remote distributed HDFS instance by a sh script.

My main concerns are:

-
if the output is collected in a different order, the thing just stops working, and instead of returning one result for each key, it prints the same key over and over, filling the terminal

-
it creates a new Text instance every time, which looks really inefficient, but when I tired to use a single shared constant, it stopped working. Maybe using an Enum would do fine, but I don't feel like changing it until I fixed the previous point.

-
it's using a Scanner inside the mapper to treat multi-line inputs properly, but only single-line input is showing up. Does Hadoop guarantee each mapper only receives one-line inputs, or is the remote setup that makes it so?

-
it uses the "old" approach of extending the MapReduceBase class and implementing the Mapper/Reducer interface. I've read that, with the new 2.0 APIs it's sufficient to extend one Mapper or Reducer class. Yet, I can't find any migration doc with simple migration doc, and the official WordCount example tutorial is stuck at r1.2.1. EDIT: found a reference for this. And another one here.

Here's the code:

```
package org.myorg;

import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.BufferedWriter;
import java.io.OutputStreamWriter;
import java.util.*;

import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class calcAll {

public static class Map extends MapReduceBase implements Mapper {

public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {

Solution

I'm not familiar with Hadoop, take my advice with care.

-
Are you sure that the following is precise enough?

double count = 0d; // should be an int, but anyway...


Consider the following:

final double original = 123456789123456789.0;
double d = original;
d++;
System.out.println(d == original); // true
System.out.printf("%f%n", original); // 123456789123456784,000000
System.out.printf("%f%n", d);        // 123456789123456784,000000


I'd use a long there. I guess it's not a problem here but if I'm right the reducer could suffer from this.

See also:

  • Why not use Double or Float to represent currency?



  • Effective Java, 2nd Edition, Item 48: Avoid float and double if exact answers are required



-

tokens = line.split("\\s+");


The implementation of split was simply this in Java 6:

public String[] split(String regex, int limit) {
    return Pattern.compile(regex).split(this, limit);
}


Java 7 contains some fast-paths, but at the end it still calls:

return Pattern.compile(regex).split(this, limit);


In your case (with the \\s+ pattern) it does not use any fast-path, so it might be worth to store the compiled Pattern instance and call split on that. (I guess it would be faster but the JVM might cache that for you.)

-

double count = 0d, max = 0d, min = 0d, sum = 0d, sumSquared = 0d;


I'd put the variable declarations to separate lines. From Code Complete 2nd Edition, p759:


With statements on their own lines, the code reads from top to bottom, instead
of top to bottom and left to right. When you’re looking for a specific line of code,
your eye should be able to follow the left margin of the code. It shouldn’t have to
dip into each and every line just because a single line might contain two statements.

Additionally, max and min is not used (they are only written), you could safely remove them (or might want to print them to the output).

-
@Override
public void reduce(Text key, Iterator values,
        OutputCollector output, Reporter reporter) throws IOException {
    if (key.equals(new Text("count"))) {


If Text is thread-safe/immutable you could store new Text("count") (and the other ones) in fields instead of constructing them on every call for equals as well as for output.collect():

output.collect(new Text("sumSquared"), new DoubleWritable(sumSquared));


-
It looks like that sum and sumSquared have the same implementation in the reducer. Is that a bug? If not you could create a method to eliminate the duplicated logic.

-
sum, sumSquared, min, max should be constants instead of magic numbers. There are used multiple times.

-
Declaring variables before their usage with a bigger scope looks microoptimization:

String line;
String[] tokens;
double observation;
while (scanner.hasNext()) {
    line = scanner.nextLine();
    tokens = line.split("\\s+");
    observation = Double.parseDouble(tokens[1]);


It would be readable declaring them inside the loop. (Effective Java, Second Edition, Item 45: Minimize the scope of local variables)

String line;
String[] words;

line = br.readLine();
while (line != null) {
    words = line.split("\\s+");


also could be

String line = br.readLine();
while (line != null) {
    final String[] words = line.split("\\s+");


-
The reduce method contains a lot of similar structures. I'd consider using a function interface with some implementations:

public interface AggregateFunction {
    void addValue(double value);
    double getResult();
}

public class MaxFunction implements AggregateFunction {
    private double max = Double.MAX_VALUE;

    @Override
    public void addValue(final double value) {
        max = Math.max(value, max);
    }

    @Override
    public double getResult() {
        return max;
    }
}

public class SumFunction implements AggregateFunction {
    private double sum = 0.0;

    @Override
    public void addValue(final double value) {
        sum += value;
    }

    @Override
    public double getResult() {
        return sum;
    }
}


The will reduce the Reducer:

```
public class Reduce extends MapReduceBase
implements Reducer {

// assuming that Text is thread-safe/immutable
private final Text countKey = new Text("count");
private final Text maxKey = new Text("max");
private final Text minKey = new Text("min");
private final Text sumKey = new Text("sum");
private final Text sumSquaredKey = new Text("sumSquared");

@Override
public void reduce(Text key, Iterator values,
OutputCollector output,
Reporter reporter) throws IOException {
aggregate("count", countKey, values, output, new SumFunction());
aggregate("max", maxKey, values, output, new MaxFunction());
aggregate("min", minKey, values, output, new MinFunction());
aggregate("sum", sumKey, values, output, new SumFunction());
aggregate("sumSquared"

Code Snippets

double count = 0d; // should be an int, but anyway...
final double original = 123456789123456789.0;
double d = original;
d++;
System.out.println(d == original); // true
System.out.printf("%f%n", original); // 123456789123456784,000000
System.out.printf("%f%n", d);        // 123456789123456784,000000
tokens = line.split("\\s+");
public String[] split(String regex, int limit) {
    return Pattern.compile(regex).split(this, limit);
}
return Pattern.compile(regex).split(this, limit);

Context

StackExchange Code Review Q#42885, answer score: 5

Revisions (0)

No revisions yet.