HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Shortest Path in a Unit Cost DAG BFS in Hadoop

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
pathshortestbfshadoopcostdagunit

Problem

Just got done implementing a shortest path algo of a unit cost DAG BFS in Hadoop and wanting to get anyone's opinion on how I could have done it better. The main issue I see is that I lose the parallel nature of hadoop by forcing the mappers to wait for the output of the previous mapper before beginning.

The thing works and finds the shortest path given two input files: nodes.txt and edges.txt.

nodes.txt represents all the nodes in the DAG and is of the form

1 "A"
2 "B"
3 "C"
4 "D"
5 "E"
6 "F"
7 "G"
8 "Z"
9 "X"


edges.txt represents all the edges of the DAG and is of the form

1 2
1 3
1 4
2 5
2 6
3 5
3 6
4 7
5 8
6 8
7 8


```
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.PrintStream;
import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.Map.Entry;
import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class ShortestPathDAGBFS
{

public static class Map extends Mapper
{
HashMap nodes = new HashMap<>();
HashMap> edges = new HashMap<>();
String sourceNodeId;
String destNodeId;
Pattern pattern = Pattern.compile("[^\\S\\n]{1,4}");
boolean found = false;

@Override
public void setup(Context context) throws IOException
{
nodes = GlobalFunctions.parseNodesFile(context.getConfiguration().get("Node

Solution

I'm not familiar with Hadoop, take my advice with care.

-
Are you sure that you want to use the default charset?

try (BufferedReader reader = new BufferedReader(new InputStreamReader(is, Charset.defaultCharset()))) {


It could vary from system to system. Usually it's a good idea using UTF-8 explicitly.

-
Since both newPath and nextNodeId are Strings,

context.write(new Text(newPath.toString()), new Text(nextNodeId.toString()));


could be

context.write(new Text(newPath), new Text(nextNodeId));


It's the same, since String.toString has the following implementation:

public String toString() {
    return this;
}


(There are other similar occurrences in the code, find them.)

-
As far as I see Text has a constructor which accepts Text parameters. You might want to use that instead of converting it back to String:

context.write(new Text(prevPath.toString()), new Text(newestNode.toString()));


-
The setup method also queries Dest. You might want to store that in a field or move it outside the loop:

for (Text newestNode: values) {
    if (newestNode.toString().trim().equals(destNodeId) && !written) {
        String destNode = context.getConfiguration().get("Dest");


-
You could invert the conditon in the map method to make the code flatten:

@Override
public void map(LongWritable offset, Text line, Context context) 
        throws IOException, InterruptedException {
    String[] edge = pattern.split(line.toString());
    edge[0] = edge[0].trim();
    edge[1] = edge[1].trim();
    List outNodes = edges.get(edge[1]);
    String newPath = edge[0] + "->" + nodes.get(edge[1]);
    if (outNodes == null || found) {
        return;
    }
    for (String nextNodeId: outNodes) {
        context.write(new Text(newPath), new Text(nextNodeId));
        if (nextNodeId.toString().equalsIgnoreCase(destNodeId)) {
            context.getCounter("FoundDest", 
                "FoundDest").increment(1);
            found = true;
            context.write(new Text(edge[0].toString()), 
                new Text(edge[1].toString()));
            return;
        }
    }
}


-
Actually, I guess the found check could be right at the beginning:

@Override
public void map(LongWritable offset, Text line, Context context) 
        throws IOException, InterruptedException {
    if (found) {
        return;
    }
    ...
}


-
The reduce method also could use a guard clause. If I'm right it could even break the loop since if written is true both nested conditions will be false.

@Override
public void reduce(Text prevPath, 
        Iterable values, Context context) 
        throws IOException, InterruptedException {
    for (Text newestNode: values) {
        if (written) {
            break;
        }
        ...
    }
}


-

sourceNodeId = GlobalFunctions.findNodeId(context.getConfiguration().get("Source"), nodes).trim();


findNodeId could be a little bit smarter. All of the four uses of it trim its output and pass an object which is get
from context.getConfiguration().get("somekey"). A wrapper like this could eliminate some duplication:

public static String findTrimmedNodeId(
        Configuration configuration, 
        String key, Map nodes) {
    return findNodeId(configuration.get(key), nodes).trim();
}


-
The fs variable is unused here:

FileSystem fs = FileSystem.get(new Configuration());


Are you sure that these statements are required?

-
I'd change ShortestPathDAGBFS to ShortestPathDagBfs

From Effective Java, 2nd edition, Item 56: Adhere to generally accepted naming conventions:


While uppercase may be more common,
a strong argument can made in favor of capitalizing only the first
letter: even if multiple acronyms occur back-to-back, you can still
tell where one word starts and the next word ends.
Which class name would you rather see, HTTPURL or HttpUrl?

-
Calling the mapper Map is confusing. I'd choose DagBfsMapper or something descriptive. The same is true for Reduce.

-
HashMap reference types should be simply Map, as well as LinkedList could be List (Effective Java, 2nd edition, Item 52: Refer to objects by their interfaces)

-
Consider using LineIterator, it would simplify the code a little bit.

-
Try to use more descriptive variable names:

Pattern pattern = Pattern.compile("[^\\S\\n]{1,4}");


What is this pattern supposed to match?

Code Snippets

try (BufferedReader reader = new BufferedReader(new InputStreamReader(is, Charset.defaultCharset()))) {
context.write(new Text(newPath.toString()), new Text(nextNodeId.toString()));
context.write(new Text(newPath), new Text(nextNodeId));
public String toString() {
    return this;
}
context.write(new Text(prevPath.toString()), new Text(newestNode.toString()));

Context

StackExchange Code Review Q#32243, answer score: 2

Revisions (0)

No revisions yet.