HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Parsing huge data coming from a URL

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
hugeparsingfromdatacomingurl

Problem

I need to parse the data coming from the URL which looks like this:

hasProcess=true
version=1
DATACENTER=abc
    TotalNumberOfServers:4
    primary:{0=1, 1=2, 2=1, 3=2, 4=1, 5=2, 6=1, 7=2, 8=1, 9=2, 10=1, 11=2, 12=1, 13=2}
    secondary:{0=0, 1=0, 2=0, 3=1, 4=0, 5=0, 6=0, 7=1, 8=0, 9=0, 10=0, 11=1, 12=0, 13=0}
    hosttomachine:{3=machineA, 2=machineB, 1=machineC, 4=machineD}
DATACENTER=pqr
    TotalNumberOfServers:2
    primary:{0=1, 1=2, 2=1, 3=2, 4=1, 5=2, 6=1, 7=2, 8=1, 9=2, 10=1, 11=2, 12=1, 13=2, 14=1}
    secondary:{0=0, 1=0, 2=0, 3=1, 4=0, 5=0, 6=0, 7=1, 8=0, 9=0, 10=0, 11=1, 12=0, 13=0, 14=0}
    hosttomachine:{1=machineP, 4=machineQ}
DATACENTER=tuv
    TotalNumberOfServers:0
    primary:{}
    secondary:{}
    hosttomachine:{}


After parsing the data I need to store each datacenter data in a Map like this:

HashMap> primaryData


For example, the Key of primaryData is abc and value is:

{0=1, 1=2, 2=1, 3=2, 4=1, 5=2, 6=1, 7=2, 8=1, 9=2, 10=1, 11=2, 12=1, 13=2}


which is for primary.

Similarly another Map for secondary for each datacenter:

HashMap> secondaryData


For example, the Key of secondaryData is abc and value is:

{0=0, 1=0, 2=0, 3=1, 4=0, 5=0, 6=0, 7=1, 8=0, 9=0, 10=0, 11=1, 12=0, 13=0}


which is for secondary.

And lastly, one more map for hosttomachine mapping for each datacenter:

HashMap> hostMachineMapping -


For example, the Key of hostMachineMapping is abc and value is:

{3=machineA, 2=machineB, 1=machineC, 4=machineD}


which is for hosttomachine.

And all the above map will have data for its datacenter as I have three datacenter in the above example. So each each map will have three data. And also I will parse the above response only when hasProcess is equal to true. If it is not true, then I won't parse anything.

This code takes more than 200 ms to parse the data and store it in its corresponding HashMap. Is there any way to parse the above data efficient

Solution

I started to write a comment, but it became too long.
Questions

  • 200 ms is pretty close to eternity, how long is your input?



  • Can you measure how long parseResponse alone takes?



  • Could you post your data? Someone may try harder to optimize.



Potential slowness causes

I guess your String.split could be the culprit (with more than one character a regex gets created; you should use

private static final Pattern COMMA_BLANK_PATTERN = Pattern.compile(", ");


for splitting and use split limit of 2. Probably faster could be using Guava's

private static final Splitter COMMA_BLANK_SPLITTER = Splitter.on(", ");


Even better would be piece-wise processing instead of splitting.

Note also that Integer.parseInt is internationalized and therefore slightly slower than necessary. Probably unimportant.
Review

if (response != null) {


As Mat's Mug said, this is wrong. I'm using Guava's Preconditions with a static import writing simply

checkNotNull(response);


This makes it fail-fast rather than hiding the problem to bite you later.

String tableString = map.split(":")[1];
Map table = new HashMap();
tableString = tableString.substring(1, tableString.length() - 1);


This is pretty confusing by squeezing something else between the two tableString defining lines.
Summary

Overall, it's not bad. When speed is important, I'd go for something like

MyParser parser = new MyParser(response).skip("hasProcess=");
boolean hasProcess = parser.readBoolean();
parser.skip("\nversion=");
int version = parser.readInt();
while (parser.skipIfLookingAt("\nDATACENTER=")) {
     parser.parseDatacenter(parser, whatever...);
}


It's just an idea (damn similar to java.util.Scanner, which isn't exactly known for speed), but I guess that string splitting is the problem (simply as I can't see anything else).

An example method could look like

MyParser skip(String prefix) {
    for (int i=0; i<prefix.length(); ++i) {
        if (content.charAt(index++) != prefix.charAt(i)) {
             throw ...
        }
    }
}


where content and index are instance variables. This part doesn't need any substring creation at all (and gets a bit unreadable because of this). Warning: Doing something like content = content.substring(index) (so you could use startsWith) would make it clearer but terribly slow as you'd copy a big part of the string a lot of times.
Filling the maps

Now, I'd probably move all the functionality into MyParser, so that the parsing would be just

private void parseResponse(String response) {
     new MyParser(response).parse();
 }


The parser would have fields like

private final Map> primaryData;


so that I could write

void parseDatacenter() {
    String datacenter = readTill('\n');
    skip("\n\tTotalNumberOfServers:");
    int totalNumberOfServers = readInt();
    skip("\n\tprimary{");        
    primaryData.put(datacenter, readMap());
    skip("\n\t");
}

Map readMap() {
    Map result = new HashMap<>();
    while (true) {
         String key = readTillAndSkip('=');
         String value = readTillOneOf("},");
         map.put(key, value);
         if (lookingAt('}') {
             break;
         }
         skip(", ");
    }
    return result;
}


As an example see this simple method

String readTillOneOf(String terminators) {
    int start = index;
    while (terminators.indexOf(content.charAt(index)) > -1) {
        ++index;
    }
    return content.substring(start, index);
}


I ignored quite some details like parsing numbers and I can't tell how many parsing methods you'll need, but basically it's all pretty simple. Just skip what's fixed and process what you want while looking for a terminator.

It could get a bit more complicated if you needed to be error-tolerant, like allowing multiple spaces, but then you could let some of your methods skip over them. Or, if it gets difficult, use regexes. But I don't think it gets complicated.

Note that I'm unsure why your parser is slow. There's a lot of splitting and this is the probable cause, but it's just a guess.

Code Snippets

private static final Pattern COMMA_BLANK_PATTERN = Pattern.compile(", ");
private static final Splitter COMMA_BLANK_SPLITTER = Splitter.on(", ");
if (response != null) {
checkNotNull(response);
String tableString = map.split(":")[1];
Map<Integer, String> table = new HashMap<Integer, String>();
tableString = tableString.substring(1, tableString.length() - 1);

Context

StackExchange Code Review Q#94390, answer score: 6

Revisions (0)

No revisions yet.