HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Generating Signatures of Tree Structures using an Order-Independent Hash

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
signaturesorderindependenthashgeneratingusingstructurestree

Problem

I'm working with a file format which stores data as a tree. Each node in the tree is known as a Tag, and every tag has a name and a value. The order of tags is not guaranteed, but the general structure is. For example, in the diagram below, the CompoundTag "John Smith" and "Albert Einstein" have the same structure even though their child tags are in a different order:

Formatted as: [TagType] TagName : TagValue

[ListTag] Employees :
  [CompoundTag] :
    [StringTag] FirstName : John
    [StringTag] LastName : Smith
    [IntTag] EmployeeId : 123
  [CompoundTag] :
    [StringTag] FirstName : Albert
    [IntTag] EmployeeId : 456
    [StringTag] LastName : Einstein


I'm creating signatures of certain tags so that I don't have to parse the whole tree structure, as I'm only interested in specific tags in the tree. To do that, I've created an order-independent hash function which hashes the name of each tag and the type. The function then combines those hashes into one signature which I use to identify other tags with the same structure.

The function hashes tag names using the standard String.hashCode() method, then adds the tag Id (signifying the type) to the upper bits of the hash. It then combines the hash with two previous values which keep track of hashes combined using two commutative operations: addition and exclusive or. I keep track of two combined hashes because I found that addition and xor on their own are easily susceptible to collision. Finally, I combine the two hashes as a long and return that as the tag signature.

I have an optional parameter to include the parent tag in the signature. The parent tag is hashed the same way as child tags are, but I also add a small constant to the parent tag hash to make it slightly different from child tags. I add the constant in-case a child tag has the same name and type as the parent, which would generate the same signature in some scenarios.

The code targets Java 7, though I'm considering switchi

Solution

Overall, this is very good code. Maybe there is another hashing strategy to have, but I'll focus on the current code in this answer. There is one point that can be greatly improved here: the comments. There are certainly a lot of them, but...

sigA += hash; // ADD
sigB ^= hash; // XOR


or

// Iterate over all child tags
while (!stack.isEmpty()) {


are not really useful. The method is actually quite complicated, with magic numbers, additions and bit-shifting left and right, which lots of people struggle with alone. Instead of those small one-line comments that consider just a single line of code, I would suggest having a long comment on top of it explaining what that particular algorithm, as a whole, is doing and why it is done this way. It could even be part of the Javadoc.

There are two cases: either the person reading those comments (like the ones above) already know what the line is doing, and the comment is not really useful, or they don't, and the comment doesn't really explain why that line of code is there, what its purpose is, and ends up being not really useful as well.

Still, some of those are useful, like

// Parent tag should be hashed different from child tags
hash = 0xD0EF + ((long) parentTag.getName().hashCode()) & 0xffffffffL;


It does more than saying it hashes something. It explains why there is a constant added to the hash: to make it different from hashes of child tags. It could be even better by explaining why that particular constant was chosen; otherwise the reader is thinking: "Was that chosen randomly? Or is it one of those magical constants that make everything work?"

A couple of other notes:

-
That part of the code looks like it could use some OOP:

switch (tag.getType()) {
    case COMPOUND:
        // ArrayList of tags
        for (Tag child : ((CompoundTag) tag).getValue()) {
            stack.add(child);
        }
        break;
    case LIST:
        // Array of a single tag type
        for (Tag child : ((ListTag) tag).getValue()) {
            stack.add(child);
        }
}


Perhaps it doesn't make sense with regard to how the rest of the code is structured, but you could consider adding a getValue() abstract method to Tag, perhaps defaulting to return an empty list or without defaults. The idea is that it could make the code be refactored to

for (Tag child : tag.getValue()) {
    stack.add(child);
}


without any switch statement: some tags would return an empty list, and compound or list tags would return their children. If Tag cannot have a getValue() method because it doesn't make sense for all tags to have a value, perhaps another possibility would be to make a specialized subclass/subinterface of Tag for which having a value would make sense, and let the method take that instead. Of course, it could be the case that it would be too much for that lone piece of code.

-
I wouldn't let a temporary variable have a broader scope than it should have because it could be reused somewhere else.

long hash; // temporary variable which stores computed hashes


Generally speaking, variables should have the tighter scope possible. In this case, I would just declare two hash variables inside the two for loops. I don't think it would really hurt performance, and it prevents that variable from leaking to other parts of the code.

Code Snippets

sigA += hash; // ADD
sigB ^= hash; // XOR
// Iterate over all child tags
while (!stack.isEmpty()) {
// Parent tag should be hashed different from child tags
hash = 0xD0EF + ((long) parentTag.getName().hashCode()) & 0xffffffffL;
switch (tag.getType()) {
    case COMPOUND:
        // ArrayList of tags
        for (Tag child : ((CompoundTag) tag).getValue()) {
            stack.add(child);
        }
        break;
    case LIST:
        // Array of a single tag type
        for (Tag child : ((ListTag) tag).getValue()) {
            stack.add(child);
        }
}
for (Tag child : tag.getValue()) {
    stack.add(child);
}

Context

StackExchange Code Review Q#154107, answer score: 3

Revisions (0)

No revisions yet.