HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Find change in XML and print node

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
nodexmlprintfindandchange

Problem

This code has been developed with help on Stack Overflow. It is designed to compare two XML files and print the node for any differences (i.e if a entry appears in A.xml and not B.xml and vice versa).

#include "pugixml.hpp"

#include 
#include 

struct string_hasher
{
    unsigned int operator()(const char* str) const
    {
        // Jenkins one-at-a-time hash (http://en.wikipedia.org/wiki/Jenkins_hash_function#one-at-a-time)
        unsigned int result = 0;

        while (*str)
        {
            result += static_cast(*str++);
            result += result > 6;
        }

        result += result > 11;
        result += result  xml_node_map;

int main()
{
    pugi::xml_document doca, docb;
    xml_node_map mapa, mapb;

    if (!doca.load_file("a.xml") || !docb.load_file("b.xml"))
        return 1;

    for (auto& node: doca.child("site_entries").children("entry"))
        mapa[node.child_value("id")] = node;

    for (auto& node: docb.child("site_entries").children("entry"))
        mapb[node.child_value("id")] = node;

    for (auto& ea: mapa)
        if (mapb.count(ea.first) == 0)
        {
            std::cout << "Removed:" << std::endl;
            ea.second.print(std::cout);
        }

    for (auto& eb: mapb)
        if (mapa.count(eb.first) == 0)
        {
            std::cout << "Added:" << std::endl;
            eb.second.print(std::cout);
        }
}


I was wondering if I could get some feedback on how it is designed, most notably if this is the optimal approach in terms of performance, since this has to sort hundreds of thousands of entries.

Example XML file:


Solution

Your code is somewhat simplistic in that all it is doing is ensuring that the same id values appear in each document. XML is a lot more complicated than that, though, and you may have problems when:

  • you have multiple entries with the same id



  • you have entries without an id



  • the content in entries with the same ID in different documents is wildly different.



In terms of the algorithm, the code is clear, and so on. The string_hasher logic concerns me a bit simply because I am not that familiar with C++ (I am more familiar with XML)..... but, I have the following concerns there:

  • Why return int for the hash? It is specified to return size_t.



  • why have the struct at all?



If you convert the const char* ID to be a std::string then you can just make it a map of std::string and use the default values for the hashing and equalities in the template.

Funally, as a small optimization, you can probably save half your memory if you us a 'subtraction' type algorithm for comparing the document structures. Consider )note, I always use braces for code blocks, even if they have just one line):

for (auto& node: doca.child("site_entries").children("entry")) {
    mapa[node.child_value("id")] = node;
}


Then, in the next loop, do the following instead:

for (auto& node: docb.child("site_entries").children("entry")) {
    const char* nid = node.child_value("id");
    if (!mapa.erase(nid)) {
        mapb[nid] = node;
    }
}


Now, at this point, there is no need to perform an intersection on the maps... the values in mapa we know ae not in mapb, and visa versa. You can just print the contents of the two maps without any further processing...

Putting all these suggestions together, your code would look something like:

int main() {
    pugi::xml_document doca, docb;
    std::map mapa, mapb;

    if (!doca.load_file("a.xml") || !docb.load_file("b.xml"))
        return 1;

    for (auto& node: doca.child("site_entries").children("entry")) {
        const char* id = node.child_value("id");
        mapa[new std::string(id, strlen(id))] = node;
    }

    for (auto& node: docb.child("site_entries").children("entry"))
        const char* idcs = node.child_value("id");
        std::string id = new std::string(idcs, strlen(idcs));
        if (!mapa.erase(id)) {
            mapb[id] = node;
        }
    }

    for (auto& ea: mapa) {
        std::cout << "Removed:" << std::endl;
        ea.second.print(std::cout);
    }

    for (auto& eb: mapb) {
        std::cout << "Added:" << std::endl;
        eb.second.print(std::cout);
    }

}

Code Snippets

for (auto& node: doca.child("site_entries").children("entry")) {
    mapa[node.child_value("id")] = node;
}
for (auto& node: docb.child("site_entries").children("entry")) {
    const char* nid = node.child_value("id");
    if (!mapa.erase(nid)) {
        mapb[nid] = node;
    }
}
int main() {
    pugi::xml_document doca, docb;
    std::map<string, pugi::xml_node> mapa, mapb;

    if (!doca.load_file("a.xml") || !docb.load_file("b.xml"))
        return 1;

    for (auto& node: doca.child("site_entries").children("entry")) {
        const char* id = node.child_value("id");
        mapa[new std::string(id, strlen(id))] = node;
    }

    for (auto& node: docb.child("site_entries").children("entry"))
        const char* idcs = node.child_value("id");
        std::string id = new std::string(idcs, strlen(idcs));
        if (!mapa.erase(id)) {
            mapb[id] = node;
        }
    }

    for (auto& ea: mapa) {
        std::cout << "Removed:" << std::endl;
        ea.second.print(std::cout);
    }

    for (auto& eb: mapb) {
        std::cout << "Added:" << std::endl;
        eb.second.print(std::cout);
    }

}

Context

StackExchange Code Review Q#86958, answer score: 4

Revisions (0)

No revisions yet.