patterncppMinor
Find change in XML and print node
Viewed 0 times
nodexmlprintfindandchange
Problem
This code has been developed with help on Stack Overflow. It is designed to compare two XML files and print the node for any differences (i.e if a entry appears in A.xml and not B.xml and vice versa).
I was wondering if I could get some feedback on how it is designed, most notably if this is the optimal approach in terms of performance, since this has to sort hundreds of thousands of entries.
Example XML file:
#include "pugixml.hpp"
#include
#include
struct string_hasher
{
unsigned int operator()(const char* str) const
{
// Jenkins one-at-a-time hash (http://en.wikipedia.org/wiki/Jenkins_hash_function#one-at-a-time)
unsigned int result = 0;
while (*str)
{
result += static_cast(*str++);
result += result > 6;
}
result += result > 11;
result += result xml_node_map;
int main()
{
pugi::xml_document doca, docb;
xml_node_map mapa, mapb;
if (!doca.load_file("a.xml") || !docb.load_file("b.xml"))
return 1;
for (auto& node: doca.child("site_entries").children("entry"))
mapa[node.child_value("id")] = node;
for (auto& node: docb.child("site_entries").children("entry"))
mapb[node.child_value("id")] = node;
for (auto& ea: mapa)
if (mapb.count(ea.first) == 0)
{
std::cout << "Removed:" << std::endl;
ea.second.print(std::cout);
}
for (auto& eb: mapb)
if (mapa.count(eb.first) == 0)
{
std::cout << "Added:" << std::endl;
eb.second.print(std::cout);
}
}I was wondering if I could get some feedback on how it is designed, most notably if this is the optimal approach in terms of performance, since this has to sort hundreds of thousands of entries.
Example XML file:
Solution
Your code is somewhat simplistic in that all it is doing is ensuring that the same
In terms of the algorithm, the code is clear, and so on. The
If you convert the
Funally, as a small optimization, you can probably save half your memory if you us a 'subtraction' type algorithm for comparing the document structures. Consider )note, I always use braces for code blocks, even if they have just one line):
Then, in the next loop, do the following instead:
Now, at this point, there is no need to perform an intersection on the maps... the values in
Putting all these suggestions together, your code would look something like:
id values appear in each document. XML is a lot more complicated than that, though, and you may have problems when:- you have multiple entries with the same
id
- you have entries without an
id
- the content in entries with the same ID in different documents is wildly different.
In terms of the algorithm, the code is clear, and so on. The
string_hasher logic concerns me a bit simply because I am not that familiar with C++ (I am more familiar with XML)..... but, I have the following concerns there:- Why return
intfor the hash? It is specified to returnsize_t.
- why have the struct at all?
If you convert the
const char* ID to be a std::string then you can just make it a map of std::string and use the default values for the hashing and equalities in the template.Funally, as a small optimization, you can probably save half your memory if you us a 'subtraction' type algorithm for comparing the document structures. Consider )note, I always use braces for code blocks, even if they have just one line):
for (auto& node: doca.child("site_entries").children("entry")) {
mapa[node.child_value("id")] = node;
}Then, in the next loop, do the following instead:
for (auto& node: docb.child("site_entries").children("entry")) {
const char* nid = node.child_value("id");
if (!mapa.erase(nid)) {
mapb[nid] = node;
}
}Now, at this point, there is no need to perform an intersection on the maps... the values in
mapa we know ae not in mapb, and visa versa. You can just print the contents of the two maps without any further processing... Putting all these suggestions together, your code would look something like:
int main() {
pugi::xml_document doca, docb;
std::map mapa, mapb;
if (!doca.load_file("a.xml") || !docb.load_file("b.xml"))
return 1;
for (auto& node: doca.child("site_entries").children("entry")) {
const char* id = node.child_value("id");
mapa[new std::string(id, strlen(id))] = node;
}
for (auto& node: docb.child("site_entries").children("entry"))
const char* idcs = node.child_value("id");
std::string id = new std::string(idcs, strlen(idcs));
if (!mapa.erase(id)) {
mapb[id] = node;
}
}
for (auto& ea: mapa) {
std::cout << "Removed:" << std::endl;
ea.second.print(std::cout);
}
for (auto& eb: mapb) {
std::cout << "Added:" << std::endl;
eb.second.print(std::cout);
}
}Code Snippets
for (auto& node: doca.child("site_entries").children("entry")) {
mapa[node.child_value("id")] = node;
}for (auto& node: docb.child("site_entries").children("entry")) {
const char* nid = node.child_value("id");
if (!mapa.erase(nid)) {
mapb[nid] = node;
}
}int main() {
pugi::xml_document doca, docb;
std::map<string, pugi::xml_node> mapa, mapb;
if (!doca.load_file("a.xml") || !docb.load_file("b.xml"))
return 1;
for (auto& node: doca.child("site_entries").children("entry")) {
const char* id = node.child_value("id");
mapa[new std::string(id, strlen(id))] = node;
}
for (auto& node: docb.child("site_entries").children("entry"))
const char* idcs = node.child_value("id");
std::string id = new std::string(idcs, strlen(idcs));
if (!mapa.erase(id)) {
mapb[id] = node;
}
}
for (auto& ea: mapa) {
std::cout << "Removed:" << std::endl;
ea.second.print(std::cout);
}
for (auto& eb: mapb) {
std::cout << "Added:" << std::endl;
eb.second.print(std::cout);
}
}Context
StackExchange Code Review Q#86958, answer score: 4
Revisions (0)
No revisions yet.