HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavascriptMinor

Matrix of articles and authors from large JSON dataset

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
authorsjsondatasetarticleslargeandfrommatrix

Problem

Given the following JSON data containing 4 articles by 4 different authors:

const data = {
  "DataSet": {
    "Article": [
      {
        "Title": 'A',
        "AuthorList": {
          "Author": [{
            "FirstName": "Al",
            "LastName": "Ab",
            "Initials": "A"
          }, {
            "FirstName": "Bi",
            "LastName": "By",
            "Initials": "B"
          }] // Author
        } // AuthorList
      }, // Article
      {
        "Title": 'B',
        "AuthorList": {
          "Author": [{
            "FirstName": "Bi",
            "LastName": "By",
            "Initials": "B"
          }, {
            "FirstName": "Ch",
            "LastName": "Ch",
            "Initials": "C"
          }] // Author
        } // AuthorList
      }, // Article
      {
        "Title": 'C',
        "AuthorList": {
          "Author": [{
            "FirstName": "Ch",
            "LastName": "Ch",
            "Initials": "C"
          }, {
            "FirstName": "Al",
            "LastName": "Ab",
            "Initials": "A"
          }] // Author
        } // AuthorList
      }, // Article
      {
        "Title": 'D',
        "AuthorList": {
          "Author": [{
            "FirstName": "Da",
            "LastName": "Do",
            "Initials": "D"
          }, {
            "FirstName": "Bi",
            "LastName": "By",
            "Initials": "B"
          }] // Author
        } // AuthorList
      } // Article
    ] // Articles
  } // DataSet
};


I need to answer the following query:

Generate a matrix such that each cell (i, j) contains say N the number of articles authored by the author in column and co-authored by author in row. A paper may have more than 2 authors. It may be assumed that JSON is present as a file.

For above example the output would simply look like:

[2, 1, 1, 0]
[1, 3, 1, 1]
[1, 1, 2, 0]
[0, 1, 0, 1]


I have devised the solution below without thinking too much about design and performanc

Solution

You seem to be taking a very non-optimal approach from a performance standpoint, as you iterate over the data set unnecessarily trying to extra only derive certain pieces from it at a single time. I would think your goal would be to map the data into your data structure with a single iteration over the data.dataSet.article array, resulting in \$O(n)\$ operational complexity to read the data in).

Right now you do:

  • Build an array of authors - \$O(n)\$, where n is number of articles



  • Cast authors in to a Set - \$O(n)\$, where n is number of authors from previous step



  • Get articles into array - \$O(n)\$, where n is # of articles. Note you gain nothing from mapping articles in place into there own array, but double memory usage and add this \$O(n)\$ performance hit. At a minimum you should kill this step.



  • You map authors into an object with index position as value - \$O(n)\$ where n i # of authors. Another unnecessary step it seems, as you could iterate the authors set just as you have done here later if needed.



  • You initialize the matrix to hold empty 0's - \$O(n)\$ where n is number of authors. This could have been done at any other point where you already iterated author set.



  • You iterate the articles array - \$O(n)\$ where n is # of articles.



As you can see that is a lot of unnecessary complexity (and memory consumption).

You should be able to build your matrix in a single pass.

  • Initialize your matrix as an empty object which will later be filled in .



The target data structure may look like this:

{
    'ch ch': {
        'ch ch':  *,
        'other author': *,
        ...
    },
    'other author': { ... },
    ...
}


  • iterate through data.dataSet.article. On each element populate counters on properties in proposed data structure above.



  • populate all authors into a set for use in rendering display.



This all could be as simple as:

const matrix = {};
const authors = new Set();
const articleLen = data.dataSet.article.length;
for (let artIdx = 0; artIdx < articleLen; artIdx++) {
    const len = element[artIdx].authorList.length;
    for (let i = 0; i < len; i++) {
        let author = fullname(element[artIdx].authorList[i]);
        authors.add(author);
        for (let j = 0; j < len; j++) {
            let coauthor = fullname(element[artIdx].authorList[j]);
            matrix[coauthor][author] = (matrix[coauthor][author]++ || 0) + 1;
        }
    }
}


Note: I used for loops for performance optimization rather than array methods. You really could use either. If writing code for real-world system where I did not expect millions of rows and the performance micro-optimization was not necessary I would probably use array methods for cleaner code.

-
Fill 0's in matrix on display (or as secondary process if you really need the data structure). You do this by iterating the author set in nested loop.

-
This would result in complexity that is \$O(n)\$ (n = articles)for article iteration + \$O(m^2)\$ (m = authors) for matrix display. So, if # of articles are assumed to be >> # of authors, this would likely be a good approach.

Code Snippets

{
    'ch ch': {
        'ch ch':  *,
        'other author': *,
        ...
    },
    'other author': { ... },
    ...
}
const matrix = {};
const authors = new Set();
const articleLen = data.dataSet.article.length;
for (let artIdx = 0; artIdx < articleLen; artIdx++) {
    const len = element[artIdx].authorList.length;
    for (let i = 0; i < len; i++) {
        let author = fullname(element[artIdx].authorList[i]);
        authors.add(author);
        for (let j = 0; j < len; j++) {
            let coauthor = fullname(element[artIdx].authorList[j]);
            matrix[coauthor][author] = (matrix[coauthor][author]++ || 0) + 1;
        }
    }
}

Context

StackExchange Code Review Q#160616, answer score: 3

Revisions (0)

No revisions yet.