HiveBrain v1.2.0
Get Started
← Back to all entries
principlejavascriptModerate

O(N) vs O(N * N) for Word Frequency Algorithm

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
frequencyalgorithmwordfor

Problem

I am trying to come up with a better solution to the following problem:

Statement


You are supposed to write a function which takes the contents of a document in string format and also a whole number, n, providing how many results to return. (So that's two arguments for the function.) The function should return the n most frequently occurring words in descending order. The algorithm should be \$O(N)\$, not \$O(N \cdot N)\$.

My Current Solution

function freq(str, n) {
  var results = [];

  var strArray = str.split(/\s+/), countObj = {}, wordArray = [], wordObj = {};

  strArray.forEach(function(el) {
    countObj[el] ? countObj[el] += 1 : countObj[el] = 1;
  });

  for(var el in countObj) { 
    wordObj[countObj[el]] ? wordObj[countObj[el]].push(el) : wordObj[countObj[el]] = [el];
  }

  for(var el in wordObj){
    wordArray.push(wordObj[el].join(' '));
  }

  for(var i = 0; i < n; i++) {
    results.push(wordArray[wordArray.length - 1 - i]);
  }

  return results.join('\n');
};


Time Complexity

The solution seems to be \$O(N)\$. Are there any improvements I could make?

Best Practices

I am using ternary operators a lot for the if-else construct. Am I overusing them?

General

Any suggestions regarding how I can make this solution 'production grade' quality?

Solution

First of all, there's a bug: You're relying on the properties in wordObj being iterated in ascending order when creating wordArray. I.e. wordObj['1'] is handled before wordObj['3'] and so on. This behavior is not guaranteed!

Objects are unordered in JS. Some engines do order them, but you cannot rely on it. Even engines that do order properties are under no obligation to sort them in any particular way. They might be in descending order, or sorted lexicographically (which would make sense since keys are strings, but it'd sort 10 before 2), or something else entirely.

Secondly, it just seems like you're working too hard here.

If there's a "tie" between to words (say "foo" and "bar" both appear 3 times), you return all of them on one line. The problem description does not explicitly require you to handle ties in any particular way. However, your solution basically means that you're not actually returning N words. A line of words separated by spaces is not "a word", but if it was, there'd be more than N words in total.

So I'd simply leave that out.

Besides, it's simpler to use reduce, sort and slice. You can also use a || "failover" instead of ternaries.

function freq(str, n) {
  var words = str.split(/\s+/);

  // Count occurrences
  var counts = words.reduce(function (memo, word) {
    memo[word] = (memo[word] || 0) + 1;
    return memo;
  }, {});

  // Sort by count (descending)
  var sorted = Object.keys(counts).sort(function (a, b) {
    return counts[b] - counts[a];
  });

  // Return the first n words, separated by linebreaks
  return sorted.slice(0, n).join("\n");
}


Edit: As rolfl points out, the input should probably be trimmed before anything else, or you might get empty strings in your words array. However, to avoid empty strings being counted as words, without changing the code above too much, you can do this:

var counts = words.reduce(function (memo, word) {
  if(word) { // ignore empty strings
    memo[word] = (memo[word] || 0) + 1;
  }
  return memo;
}, {});


Edit: As was pointed out, the above isn't actually O(n) due to its use of sort. Bummer. This one should be closer, although not as elegant. It's closer to the original code in some ways, but still avoids assuming that object properties are ordered.

function freq(str, n) {
  var words = str.split(/\s+/);

  // Count words (same as before)
  var wordCounts = words.reduce(function (memo, word) {
    if(word) {
      memo[word] = (memo[word] || 0) + 1;
    }
    return memo;
  }, {});

  // Group words into nested arrays that are, in turn,
  // indexed by the word count. I.e. wordsIndexedByCount[3]
  // will be a nested array of all the words that occurred
  // 3 times.
  // Also keep track of the highest index we use (maxCount).
  var wordsIndexedByCount = [],
      maxCount = 0;

  for(var word in wordCounts) {
    var count = wordCounts[word];
    maxCount = count > maxCount ? count : maxCount;
    wordsIndexedByCount[count] = wordsIndexedByCount[count] || [];
    wordsIndexedByCount[count].push(word);
  }

  // Flatten the array from above, starting with the most
  // often occurring word(s) - i.e. the highest index.
  var sorted = [];
  for(var i = maxCount; i > 0; i--) {
    // Skip "holes" in the array (not that it's necessary; the
    // code works without this check, since push.apply ignores
    // an undefined argument)
    if(!wordsIndexedByCount[i]) {
      continue;
    }
    [].push.apply(sorted, wordsIndexedByCount[i]);
    if(sorted.length >= n) {
      break; // if we've got enough to return, there's no reason to keep going
    }
  }

  return sorted.slice(0, n).join("\n");
}

Code Snippets

function freq(str, n) {
  var words = str.split(/\s+/);

  // Count occurrences
  var counts = words.reduce(function (memo, word) {
    memo[word] = (memo[word] || 0) + 1;
    return memo;
  }, {});

  // Sort by count (descending)
  var sorted = Object.keys(counts).sort(function (a, b) {
    return counts[b] - counts[a];
  });

  // Return the first n words, separated by linebreaks
  return sorted.slice(0, n).join("\n");
}
var counts = words.reduce(function (memo, word) {
  if(word) { // ignore empty strings
    memo[word] = (memo[word] || 0) + 1;
  }
  return memo;
}, {});
function freq(str, n) {
  var words = str.split(/\s+/);

  // Count words (same as before)
  var wordCounts = words.reduce(function (memo, word) {
    if(word) {
      memo[word] = (memo[word] || 0) + 1;
    }
    return memo;
  }, {});

  // Group words into nested arrays that are, in turn,
  // indexed by the word count. I.e. wordsIndexedByCount[3]
  // will be a nested array of all the words that occurred
  // 3 times.
  // Also keep track of the highest index we use (maxCount).
  var wordsIndexedByCount = [],
      maxCount = 0;

  for(var word in wordCounts) {
    var count = wordCounts[word];
    maxCount = count > maxCount ? count : maxCount;
    wordsIndexedByCount[count] = wordsIndexedByCount[count] || [];
    wordsIndexedByCount[count].push(word);
  }

  // Flatten the array from above, starting with the most
  // often occurring word(s) - i.e. the highest index.
  var sorted = [];
  for(var i = maxCount; i > 0; i--) {
    // Skip "holes" in the array (not that it's necessary; the
    // code works without this check, since push.apply ignores
    // an undefined argument)
    if(!wordsIndexedByCount[i]) {
      continue;
    }
    [].push.apply(sorted, wordsIndexedByCount[i]);
    if(sorted.length >= n) {
      break; // if we've got enough to return, there's no reason to keep going
    }
  }

  return sorted.slice(0, n).join("\n");
}

Context

StackExchange Code Review Q#68655, answer score: 15

Revisions (0)

No revisions yet.