patterncsharpMinor
Inverse Document Frequency (IDF) implementation
Viewed 0 times
inversefrequencyidfdocumentimplementation
Problem
I would like to share my class IDF. Any suggestions/modification?
public class IDFMeasure
{
private readonly string[] _docs;
private readonly int _numDocs;
private int _numTerms;
private List _terms;
private float[][] _InverseDocFreq;
private int[] _docFreq;
private readonly Dictionary _wordsIndex = new Dictionary();
public IDFMeasure(string[] documents)
{
_docs = documents;
_numDocs = documents.Length;
MyInit();
}
private List GenerateTerms(string[] docs)
{
return docs.SelectMany(doc => ProcessDocument(doc)).Distinct().ToList();
}
private IEnumerable ProcessDocument(string doc)
{
return doc.Split(' ')
.GroupBy(word => word)
.OrderByDescending(g => g.Count())
.Select(g => g.Key)
}
private int GetTermIndex(string term)
{
return _wordsIndex[term];
}
private void MyInit()
{
_terms = GenerateTerms(_docs);
_numTerms = _terms.Count;
_docFreq = new int[_numTerms];
for (var i = 0; i GetWordFrequency(string input)
{
return input.Split(' ').GroupBy(x => x)
.OrderByDescending(g => g.Count())
.ToDictionary(g => g.Key, g => g.Count());
}
}Solution
The layout and naming conventions look good. Thank you for that.
_InverseDocFreq should be renamed _inverseDocFreq to keep with conventions and also stay consistent in your code.
I also think _terms should be an IList. This just eliminates a bunch of the unneeded operations.
you are missing a ; after this statement in
I'm not sure why you are using
I would rename
The for loop where you are filling the terms into the _wordIndex dictionary could be moved to its own method, I called it FillWordIndexWithTerms(), and it looks like
I renamed i to termIndex to give it a bit better meaning in your code, and make it easier to read.
The constructor now looks like
In the
I would move the inner ForEach in
The ForEach can also be simplified using Linq.
In
Overall, nice code to read, the suggestions are just minor clean-up. I really liked the way you used methods to separate different operations.
_InverseDocFreq should be renamed _inverseDocFreq to keep with conventions and also stay consistent in your code.
I also think _terms should be an IList. This just eliminates a bunch of the unneeded operations.
you are missing a ; after this statement in
ProcessDocumentreturn doc.Split(' ')
.GroupBy(word => word)
.OrderByDescending(g => g.Count())
.Select(g => g.Key)I'm not sure why you are using
MyInit, the constructor is the only place that is called. I would move the code into the constructor, then you can make _numTerms, _terms, and _docFreq readonly.I would rename
_wordsIndex to _wordIndex because it is really only and index for each word. I would also move the new into the constructor.The for loop where you are filling the terms into the _wordIndex dictionary could be moved to its own method, I called it FillWordIndexWithTerms(), and it looks like
private void FillWordIndexWithTerms()
{
for (var termIndex = 0; termIndex < _terms.Count; termIndex++)
{
_wordIndex.Add(_terms[termIndex], termIndex);
}
}I renamed i to termIndex to give it a bit better meaning in your code, and make it easier to read.
The constructor now looks like
public IDFMeasure(string[] documents)
{
_docs = documents;
_numDocs = documents.Length;
_terms = GenerateTerms(_docs);
_numTerms = _terms.Count;
_docFreq = new int[_numTerms];
_wordIndex = new Dictionary();
FillWordIndexWithTerms();
GenerateDocumentFrequency();
GenerateInverseDocfrequency();
}In the
GenerateTerms method, I would change the input parameter to an IEnumerable. You could also use a method group in the linq statement, you don't need the lamba. The function can also be made static.private static IList GenerateTerms(IEnumerable docs)
{
return docs.SelectMany(ProcessDocument).Distinct().ToList();
}ProcessDocument looks good.GetTermIndex looks good.Log could be made static. Other than that it looks good.I would move the inner ForEach in
GenerateDocumentFrequency to its own method. This will reduce nesting.private void GenerateDocumentFrequency()
{
_inverseDocFreq = new float[_numDocs][];
for (var i = 0; i < _numDocs; i++)
{
_inverseDocFreq[i] = new float[_numTerms];
var curDoc = _docs[i];
var freq = GetWordFrequency(curDoc);
CalculateTermFrequency(freq);
}
}The ForEach can also be simplified using Linq.
private void CalculateTermFrequency(Dictionary freq)
{
foreach (var termIndex in freq.Select(entry => GetTermIndex(entry.Key)))
{
_docFreq[termIndex]++;
}
}In
GenerateInverseDocfrequency I would rename i to currentDocument. The nested ForEach should be moved to its own method, CalculateInverseDocumentFrequency.private void GenerateInverseDocfrequency()
{
for (var currentDocument = 0; currentDocument freq)
{
foreach (var termIndex in freq.Select(entry => entry.Key).Select(GetTermIndex))
{
_inverseDocFreq[currentDocument][termIndex] = Log((_numDocs)/
(float) _docFreq[termIndex]);
}
}GetWordFrequency could be made static. Other than that it looks good.Overall, nice code to read, the suggestions are just minor clean-up. I really liked the way you used methods to separate different operations.
Code Snippets
return doc.Split(' ')
.GroupBy(word => word)
.OrderByDescending(g => g.Count())
.Select(g => g.Key)private void FillWordIndexWithTerms()
{
for (var termIndex = 0; termIndex < _terms.Count; termIndex++)
{
_wordIndex.Add(_terms[termIndex], termIndex);
}
}public IDFMeasure(string[] documents)
{
_docs = documents;
_numDocs = documents.Length;
_terms = GenerateTerms(_docs);
_numTerms = _terms.Count;
_docFreq = new int[_numTerms];
_wordIndex = new Dictionary<string, int>();
FillWordIndexWithTerms();
GenerateDocumentFrequency();
GenerateInverseDocfrequency();
}private static IList<string> GenerateTerms(IEnumerable<string> docs)
{
return docs.SelectMany(ProcessDocument).Distinct().ToList();
}private void GenerateDocumentFrequency()
{
_inverseDocFreq = new float[_numDocs][];
for (var i = 0; i < _numDocs; i++)
{
_inverseDocFreq[i] = new float[_numTerms];
var curDoc = _docs[i];
var freq = GetWordFrequency(curDoc);
CalculateTermFrequency(freq);
}
}Context
StackExchange Code Review Q#19143, answer score: 4
Revisions (0)
No revisions yet.