HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Inverse Document Frequency (IDF) implementation

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
inversefrequencyidfdocumentimplementation

Problem

I would like to share my class IDF. Any suggestions/modification?

public class IDFMeasure
{
    private readonly string[] _docs;
    private readonly int _numDocs;
    private int _numTerms;
    private List _terms;
    private float[][] _InverseDocFreq;
    private int[] _docFreq;

    private readonly Dictionary _wordsIndex = new Dictionary();

    public IDFMeasure(string[] documents)
    {
        _docs = documents;
        _numDocs = documents.Length;
        MyInit();
    }

    private List GenerateTerms(string[] docs)
    {
        return docs.SelectMany(doc => ProcessDocument(doc)).Distinct().ToList();
    }

    private IEnumerable ProcessDocument(string doc)
    {
        return doc.Split(' ')
            .GroupBy(word => word)
            .OrderByDescending(g => g.Count())
            .Select(g => g.Key)
    }

    private int GetTermIndex(string term)
    {
        return _wordsIndex[term];
    }

    private void MyInit()
    {
        _terms = GenerateTerms(_docs);
        _numTerms = _terms.Count;

        _docFreq = new int[_numTerms];

        for (var i = 0; i  GetWordFrequency(string input)
    {
        return input.Split(' ').GroupBy(x => x)
            .OrderByDescending(g => g.Count())
            .ToDictionary(g => g.Key, g => g.Count());
    }
}

Solution

The layout and naming conventions look good. Thank you for that.

_InverseDocFreq should be renamed _inverseDocFreq to keep with conventions and also stay consistent in your code.

I also think _terms should be an IList. This just eliminates a bunch of the unneeded operations.

you are missing a ; after this statement in ProcessDocument

return doc.Split(' ')
            .GroupBy(word => word)
            .OrderByDescending(g => g.Count())
            .Select(g => g.Key)


I'm not sure why you are using MyInit, the constructor is the only place that is called. I would move the code into the constructor, then you can make _numTerms, _terms, and _docFreq readonly.

I would rename _wordsIndex to _wordIndex because it is really only and index for each word. I would also move the new into the constructor.

The for loop where you are filling the terms into the _wordIndex dictionary could be moved to its own method, I called it FillWordIndexWithTerms(), and it looks like

private void FillWordIndexWithTerms()
{
    for (var termIndex = 0; termIndex < _terms.Count; termIndex++)
    {
        _wordIndex.Add(_terms[termIndex], termIndex);
    }
}


I renamed i to termIndex to give it a bit better meaning in your code, and make it easier to read.

The constructor now looks like

public IDFMeasure(string[] documents)
{
    _docs = documents;
    _numDocs = documents.Length;

    _terms = GenerateTerms(_docs);
    _numTerms = _terms.Count;

    _docFreq = new int[_numTerms];
    _wordIndex = new Dictionary();

    FillWordIndexWithTerms();
    GenerateDocumentFrequency();
    GenerateInverseDocfrequency();
}


In the GenerateTerms method, I would change the input parameter to an IEnumerable. You could also use a method group in the linq statement, you don't need the lamba. The function can also be made static.

private static IList GenerateTerms(IEnumerable docs)
{
    return docs.SelectMany(ProcessDocument).Distinct().ToList();
}


ProcessDocument looks good.

GetTermIndex looks good.

Log could be made static. Other than that it looks good.

I would move the inner ForEach in GenerateDocumentFrequency to its own method. This will reduce nesting.

private void GenerateDocumentFrequency()
{
    _inverseDocFreq = new float[_numDocs][];

    for (var i = 0; i < _numDocs; i++)
    {
        _inverseDocFreq[i] = new float[_numTerms];

        var curDoc = _docs[i];
        var freq = GetWordFrequency(curDoc);

        CalculateTermFrequency(freq);
    }
}


The ForEach can also be simplified using Linq.

private void CalculateTermFrequency(Dictionary freq)
{
    foreach (var termIndex in freq.Select(entry => GetTermIndex(entry.Key)))
    {
        _docFreq[termIndex]++;
    }
}


In GenerateInverseDocfrequency I would rename i to currentDocument. The nested ForEach should be moved to its own method, CalculateInverseDocumentFrequency.

private void GenerateInverseDocfrequency()
{
    for (var currentDocument = 0; currentDocument  freq)
{
    foreach (var termIndex in freq.Select(entry => entry.Key).Select(GetTermIndex))
    {
        _inverseDocFreq[currentDocument][termIndex] = Log((_numDocs)/
                                                            (float) _docFreq[termIndex]);
    }
}


GetWordFrequency could be made static. Other than that it looks good.

Overall, nice code to read, the suggestions are just minor clean-up. I really liked the way you used methods to separate different operations.

Code Snippets

return doc.Split(' ')
            .GroupBy(word => word)
            .OrderByDescending(g => g.Count())
            .Select(g => g.Key)
private void FillWordIndexWithTerms()
{
    for (var termIndex = 0; termIndex < _terms.Count; termIndex++)
    {
        _wordIndex.Add(_terms[termIndex], termIndex);
    }
}
public IDFMeasure(string[] documents)
{
    _docs = documents;
    _numDocs = documents.Length;

    _terms = GenerateTerms(_docs);
    _numTerms = _terms.Count;

    _docFreq = new int[_numTerms];
    _wordIndex = new Dictionary<string, int>();

    FillWordIndexWithTerms();
    GenerateDocumentFrequency();
    GenerateInverseDocfrequency();
}
private static IList<string> GenerateTerms(IEnumerable<string> docs)
{
    return docs.SelectMany(ProcessDocument).Distinct().ToList();
}
private void GenerateDocumentFrequency()
{
    _inverseDocFreq = new float[_numDocs][];

    for (var i = 0; i < _numDocs; i++)
    {
        _inverseDocFreq[i] = new float[_numTerms];

        var curDoc = _docs[i];
        var freq = GetWordFrequency(curDoc);

        CalculateTermFrequency(freq);
    }
}

Context

StackExchange Code Review Q#19143, answer score: 4

Revisions (0)

No revisions yet.