HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Tokenizing each document in a large document of documents

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
eachtokenizinglargedocumentdocuments

Problem

I've been working on a project where I am importing a large document of documents and tokenizing each document. I then make a hashset of the tokens I have, and for each of these unique tokens I am looking to find their frequency in each of the documents I have. There are about 130000 documents total.

I've run the code and unfortunately, it's taking 150 hrs to run. Do you have any suggestions on improving my code?

This is my Tokenizer() function, which surprisingly works okay:

```
static private IEnumerable Tokenizer(ref StreamReader sr, ref List tokenz, ref List stopwords)//function for tokenizing
{
string line;
var comparer = StringComparer.InvariantCultureIgnoreCase;
var StopWordSet = new HashSet(stopwords, comparer);
List tokens = new List();//list of strings called tokens
while ((line = sr.ReadLine()) != null)//as long as the streamreader has something
{
foreach (string item in line.Split(' '))//split amongst strings
{
if (item.StartsWith(""))
{
item.Trim();//trims the item of spaces
if (item == "")
{
//return item;
tokenz.Add(item);//adds the doc tags for later separation use
}
}
else
{
string newitem;
item.Trim();//trims the item of spaces
if (item != "")//ensures item is not blank
{
newitem = Regex.Replace(item, @"[^A-Za-z0-9]+", "", RegexOptions.IgnoreCase);//regex allows us to ignore case and remove any special characters
string newitem2 = newitem.ToLower();
{
if (StopWordSet.Contains(newitem2))
{
}
else
{
tokenz.Add(newitem2);
}
//token

Solution

Tokenizer method:

  • You use ref incorrectly, all reference types are always passed by reference, please read the MSDN or any other online resource to understand the ref



  • Your code doesn't follow naming conventions: local variables, parameters should be camelCased (e.g. StopWordSet)



  • use string.Split(new char[0], StringSplitOptions.RemoveEmptyEntries) overload to split on all whitespace characters and remove empty strings during tokenizing, rather than trimming result



  • strings are immutable, so the statement item.Trim() is useless unless you assign it to something



  • Tokenizer method always returns an empty list since tokens (not tokenz) is only intialized and not used. It looks like tokenz parameter is redundant and tokens variable should be used instead



-
To improve performance it might be better to change the type of stopwords parameter to ISet and use it for matching, rather than create a new StopWordSet each time
As a result you'll get a cleaner version of Tokenizer:

private static IEnumerable Tokenizer(StreamReader sr, ISet stopwords)//function for tokenizing
{
    string line;
    List tokens = new List();//list of strings called tokens

    while ((line = sr.ReadLine()) != null)//as long as the streamreader has something
    {
        foreach (string item in line.Split(new char[0], StringSplitOptions.RemoveEmptyEntries))//split amongst strings
        {
            if (item.StartsWith(""))
            {
                if (item == "")
                    tokens.Add(item); //adds the doc tags for later separation use
            }
            else
            {
                string newitem = Regex.Replace(item, @"[^A-Za-z0-9]+", "", RegexOptions.IgnoreCase).ToLower();
                if (!stopwords.Contains(newitem))
                    tokens.Add(newitem);
            }
        }
    }
    return tokens;
}


Now let's look at the AddToDictionaryAndCount method:

  • naming conventions - do name your variables/parameters properly, it's absolutely not clear what is the difference between doccounter2 parameter, doccounter and secondcounter variables. Rename them to smth. like documentIndex, numberOfTermOccurances etc.



  • while (counter ")) is the killer, you're looking for "" entry on each iteration. Cache the calculated value instead.



  • since you know the number of iterations and you always increment the index it's better to use for instead of while



  • not sure why you pass docFreqCounter as parameter, it looks like is being used as a local variable.



As to performance improvements (other than caching LastIndexOf) - currently you scan the through the tokens list for each and every item in myLexicon. Assuming that lexicon is built from tokens it would be much better to scan the tokens list only once, tracking which items you've already counted and where are the doc boundaries. Since you haven't provided the meaning of all parameters participating in this method it's hard to suggest a proper solution, but here is the first approximation:

public static Dictionary> AddToDictionaryAndCount(List tokens)
{
    var result = new Dictionary>();
    var documentIndex = 0; //tracks the current document index

    var lastDocumentIndex = tokens.LastIndexOf("");
    for (int i = 0; i ")
        {
            //finalize stats for the document. not sure what goes here.
            //add the logic corresponding to "doccounter2  documentStats;
        if (!result.TryGetValue(token, out documentStats))
            documentStats = result[token] = new Dictionary();

        documentStats[documentIndex]++;
    }
    return result;
}

Code Snippets

private static IEnumerable<string> Tokenizer(StreamReader sr, ISet<string> stopwords)//function for tokenizing
{
    string line;
    List<string> tokens = new List<string>();//list of strings called tokens

    while ((line = sr.ReadLine()) != null)//as long as the streamreader has something
    {
        foreach (string item in line.Split(new char[0], StringSplitOptions.RemoveEmptyEntries))//split amongst strings
        {
            if (item.StartsWith("<") & item.EndsWith(">"))
            {
                if (item == "</DOC>")
                    tokens.Add(item); //adds the doc tags for later separation use
            }
            else
            {
                string newitem = Regex.Replace(item, @"[^A-Za-z0-9]+", "", RegexOptions.IgnoreCase).ToLower();
                if (!stopwords.Contains(newitem))
                    tokens.Add(newitem);
            }
        }
    }
    return tokens;
}
public static Dictionary<string, Dictionary<int, int>> AddToDictionaryAndCount(List<string> tokens)
{
    var result = new Dictionary<string, Dictionary<int,int>>();
    var documentIndex = 0; //tracks the current document index

    var lastDocumentIndex = tokens.LastIndexOf("</DOC>");
    for (int i = 0; i <= lastDocumentIndex; i++)
    {
        var token = tokens[i];
        if (token == "</DOC>")
        {
            //finalize stats for the document. not sure what goes here.
            //add the logic corresponding to "doccounter2 < doccounter"
            documentIndex++;
            continue;
        }

        Dictionary<int, int> documentStats;
        if (!result.TryGetValue(token, out documentStats))
            documentStats = result[token] = new Dictionary<int, int>();

        documentStats[documentIndex]++;
    }
    return result;
}

Context

StackExchange Code Review Q#23002, answer score: 9

Revisions (0)

No revisions yet.