patterncsharpMinor
Tokenizing each document in a large document of documents
Viewed 0 times
eachtokenizinglargedocumentdocuments
Problem
I've been working on a project where I am importing a large document of documents and tokenizing each document. I then make a hashset of the tokens I have, and for each of these unique tokens I am looking to find their frequency in each of the documents I have. There are about 130000 documents total.
I've run the code and unfortunately, it's taking 150 hrs to run. Do you have any suggestions on improving my code?
This is my
```
static private IEnumerable Tokenizer(ref StreamReader sr, ref List tokenz, ref List stopwords)//function for tokenizing
{
string line;
var comparer = StringComparer.InvariantCultureIgnoreCase;
var StopWordSet = new HashSet(stopwords, comparer);
List tokens = new List();//list of strings called tokens
while ((line = sr.ReadLine()) != null)//as long as the streamreader has something
{
foreach (string item in line.Split(' '))//split amongst strings
{
if (item.StartsWith(""))
{
item.Trim();//trims the item of spaces
if (item == "")
{
//return item;
tokenz.Add(item);//adds the doc tags for later separation use
}
}
else
{
string newitem;
item.Trim();//trims the item of spaces
if (item != "")//ensures item is not blank
{
newitem = Regex.Replace(item, @"[^A-Za-z0-9]+", "", RegexOptions.IgnoreCase);//regex allows us to ignore case and remove any special characters
string newitem2 = newitem.ToLower();
{
if (StopWordSet.Contains(newitem2))
{
}
else
{
tokenz.Add(newitem2);
}
//token
I've run the code and unfortunately, it's taking 150 hrs to run. Do you have any suggestions on improving my code?
This is my
Tokenizer() function, which surprisingly works okay:```
static private IEnumerable Tokenizer(ref StreamReader sr, ref List tokenz, ref List stopwords)//function for tokenizing
{
string line;
var comparer = StringComparer.InvariantCultureIgnoreCase;
var StopWordSet = new HashSet(stopwords, comparer);
List tokens = new List();//list of strings called tokens
while ((line = sr.ReadLine()) != null)//as long as the streamreader has something
{
foreach (string item in line.Split(' '))//split amongst strings
{
if (item.StartsWith(""))
{
item.Trim();//trims the item of spaces
if (item == "")
{
//return item;
tokenz.Add(item);//adds the doc tags for later separation use
}
}
else
{
string newitem;
item.Trim();//trims the item of spaces
if (item != "")//ensures item is not blank
{
newitem = Regex.Replace(item, @"[^A-Za-z0-9]+", "", RegexOptions.IgnoreCase);//regex allows us to ignore case and remove any special characters
string newitem2 = newitem.ToLower();
{
if (StopWordSet.Contains(newitem2))
{
}
else
{
tokenz.Add(newitem2);
}
//token
Solution
Tokenizer method:- You use
refincorrectly, all reference types are always passed by reference, please read the MSDN or any other online resource to understand theref
- Your code doesn't follow naming conventions: local variables, parameters should be camelCased (e.g.
StopWordSet)
- use
string.Split(new char[0], StringSplitOptions.RemoveEmptyEntries)overload to split on all whitespace characters and remove empty strings during tokenizing, rather than trimming result
- strings are immutable, so the statement
item.Trim()is useless unless you assign it to something
Tokenizermethod always returns an empty list sincetokens(nottokenz) is only intialized and not used. It looks liketokenzparameter is redundant andtokensvariable should be used instead
-
To improve performance it might be better to change the type of
stopwords parameter to ISet and use it for matching, rather than create a new StopWordSet each timeAs a result you'll get a cleaner version of
Tokenizer: private static IEnumerable Tokenizer(StreamReader sr, ISet stopwords)//function for tokenizing
{
string line;
List tokens = new List();//list of strings called tokens
while ((line = sr.ReadLine()) != null)//as long as the streamreader has something
{
foreach (string item in line.Split(new char[0], StringSplitOptions.RemoveEmptyEntries))//split amongst strings
{
if (item.StartsWith(""))
{
if (item == "")
tokens.Add(item); //adds the doc tags for later separation use
}
else
{
string newitem = Regex.Replace(item, @"[^A-Za-z0-9]+", "", RegexOptions.IgnoreCase).ToLower();
if (!stopwords.Contains(newitem))
tokens.Add(newitem);
}
}
}
return tokens;
}Now let's look at the
AddToDictionaryAndCount method:- naming conventions - do name your variables/parameters properly, it's absolutely not clear what is the difference between
doccounter2parameter,doccounterandsecondcountervariables. Rename them to smth. likedocumentIndex,numberOfTermOccurancesetc.
while (counter "))is the killer, you're looking for""entry on each iteration. Cache the calculated value instead.
- since you know the number of iterations and you always increment the index it's better to use
forinstead ofwhile
- not sure why you pass
docFreqCounteras parameter, it looks like is being used as a local variable.
As to performance improvements (other than caching
LastIndexOf) - currently you scan the through the tokens list for each and every item in myLexicon. Assuming that lexicon is built from tokens it would be much better to scan the tokens list only once, tracking which items you've already counted and where are the doc boundaries. Since you haven't provided the meaning of all parameters participating in this method it's hard to suggest a proper solution, but here is the first approximation:public static Dictionary> AddToDictionaryAndCount(List tokens)
{
var result = new Dictionary>();
var documentIndex = 0; //tracks the current document index
var lastDocumentIndex = tokens.LastIndexOf("");
for (int i = 0; i ")
{
//finalize stats for the document. not sure what goes here.
//add the logic corresponding to "doccounter2 documentStats;
if (!result.TryGetValue(token, out documentStats))
documentStats = result[token] = new Dictionary();
documentStats[documentIndex]++;
}
return result;
}Code Snippets
private static IEnumerable<string> Tokenizer(StreamReader sr, ISet<string> stopwords)//function for tokenizing
{
string line;
List<string> tokens = new List<string>();//list of strings called tokens
while ((line = sr.ReadLine()) != null)//as long as the streamreader has something
{
foreach (string item in line.Split(new char[0], StringSplitOptions.RemoveEmptyEntries))//split amongst strings
{
if (item.StartsWith("<") & item.EndsWith(">"))
{
if (item == "</DOC>")
tokens.Add(item); //adds the doc tags for later separation use
}
else
{
string newitem = Regex.Replace(item, @"[^A-Za-z0-9]+", "", RegexOptions.IgnoreCase).ToLower();
if (!stopwords.Contains(newitem))
tokens.Add(newitem);
}
}
}
return tokens;
}public static Dictionary<string, Dictionary<int, int>> AddToDictionaryAndCount(List<string> tokens)
{
var result = new Dictionary<string, Dictionary<int,int>>();
var documentIndex = 0; //tracks the current document index
var lastDocumentIndex = tokens.LastIndexOf("</DOC>");
for (int i = 0; i <= lastDocumentIndex; i++)
{
var token = tokens[i];
if (token == "</DOC>")
{
//finalize stats for the document. not sure what goes here.
//add the logic corresponding to "doccounter2 < doccounter"
documentIndex++;
continue;
}
Dictionary<int, int> documentStats;
if (!result.TryGetValue(token, out documentStats))
documentStats = result[token] = new Dictionary<int, int>();
documentStats[documentIndex]++;
}
return result;
}Context
StackExchange Code Review Q#23002, answer score: 9
Revisions (0)
No revisions yet.