patterncsharpMinor
Slow data-processing and inefficient memory usage in .NET containers
Viewed 0 times
inefficientslowusagenetmemoryandcontainersprocessingdata
Problem
I am writing a text classifier, and in order to do so, I need TF/IDF values per every word of my single text.
Then I need to use the cosine similarity:
$$similarity = cos(\theta) = \dfrac{A \cdot B}{\lVert A \lVert \lVert B \lVert} = \dfrac{\overset{n}{\underset{i=1}{\LARGE\Sigma}}A_i \times B_i}{\sqrt{ \overset{n}{\underset{i=1}{\LARGE\Sigma}}(A_i)^2} \times \sqrt{\overset{n}{\underset{i=1}{\LARGE\Sigma}}(B_i)^2}}$$
This requires processing of a big data storage (all of the texts that already exists in my database). The problem is that my code is doing his job for about 2 hours (quite too long) and breaks giving me message that I have run out of memory. I think that the main method might be very, very unoptimised.
```
public static void CreateCategoryClasses()
{
deserializeClasses = Deserialize();
howManyClasses = deserializeClasses.Count;
ewhClass = new EventWaitHandle[howManyClasses];
for (var i = 0; i ());
result.Enqueue(new ConcurrentDictionary());
}
WaitCallback threadMethod = ParseCategories;
ThreadPool.SetMaxThreads(howManyStudents, howManyClasses);
for (var i = 0; i > efekt = new List>(5);
for (int i = 0; i >));
using (var sw = new StreamWriter(@"categoryClasses.xml"))
{
xmls.Serialize(sw, efekt);
}
}
private static void AddIDF(object index)
{
Console.WriteLine("Thread started with id:" + index + " and number: " + Thread.CurrentThread.ManagedThreadId);
var i = index as int?;
double sum;
foreach (var word in categoryClasses.ElementAt(i.Value))
{
sum =
deserializeClasses.Count(
clas =>
clas.Bag.Where(x => clas.Category == ((Categories)i.Value).ToString())
.Contains(new Words(word.Key, 0, 0)));
var temp = Convert.ToDouble(sum) /
Convert.ToDouble(
deserializeClasses.Count(x => x.Category == ((Categories)i.Value).ToString()));
result.E
Then I need to use the cosine similarity:
$$similarity = cos(\theta) = \dfrac{A \cdot B}{\lVert A \lVert \lVert B \lVert} = \dfrac{\overset{n}{\underset{i=1}{\LARGE\Sigma}}A_i \times B_i}{\sqrt{ \overset{n}{\underset{i=1}{\LARGE\Sigma}}(A_i)^2} \times \sqrt{\overset{n}{\underset{i=1}{\LARGE\Sigma}}(B_i)^2}}$$
This requires processing of a big data storage (all of the texts that already exists in my database). The problem is that my code is doing his job for about 2 hours (quite too long) and breaks giving me message that I have run out of memory. I think that the main method might be very, very unoptimised.
```
public static void CreateCategoryClasses()
{
deserializeClasses = Deserialize();
howManyClasses = deserializeClasses.Count;
ewhClass = new EventWaitHandle[howManyClasses];
for (var i = 0; i ());
result.Enqueue(new ConcurrentDictionary());
}
WaitCallback threadMethod = ParseCategories;
ThreadPool.SetMaxThreads(howManyStudents, howManyClasses);
for (var i = 0; i > efekt = new List>(5);
for (int i = 0; i >));
using (var sw = new StreamWriter(@"categoryClasses.xml"))
{
xmls.Serialize(sw, efekt);
}
}
private static void AddIDF(object index)
{
Console.WriteLine("Thread started with id:" + index + " and number: " + Thread.CurrentThread.ManagedThreadId);
var i = index as int?;
double sum;
foreach (var word in categoryClasses.ElementAt(i.Value))
{
sum =
deserializeClasses.Count(
clas =>
clas.Bag.Where(x => clas.Category == ((Categories)i.Value).ToString())
.Contains(new Words(word.Key, 0, 0)));
var temp = Convert.ToDouble(sum) /
Convert.ToDouble(
deserializeClasses.Count(x => x.Category == ((Categories)i.Value).ToString()));
result.E
Solution
Edit in response to OP comment above:
Original answer:
Some of this is performance suggestions, some is just general advice...
-
If you have one, run this through a performance profiler. This can point you directly to the lines that are using the most CPU time very quickly. Personally, I use ANTS from Red Gate. Another good one is DotTrace from JetBrains.
-
Why is
-
You call
-
String comparisons are slow compared to integers and enums (your
-
You pass an
-
-
What is this
CreateCategoryClasses may appear to be slow because of the WaitOne calls. You're essentially waiting waiting on either ParseCategories or AddIDF to complete before moving on. It would help if you could also post the ParseCategories method as well. Otherwise, you're just adding stuff to collections, which generally shouldn't perform too bad, except for some of my comments below.Original answer:
Some of this is performance suggestions, some is just general advice...
-
If you have one, run this through a performance profiler. This can point you directly to the lines that are using the most CPU time very quickly. Personally, I use ANTS from Red Gate. Another good one is DotTrace from JetBrains.
-
Why is
categoryClasses a ConcurrentQueue? As a general rule, a Queue is used for removing the items as you process them. If you just need to loop over a collection (possibly multiple times), a List would be a better choice. This would also avoid the call to ToList() each time through your secondWord loop. ToList() will iterate through the entire collection every time you call it (granted it's only 5 element, but still...). This may also apply to result as well, depending on where else it might be used.-
You call
ElementAt() every time through your foreach loop in AddIDF, even though i isn't changing. If you must use a ConcurrentQueue for categoryClasses, then ElementAt() is \$O(i)\$. If you can use a List, then it will be \$O(1)\$. You could save the current ConcurrentDictionary before the loop to avoid the repetitive indexing.-
String comparisons are slow compared to integers and enums (your
Where() calls in AddIDF). If you changed Category to an enum or integer (or possibly add a second property for comparison purposes), that would improve the count performance.-
You pass an
int into your AddIDF method, not an int?, so you can just do a direct cast to an int. This avoids the .Value property everywhere you access the index in AddIDF.-
sum is already a double, so there's no need to call Convert.ToDouble. Because sum is a double, temp will be a double, even if the result of Count is an int.-
What is this
5 I keep seeing? Typically it's good to assign this to a const value for clarity.Context
StackExchange Code Review Q#52202, answer score: 7
Revisions (0)
No revisions yet.