patternMinor
Performance of using objects with scripting dictionary
Viewed 0 times
scriptingobjectswithusingperformancedictionary
Problem
The code below is a working excerpt from a larger application that I am re-factoring. My question focuses on the performance aspect of using (or not using) objects in this case.
I have a medium-ish size data set on a worksheet (6500 rows by 12 columns) and am building an application to merge (mostly) new data records with the existing data.
My overall approach is to:
So far, so good. I have working code that accomplishes these steps, though without the architecture of classes/objects and good functional design.
My question concerns performance issues I'm seeing in Step 2. (A fully functional example is listed at the bottom of this post.)
I'm seeing wildly different execution performance times when using objects and the dictionary as opposed to when I side-step the objects completely. Searches online for questions related to object performance turned up this reference, but nothing directly related to my experience.
Giving you the answers up front, my performance is measured as:
Clearly, when I'm using my full dataset at 6500+ rows, my execution time grows to an unacceptable level.
I would appreciate comments and feedback on
I have a medium-ish size data set on a worksheet (6500 rows by 12 columns) and am building an application to merge (mostly) new data records with the existing data.
My overall approach is to:
- Ingest the new data into a single object. The new dataset is held internally in both an array and a dictionary.
- Ingest the existing data into a single object. The existing data is held internally in both an array and a dictionary.
- Determine if a record from the new dataset exists in the existing dataset by creating a "key" and using it to search the dictionary. This unique key is created by concatenating the first four fields of a data record.
- If the data record does not exist in the existing dataset, add it to the existing data object (to both the dictionary and the array).
So far, so good. I have working code that accomplishes these steps, though without the architecture of classes/objects and good functional design.
My question concerns performance issues I'm seeing in Step 2. (A fully functional example is listed at the bottom of this post.)
I'm seeing wildly different execution performance times when using objects and the dictionary as opposed to when I side-step the objects completely. Searches online for questions related to object performance turned up this reference, but nothing directly related to my experience.
Giving you the answers up front, my performance is measured as:
Array Used is (1000 by 12)
set up array = 7.046 ms
populate dict with objects = 4775.396 ms
populate dict without objects (store key) = 11.222 ms
populate dict without objects (store array) = 7502.135 ms
Clearly, when I'm using my full dataset at 6500+ rows, my execution time grows to an unacceptable level.
I would appreciate comments and feedback on
Solution
As far as the relative performance goes:
When you instantiate an object and add it to the
In your "populate dict without objects (store array)" case, you are actually cramming a significant amount of data into the
...you are not only storing the entire array in the
In your final case, storing the unique row identifier as both the key and the value, you are simply storing a
In short, the above make the benchmarking results almolst entirely meaningless.
A couple other things:
When you instantiate an object and add it to the
Dictionary in DictWithObjects, the only thing that actually gets stored in the Dictionary is a reference to the object. This is basically just a pointer. So in your "populate dict with objects" case, you are making a copy of the object reference and storing that. In this case, you are adding the overhead of instantiating the object and initializing it. However, this case also does something that none of the other test cases do - it uses Application.WorksheetFunction.Index to slice the input array. This alone invalidates your benchmarks because none of the other subs do this.In your "populate dict without objects (store array)" case, you are actually cramming a significant amount of data into the
Dictionary. Not only that, but in this line...thisDict.Add key, arr 'the Item is the array slice...you are not only storing the entire array in the
Dictionary, you're making a copy of the entire array and storing that in the Dictionary. I would guess that a very large part of the poor performance of DictWithoutObjectsArr is due solely to this bug in the benchmarking code. That said, the missing call to Application.WorksheetFunction.Index(arr, i, 0) makes the performance of this method even worse - when I add the array slicing, it seems like the large memory copies are actually cheaper from a performance standpoint than an array slice and a much smaller memory copy. In your final case, storing the unique row identifier as both the key and the value, you are simply storing a
String. This case is sacrificing long term performance in exchange for not collecting the same information that the other cases do. The trade-off will come later in a finished product when it is time to get information back out. Obviously this isn't possible (since you only have a key-key combination), and you aren't benchmarking everything that is relevant to the overall performance. Assuming that you need to get the row information at some point, this benchmark should simply either be ignored or taken only as a proxy estimation of how long it takes to add a generic key value pair to a Dictionary. Otherwise, at very least the value should be the row - at least in that case you can get some useful information out of it later.In short, the above make the benchmarking results almolst entirely meaningless.
A couple other things:
- Your object shouldn't be exposing all of it's members as public class variables. These should be encapsulated into properties. As currently written, the class is little more than a heavy-weight
Typewith a initializer property (more on this below).
- The objects should be responsible for calculating their own keys - not the calling code.
- Your "constructor" seems really misplaced in the write-only property
Record. Better would be to simply make it aSub. Best would be to add the VB_PredeclaredId attribute and create an object factory. However, taking an array as a parameter seems suspect. If the class is intended to mirror what you would find on a row, aRangeparameter would probably be clearer and more performant than slicing a single dimension out of a 2D array. If you're going to pull your data range into an array (which makes sense), it's probably much faster to just set the individual properties. If you aren't pulling the dataset into an array, just pass it aRange. There's nothing that says you can only have one class factory. In fact, in a comparison benchmark, setting properties directly cut the object creation overhead by %75 over slicing and passing an array usingWorksheetFunction.Index.
- The line
duplicate = FalseinClass_Initializeraises red flags about the class design also. It's hard to tell what it is there for (due mainly to the stripped down version of the code submitted), but regardless, it probably shouldn't. An individual object shouldn't have any concept of being a "duplicate" outside of possibly anEquals(other as LaborRecord)utility function. If your class structure is designed properly, aLaborRecordshouldn't know or care about otherLaborRecords.
Code Snippets
thisDict.Add key, arr 'the Item is the array sliceContext
StackExchange Code Review Q#141182, answer score: 4
Revisions (0)
No revisions yet.