patterncsharpMinor
Generating histograms from "chunked" large data streams
Viewed 0 times
chunkedhistogramsgeneratinglargestreamsfromdata
Problem
I've been working in some tools lately that analize long streams of data and extract frequent patterns, histograms, etc. for personal use. Lately I've been coding a histogram generator; that is, given a long chunked data stream, generate a histogram with all distinct chunks; typical case, given a text, generate a word histogram.
The algorithm and code I've come up with is the following:
```
public static class HistogramGenerator
{
public static IEnumerable> GetHistogram(IEnumerable> stream)
=> BuildHistogram(BuildTree(stream));
private static Tree BuildTree(IEnumerable> stream)
{
var tree = new Tree();
foreach (var chunk in stream)
{
var currentTree = tree;
foreach (var item in chunk)
{
currentTree = currentTree.Add(item);
}
currentTree.Count++;
}
return tree;
}
private static IEnumerable> BuildHistogram(Tree tree) =>
getAllCurrentTreeReversedPaths(tree).Select(histogram => new HistogramItem(histogram.Count, histogram.Path.Reverse()))
.OrderByDescending(h => h.Count);
private static IEnumerable> getAllCurrentTreeReversedPaths(Tree tree)
{
foreach (var branch in tree.Branches)
{
if (branch.Value.Count > 0)
{
//Make sure we return a valid path included in a longer one. For example: "the" inside "therefore"
yield return new HistogramItem(branch.Value.Count, new[] { branch.Key });
}
foreach (var subBranch in getAllCurrentTreeReversedPaths(branch.Value))
{
yield return new HistogramItem(subBranch.Count, subBranch.Path.Concat(new[] { branch.Key }));
}
}
}
//TODO: Check if this is should be kept as a struct.
public struct HistogramItem
{
public int Count { get; }
public IEnumerable Path { g
The algorithm and code I've come up with is the following:
```
public static class HistogramGenerator
{
public static IEnumerable> GetHistogram(IEnumerable> stream)
=> BuildHistogram(BuildTree(stream));
private static Tree BuildTree(IEnumerable> stream)
{
var tree = new Tree();
foreach (var chunk in stream)
{
var currentTree = tree;
foreach (var item in chunk)
{
currentTree = currentTree.Add(item);
}
currentTree.Count++;
}
return tree;
}
private static IEnumerable> BuildHistogram(Tree tree) =>
getAllCurrentTreeReversedPaths(tree).Select(histogram => new HistogramItem(histogram.Count, histogram.Path.Reverse()))
.OrderByDescending(h => h.Count);
private static IEnumerable> getAllCurrentTreeReversedPaths(Tree tree)
{
foreach (var branch in tree.Branches)
{
if (branch.Value.Count > 0)
{
//Make sure we return a valid path included in a longer one. For example: "the" inside "therefore"
yield return new HistogramItem(branch.Value.Count, new[] { branch.Key });
}
foreach (var subBranch in getAllCurrentTreeReversedPaths(branch.Value))
{
yield return new HistogramItem(subBranch.Count, subBranch.Path.Concat(new[] { branch.Key }));
}
}
}
//TODO: Check if this is should be kept as a struct.
public struct HistogramItem
{
public int Count { get; }
public IEnumerable Path { g
Solution
var currentTree = tree;
foreach (var item in chunk)
{
currentTree = currentTree.Add(item);
}
currentTree.Count++;That function shouldn't need to know how to traverse a tree. That functionality belongs to
Tree, not to the invoking site.A simple
tree.Add(chunk); is much cleaner in this context. The complexity of building and traversing the tree should be hidden entirely hidden inside the Tree class:public void Add(IEnumerable path)
{
var node = this;
foreach (var item in path)
{
node = node.GetChild(item);
}
node.Count++;
}There is no longer a need to expose the original
Tree::Add publicly, as the only legit caller is now inside the same class:private Tree GetChild(T value)
{
Tree node;
if (!Branches.TryGetValue(value, out node))
{
node = new Tree();
Branches.Add(value, node);
}
return node;
}The second occurrence where the functionality should have been encapsulated inside
Tree is getAllCurrentTreeReversedPaths. This function requires specific knowledge about how the storage is implemented, so it's inseparable from it.This also raises concerns regarding the visbility of
Tree::Count and Tree::Branches. These properties don't need to be exposed, if the methods belonging to Tree are actually moved into that class.For the invoking side, only the capabilities of storing, counting, and reciting stored paths are of interest. Nothing else, especially not how the tree is implemented. Not even the fact that it is implemented as a tree at all.
foreach (var subBranch in getAllCurrentTreeReversedPaths(branch.Value))
{
yield return new HistogramItem(
subBranch.Count,
subBranch.Path.Concat(new[] { branch.Key })
);
}This doesn't look right. You are essentially recreating and discarding
HistogramItem instances over and over again, while you are technically only appending to the Path property.HistogramItem::Path should either be mutable, or simply add the current path suffix as a parameter to getAllCurrentTreeReversedPaths, so you can create HistogramItem instances with the correct properties right away.Code Snippets
var currentTree = tree;
foreach (var item in chunk)
{
currentTree = currentTree.Add(item);
}
currentTree.Count++;public void Add(IEnumerable<T> path)
{
var node = this;
foreach (var item in path)
{
node = node.GetChild(item);
}
node.Count++;
}private Tree<T> GetChild(T value)
{
Tree<T> node;
if (!Branches.TryGetValue(value, out node))
{
node = new Tree<T>();
Branches.Add(value, node);
}
return node;
}foreach (var subBranch in getAllCurrentTreeReversedPaths(branch.Value))
{
yield return new HistogramItem<T>(
subBranch.Count,
subBranch.Path.Concat(new[] { branch.Key })
);
}Context
StackExchange Code Review Q#139950, answer score: 2
Revisions (0)
No revisions yet.