HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Generating histograms from "chunked" large data streams

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
chunkedhistogramsgeneratinglargestreamsfromdata

Problem

I've been working in some tools lately that analize long streams of data and extract frequent patterns, histograms, etc. for personal use. Lately I've been coding a histogram generator; that is, given a long chunked data stream, generate a histogram with all distinct chunks; typical case, given a text, generate a word histogram.

The algorithm and code I've come up with is the following:

```
public static class HistogramGenerator
{
public static IEnumerable> GetHistogram(IEnumerable> stream)
=> BuildHistogram(BuildTree(stream));

private static Tree BuildTree(IEnumerable> stream)
{
var tree = new Tree();

foreach (var chunk in stream)
{
var currentTree = tree;

foreach (var item in chunk)
{
currentTree = currentTree.Add(item);
}

currentTree.Count++;
}

return tree;
}

private static IEnumerable> BuildHistogram(Tree tree) =>
getAllCurrentTreeReversedPaths(tree).Select(histogram => new HistogramItem(histogram.Count, histogram.Path.Reverse()))
.OrderByDescending(h => h.Count);

private static IEnumerable> getAllCurrentTreeReversedPaths(Tree tree)
{
foreach (var branch in tree.Branches)
{
if (branch.Value.Count > 0)
{
//Make sure we return a valid path included in a longer one. For example: "the" inside "therefore"
yield return new HistogramItem(branch.Value.Count, new[] { branch.Key });
}

foreach (var subBranch in getAllCurrentTreeReversedPaths(branch.Value))
{
yield return new HistogramItem(subBranch.Count, subBranch.Path.Concat(new[] { branch.Key }));
}
}
}

//TODO: Check if this is should be kept as a struct.
public struct HistogramItem
{
public int Count { get; }
public IEnumerable Path { g

Solution

var currentTree = tree;
foreach (var item in chunk)
{
     currentTree = currentTree.Add(item);
}
currentTree.Count++;


That function shouldn't need to know how to traverse a tree. That functionality belongs to Tree, not to the invoking site.

A simple tree.Add(chunk); is much cleaner in this context. The complexity of building and traversing the tree should be hidden entirely hidden inside the Tree class:

public void Add(IEnumerable path)
{
    var node = this;
    foreach (var item in path)
    {
        node = node.GetChild(item);
    }
    node.Count++;
}


There is no longer a need to expose the original Tree::Add publicly, as the only legit caller is now inside the same class:

private Tree GetChild(T value)
{
    Tree node;

    if (!Branches.TryGetValue(value, out node))
    {
        node = new Tree();
        Branches.Add(value, node);
    }
    return node;
}


The second occurrence where the functionality should have been encapsulated inside Tree is getAllCurrentTreeReversedPaths. This function requires specific knowledge about how the storage is implemented, so it's inseparable from it.

This also raises concerns regarding the visbility of Tree::Count and Tree::Branches. These properties don't need to be exposed, if the methods belonging to Tree are actually moved into that class.

For the invoking side, only the capabilities of storing, counting, and reciting stored paths are of interest. Nothing else, especially not how the tree is implemented. Not even the fact that it is implemented as a tree at all.

foreach (var subBranch in getAllCurrentTreeReversedPaths(branch.Value))
{
    yield return new HistogramItem(
      subBranch.Count,
      subBranch.Path.Concat(new[] { branch.Key })
    );
}


This doesn't look right. You are essentially recreating and discarding HistogramItem instances over and over again, while you are technically only appending to the Path property.

HistogramItem::Path should either be mutable, or simply add the current path suffix as a parameter to getAllCurrentTreeReversedPaths, so you can create HistogramItem instances with the correct properties right away.

Code Snippets

var currentTree = tree;
foreach (var item in chunk)
{
     currentTree = currentTree.Add(item);
}
currentTree.Count++;
public void Add(IEnumerable<T> path)
{
    var node = this;
    foreach (var item in path)
    {
        node = node.GetChild(item);
    }
    node.Count++;
}
private Tree<T> GetChild(T value)
{
    Tree<T> node;

    if (!Branches.TryGetValue(value, out node))
    {
        node = new Tree<T>();
        Branches.Add(value, node);
    }
    return node;
}
foreach (var subBranch in getAllCurrentTreeReversedPaths(branch.Value))
{
    yield return new HistogramItem<T>(
      subBranch.Count,
      subBranch.Path.Concat(new[] { branch.Key })
    );
}

Context

StackExchange Code Review Q#139950, answer score: 2

Revisions (0)

No revisions yet.