HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Have I invented a new data structure?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
newinventedstructuredatahave

Problem

EDIT: Seems like what this structure is nothing more than a 26-bit bitwise trie, as KWillets suggested in the comments! You can think of the words we insert the trie as 26-bit Bitsets where the i-th bit indicates whether the word contains the i-the letter in the alphabet.

As I was tackling an interview problem I came up with this tree-like data structure. Actually it IS a binary tree that is maximum 26(English alphabet size) nodes deep.
Basically the problem is:


There are a set of dictionary words and a set of license plate numbers. Write a code/algorithm to find the shortest dictionary word which contains all the letters in the license plate, irrespective of the order of characters

This data structure is specifically designed for the problem of finding words from a given list containing all letters from a set of letters.

I call this operation getSupersetWords().

It can also be used for finding words containing ONLY a set of letters.(In fact my empirical tests show that this is the use case giving the best performance gain over linear search (over 16 times faster on average for a list of 73k English words)).

I call this operation getSubsetWords().

To give you an example:

Lets say we have list of words = [ “cat” ,”dog” “cow”, “case”].

On query getSupersetWords([‘a’, ‘c’]) we get [“cat”, “case”].

On query getSubsetWords(['c', 'o', 'w', 'a', 't']) we get ["cat", "cow"].

How to add a string to this data structure ?

As I said earlier the data structure is a binary tree.
At every level of the tree we make a decision. If the word contains the next letter of the alphabet(we start from 'a' at level 1) we go right, else we go left.

If we have made all the 'right' possible decisions for the given word we place it in the current node.(A node contains a reference list of strings).

How to answer a query from this data structure ?

Now to answer a getSubsetWords() query we traverse the tree recursively in DFS making both decisions if the query cont

Solution

I've never seen this data structure before. However, it doesn't seem like a good choice for storing a set of words, for most purposes. I see three significant disadvantages:

-
Performance. Looking up a word in this data structure can potentially be quite slow. In particular, when looking up a word of length $n$, checking whether it contains a particular letter involves looking at all $n$ characters of the word. Therefore, traversing the entire tree (all 26 levels) might require reading the entire word 26 times. In contrast, a trie reads each character of the word only once, so might be about 26x times faster. Same for a hashtable.

-
Space usage. This might take a lot of space. I am assuming you won't construct a node unless it is needed (i.e., if it has only one child, you'll omit the node). With that optimization, the depth of the tree is at most 26 if you're using words with letters a-z. If you are storing strings in Unicode, the depth of the tree might be enormous, and in the worst case this might devolve to a data structure where the depth of the tree is proportional to the number of words in the set, and then the space usage is bad and the time for a lookup is very bad (linear in the size of the set). Without that optimization, the number of the nodes in the tree can be as large as $2^{26}$ for letters a-z, or enormous for Unicode strings.

-
Incomplete. This isn't a complete data structure. It cannot distinguish between two words with the same set of letters, e.g., between words that are permutations of each other (anagrams). Therefore, you'll need another data structure for storing the second-level sets (your circle shapes). If we had a good data structure for that purpose, why wouldn't we use it for all the words, and skip your data structure? Conversely, if you use a poor data structure for that, then your data structure inherits the limitations of that data structure: for instance, if you store all the words in the circle in a linked list, then lookups will be slow as they might need to traverse the entire list.

Context

StackExchange Computer Science Q#75596, answer score: 7

Revisions (0)

No revisions yet.