patternMinor
Help With Writing An Algorithm Formally
Viewed 0 times
formallywithwritinghelpalgorithm
Problem
First off - I'm not a Computer Scientist, I'm a Software Developer - so when it comes to presenting an idea in a formal manner to a Computer Scientist, I have no idea how to do so. As such, I'm wondering if someone would be good enough to show me how to write the algorithm/idea I've outlined below in some form of formal alogrithmic notation, please?
Part One:
Say I have a list of 'words' made up from characters of the English alphabet. Essentially, I want to split this list of 'words' up into twenty-six sub-lists, where each sub-list is associated with one letter of the alphabet - a, b, c, etc. Each 'word' should be moved to the sub-list associated with the character that the 'word' begins with - so 'apple' would go in the 'a' sub-list, 'banana' would go in the 'b' sublist, etc. BUT, I only want to divide my original list up into sub-lists provided that there are at least X 'words' in the list that begin with each letter of the alphabet (so if X was 2, there would need to be at least two words beginning with 'a', at least two words beginning with 'b', ..., at least two words beginning with 'z', etc.). In essence, it's either one list with all 'words' in it or 26 sub-lists with at least X 'words' in it.
Part Two:
Assuming I was able to split the list of 'words' up into sub-lists as described in Step One, I then want to further divide each sub-list based on the value of the second character in each 'word'. So there would be an 'aa' sublist, an 'ab' sublist, ..., a 'zz' sublist, etc. Again, I only want to do any further division of the sub-lists provided there are at least X 'words' that begin with every possible two-character combination of English alphabet letters - so at least two 'words' beginning with 'aa', two 'words' beginning with 'ab', ..., two words beginning with 'zz', etc. In essence, it's either 26 sub-lists or 676 sub-lists.
Part Three:
I want this dividing process to continue (character three, character four, etc.) until it is no longer possible to
Part One:
Say I have a list of 'words' made up from characters of the English alphabet. Essentially, I want to split this list of 'words' up into twenty-six sub-lists, where each sub-list is associated with one letter of the alphabet - a, b, c, etc. Each 'word' should be moved to the sub-list associated with the character that the 'word' begins with - so 'apple' would go in the 'a' sub-list, 'banana' would go in the 'b' sublist, etc. BUT, I only want to divide my original list up into sub-lists provided that there are at least X 'words' in the list that begin with each letter of the alphabet (so if X was 2, there would need to be at least two words beginning with 'a', at least two words beginning with 'b', ..., at least two words beginning with 'z', etc.). In essence, it's either one list with all 'words' in it or 26 sub-lists with at least X 'words' in it.
Part Two:
Assuming I was able to split the list of 'words' up into sub-lists as described in Step One, I then want to further divide each sub-list based on the value of the second character in each 'word'. So there would be an 'aa' sublist, an 'ab' sublist, ..., a 'zz' sublist, etc. Again, I only want to do any further division of the sub-lists provided there are at least X 'words' that begin with every possible two-character combination of English alphabet letters - so at least two 'words' beginning with 'aa', two 'words' beginning with 'ab', ..., two words beginning with 'zz', etc. In essence, it's either 26 sub-lists or 676 sub-lists.
Part Three:
I want this dividing process to continue (character three, character four, etc.) until it is no longer possible to
Solution
Computer scientists are almost people
I think your explanation is pretty good. There is not a single formal way to specify an algorithm, except perhaps for pseudocode but as a software developer you are likely already familiar with that. Apart from that, algorithms that don't rely on intricate math are often best explained through text, as you've done.
Perhaps it could do with a bit more formal language however. First, note that what you're describing is more accurately called a data structure than an algorithm. Indeed, you give no method for creating this structure, but rather give a description of what it looks like.
Furthermore, you could (though I argue this is not mandatory) describe the structure more mathematically if you want, defining the various objects used. For example, say we have some set of words $W$, each of which is a string made up of some set of characters $C = \{a, b, c, \dots\}$. We'll also define $W[s]$ to mean the set of words in $W$ that begin with string $s$. In our case these are English words and characters, but that does not matter for the general case. Now, we can define our data structure recursively as follows (we'll call the structure a "slick" for now, in reference to your username).
A slick $S_s$ for some set of words $W[s]$ denoted $S_s(W[s])$ is defined as a list $[S_{s+a}(W[s+a]), S_{s+b}(W[s + b]), \dots]$. A slick on an entire corpus $W$ is defined as $S_\epsilon(W[\epsilon])$ where $\epsilon$ denotes the empty string.
However, if one of the subsets $W[s+c]$ for $c \in C$ is not of at least some constant size $k$, $S_s$ is instead defined as $S_s(W[s]) = W[s]$.
Note here that we implicitly define every "subslick" $S_s$. That is, the part of the slick where every word already starts with some string $s$. We define the whole slick as a special case of the subslick, namely one where every words starts with nothing, which is true for every word. Finally, we denote the stopping condition that if not every prefix is frequent enough, we stop making slicks.
We might be able to make it shorter and more formal by introducing more notation, but this would in my opinion come at the cost of readability. For example, "if one of the subslicks is not of some constant size $k$, then define it this other way" can be turned into $(\exists_c \, [|W[s+c]| < k]) \rightarrow S_s(W[s]) = W[s]$, but I argue that this does not help most readers.
Of course, this all depends on the audience. If you're submitting a paper in computer science with as its primary purpose the establishment of an algorithm or data structure, I would expect an informal explanation as well as a formal one, with absolutely no ambiguity. I would also expect not just a description of the data structure, but an explanation of its usefulness, and mathematical arguments for its properties.
This is almost exactly a trie
As for the data structure itself, except for the stopping condition this is exactly a trie, a structure that's been around since this paper in 1959. To look for other ideas for formalization, you might want to look for other explanations of the trie on the web.
The difference between the trie (also called prefix tree for obvious reasons) and the slick is that the trie just keeps making tries until there are no more words with a certain prefix. The slick stops significantly earlier, which I'm not sure is a great idea. Indeed, if we're encoding a moderately sized set of English words, we might not find $k$ words that start with an x, and then the data structure collapses entirely. Instead, I'd recommend not collapsing $S_s$ when $W[s+c]$ is small, but rather collapsing $S_s$ only if $W[s]$ is small.
I think your explanation is pretty good. There is not a single formal way to specify an algorithm, except perhaps for pseudocode but as a software developer you are likely already familiar with that. Apart from that, algorithms that don't rely on intricate math are often best explained through text, as you've done.
Perhaps it could do with a bit more formal language however. First, note that what you're describing is more accurately called a data structure than an algorithm. Indeed, you give no method for creating this structure, but rather give a description of what it looks like.
Furthermore, you could (though I argue this is not mandatory) describe the structure more mathematically if you want, defining the various objects used. For example, say we have some set of words $W$, each of which is a string made up of some set of characters $C = \{a, b, c, \dots\}$. We'll also define $W[s]$ to mean the set of words in $W$ that begin with string $s$. In our case these are English words and characters, but that does not matter for the general case. Now, we can define our data structure recursively as follows (we'll call the structure a "slick" for now, in reference to your username).
A slick $S_s$ for some set of words $W[s]$ denoted $S_s(W[s])$ is defined as a list $[S_{s+a}(W[s+a]), S_{s+b}(W[s + b]), \dots]$. A slick on an entire corpus $W$ is defined as $S_\epsilon(W[\epsilon])$ where $\epsilon$ denotes the empty string.
However, if one of the subsets $W[s+c]$ for $c \in C$ is not of at least some constant size $k$, $S_s$ is instead defined as $S_s(W[s]) = W[s]$.
Note here that we implicitly define every "subslick" $S_s$. That is, the part of the slick where every word already starts with some string $s$. We define the whole slick as a special case of the subslick, namely one where every words starts with nothing, which is true for every word. Finally, we denote the stopping condition that if not every prefix is frequent enough, we stop making slicks.
We might be able to make it shorter and more formal by introducing more notation, but this would in my opinion come at the cost of readability. For example, "if one of the subslicks is not of some constant size $k$, then define it this other way" can be turned into $(\exists_c \, [|W[s+c]| < k]) \rightarrow S_s(W[s]) = W[s]$, but I argue that this does not help most readers.
Of course, this all depends on the audience. If you're submitting a paper in computer science with as its primary purpose the establishment of an algorithm or data structure, I would expect an informal explanation as well as a formal one, with absolutely no ambiguity. I would also expect not just a description of the data structure, but an explanation of its usefulness, and mathematical arguments for its properties.
This is almost exactly a trie
As for the data structure itself, except for the stopping condition this is exactly a trie, a structure that's been around since this paper in 1959. To look for other ideas for formalization, you might want to look for other explanations of the trie on the web.
The difference between the trie (also called prefix tree for obvious reasons) and the slick is that the trie just keeps making tries until there are no more words with a certain prefix. The slick stops significantly earlier, which I'm not sure is a great idea. Indeed, if we're encoding a moderately sized set of English words, we might not find $k$ words that start with an x, and then the data structure collapses entirely. Instead, I'd recommend not collapsing $S_s$ when $W[s+c]$ is small, but rather collapsing $S_s$ only if $W[s]$ is small.
Context
StackExchange Computer Science Q#131166, answer score: 4
Revisions (0)
No revisions yet.