debugModerate
Algorithm to convert a fixed-length string to the smallest possible collision-free representation?
Viewed 0 times
smallestthecollisionfreeconvertlengthalgorithmpossiblefixedrepresentation
Problem
I have a US-based telephone number (in the format 000-000-0000) and need to convert it to a "shorter" representation. Using base32/64 produces too long of a string. I've discovered CRC-16, which produces a string of length 4, however, I'm not sure it's a suitable tool for the job (I want to prevent collisions).
Basically, if the input is 000-000-0000 (without the dashes), I want the output to be something like 4xg4. In this use case, the advantage is that I have 10 digits, thus I could "convert" them to something smaller that uses letters as well.
Are you aware of any algorithm implementation for this?
Basically, if the input is 000-000-0000 (without the dashes), I want the output to be something like 4xg4. In this use case, the advantage is that I have 10 digits, thus I could "convert" them to something smaller that uses letters as well.
Are you aware of any algorithm implementation for this?
Solution
As ratchet freak says, you have ten decimal digits, which should give $10^{10}$ possible values. But in practice, there are a few more restrictions. The format of a North American telephone number looks something like this:
This gives 8910 81010 101010*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)
If you want to encode these using an alphabet with $b$ different characters, you'll need a code $\lceil \log_b T \rceil$ characters long. So if you want a code four characters long, you'll need $b \geq 275$. If you want a code five characters long, you'll need $b \geq 90$. And if you want a code six characters long, you'll need $b \geq 43$.
Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.
(You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)
P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of
[2-9][0-8]\d - [2-9]\d\d - \d\d\d\dThis gives 8910 81010 101010*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)
If you want to encode these using an alphabet with $b$ different characters, you'll need a code $\lceil \log_b T \rceil$ characters long. So if you want a code four characters long, you'll need $b \geq 275$. If you want a code five characters long, you'll need $b \geq 90$. And if you want a code six characters long, you'll need $b \geq 43$.
Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.
(You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)
P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of
o, O, 0, Q, I, l, and 1 to cut down on confusion in writing. My suggestion would be 23456789ABCDEFGHJKLMNPRSTVXYZabdeghknpqrt-=.Code Snippets
[2-9][0-8]\d - [2-9]\d\d - \d\d\d\dContext
StackExchange Computer Science Q#105488, answer score: 15
Revisions (0)
No revisions yet.