patternMinor
Hashing algorithm for millions of variable length strings (URLs)
Viewed 0 times
millionslengthstringsalgorithmforhashingvariableurls
Problem
The goal is to distribute approximately 100 million variable length strings, average length 100 characters, uniformly among 100 million buckets. Perfection not required, just no egregious clumping. The strings are URLs. They tend to begin much the same way and end much the same way (e.g. h t t p:// or w w w. and ".c o m", ".e d u", "a s p x" and so forth) and show their greatest variation in the latter half of the string, except for the final few characters.
What's a good algorithm that would accept the string and the number of slots as inputs, and return a number between 0 and SlotCount-1, and satisfy the uniform-distribution requirement?
What's a good algorithm that would accept the string and the number of slots as inputs, and return a number between 0 and SlotCount-1, and satisfy the uniform-distribution requirement?
Solution
Any string hashing algorithm designed for ASCII text should work fine. Take a look at the Wikipedia article, consider if the set could be manipulated by an adversary to create pile-ups (algoritmic vulnerabilities). If that isn't an issue, at strchr and partow several "classical" hash algorithms are given and compared. Get a subset of your data and use the parameters suggested there for your own comparison. And please tell us the results.
Context
StackExchange Computer Science Q#10554, answer score: 2
Revisions (0)
No revisions yet.