HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Counting different words in text using hashing

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
countingwordstextdifferentusinghashing

Problem

I am still fighting with hashing and I am ask myself: what is the most efficient way to count the number of different words in a text using a hash table?

My intuition says that applying the hashcode function to every word in the text, as result we will have words with different hash values in different buckets and the same words will have the same bucket and therefore we will have a collision problem which we can resolve using the chaining method.

Does it work like that?

Solution

If you really just want to count the number of distinct words in the document, you don't need to save each instance of the word to the hash table.

So, if you find a words that's already in the table, just don't add it there. This means you don't have to deal with chaining as often, which will speed things up.

But you still have to deal with collisions, because two different words can have the same hash code.

This way, the expected time complexity is $\mathcal{O}(n)$, where $n$ is the total count of characters in the text, assuming a good hash function.

Context

StackExchange Computer Science Q#1838, answer score: 6

Revisions (0)

No revisions yet.