HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Subsampling of Frequent Words in Word2Vec

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
word2vecwordsfrequentsubsampling

Problem

I am reading through the following paper: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Under section 2.3 on page 4 the authors discuss subsampling of frequent words.

Now based on my understanding of subsampling of frequent words, the goal is to reduce the number of times we sample frequent words. But here is what the paper states:

"To counter the imbalance between the rare and frequent words, we use a simple subsampling approach: each word $w_i$ in the training set is discarded with probability computed by the formula

$$P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}}$$

where $f(w_i)$ is the frequency of word $w_i$ and $t$ is a chosen threshold typically around $10^{-5}$."

Now, I am interpreting $f(w_i)$ as an integer value that essentially counts the number of times $w_i$ occurs in the corpus. But this seems to not make any sense because all words then discarded with probability $.99$ or greater. Should I instead be interpreting $f(w_i)$ as $\text{count}(w_i)/\text{total_count}$ where $\text{total_count}$ is the number of words in the corpus?

Solution

My guess is that we should regard $f(w_i)$ as a number between 0 and 1, i.e., the number of times word $w_i$ appears divided by the number of words. This makes it an estimate of the probability of occurrence of word $w_i$.

Then the formula seems to lead to reasonable results. If you have a very frequent word (appears as a $10^{-3}$ fraction of the words), we have $P(w_i) = 1 - \sqrt{10^{-2}} = 0.9$, so it's very likely to be discarded. If you have a moderately infrequent word (probability of occurrence $10^{-5}$), then we have $P(w_i) = 1 - \sqrt{1} = 0$, so it won't be discarded. If you have a word that is even rare than that, it won't be discarded either. So this discards very frequent words with high probability and keeps almost all words that are rare or moderately rare.

Context

StackExchange Computer Science Q#95266, answer score: 3

Revisions (0)

No revisions yet.