patternMinor
Subsampling of Frequent Words in Word2Vec
Viewed 0 times
word2vecwordsfrequentsubsampling
Problem
I am reading through the following paper: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
Under section 2.3 on page 4 the authors discuss subsampling of frequent words.
Now based on my understanding of subsampling of frequent words, the goal is to reduce the number of times we sample frequent words. But here is what the paper states:
"To counter the imbalance between the rare and frequent words, we use a simple subsampling approach: each word $w_i$ in the training set is discarded with probability computed by the formula
$$P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}}$$
where $f(w_i)$ is the frequency of word $w_i$ and $t$ is a chosen threshold typically around $10^{-5}$."
Now, I am interpreting $f(w_i)$ as an integer value that essentially counts the number of times $w_i$ occurs in the corpus. But this seems to not make any sense because all words then discarded with probability $.99$ or greater. Should I instead be interpreting $f(w_i)$ as $\text{count}(w_i)/\text{total_count}$ where $\text{total_count}$ is the number of words in the corpus?
Under section 2.3 on page 4 the authors discuss subsampling of frequent words.
Now based on my understanding of subsampling of frequent words, the goal is to reduce the number of times we sample frequent words. But here is what the paper states:
"To counter the imbalance between the rare and frequent words, we use a simple subsampling approach: each word $w_i$ in the training set is discarded with probability computed by the formula
$$P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}}$$
where $f(w_i)$ is the frequency of word $w_i$ and $t$ is a chosen threshold typically around $10^{-5}$."
Now, I am interpreting $f(w_i)$ as an integer value that essentially counts the number of times $w_i$ occurs in the corpus. But this seems to not make any sense because all words then discarded with probability $.99$ or greater. Should I instead be interpreting $f(w_i)$ as $\text{count}(w_i)/\text{total_count}$ where $\text{total_count}$ is the number of words in the corpus?
Solution
My guess is that we should regard $f(w_i)$ as a number between 0 and 1, i.e., the number of times word $w_i$ appears divided by the number of words. This makes it an estimate of the probability of occurrence of word $w_i$.
Then the formula seems to lead to reasonable results. If you have a very frequent word (appears as a $10^{-3}$ fraction of the words), we have $P(w_i) = 1 - \sqrt{10^{-2}} = 0.9$, so it's very likely to be discarded. If you have a moderately infrequent word (probability of occurrence $10^{-5}$), then we have $P(w_i) = 1 - \sqrt{1} = 0$, so it won't be discarded. If you have a word that is even rare than that, it won't be discarded either. So this discards very frequent words with high probability and keeps almost all words that are rare or moderately rare.
Then the formula seems to lead to reasonable results. If you have a very frequent word (appears as a $10^{-3}$ fraction of the words), we have $P(w_i) = 1 - \sqrt{10^{-2}} = 0.9$, so it's very likely to be discarded. If you have a moderately infrequent word (probability of occurrence $10^{-5}$), then we have $P(w_i) = 1 - \sqrt{1} = 0$, so it won't be discarded. If you have a word that is even rare than that, it won't be discarded either. So this discards very frequent words with high probability and keeps almost all words that are rare or moderately rare.
Context
StackExchange Computer Science Q#95266, answer score: 3
Revisions (0)
No revisions yet.