HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How to compute the TF-IDF scores for a handful of documents, but without my own corpus?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
withoutthecomputebutownidfforscoreshowcorpus

Problem

I have a small number of documents (I could probably get 4 or 5 documents) and want to assign terms in these documents a score based on its importance to the document. I want to find which terms are simultaneously important to all of these documents.

These documents have different authors but are written about the same topic, and I want to filter out keywords that are artifacts of the authors' writing styles.

TF-IDF seems like a well-established scoring function. However, the documents I have do not belong to a large corpus of documents, prohibiting the computation of IDF-values.

Q: How can I compute the TF-IDF score of terms within a handful of documents, but without my own corpus?

I'm open to alternative suggests that do not involve TF-IDF.

Solution

Answer

What do you value in a ranking function? If you don't value TF or IDF, then it'll be difficult for you to achieve what you want to achieve.

Without a corpus however, you could use a frequency chart as a substitute for TF:

http://www.wordfrequency.info/free.asp?s=y

I had trouble finding a similar list for IDF online. However, you could just remove the IDF term from the TF-IDF equation. Finding an alternative for IDF is difficult.

Your Situation

4-5 documents isn't much, but it's still a corpus. Of course, it'll likely have more error than a larger corpus, but I see no reason why you can't use it. Also, why can't you use IDF? If the term exists in one document, then IDF can be computed, otherwise it shouldn't be in your controlled vocabulary anyway. Although if you're worried about error, then you could probably just discard it at this point.

If you just use your documents as the corpus, you can take advantage their context, which seems to be what you want here. Basically, many words will be more/less common in your subset than in the English language in general due to the specific domain and small sample size.

Context

StackExchange Computer Science Q#43975, answer score: 2

Revisions (0)

No revisions yet.