HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Semantic similarity in text

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
similaritysemantictext

Problem

Is there a relatively simple way of telling if two pieces of text are semantically similar?

Some assumptions that are valid:

  • It is all english



  • I have a list of all the important nouns



Are there any strategies that I should pursue? Looking for something that is relatively computationally cheap, though something that could be scaled to improve accuracy at the expense of computational power would be a bonus.

Note:

Assume that there are not enough posts for some type of probabilistic analysis, but some type of NN might be feasible (I think, just don't know enough about it).

Solution

Here's a simple technique.

Train an LDA using something like MALLET over your collection of texts. For each pair of documents you want to compare, obtain the topic distributions and compute the Hellinger distance between them.

Things you can tweak include term weighting, the LDA hyperparameters, and the metric for comparing distributions. Term weighting would obviate both the need for a list of important words, and the restriction to only English.

Context

StackExchange Computer Science Q#2955, answer score: 4

Revisions (0)

No revisions yet.