patternMinor
Semantic similarity in text
Viewed 0 times
similaritysemantictext
Problem
Is there a relatively simple way of telling if two pieces of text are semantically similar?
Some assumptions that are valid:
Are there any strategies that I should pursue? Looking for something that is relatively computationally cheap, though something that could be scaled to improve accuracy at the expense of computational power would be a bonus.
Note:
Assume that there are not enough posts for some type of probabilistic analysis, but some type of NN might be feasible (I think, just don't know enough about it).
Some assumptions that are valid:
- It is all english
- I have a list of all the important nouns
Are there any strategies that I should pursue? Looking for something that is relatively computationally cheap, though something that could be scaled to improve accuracy at the expense of computational power would be a bonus.
Note:
Assume that there are not enough posts for some type of probabilistic analysis, but some type of NN might be feasible (I think, just don't know enough about it).
Solution
Here's a simple technique.
Train an LDA using something like MALLET over your collection of texts. For each pair of documents you want to compare, obtain the topic distributions and compute the Hellinger distance between them.
Things you can tweak include term weighting, the LDA hyperparameters, and the metric for comparing distributions. Term weighting would obviate both the need for a list of important words, and the restriction to only English.
Train an LDA using something like MALLET over your collection of texts. For each pair of documents you want to compare, obtain the topic distributions and compute the Hellinger distance between them.
Things you can tweak include term weighting, the LDA hyperparameters, and the metric for comparing distributions. Term weighting would obviate both the need for a list of important words, and the restriction to only English.
Context
StackExchange Computer Science Q#2955, answer score: 4
Revisions (0)
No revisions yet.