principlesqlMinor
Pragmatic approach to k-means clusters with PostgreSQL
Viewed 0 times
postgresqlwithmeansclusterspragmaticapproach
Problem
I am looking for a super simple and very pragmatic approach to k-means clustering of questions using PostgreSQL database alone.
While I am fully aware that this method may not yield meaningful results if my assumptions are off, it may be a good and first attempt at categorizing my data.
Imagine a small online forum, where users are free to ask brief questions about ANY topic they want but without the need to categorize them, and other users need to be notified about a new question if it matches the topics of interest to which they have previously subscribed.
My plan is to first break down each incoming question into lexemes using
Even assuming I have correctly identified the k categories to which the question may be matched, how would I go about deciding whether a questions should fall into one (or more) category?
While I am fully aware that this method may not yield meaningful results if my assumptions are off, it may be a good and first attempt at categorizing my data.
Imagine a small online forum, where users are free to ask brief questions about ANY topic they want but without the need to categorize them, and other users need to be notified about a new question if it matches the topics of interest to which they have previously subscribed.
My plan is to first break down each incoming question into lexemes using
to_tsvector but in all honesty I am a bit lost about what to do afterwards.Even assuming I have correctly identified the k categories to which the question may be matched, how would I go about deciding whether a questions should fall into one (or more) category?
Solution
you could use the ranking returned from the text search as a cut-off to decide if the question matches a topic.
For each category you could hold a document with relevant terms and search the query text against these terms.
Btw, there's a nice extension called madlib (it's more of a utility package) - it contains many useful features/algorithms including topic analysis & clustering.
Take a look at madlib documentation
For each category you could hold a document with relevant terms and search the query text against these terms.
Btw, there's a nice extension called madlib (it's more of a utility package) - it contains many useful features/algorithms including topic analysis & clustering.
Take a look at madlib documentation
Context
StackExchange Database Administrators Q#155759, answer score: 2
Revisions (0)
No revisions yet.