HiveBrain v1.2.0
Get Started
← Back to all entries
principlesqlMinor

Pragmatic approach to k-means clusters with PostgreSQL

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlwithmeansclusterspragmaticapproach

Problem

I am looking for a super simple and very pragmatic approach to k-means clustering of questions using PostgreSQL database alone.

While I am fully aware that this method may not yield meaningful results if my assumptions are off, it may be a good and first attempt at categorizing my data.

Imagine a small online forum, where users are free to ask brief questions about ANY topic they want but without the need to categorize them, and other users need to be notified about a new question if it matches the topics of interest to which they have previously subscribed.

My plan is to first break down each incoming question into lexemes using to_tsvector but in all honesty I am a bit lost about what to do afterwards.

Even assuming I have correctly identified the k categories to which the question may be matched, how would I go about deciding whether a questions should fall into one (or more) category?

Solution

you could use the ranking returned from the text search as a cut-off to decide if the question matches a topic.
For each category you could hold a document with relevant terms and search the query text against these terms.

Btw, there's a nice extension called madlib (it's more of a utility package) - it contains many useful features/algorithms including topic analysis & clustering.
Take a look at madlib documentation

Context

StackExchange Database Administrators Q#155759, answer score: 2

Revisions (0)

No revisions yet.