patternsqlMinor
Return top 100 words from strings
Viewed 0 times
topreturnwords100fromstrings
Problem
Not sure this is even possible but I was wondering if there is any way of using PostgreSQL to return the top 100 most occurring words from a database column containing strings of varying lengths?
The table contains ~ 250k rows, with each row containing a
If it is easier, the metadata can be dropped to leave only the column containing the Tweets.
Ideally, I would like to be able to perform what this website does to give the same output, only with many more records.
The table contains ~ 250k rows, with each row containing a
message among much other metadata. So, is there a way of sifting through every word in each row, taking the total number of times it appears and comparing it to a count of all the other words in the same column to return the top 100?If it is easier, the metadata can be dropped to leave only the column containing the Tweets.
Ideally, I would like to be able to perform what this website does to give the same output, only with many more records.
Solution
Assuming that multiple "words" are separated by a space character.
It's not clear whether you want a compare total counts across the whole table or counts per row. for the whole table:
Or, faster, but less clean, with the set-returning function in the
For natural language texts you could do a lot more: Remove punctuation characters, compare word stems etc. Full Text Search provides various tools. The debugging function
The query returns the most common words in the column across the whole table - just like the first query, but now punctuation and stop words are trimmed and words are reduced to their English word stem.
To find the words contained in the most rows, use instead:
One would think it should be possible to employ a functional GIN index on
It's not clear whether you want a compare total counts across the whole table or counts per row. for the whole table:
SELECT word, count(*) AS ct
FROM tbl, unnest(string_to_array(message, ' ')) word -- implicit LATERAL join
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100; -- top 100Or, faster, but less clean, with the set-returning function in the
SELECT list:SELECT unnest(string_to_array(message, ' ')) AS word, count(*) AS ct
FROM tbl
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100;For natural language texts you could do a lot more: Remove punctuation characters, compare word stems etc. Full Text Search provides various tools. The debugging function
ts_stat() seems particularly handy for this. For an English text:SELECT word, ndoc, nentry
FROM ts_stat($SELECT to_tsvector('english', message) FROM tbl$)
ORDER BY nentry DESC
LIMIT 100;ndoc .. the number of rowsnentry .. the number of occurrences (can be bigger with multiple instances in a single text.The query returns the most common words in the column across the whole table - just like the first query, but now punctuation and stop words are trimmed and words are reduced to their English word stem.
To find the words contained in the most rows, use instead:
ORDER BY ndoc DESCOne would think it should be possible to employ a functional GIN index on
to_tsvector('english', message) to make this very fast. But I did not find a way in a quick test.Code Snippets
SELECT word, count(*) AS ct
FROM tbl, unnest(string_to_array(message, ' ')) word -- implicit LATERAL join
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100; -- top 100SELECT unnest(string_to_array(message, ' ')) AS word, count(*) AS ct
FROM tbl
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100;SELECT word, ndoc, nentry
FROM ts_stat($$SELECT to_tsvector('english', message) FROM tbl$$)
ORDER BY nentry DESC
LIMIT 100;ORDER BY ndoc DESCContext
StackExchange Database Administrators Q#127540, answer score: 7
Revisions (0)
No revisions yet.