patternsqlMinor
Finding the most commonly used non-stop words in a column
Viewed 0 times
thenoncolumnusedwordsstopfindingmostcommonly
Problem
Simply put, I have a Postgres column containing ordinary prose and would like to determine the
I've found two solutions that nearly hit the mark:
This is great, except it returns word stems.
This one returns the full words, but includes stop words.
For the sake of simplicity: stop words are supposed to be found on
Can anyone help me over the line?
x most commonly used words ("word" being a group of characters delimited by a space, but not being stop word) across all rows.I've found two solutions that nearly hit the mark:
SELECT *
FROM ts_stat($SELECT to_tsvector('english', title) FROM item$)
ORDER BY ndoc DESC
LIMIT 50;This is great, except it returns word stems.
SELECT UNNEST(string_to_array(title, ' ')) AS word, COUNT(*) AS ct
FROM item
GROUP BY 1
ORDER BY 2 DESC
LIMIT 50;This one returns the full words, but includes stop words.
For the sake of simplicity: stop words are supposed to be found on
TABLE stop_words (lowercase_stopword text PRIMARY KEY).Can anyone help me over the line?
Solution
Your first query was pretty close. To remove the unwanted stemming, create a text search configuration with a simple dictionary that does not do it.
I suggest to use a separate schema for text search objects, but that's totally optional:
Then your query works, and very fast, too:
dbfiddle here
This operates with lower case words without stemming and doesn't break for non-ASCII letters.
Backgroud
Read the chapter Simple Dictionary in the manual.
The exact definition of a "word" is a tricky matter. The default text search parser (currently it's the only one) identifies 23 different types of tokens. See:
Built-in text search configurations map most of those to (built-in) dictionaries. Mappings for the
The demo above creates a new config based on the
I suggest to use a separate schema for text search objects, but that's totally optional:
CREATE SCHEMA ts;
GRANT USAGE ON SCHEMA ts TO public;
COMMENT ON SCHEMA ts IS 'text search objects';
CREATE TEXT SEARCH DICTIONARY ts.english_simple_dict (
TEMPLATE = pg_catalog.simple
, STOPWORDS = english
);
CREATE TEXT SEARCH CONFIGURATION ts.english_simple (COPY = simple);
ALTER TEXT SEARCH CONFIGURATION ts.english_simple
ALTER MAPPING FOR asciiword WITH ts.english_simple_dict; -- 1, 'Word, all ASCII'Then your query works, and very fast, too:
SELECT *
FROM ts_stat($SELECT to_tsvector('ts.english_simple', title) FROM item$)
ORDER BY ndoc DESC
LIMIT 50;dbfiddle here
This operates with lower case words without stemming and doesn't break for non-ASCII letters.
Backgroud
Read the chapter Simple Dictionary in the manual.
The exact definition of a "word" is a tricky matter. The default text search parser (currently it's the only one) identifies 23 different types of tokens. See:
SELECT * FROM ts_token_type('default');Built-in text search configurations map most of those to (built-in) dictionaries. Mappings for the
english config:SELECT tt.*, m.mapdict::regdictionary AS dictionary
FROM pg_ts_config_map m
LEFT JOIN ts_token_type(3722) tt ON tt.tokid = m.maptokentype
WHERE mapcfg = 'english'::regconfig -- 'ts.english_simple'::regconfig
ORDER BY tt.tokid;The demo above creates a new config based on the
simple config, and since all English stop words are of type 'asciiword', we only need to map this type to remove stop words, no stemming or anything else.Code Snippets
CREATE SCHEMA ts;
GRANT USAGE ON SCHEMA ts TO public;
COMMENT ON SCHEMA ts IS 'text search objects';
CREATE TEXT SEARCH DICTIONARY ts.english_simple_dict (
TEMPLATE = pg_catalog.simple
, STOPWORDS = english
);
CREATE TEXT SEARCH CONFIGURATION ts.english_simple (COPY = simple);
ALTER TEXT SEARCH CONFIGURATION ts.english_simple
ALTER MAPPING FOR asciiword WITH ts.english_simple_dict; -- 1, 'Word, all ASCII'SELECT *
FROM ts_stat($$SELECT to_tsvector('ts.english_simple', title) FROM item$$)
ORDER BY ndoc DESC
LIMIT 50;SELECT * FROM ts_token_type('default');SELECT tt.*, m.mapdict::regdictionary AS dictionary
FROM pg_ts_config_map m
LEFT JOIN ts_token_type(3722) tt ON tt.tokid = m.maptokentype
WHERE mapcfg = 'english'::regconfig -- 'ts.english_simple'::regconfig
ORDER BY tt.tokid;Context
StackExchange Database Administrators Q#145016, answer score: 6
Revisions (0)
No revisions yet.