patternsqlMinor
Wordcount in a field (all and unique) - is there a more elegant/optimal way?
Viewed 0 times
uniquefieldoptimalallandmorewayelegantwordcountthere
Problem
Answering this question,
Given this table (constructed from the question):
produce this result:
I gave the (correct) answer, which you can find here.
However in a comment, the OP then asked a further question - what if the string in question was
So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.
Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailed explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.
I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.
Given this table (constructed from the question):
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ('What a great day');
INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');produce this result:
id | word_count | unique_word_count | Description
---------+------------+-------------------+---------------
1 | 4 | 4 | What a great day
2 | 7 | 6 | This is a product. It is usefulI gave the (correct) answer, which you can find here.
However in a comment, the OP then asked a further question - what if the string in question was
['a', ' ', ' ', 'b'] and my solution broke down completely - for starters, the string wouldn't even INSERT into the table.So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.
Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailed explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.
I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.
Solution
First step obviously is to create the table and data (as per the question mentioned):
The first "lifesaver" was the dollar quoting (
PostgreSQL feature. I was really floundering before I came across
this - couldn't even get the data into the table (trying
backslashes, double quotes &c.)
My final SQL looks like this (fiddle here):
Result:
Logic explained:
I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an
Step 1:
Result:
Step 2 (remove all non-space, non-alpha)
Result:
Step 3 (put the strings into an array):
Result:
Finally, the answer itself -
i.e.
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ($What a great day$);
INSERT INTO wordcount (description) VALUES ($This is a product. It is useful$);
INSERT INTO wordcount (description) VALUES ($['a', ' ', ' ', 'b']$);The first "lifesaver" was the dollar quoting (
$$) - a really neatPostgreSQL feature. I was really floundering before I came across
this - couldn't even get the data into the table (trying
backslashes, double quotes &c.)
My final SQL looks like this (fiddle here):
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT id,
COUNT(word),
COUNT(DISTINCT(word)),
description
FROM cte1
WHERE LENGTH(word) > 0
GROUP BY id, description
ORDER BY id;Result:
id Word_count Distinct_count description
1 4 4 What a great day
2 7 6 This is a product. It is useful
3 2 2 ['a', ' ', ' ', 'b']Logic explained:
I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an
UPPER() function would solve that - it's not core to the question.Step 1:
SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
-- Keeping the id field helps clarity, even if superfluous.Result:
id regexp_split_to_table
1 What a great day
2 This is a product. It is useful
3 ['a'
3 ' '
3 ' '
3 'b']Step 2 (remove all non-space, non-alpha)
SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Z\s]', '', 'g')
FROM wordcount;
-- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
-- be counted as different! Again, keeping the id field makes things clearer,
-- even if not strictly necessary for puristsResult:
id regexp_replace
1 What a great day
2 This is a product It is useful
3 a
3
3
3 bStep 3 (put the strings into an array):
SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')
FROM wordcount;
-- id again - not strictly necessary at this step.Result:
id string_to_array
1 {What,a,great,day}
2 {This,is,a,product,It,is,useful}
3 {a}
3 {"","",""}
3 {"","",""}
3 {"",b}Finally, the answer itself -
UNNEST and then select those words LENGTH > 0 grouping by id and description.i.e.
SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT blah... (see above)Code Snippets
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ($$What a great day$$);
INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT id,
COUNT(word),
COUNT(DISTINCT(word)),
description
FROM cte1
WHERE LENGTH(word) > 0
GROUP BY id, description
ORDER BY id;id Word_count Distinct_count description
1 4 4 What a great day
2 7 6 This is a product. It is useful
3 2 2 ['a', ' ', ' ', 'b']SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
-- Keeping the id field helps clarity, even if superfluous.id regexp_split_to_table
1 What a great day
2 This is a product. It is useful
3 ['a'
3 ' '
3 ' '
3 'b']Context
StackExchange Database Administrators Q#239958, answer score: 4
Revisions (0)
No revisions yet.