HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Wordcount in a field (all and unique) - is there a more elegant/optimal way?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
uniquefieldoptimalallandmorewayelegantwordcountthere

Problem

Answering this question,

Given this table (constructed from the question):

CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

INSERT INTO wordcount (description) VALUES ('What a great day');
INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');


produce this result:

id  | word_count | unique_word_count |  Description                        
---------+------------+-------------------+---------------
     1   |  4         | 4                 | What a great day
     2   |  7         | 6                 | This is a product. It is useful


I gave the (correct) answer, which you can find here.

However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b'] and my solution broke down completely - for starters, the string wouldn't even INSERT into the table.

So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.

Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailed explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.

I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.

Solution

First step obviously is to create the table and data (as per the question mentioned):

CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

INSERT INTO wordcount (description) VALUES ($What a great day$);
INSERT INTO wordcount (description) VALUES ($This is a product. It is useful$);
INSERT INTO wordcount (description) VALUES ($['a', ' ', ' ', 'b']$);


The first "lifesaver" was the dollar quoting ($$) - a really neat
PostgreSQL feature. I was really floundering before I came across
this - couldn't even get the data into the table (trying
backslashes, double quotes &c.)

My final SQL looks like this (fiddle here):

WITH cte1 AS
(
  SELECT id,
    UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
    REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')) as "word",
    description
  FROM wordcount
)
SELECT id,
       COUNT(word),
       COUNT(DISTINCT(word)),
       description
FROM cte1
WHERE LENGTH(word) > 0
GROUP BY id, description
ORDER BY id;


Result:

id  Word_count  Distinct_count  description
 1           4               4    What a great day
 2           7               6    This is a product. It is useful
 3           2               2    ['a', ' ', ' ', 'b']


Logic explained:

I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER() function would solve that - it's not core to the question.

Step 1:

SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
-- Keeping the id field helps clarity, even if superfluous.


Result:

id  regexp_split_to_table
1   What a great day
2   This is a product. It is useful
3   ['a'
3    ' '
3    ' '
3    'b']


Step 2 (remove all non-space, non-alpha)

SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Z\s]', '', 'g')
FROM wordcount;

-- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
-- be counted as different! Again, keeping the id field makes things clearer, 
-- even if not strictly necessary for purists


Result:

id  regexp_replace
1   What a great day
2   This is a product It is useful
3   a
3     
3     
3    b


Step 3 (put the strings into an array):

SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')
FROM wordcount;  
--              id again - not strictly necessary at this step.


Result:

id  string_to_array
1   {What,a,great,day}
2   {This,is,a,product,It,is,useful}
3   {a}
3   {"","",""}
3   {"","",""}
3   {"",b}


Finally, the answer itself - UNNEST and then select those words LENGTH > 0 grouping by id and description.

i.e. SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!

WITH cte1 AS
(
  SELECT id, 
    UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
    REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')) as "word",
    description
  FROM wordcount
)
SELECT blah... (see above)

Code Snippets

CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

INSERT INTO wordcount (description) VALUES ($$What a great day$$);
INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);
WITH cte1 AS
(
  SELECT id,
    UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
    REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')) as "word",
    description
  FROM wordcount
)
SELECT id,
       COUNT(word),
       COUNT(DISTINCT(word)),
       description
FROM cte1
WHERE LENGTH(word) > 0
GROUP BY id, description
ORDER BY id;
id  Word_count  Distinct_count  description
 1           4               4    What a great day
 2           7               6    This is a product. It is useful
 3           2               2    ['a', ' ', ' ', 'b']
SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
-- Keeping the id field helps clarity, even if superfluous.
id  regexp_split_to_table
1   What a great day
2   This is a product. It is useful
3   ['a'
3    ' '
3    ' '
3    'b']

Context

StackExchange Database Administrators Q#239958, answer score: 4

Revisions (0)

No revisions yet.