HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Use utl_match.jaro_winkler_similarity to select distinct rows using Oracle

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
distinctjaro_winkler_similarityrowsoracleutl_matchusingselectuse

Problem

I have a table with log messages that many times contain repeated information with slightly different values.
For example:

Unable to make thread 19043 a realtime process
Unable to make thread 20763 a realtime process
Unable to make thread 22179 a realtime process
FEED_XYZ Secondary Instrument Lines not configured
FEED_ZZZ Secondary Instrument Lines not configured
(...)

I'd like to have just one line from each group above:

Unable to make thread 19043 a realtime process
FEED_XYZ Secondary Instrument Lines not configured

Is there a way to use a select query to retrieve distinct rows with utl_match.jaro_winkler_similarity?

Something like:

select log_message
from logs_table
where "utl_match.jaro_winkler_similarity between rows < 80"


I know that I can make a pipelined function or some other PL/SQL routine, but I just would like to double-check if there's some easier way of doing this.

Solution

I'm not overly familiar with the UTL_MATCH.JARO_WINKLER_SIMILARITY function, but using it in a UNION query, in conjunction with the ROW_NUMBER analytic function, will give you the result you desire:

WITH
    temp AS
    (
    SELECT log_message, ROW_NUMBER() OVER (ORDER BY log_message) rn
    FROM logs_table
    WHERE UTL_MATCH.JARO_WINKLER_SIMILARITY(log_message, 'Unable to make thread 19043 a realtime process') > 80
    UNION
    SELECT log_message, ROW_NUMBER() OVER (ORDER BY log_message) rn
    FROM logs_table
    WHERE UTL_MATCH.JARO_WINKLER_SIMILARITY(log_message, 'FEED_XYZ Secondary Instrument Lines not configured') > 80
    )
SELECT log_message
FROM temp
WHERE rn = 1
;


Result (screenshot from test DB at apex.oracle.com):

If you just need any value from each group (as opposed to the "first" value in each sorted group), you could use ROWNUM criteria instead of the ROW_NUMBER functions, which would return a similar result:

WITH
    temp AS
    (
    SELECT log_message
    FROM logs_table
    WHERE UTL_MATCH.JARO_WINKLER_SIMILARITY(log_message, 'Unable to make thread 19043 a realtime process') > 80
      AND ROWNUM = 1
    UNION
    SELECT log_message
    FROM logs_table
    WHERE UTL_MATCH.JARO_WINKLER_SIMILARITY(log_message, 'FEED_XYZ Secondary Instrument Lines not configured') > 80
      AND ROWNUM = 1
    )
SELECT log_message
FROM temp
;


However, unless there is a specific reason why you need to use UTL_MATCH.JARO_WINKLER_SIMILARITY, there's a chance you may be falling victim to the XY Problem. You could achieve identical results by replacing the two individual WHERE-clause lines (in either query above) with these:

WHERE log_message LIKE 'Unable to make thread %'
WHERE log_message LIKE 'FEED\_%' ESCAPE '\'


Or these (e.g. if there is a need for more fine-grained filtering):

WHERE REGEXP_LIKE(log_message, 'Unable\sto\smake\sthread\s\d+\sa\srealtime\sprocess')
WHERE REGEXP_LIKE(log_message, 'FEED_[A-Z]{3}\sSecondary\sInstrument\sLines\snot\sconfigured')

Code Snippets

WITH
    temp AS
    (
    SELECT log_message, ROW_NUMBER() OVER (ORDER BY log_message) rn
    FROM logs_table
    WHERE UTL_MATCH.JARO_WINKLER_SIMILARITY(log_message, 'Unable to make thread 19043 a realtime process') > 80
    UNION
    SELECT log_message, ROW_NUMBER() OVER (ORDER BY log_message) rn
    FROM logs_table
    WHERE UTL_MATCH.JARO_WINKLER_SIMILARITY(log_message, 'FEED_XYZ Secondary Instrument Lines not configured') > 80
    )
SELECT log_message
FROM temp
WHERE rn = 1
;
WITH
    temp AS
    (
    SELECT log_message
    FROM logs_table
    WHERE UTL_MATCH.JARO_WINKLER_SIMILARITY(log_message, 'Unable to make thread 19043 a realtime process') > 80
      AND ROWNUM = 1
    UNION
    SELECT log_message
    FROM logs_table
    WHERE UTL_MATCH.JARO_WINKLER_SIMILARITY(log_message, 'FEED_XYZ Secondary Instrument Lines not configured') > 80
      AND ROWNUM = 1
    )
SELECT log_message
FROM temp
;
WHERE log_message LIKE 'Unable to make thread %'
WHERE log_message LIKE 'FEED\_%' ESCAPE '\'
WHERE REGEXP_LIKE(log_message, 'Unable\sto\smake\sthread\s\d+\sa\srealtime\sprocess')
WHERE REGEXP_LIKE(log_message, 'FEED_[A-Z]{3}\sSecondary\sInstrument\sLines\snot\sconfigured')

Context

StackExchange Database Administrators Q#158190, answer score: 2

Revisions (0)

No revisions yet.