gotchasqlMinor
Why does CONTAINS not find matches at the end of the string?
Viewed 0 times
matcheswhythecontainsdoesfindendnotstring
Problem
I have a stored procedure in Microsoft SQL Server 2008 R2 (SP3) which searches a table for anything matching a
Adding the double quotes and asterisk worked well until someone discovered they had no results when searching for a string they knew definitely existed. It seems that the
Analysis with very simple t-SQL:
I note that
My question:
How should I modify the incoming keywords to make sure than
@Keywords parameter that I pass in. It was too slow using LIKE as the table has 155,000 rows, so I created a fulltext catalog and switched to using CONTAINS instead. I am not very familiar with this feature, but after researching it seemed all I needed to do was process the user's choice of words with:SET @Keywords = '"' + @Keywords + '*"'Adding the double quotes and asterisk worked well until someone discovered they had no results when searching for a string they knew definitely existed. It seems that the
* is stopping CONTAINS from finding rows where the @Keywords are right at the very end of the string.Analysis with very simple t-SQL:
--first I prove that I can find the one row which the user was searching for
--this column is a NVARCHAR(200)
SELECT * FROM News
WHERE Headline = '1120 days at sea and still sailing' --1 row returned (OK)
SELECT * FROM News
WHERE Headline LIKE '1120 days at sea and still sailing%' --1 row returned (OK)
--using the full headline
SELECT * FROM News
WHERE CONTAINS(Headline, '"1120 days at sea and still sailing*"') --no rows, just like my proc (WRONG)
SELECT * FROM News
WHERE CONTAINS(Headline, '"1120 days at sea and still sailing"') --1 row returned if asterisk omitted (OK)
--using part of the headline so we don't touch the end of the
SELECT * FROM News
WHERE CONTAINS(Headline, '"days at sea and still"') --lots of rows, since it is anything with 'days' and then 'sea' present (NOT USEFUL)
SELECT * FROM News
WHERE CONTAINS(Headline, '"days at sea and still*"') --no rows (WRONG)I note that
CONTAINS considers at/and/still to be 'noise' words and ignores them. Fair enough. The word sailing is the troublesome one at the end of the string.My question:
How should I modify the incoming keywords to make sure than
CONTAINS can find results regardless of the position of the string of keySolution
This has to do with how stoplist (or noise words) is/are treated when adding the asterix (*) in combination with
A longer explanation on that here.
In short: Default <> Prefix searching
By default, when using the
When adding the * +
But when using the default
Question
Are you saying that the intended behaviour, when an asterisk is
present, is to return zero matches for a user who includes a noise
word? This seems surprising, when you consider what most humans are
likely to type!
Prefix matching in combination with contains from microsoft docs
Specifies a match of words or phrases beginning with the specified
text. Enclose a prefix term in double quotation marks ("") and add an
asterisk (*) before the ending quotation mark, so that all text
starting with the simple term specified before the asterisk is matched
Source
Keywords here are:
so that all text
starting with the simple term specified before the asterisk is matched
Meaning that we have to have precise matches when using prefix matching and
Workaround #1 Disabling the stoplist
One workaround for this would be disabling the 'stoplist'
You do have to know that this will increase the words stored in the fulltext index, as the words 'and', 'or', 'still', ... are added to the index.
Performance might suffer from disabling the stoplist.
To reenable the system stoplist
Alternatively, you could create your own stoplist
Testing this on your query:
Result
For other workarounds you could look into using
In your case, enbling
Again due to how prefix matching + contains works.
-- No results
Inside the fulltext index (with stoplist = system).
result
'Still','at','and' are filtered out, when disabling the stoplist they will be added to the index
Testing without stopwords
More on the asterix
The presence of the asterisk enables the prefix-matching mode...
If a phrase is provided, matches are detected if the column contains
all the specified words with zero or more other characters following
the final word
Source
CONTAINS().A longer explanation on that here.
In short: Default <> Prefix searching
By default, when using the
SYSTEM stoplist some words are not indexed and classified as stoplist words. When adding the * +
CONTAINS()they are not seen as stoplist words, and the exact phrase has to match.But when using the default
SYSTEM stoplist these 'stoplist words' are not added to the index, resulting in no matches found.Question
Are you saying that the intended behaviour, when an asterisk is
present, is to return zero matches for a user who includes a noise
word? This seems surprising, when you consider what most humans are
likely to type!
Prefix matching in combination with contains from microsoft docs
Specifies a match of words or phrases beginning with the specified
text. Enclose a prefix term in double quotation marks ("") and add an
asterisk (*) before the ending quotation mark, so that all text
starting with the simple term specified before the asterisk is matched
Source
Keywords here are:
so that all text
starting with the simple term specified before the asterisk is matched
Meaning that we have to have precise matches when using prefix matching and
CONTAINS().Workaround #1 Disabling the stoplist
One workaround for this would be disabling the 'stoplist'
ALTER FULLTEXT INDEX ON News
SET STOPLIST OFF;You do have to know that this will increase the words stored in the fulltext index, as the words 'and', 'or', 'still', ... are added to the index.
Performance might suffer from disabling the stoplist.
To reenable the system stoplist
ALTER FULLTEXT INDEX ON News
SET STOPLIST SYSTEM;Alternatively, you could create your own stoplist
Testing this on your query:
ALTER FULLTEXT INDEX ON News
SET STOPLIST OFF;
INSERT INTO News(Headline)
VALUES('1120 days at sea and still sailing')
INSERT INTO News(Headline)
VALUES('zzz 1120 days at sea and still sailing')
SELECT * FROM News
WHERE CONTAINS(Headline, '"days at sea and still*"')Result
id Headline
1 1120 days at sea and still sailing
2 zzz 1120 days at sea and still sailingFor other workarounds you could look into using
CONTAINS() or FREETEXT() without prefix matching, or prefix matching + CONTAINS() with one (non noise) word to get more accurate resultsIn your case, enbling
transform noise words does not seem to work with prefix-matchingsp_configure 'show advanced options', 1;
RECONFIGURE;
GO
sp_configure 'transform noise words', 1;
RECONFIGURE;
GOAgain due to how prefix matching + contains works.
SELECT * FROM News
WHERE CONTAINS(Headline, '"1120 days at sea and still sailing*"')-- No results
Inside the fulltext index (with stoplist = system).
SELECT display_term,column_id,document_count
FROM sys.dm_fts_index_keywords(DB_ID('MNGDB'), OBJECT_ID('dbo.News'));result
display_term column_id document_count
1120 2 2
days 2 2
nn1120 2 2
sailing 2 2
sea 2 2
zzz 2 1
END OF FILE 2 2'Still','at','and' are filtered out, when disabling the stoplist they will be added to the index
Testing without stopwords
ALTER FULLTEXT INDEX ON News
SET STOPLIST SYSTEM;
INSERT INTO dbo.News
VALUES('days sea sailing')
SELECT * FROM dbo.News
WHERE CONTAINS(Headline, '"days sea sailing*"')
id Headline
4 days sea sailingMore on the asterix
The presence of the asterisk enables the prefix-matching mode...
If a phrase is provided, matches are detected if the column contains
all the specified words with zero or more other characters following
the final word
Source
Code Snippets
ALTER FULLTEXT INDEX ON News
SET STOPLIST OFF;ALTER FULLTEXT INDEX ON News
SET STOPLIST SYSTEM;ALTER FULLTEXT INDEX ON News
SET STOPLIST OFF;
INSERT INTO News(Headline)
VALUES('1120 days at sea and still sailing')
INSERT INTO News(Headline)
VALUES('zzz 1120 days at sea and still sailing')
SELECT * FROM News
WHERE CONTAINS(Headline, '"days at sea and still*"')id Headline
1 1120 days at sea and still sailing
2 zzz 1120 days at sea and still sailingsp_configure 'show advanced options', 1;
RECONFIGURE;
GO
sp_configure 'transform noise words', 1;
RECONFIGURE;
GOContext
StackExchange Database Administrators Q#234123, answer score: 5
Revisions (0)
No revisions yet.