patternsqlMinor
BM25 (full-text search) implementation in SQL Server
Viewed 0 times
fullsearchsqltextserverimplementationbm25
Problem
I've hit a stump, while trying to implement the BM25 algorithm in SQL Server 2008 R2. I know that SQL Server includes the Full-Text Search option, which already implements a variant of BM25, but I would like to do some parameter tuning tests and since the FTS procedures are non-editable (as far as I know), I've decided to implement it myself.
I have two tables, TF (term frequency) and DF (document frequency) with the following structures:
TF
*Note: the weight column denotes importance of the word (it is usually 1)
DF
The TF table contains the relationship between a term and a document; that is, the frequency of the term in a document. The DF table contains information about how many documents contain a term. The two tables can be linked using DF.ID and TF.TermID. Using these two tables I would now like to compute BM25 similarity values between two documents (one document acts as a query) according to the formula in the Wikipedia article. The tables TF and DF translate to functions f(q, D) and n(q) respectively:
I would like the result to be in this format:
Here is some code that I have up to now:
I'm having trouble co
I have two tables, TF (term frequency) and DF (document frequency) with the following structures:
TF
*Note: the weight column denotes importance of the word (it is usually 1)
ID | Term | DocumentID | Count | TermID | Weight*DF
ID | Term | CountThe TF table contains the relationship between a term and a document; that is, the frequency of the term in a document. The DF table contains information about how many documents contain a term. The two tables can be linked using DF.ID and TF.TermID. Using these two tables I would now like to compute BM25 similarity values between two documents (one document acts as a query) according to the formula in the Wikipedia article. The tables TF and DF translate to functions f(q, D) and n(q) respectively:
I would like the result to be in this format:
DocumentA_ID | DocumentB_ID | BM25_ValueHere is some code that I have up to now:
DECLARE @N FLOAT;
DECLARE @AVGDL FLOAT;
DECLARE @K1 FLOAT;
DECLARE @B FLOAT;
SET @K1 = 1.2;
SET @B = 0.75;
-- number of all documents
SELECT @N = COUNT(DISTINCT DocumentID) FROM TF;
-- average document length (in words)
SELECT @AVGDL = AVG(DocumentLength) FROM (SELECT DocumentID, SUM(TF.Count) AS DocumentLength FROM TF WHERE Weight = 3 GROUP BY DocumentID) A;
-- BM25 implementation
SELECT D.DocumentID AS DocumentA,
Q.DocumentID AS DocumentB,
*,
-- need help here (SUM or something ...)
LOG((@N - DF.Count + 0.5)/(DF.Count + 0.5)) AS IDF
FROM TF AS D
INNER JOIN TF AS Q ON D.Term = Q.Term
INNER JOIN DF ON D.TermID = DF.ID
WHERE D.DocumentID <> Q.DocumentIDI'm having trouble co
Solution
After a few days of fiddling around, I finally managed to get this done. Here is the code I ended up with, packed into a stored procedure. For this to work, I added another table, which just holds a document ID and its length in words.
I hope this answer helps anyone who wanted to implement a full-text search algorithm by themselves. The default algorithm in SQL Server 2008 R2 has only minor differences to this implementation.
ALTER PROCEDURE [dbo].[BuildIndexBM25]
-- default parameters K1=> [1.2 - 2.0], B => [0.0 - 1.0], Weight => 3 (e.g. titles only)
@K1 FLOAT = 1.2,
@B FLOAT = 0.75,
@Weight FLOAT = 3
AS
BEGIN
SET NOCOUNT ON;
DECLARE @N FLOAT;
DECLARE @AVGDL FLOAT;
-- number of all documents
SELECT @N = COUNT(DISTINCT DocumentID) FROM TF;
-- average document length (in words)
SELECT @AVGDL = AVG(Length) FROM DocumentLength;
-- BM25 implementation
-- result set: | DocumentA | DocumentB | BM25 |
SELECT
DL.ID AS DocumentA,
BM.DocumentID AS DocumentB,
BM.BM25 AS BM25
FROM (
SELECT ID FROM DocumentLength
) DL
CROSS APPLY (
SELECT
Q.DocumentID,
SUM(LOG((@N - X.Count + 0.5) / (X.Count + 0.5)) * (Q.Count * (@K1 + 1)) / (Q.Count + @K1 * (1 - @B + (@B * L.Length / @AVGDL)))) AS BM25
FROM TF D, TF Q, DF X, DocumentLength L
WHERE D.TermID = Q.TermID
AND D.DocumentID = DL.ID
AND Q.TermID = X.ID
AND D.DocumentID = L.ID
AND Q.Weight = @Weight
AND D.Weight = @Weight
GROUP BY Q.DocumentID
) BM
WHERE DL.ID != BM.DocumentID
ENDI hope this answer helps anyone who wanted to implement a full-text search algorithm by themselves. The default algorithm in SQL Server 2008 R2 has only minor differences to this implementation.
Code Snippets
ALTER PROCEDURE [dbo].[BuildIndexBM25]
-- default parameters K1=> [1.2 - 2.0], B => [0.0 - 1.0], Weight => 3 (e.g. titles only)
@K1 FLOAT = 1.2,
@B FLOAT = 0.75,
@Weight FLOAT = 3
AS
BEGIN
SET NOCOUNT ON;
DECLARE @N FLOAT;
DECLARE @AVGDL FLOAT;
-- number of all documents
SELECT @N = COUNT(DISTINCT DocumentID) FROM TF;
-- average document length (in words)
SELECT @AVGDL = AVG(Length) FROM DocumentLength;
-- BM25 implementation
-- result set: | DocumentA | DocumentB | BM25 |
SELECT
DL.ID AS DocumentA,
BM.DocumentID AS DocumentB,
BM.BM25 AS BM25
FROM (
SELECT ID FROM DocumentLength
) DL
CROSS APPLY (
SELECT
Q.DocumentID,
SUM(LOG((@N - X.Count + 0.5) / (X.Count + 0.5)) * (Q.Count * (@K1 + 1)) / (Q.Count + @K1 * (1 - @B + (@B * L.Length / @AVGDL)))) AS BM25
FROM TF D, TF Q, DF X, DocumentLength L
WHERE D.TermID = Q.TermID
AND D.DocumentID = DL.ID
AND Q.TermID = X.ID
AND D.DocumentID = L.ID
AND Q.Weight = @Weight
AND D.Weight = @Weight
GROUP BY Q.DocumentID
) BM
WHERE DL.ID != BM.DocumentID
ENDContext
StackExchange Database Administrators Q#42023, answer score: 7
Revisions (0)
No revisions yet.