patternsqlMinor
Using UPSERT to maintain word counts for subtitles
Viewed 0 times
upsertmaintainwordusingforsubtitlescounts
Problem
I will first explain the context. I want to read serie subtitles contained in .srt files. I have already done the part of extracting the words from the files (and how many times the word appears for a serie) but when it comes to inserting the data in my SQL Server database, it takes more and more time to do the updates or inserts because i am checking if the word exists or not. I call a stored procedure from C# with the following parameters : the word, how many times it appears in my files, the serie name.
I have inserted only two series and I can already see that it is taking more and more time to insert or update one word.
Can you see any improvements that can be done or different way of doing what I want more efficiently ?
ALTER PROCEDURE [dbo].[INSERT_WORD_FOR_SERIE]
@word nvarchar(50),
@counter int,
@serie nvarchar(50)
AS
BEGIN
DECLARE @idSerie AS int
DECLARE @idWord AS int
SELECT @idWord = IDWORD FROM WORD WHERE WORD=@word;
IF @idWord IS NOT NULL
BEGIN
SELECT @idSerie = IDSERIE FROM SERIE WHERE SERIE=@serie;
IF EXISTS (SELECT counter FROM APPEARS WHERE IDWORD=@idWord AND IDSERIE=@idSerie)
BEGIN
UPDATE APPEARS SET counter = counter + @counter WHERE IDWORD=@idWord AND IDSERIE=@idSerie;
END
ELSE
BEGIN
INSERT INTO APPEARS VALUES (@idWord, @idSerie, @counter);
END
END
ELSE
BEGIN
INSERT INTO WORD VALUES (@word);
SELECT @idWord = IDWORD FROM WORD WHERE WORD=@word;
SELECT @idSerie = IDSERIE FROM SERIE WHERE SERIE=@serie;
INSERT INTO APPEARS VALUES (@idWord, @idSerie, @counter);
END
ENDI have inserted only two series and I can already see that it is taking more and more time to insert or update one word.
Can you see any improvements that can be done or different way of doing what I want more efficiently ?
Solution
Like Pankaj Gupta already mentions, you could move the query for
Secondly, I find the code a bit hard to read because of the casing of the table and field names. This is, of course, a matter of style, but unless you are not "in charge" of the naming style, I would suggest to case them differently. E.g. using camelCase for field names as you do for variable names (and the
Separation of concerns is probably a bit harder in T-SQL, because writing beautiful T-SQL is not always the same as writing well-performing T-SQL. But you could split up your stored procedure into three parts: the first part would get the
Those three parts have generally the same form: 1) select
You didn't mention which version of SQL Server you use, but since version 2008, you could look into the MERGE statement, which combines tho logic of inserting and updating into a single statement.
About performance (of which I don't know whether it is in the scope of code review): I suspect that the
Finally, it's good practice to specify the columns you are inserting into, for readability and maintenance purposes. So I'd change
To:
Likewise for the other inserts.
By the way, I like that you are ending all statements with a semicolon, which is not mandatory, but a good practice, and can help prevent a kind of errors with changing code, using CTE's or query hints for example.
@idSerie outside the IF-statement, as it is exactly the same query on exactly the same table with exactly the same data (it does not change anywhere in this stored procedure).Secondly, I find the code a bit hard to read because of the casing of the table and field names. This is, of course, a matter of style, but unless you are not "in charge" of the naming style, I would suggest to case them differently. E.g. using camelCase for field names as you do for variable names (and the
APPEARS.counter column), and use PascalCase for the table names. This way, using UPPERCASE for reserved SQL keywords makes them stand out, instead of having them look just the same as the rest.Separation of concerns is probably a bit harder in T-SQL, because writing beautiful T-SQL is not always the same as writing well-performing T-SQL. But you could split up your stored procedure into three parts: the first part would get the
@idSerie value from the SERIE table. The second part would fetch the @idWord from the WORD table, inserting it if it's not already in there. The third part then would upsert the correct values into the VALUES table. Not only will this make the code more readable (you only have to remember one table and one or two variables in each part), but also more maintainable, because the INSERT INTO APPEARS command would be only in one place.Those three parts have generally the same form: 1) select
@id, 2) insert or update that row, and 3) get the real @id. 2 and 3 don't apply to each part, so they are optional (you don't need to insert a SERIE row, and you don't need to update a WORD row).You didn't mention which version of SQL Server you use, but since version 2008, you could look into the MERGE statement, which combines tho logic of inserting and updating into a single statement.
About performance (of which I don't know whether it is in the scope of code review): I suspect that the
WORD table contains a unique entry for each word. As you mention that the query takes more and more time, the more words are in there, you might want to add a (unique) index on the WORD.WORD column, but I don't really know much about performance, so you might want to experiment.Finally, it's good practice to specify the columns you are inserting into, for readability and maintenance purposes. So I'd change
INSERT INTO APPEARS VALUES (@idWord, @idSerie, @counter);To:
INSERT INTO APPEARS (IDWORD, IDSERIE, counter) VALUES (@idWord, @idSerie, @counter);Likewise for the other inserts.
By the way, I like that you are ending all statements with a semicolon, which is not mandatory, but a good practice, and can help prevent a kind of errors with changing code, using CTE's or query hints for example.
Code Snippets
INSERT INTO APPEARS VALUES (@idWord, @idSerie, @counter);INSERT INTO APPEARS (IDWORD, IDSERIE, counter) VALUES (@idWord, @idSerie, @counter);Context
StackExchange Code Review Q#113716, answer score: 2
Revisions (0)
No revisions yet.