HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Split string in chunks preferable at spaces

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
chunkspreferablespacessplitstring

Problem

I'm working on a tool to import data from one database to another. One requirement is that I have to split a string from one source field into three (shorter) fields at the target. If possible the string should be split at a space character. If the string doesn't fit completely into the target fields, the rest can be omitted.

Usually I would solve this using an UDF but unfortunately neither UDF's nor Stored Procedures are allowed in my scenario.

My source database has the following table:

CREATE TABLE dbo.Organisations
(
OrganisationID int IDENTITY(1,1) NOT NULL PRIMARY KEY,
OrganisationName nvarchar(180) NOT NULL
/ More columns omitted for brevity /
)


This table contains company names, for example:

OrganisationID | OrganisationName
---------------+--------------------------------------------------------------------
1 | Microsoft Corporation
2 | S&T System Integration & Technology Distribution Aktiengesellschaft


During the import the records of this table should be inserted into a stating table in the target database. The staging table looks like this:

CREATE TABLE dbo.OrgStaging
(
OrganisationID int NOT NULL,
Name1 nvarchar(50) NOT NULL,
Name2 nvarchar(50) NOT NULL,
Name3 nvarchar(50) NOT NULL
/ More columns omitted for brevity /
)


If I would simply use SUBSTRING to split the name I would end up in the staging table like this:

OrganisationID | Name1 | Name2
---------------+---------------------------------------------------+-----------------
1 | Microsoft Corporation |
2 | S&T System Integration & Technology Distribution A|ktiengesellschaft


But I don't want to split in the middle of a word so I would like to have the result like this:

`OrganisationID | Name1 | Name2
---------------+---------------------------------------------------+-------

Solution

A nitpick

This first point will not really change the query, and it is probably
only added for testing purposes, but an ORDER BY with an INSERT INTO statement does not really do anything useful (unless you insert
into a table with an IDENTITY column).

Us a more compact style

It looks like you either have a very strong policy on style and
layout, or used an auto-formatter, because in some places, SELECT *
is written in two lines. That is good, because the query is at least
well formatted and readable. If you want it to be more compact,
though, you might want to skimp a bit on the newlines.

When formatting code more compactly, there is still "breathing space",
but the scroll-factor is tuned down a bit, so you have a bit higher
view of the code. I find that it can help.

I like to use a combination of indentation to recognize the query
parts (my indentation is by no means the default in SQL) and a thing I
call "one concept per line", where each line tells me something that
can stand on its own for logic.

Not indenting UNION s helps to see the "equal level" of both sides of
the UNION, and prevents the lines from getting too long.

```
DECLARE @MaxLen int = 50; -- Maximum length of a target column

WITH SpacePositions AS
( SELECT O.OrganisationID
, CHARINDEX(' ', O.OrganisationName, 0) AS Position
FROM dbo.Organisations O
UNION ALL
SELECT O.OrganisationID
, CHARINDEX(' ', O.OrganisationName, S.Position + 1) AS Position
FROM dbo.Organisations O
INNER JOIN SpacePositions S
ON CHARINDEX(' ', O.OrganisationName, S.Position + 1) > S.Position
AND S.OrganisationID = O.OrganisationID
)
, SplitPositions AS
( SELECT S.OrganisationID
, S.Position - 1 AS Position
FROM SpacePositions S
WHERE S.Position != 0
UNION
SELECT O.OrganisationID
, LEN(O.OrganisationName) AS Position
FROM dbo.Organisations O
)
, FirstChunk AS
( SELECT D.OrganisationID
, 1 AS ChunkStart
, MAX(D.Position) AS ChunkEnd
FROM ( SELECT S.OrganisationID
, S.Position + 1 AS Position
FROM SplitPositions S
WHERE Position BETWEEN 1 AND @MaxLen
UNION
SELECT S.OrganisationID
, @MaxLen
FROM SplitPositions S
WHERE NOT EXISTS ( SELECT *
FROM SplitPositions SI
WHERE SI.Position BETWEEN 1 AND @MaxLen
AND SI.OrganisationID = S.OrganisationID
)
) D
GROUP BY D.OrganisationID
)
, SecondChunk AS
( SELECT C.OrganisationID
, C.ChunkEnd + 1 AS ChunkStart
, MAX(D.Position) AS ChunkEnd
FROM FirstChunk C
INNER JOIN ( SELECT S.OrganisationID
, S.Position + 1 AS Position
FROM SplitPositions S
INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
UNION
SELECT S.OrganisationID
, C.ChunkEnd + @MaxLen AS Position
FROM SplitPositions S
INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
WHERE NOT EXISTS ( SELECT *
FROM SplitPositions SI
WHERE SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
AND OrganisationID = C.OrganisationID
)
) D ON D.OrganisationID = C.OrganisationID
GROUP BY C.OrganisationID, C.ChunkEnd
)
, ThirdChunk AS
( SELECT C.OrganisationID
, C.ChunkEnd + 1 AS ChunkStart
, MAX(D.Position) AS ChunkEnd
FROM SecondChunk C
INNER JOIN ( SELECT S.OrganisationID
, S.Position + 1 AS Position
FROM SplitPositions S
INNER JOIN SecondChunk C ON C.OrganisationID = S.OrganisationID
WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
UNION
SELECT S.OrganisationID
, C.ChunkEnd + @MaxLen AS Position
FROM SplitPositions S
INNER JOIN SecondChunk C ON C.OrganisationID = S.OrganisationID
WHERE NOT EXISTS ( SELECT *
FROM SplitPositions SI
WHERE SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd +

Code Snippets

DECLARE @MaxLen int = 50;     -- Maximum length of a target column

  WITH SpacePositions AS
     ( SELECT O.OrganisationID
            , CHARINDEX(' ', O.OrganisationName, 0) AS Position
         FROM dbo.Organisations O
        UNION ALL
       SELECT O.OrganisationID
            , CHARINDEX(' ', O.OrganisationName, S.Position + 1) AS Position
         FROM dbo.Organisations O
        INNER JOIN SpacePositions S
                ON CHARINDEX(' ', O.OrganisationName, S.Position + 1) > S.Position
               AND S.OrganisationID = O.OrganisationID
     )
     , SplitPositions AS
     ( SELECT S.OrganisationID
            , S.Position - 1 AS Position
         FROM SpacePositions S
        WHERE S.Position != 0
        UNION
       SELECT O.OrganisationID
            , LEN(O.OrganisationName) AS Position
         FROM dbo.Organisations O
     )
     , FirstChunk AS
     ( SELECT D.OrganisationID
            , 1 AS ChunkStart
            , MAX(D.Position) AS ChunkEnd
         FROM ( SELECT S.OrganisationID
                     , S.Position + 1 AS Position
                  FROM SplitPositions S
                 WHERE Position BETWEEN 1 AND @MaxLen
                 UNION
                SELECT S.OrganisationID
                     , @MaxLen
                  FROM SplitPositions S
                 WHERE NOT EXISTS ( SELECT *
                                      FROM SplitPositions SI
                                     WHERE SI.Position BETWEEN 1 AND @MaxLen
                                       AND SI.OrganisationID = S.OrganisationID
                                  )
              ) D
          GROUP BY D.OrganisationID
     )
     , SecondChunk AS
     ( SELECT C.OrganisationID
            , C.ChunkEnd + 1 AS ChunkStart
            , MAX(D.Position) AS ChunkEnd
         FROM FirstChunk C
        INNER JOIN ( SELECT S.OrganisationID
                          , S.Position + 1 AS Position
                       FROM SplitPositions S
                      INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
                      WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
                      UNION
                     SELECT S.OrganisationID
                          , C.ChunkEnd + @MaxLen AS Position
                       FROM SplitPositions S
                      INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
                      WHERE NOT EXISTS ( SELECT *
                                           FROM SplitPositions SI
                                          WHERE SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
                                            AND OrganisationID = C.OrganisationID
                                       )
                   ) D ON D.OrganisationID = C.OrganisationID
        GROUP BY C.OrganisationID, C.ChunkEnd
     )
     , ThirdChunk AS
     ( SELECT C.OrganisationID
            , C.ChunkEnd + 1 AS ChunkStart
            , MAX(D.Position) AS Chunk
, FirstChunk AS
( SELECT O.OrganisationID
       , 1 AS ChunkStart
       , COALESCE(MAX(D.Position), @MaxLen) AS ChunkEnd
    FROM dbo.Organisations O
    LEFT JOIN ( SELECT S.OrganisationID
                     , S.Position + 1 AS Position
                  FROM SplitPositions S
                 WHERE Position BETWEEN 1 AND @MaxLen
              ) D ON D.OrganisationID = O.OrganisationID
     GROUP BY O.OrganisationID
)
, SecondChunk AS
( SELECT C.OrganisationID
       , C.ChunkEnd + 1 AS ChunkStart
       , COALESCE(MAX(D.Position), C.ChunkEnd + @MaxLen) AS ChunkEnd
    FROM FirstChunk C
    LEFT JOIN ( SELECT S.OrganisationID
                     , S.Position + 1 AS Position
                  FROM SplitPositions S
                 INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
                 WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
              ) D ON D.OrganisationID = C.OrganisationID
   GROUP BY C.OrganisationID, C.ChunkEnd
)
, ThirdChunk AS
( SELECT C.OrganisationID
       , C.ChunkEnd + 1 AS ChunkStart
       , COALESCE(MAX(D.Position), C.ChunkEnd + @MaxLen) AS ChunkEnd
    FROM SecondChunk C
    LEFT JOIN ( SELECT S.OrganisationID
                     , S.Position + 1 AS Position
                  FROM SplitPositions S
                 INNER JOIN SecondChunk C ON C.OrganisationID = S.OrganisationID
                 WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
              ) D ON D.OrganisationID = C.OrganisationID
   GROUP BY C.OrganisationID, C.ChunkEnd
)
INSERT INTO Organisations ([OrganisationName])
SELECT SUBSTRING(OrganisationName, 1, 180)
  FROM ( SELECT 'Microsoft Corporation'
         UNION ALL SELECT 'S&T System Integration & Technology Distribution Aktiengesellschaft'
         UNION ALL SELECT 'VeryLongOrganisationNameThatWillHaveToBeSplitWithoutASpace Because It Really Is A Long Name, But In The Second Column We Can Split It'
         UNION ALL SELECT 'Another VeryLongOrganisationNameThatWillHaveToBeSplitWithoutASpaceButOnlyInTheSecondColumn, Because It Really Is A Long Name'
         UNION ALL SELECT 'AnotherVeryLongOrganisationNameThatWillHaveToBeSplitWithoutASpaceBecauseItReallyIsALongNameButNowItEvenExceedsTheLimitOfAllThreeColumnsWithAMaximumLenghtOf50Characters(WhichIsACombinedTotalOf150Characters)AndNowWeDon''tHaveAnythingToPutInTheLastBox'
         UNION ALL SELECT 'OneWordOnly'
         UNION ALL SELECT 'A' -- Single letter edge case
         UNION ALL SELECT '' -- Empty string edge case
       ) Data(OrganisationName);
SELECT *, LEN(Name1), LEN(Name2), LEN(Name3) FROM dbo.OrgStaging;

Context

StackExchange Code Review Q#82468, answer score: 2

Revisions (0)

No revisions yet.