snippetsqlMinor
SSIS - How to parse extensions from path-filename strings
Viewed 0 times
pathfilenameextensionsssisparsehowfromstrings
Problem
I'm importing a large set (126M rows) of file data from CSVs. Import works fine, except I'm trying to derive a column as I import rather than trying to script it out afterward (due to the number of rows).
SSIS is not terribly helpful with string manipulation functions, and the possibilities I have are pretty complicated as well.
The derived column needs to reflect the file extensions. File extensions range from 1-10 characters (anything longer can be truncated at 10), and are (of course) separated from the file name and path by a period. They don't correspond to a concise list of file types (something like "docx, xlsx, accdb, msg" are Office types). The file path may also have periods in it. And in some cases it will not have files at the end of the path, just a folder name itself.
Some examples:
So I need to be able to extract the extensions up to 10 characters.
I tried TOKEN() but with folder names possibly having periods, that didn't work well - adding
I can't find a combination of SUBSTRING(), RIGHT(), FINDSTRING(), and/or TOKEN() that will meet the rules and the derived column definition doesn't allow logic like IF or IIF.
Some false positives are expected and I plan to sort them out after importing. If it helps here, I have a second column in the CSVs which is extracted by SSIS, it's the size of the file (for folders it populates a 0). I haven't gotten this to matter either, because of the lack of IF or IIF in derived column definition.
SSIS is not terribly helpful with string manipulation functions, and the possibilities I have are pretty complicated as well.
The derived column needs to reflect the file extensions. File extensions range from 1-10 characters (anything longer can be truncated at 10), and are (of course) separated from the file name and path by a period. They don't correspond to a concise list of file types (something like "docx, xlsx, accdb, msg" are Office types). The file path may also have periods in it. And in some cases it will not have files at the end of the path, just a folder name itself.
Some examples:
\\Server\Share\named resource
\\Server\Share\this.looks.like.a.file_but.it.isnt\its_a_directory
\\Server\Share\Group\Year.Month.Day.Subfolder\File.ext
\\Server\Share\Team Folder 1.404\another.file.here.extensionSo I need to be able to extract the extensions up to 10 characters.
I tried TOKEN() but with folder names possibly having periods, that didn't work well - adding
\ as a token delimiter helped some, but I still got odd extensions from folder names.I can't find a combination of SUBSTRING(), RIGHT(), FINDSTRING(), and/or TOKEN() that will meet the rules and the derived column definition doesn't allow logic like IF or IIF.
Some false positives are expected and I plan to sort them out after importing. If it helps here, I have a second column in the CSVs which is extracted by SSIS, it's the size of the file (for folders it populates a 0). I haven't gotten this to matter either, because of the lack of IF or IIF in derived column definition.
Solution
Firstly, the IF/IIF you are looking for is the "?" Conditional Operator
? Conditional
One way to implement the Derived Column would be to parse the string like this
In this case, "name" is the column that holds the string to be parsed.
What we are doing here is:
? Conditional
One way to implement the Derived Column would be to parse the string like this
TOKENCOUNT( TOKEN( REVERSE( RTRIM( name ) ), "\\", 1), "." ) == 1
? "" : REVERSE( TOKEN( TOKEN( REVERSE( RTRIM( name ) ), "\\", 1), ".", 1) )In this case, "name" is the column that holds the string to be parsed.
What we are doing here is:
- Remove any trailing white-space with
RTRIM( name )
- Reverse the string so we are dealing with the end as the beginning with
REVERSE()
- Grab the first "token" delimited by a backslash using
TOKEN()
- Count how many "." delimiters are in the result using
TOKENCOUNT()
- Test the count of "." delimited tokens is equal to 1 using
?
- If the count is 1, return an empty string - no extension.
- Otherwise return the first token before the "." using
TOKEN()
- Reverse the extension to get back to "normal" using
REVERSE()
Code Snippets
TOKENCOUNT( TOKEN( REVERSE( RTRIM( name ) ), "\\", 1), "." ) == 1
? "" : REVERSE( TOKEN( TOKEN( REVERSE( RTRIM( name ) ), "\\", 1), ".", 1) )Context
StackExchange Database Administrators Q#128663, answer score: 2
Revisions (0)
No revisions yet.