HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

SSIS - How to parse extensions from path-filename strings

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
pathfilenameextensionsssisparsehowfromstrings

Problem

I'm importing a large set (126M rows) of file data from CSVs. Import works fine, except I'm trying to derive a column as I import rather than trying to script it out afterward (due to the number of rows).
SSIS is not terribly helpful with string manipulation functions, and the possibilities I have are pretty complicated as well.

The derived column needs to reflect the file extensions. File extensions range from 1-10 characters (anything longer can be truncated at 10), and are (of course) separated from the file name and path by a period. They don't correspond to a concise list of file types (something like "docx, xlsx, accdb, msg" are Office types). The file path may also have periods in it. And in some cases it will not have files at the end of the path, just a folder name itself.
Some examples:

\\Server\Share\named resource
\\Server\Share\this.looks.like.a.file_but.it.isnt\its_a_directory
\\Server\Share\Group\Year.Month.Day.Subfolder\File.ext
\\Server\Share\Team Folder 1.404\another.file.here.extension


So I need to be able to extract the extensions up to 10 characters.
I tried TOKEN() but with folder names possibly having periods, that didn't work well - adding \ as a token delimiter helped some, but I still got odd extensions from folder names.
I can't find a combination of SUBSTRING(), RIGHT(), FINDSTRING(), and/or TOKEN() that will meet the rules and the derived column definition doesn't allow logic like IF or IIF.

Some false positives are expected and I plan to sort them out after importing. If it helps here, I have a second column in the CSVs which is extracted by SSIS, it's the size of the file (for folders it populates a 0). I haven't gotten this to matter either, because of the lack of IF or IIF in derived column definition.

Solution

Firstly, the IF/IIF you are looking for is the "?" Conditional Operator

? Conditional

One way to implement the Derived Column would be to parse the string like this

TOKENCOUNT( TOKEN( REVERSE( RTRIM( name ) ), "\\", 1), "." ) == 1
 ? "" : REVERSE( TOKEN( TOKEN( REVERSE( RTRIM( name ) ), "\\", 1), ".", 1) )


In this case, "name" is the column that holds the string to be parsed.

What we are doing here is:

  • Remove any trailing white-space with RTRIM( name )



  • Reverse the string so we are dealing with the end as the beginning with REVERSE()



  • Grab the first "token" delimited by a backslash using TOKEN()



  • Count how many "." delimiters are in the result using TOKENCOUNT()



  • Test the count of "." delimited tokens is equal to 1 using ?



  • If the count is 1, return an empty string - no extension.



  • Otherwise return the first token before the "." using TOKEN()



  • Reverse the extension to get back to "normal" using REVERSE()

Code Snippets

TOKENCOUNT( TOKEN( REVERSE( RTRIM( name ) ), "\\", 1), "." ) == 1
 ? "" : REVERSE( TOKEN( TOKEN( REVERSE( RTRIM( name ) ), "\\", 1), ".", 1) )

Context

StackExchange Database Administrators Q#128663, answer score: 2

Revisions (0)

No revisions yet.