HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Why is SSIS slow to enumerate over many files in a directory and import them?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
whydirectoryssisimportslowfilesmanyandthemover

Problem

I have a terribly slow SSIS package. It's quite fast with one file, and reasonably fast with 100 files or fewer. (About one second per file)

However, if my directory has thousands of (very small) files, the process drags on terribly slowly. My preference is to run this process only after business hours, but by waiting until then, the number of flat files to import is in the thousands.

The package is very simple:

  • Outer Loop is For Every (file enumeration, read file path into variable)



  • Inside, simply import without any transformation to the data



That is it.

Performance with thousands of files is running 15 seconds or more per each file. The UI (status) is drawing/scrolling so slowly that I can't even see where it's at -- the stamped time is more than 15 hours old on an execution that was started 18 hours ago.

Version: MSSQL 2012

Solution

I think you're running into a limitation of the UI/debugger.

I created two packages: MakeAllTheFiles and ReadAllTheFiles

MakeAllTheFiles accepts as input the number of files to be created. It will make use of pseudo-random function to distribute the data across a number (7) of sub folder.

MakeAllTheFiles

public void Main()
    {
        int NumberOfFilesToGenerate = (Int32)Dts.Variables["User::FilesToGenerate"].Value;
        string baseFolder = Dts.Variables["User::FolderInput"].Value.ToString();
        System.Random rand = null;
        int fileRows = 0;
        DateTime current = DateTime.Now;
        int currentRandom = -1;
        int seed = 0;
        string folder = string.Empty;
        string currentFile = string.Empty;

        for (int i = 0; i < NumberOfFilesToGenerate; i++)
        {
            seed = i * current.Month * current.Day * current.Hour * current.Minute * current.Second;
            rand = new Random(seed);
            currentRandom = rand.Next();

            // Create files in sub folders
            folder = System.IO.Path.Combine(baseFolder, string.Format("f_{0}", currentRandom % 7));

            // Create the folder if it does not exist
            if (!System.IO.Directory.Exists(folder))
            {
                System.IO.Directory.CreateDirectory(folder);
            }

            currentFile = System.IO.Path.Combine(folder, string.Format("input_{0}.txt", currentRandom));

            System.IO.FileInfo f = new FileInfo(currentFile);
            using (System.IO.StreamWriter writer = f.CreateText())
            {
                int upperBound = rand.Next(50);
                for (int row = 0; row < upperBound; row++)
                {
                    if (row == 0)
                    {
                        writer.WriteLine(string.Format("{0}|{1}", "Col1", "Col2"));                        }

                    writer.WriteLine(string.Format("{0}|{1}", row, seed));
                }
            }
            ;
        }
        Dts.TaskResult = (int)ScriptResults.Success;
    }


ReadAllTheFiles

The general appearance of the package is thus

I have two Connection Managers defined: One is to my database and the other is to a Flat File with an Expression on the ConnectionString property such that it uses my Variable @[User::CurrentFileName]

Variables, I like lots of Variables so there are plenty.

My Execute SQL Task simply stands up a table for me to write to, knocking it down if it already exists.

IF EXISTS
(
    SELECT * FROM sys.tables AS T WHERE T.name = 'dbase_54462' AND T.schema_id = SCHEMA_ID('dbo')
)
BEGIN
    DROP TABLE dbo.dbase_54462;
END

CREATE TABLE
    dbo.dbase_54462
(
    CurrentFile varchar(256) NOT NULL
,   Col1 int NOT NULL
,   Col2 varchar(50) NOT NULL
,   InsertDate datetime NOT NULL DEFAULT(CURRENT_TIMESTAMP)
);


My Foreach Enumerator simply looks at everything in my Input folder based on the file mask of *.txt and traverses subfolders. The current file name is assigned to my variable @[User::CurrentFileName]`

The Data Flow is bog standard. The Derived Column Transformation there simply adds in the Current File Name variable into the data flow so I can record it in my table.

Analysis

I'm lazy and didn't want to do anything special to record processing times so I deployed my packages into the SSISDB catalog and ran them from there.

This query looks at the catalog data to find out how long the package ran, how many files it processed and then generates a running average for file count. Run 10047 was bad and was excluded from analysis.

SELECT
    E.execution_id
,   DATEDIFF(s, E.start_time, E.end_time) As duration_s
,   ES.rc AS FilesProcessed
,   AVG(ES.rc / (1.0 * DATEDIFF(s, E.start_time, E.end_time))) OVER (PARTITION BY ES.rc ORDER BY E.execution_id) AS running_average
FROM
    catalog.executions As E
    INNER JOIN
    (
        SELECT
            MIN(ES.start_time) As start_time
        ,   MAX(ES.end_time) AS end_time
        ,   count(1) As rc
        ,   ES.execution_id
        FROm
            catalog.executable_statistics AS ES
        GROUP BY
            ES.execution_id
    ) AS ES 
    ON ES.execution_id = E.execution_id
WHERE
    E.package_name = 'ReadAllTheFiles.dtsx'
    AND E.execution_id <> 10047
ORDER BY 1,2


The resulting data (gratuitous SQLFiddle)

execution_id    duration_s  FilesProcessed  running_average
10043   15  104 6.93333333333333
10044   13  104 7.46666666666666
10045   13  104 7.64444444444444
10050   102 1004    9.84313725490196
10051   101 1004    9.89186565715395
10052   102 1004    9.87562285640328
10053   106 1004    9.77464167060435
10055   1103    10004   9.06980961015412
10056   1065    10004   9.23161842010053
10057   1033    10004   9.38255038913446
10058   957 10004   9.65028792246735
10059   945 10004   9.83747901522255


Based on this sampling size, I see no appreciable difference between processing 100, 1000 or 10,000 files with SSIS as described her

Code Snippets

public void Main()
    {
        int NumberOfFilesToGenerate = (Int32)Dts.Variables["User::FilesToGenerate"].Value;
        string baseFolder = Dts.Variables["User::FolderInput"].Value.ToString();
        System.Random rand = null;
        int fileRows = 0;
        DateTime current = DateTime.Now;
        int currentRandom = -1;
        int seed = 0;
        string folder = string.Empty;
        string currentFile = string.Empty;

        for (int i = 0; i < NumberOfFilesToGenerate; i++)
        {
            seed = i * current.Month * current.Day * current.Hour * current.Minute * current.Second;
            rand = new Random(seed);
            currentRandom = rand.Next();

            // Create files in sub folders
            folder = System.IO.Path.Combine(baseFolder, string.Format("f_{0}", currentRandom % 7));

            // Create the folder if it does not exist
            if (!System.IO.Directory.Exists(folder))
            {
                System.IO.Directory.CreateDirectory(folder);
            }

            currentFile = System.IO.Path.Combine(folder, string.Format("input_{0}.txt", currentRandom));

            System.IO.FileInfo f = new FileInfo(currentFile);
            using (System.IO.StreamWriter writer = f.CreateText())
            {
                int upperBound = rand.Next(50);
                for (int row = 0; row < upperBound; row++)
                {
                    if (row == 0)
                    {
                        writer.WriteLine(string.Format("{0}|{1}", "Col1", "Col2"));                        }

                    writer.WriteLine(string.Format("{0}|{1}", row, seed));
                }
            }
            ;
        }
        Dts.TaskResult = (int)ScriptResults.Success;
    }
IF EXISTS
(
    SELECT * FROM sys.tables AS T WHERE T.name = 'dbase_54462' AND T.schema_id = SCHEMA_ID('dbo')
)
BEGIN
    DROP TABLE dbo.dbase_54462;
END

CREATE TABLE
    dbo.dbase_54462
(
    CurrentFile varchar(256) NOT NULL
,   Col1 int NOT NULL
,   Col2 varchar(50) NOT NULL
,   InsertDate datetime NOT NULL DEFAULT(CURRENT_TIMESTAMP)
);
SELECT
    E.execution_id
,   DATEDIFF(s, E.start_time, E.end_time) As duration_s
,   ES.rc AS FilesProcessed
,   AVG(ES.rc / (1.0 * DATEDIFF(s, E.start_time, E.end_time))) OVER (PARTITION BY ES.rc ORDER BY E.execution_id) AS running_average
FROM
    catalog.executions As E
    INNER JOIN
    (
        SELECT
            MIN(ES.start_time) As start_time
        ,   MAX(ES.end_time) AS end_time
        ,   count(1) As rc
        ,   ES.execution_id
        FROm
            catalog.executable_statistics AS ES
        GROUP BY
            ES.execution_id
    ) AS ES 
    ON ES.execution_id = E.execution_id
WHERE
    E.package_name = 'ReadAllTheFiles.dtsx'
    AND E.execution_id <> 10047
ORDER BY 1,2
execution_id    duration_s  FilesProcessed  running_average
10043   15  104 6.93333333333333
10044   13  104 7.46666666666666
10045   13  104 7.64444444444444
10050   102 1004    9.84313725490196
10051   101 1004    9.89186565715395
10052   102 1004    9.87562285640328
10053   106 1004    9.77464167060435
10055   1103    10004   9.06980961015412
10056   1065    10004   9.23161842010053
10057   1033    10004   9.38255038913446
10058   957 10004   9.65028792246735
10059   945 10004   9.83747901522255

Context

StackExchange Database Administrators Q#54462, answer score: 10

Revisions (0)

No revisions yet.