patternsqlModerate
Why is SSIS slow to enumerate over many files in a directory and import them?
Viewed 0 times
whydirectoryssisimportslowfilesmanyandthemover
Problem
I have a terribly slow SSIS package. It's quite fast with one file, and reasonably fast with 100 files or fewer. (About one second per file)
However, if my directory has thousands of (very small) files, the process drags on terribly slowly. My preference is to run this process only after business hours, but by waiting until then, the number of flat files to import is in the thousands.
The package is very simple:
That is it.
Performance with thousands of files is running 15 seconds or more per each file. The UI (status) is drawing/scrolling so slowly that I can't even see where it's at -- the stamped time is more than 15 hours old on an execution that was started 18 hours ago.
Version: MSSQL 2012
However, if my directory has thousands of (very small) files, the process drags on terribly slowly. My preference is to run this process only after business hours, but by waiting until then, the number of flat files to import is in the thousands.
The package is very simple:
- Outer Loop is For Every (file enumeration, read file path into variable)
- Inside, simply import without any transformation to the data
That is it.
Performance with thousands of files is running 15 seconds or more per each file. The UI (status) is drawing/scrolling so slowly that I can't even see where it's at -- the stamped time is more than 15 hours old on an execution that was started 18 hours ago.
Version: MSSQL 2012
Solution
I think you're running into a limitation of the UI/debugger.
I created two packages: MakeAllTheFiles and ReadAllTheFiles
MakeAllTheFiles accepts as input the number of files to be created. It will make use of pseudo-random function to distribute the data across a number (7) of sub folder.
MakeAllTheFiles
ReadAllTheFiles
The general appearance of the package is thus
I have two Connection Managers defined: One is to my database and the other is to a Flat File with an Expression on the ConnectionString property such that it uses my Variable
Variables, I like lots of Variables so there are plenty.
My Execute SQL Task simply stands up a table for me to write to, knocking it down if it already exists.
My Foreach Enumerator simply looks at everything in my Input folder based on the file mask of *.txt and traverses subfolders. The current file name is assigned to my variable @[User::CurrentFileName]`
The Data Flow is bog standard. The Derived Column Transformation there simply adds in the Current File Name variable into the data flow so I can record it in my table.
Analysis
I'm lazy and didn't want to do anything special to record processing times so I deployed my packages into the SSISDB catalog and ran them from there.
This query looks at the catalog data to find out how long the package ran, how many files it processed and then generates a running average for file count. Run 10047 was bad and was excluded from analysis.
The resulting data (gratuitous SQLFiddle)
Based on this sampling size, I see no appreciable difference between processing 100, 1000 or 10,000 files with SSIS as described her
I created two packages: MakeAllTheFiles and ReadAllTheFiles
MakeAllTheFiles accepts as input the number of files to be created. It will make use of pseudo-random function to distribute the data across a number (7) of sub folder.
MakeAllTheFiles
public void Main()
{
int NumberOfFilesToGenerate = (Int32)Dts.Variables["User::FilesToGenerate"].Value;
string baseFolder = Dts.Variables["User::FolderInput"].Value.ToString();
System.Random rand = null;
int fileRows = 0;
DateTime current = DateTime.Now;
int currentRandom = -1;
int seed = 0;
string folder = string.Empty;
string currentFile = string.Empty;
for (int i = 0; i < NumberOfFilesToGenerate; i++)
{
seed = i * current.Month * current.Day * current.Hour * current.Minute * current.Second;
rand = new Random(seed);
currentRandom = rand.Next();
// Create files in sub folders
folder = System.IO.Path.Combine(baseFolder, string.Format("f_{0}", currentRandom % 7));
// Create the folder if it does not exist
if (!System.IO.Directory.Exists(folder))
{
System.IO.Directory.CreateDirectory(folder);
}
currentFile = System.IO.Path.Combine(folder, string.Format("input_{0}.txt", currentRandom));
System.IO.FileInfo f = new FileInfo(currentFile);
using (System.IO.StreamWriter writer = f.CreateText())
{
int upperBound = rand.Next(50);
for (int row = 0; row < upperBound; row++)
{
if (row == 0)
{
writer.WriteLine(string.Format("{0}|{1}", "Col1", "Col2")); }
writer.WriteLine(string.Format("{0}|{1}", row, seed));
}
}
;
}
Dts.TaskResult = (int)ScriptResults.Success;
}ReadAllTheFiles
The general appearance of the package is thus
I have two Connection Managers defined: One is to my database and the other is to a Flat File with an Expression on the ConnectionString property such that it uses my Variable
@[User::CurrentFileName] Variables, I like lots of Variables so there are plenty.
My Execute SQL Task simply stands up a table for me to write to, knocking it down if it already exists.
IF EXISTS
(
SELECT * FROM sys.tables AS T WHERE T.name = 'dbase_54462' AND T.schema_id = SCHEMA_ID('dbo')
)
BEGIN
DROP TABLE dbo.dbase_54462;
END
CREATE TABLE
dbo.dbase_54462
(
CurrentFile varchar(256) NOT NULL
, Col1 int NOT NULL
, Col2 varchar(50) NOT NULL
, InsertDate datetime NOT NULL DEFAULT(CURRENT_TIMESTAMP)
);My Foreach Enumerator simply looks at everything in my Input folder based on the file mask of *.txt and traverses subfolders. The current file name is assigned to my variable @[User::CurrentFileName]`
The Data Flow is bog standard. The Derived Column Transformation there simply adds in the Current File Name variable into the data flow so I can record it in my table.
Analysis
I'm lazy and didn't want to do anything special to record processing times so I deployed my packages into the SSISDB catalog and ran them from there.
This query looks at the catalog data to find out how long the package ran, how many files it processed and then generates a running average for file count. Run 10047 was bad and was excluded from analysis.
SELECT
E.execution_id
, DATEDIFF(s, E.start_time, E.end_time) As duration_s
, ES.rc AS FilesProcessed
, AVG(ES.rc / (1.0 * DATEDIFF(s, E.start_time, E.end_time))) OVER (PARTITION BY ES.rc ORDER BY E.execution_id) AS running_average
FROM
catalog.executions As E
INNER JOIN
(
SELECT
MIN(ES.start_time) As start_time
, MAX(ES.end_time) AS end_time
, count(1) As rc
, ES.execution_id
FROm
catalog.executable_statistics AS ES
GROUP BY
ES.execution_id
) AS ES
ON ES.execution_id = E.execution_id
WHERE
E.package_name = 'ReadAllTheFiles.dtsx'
AND E.execution_id <> 10047
ORDER BY 1,2The resulting data (gratuitous SQLFiddle)
execution_id duration_s FilesProcessed running_average
10043 15 104 6.93333333333333
10044 13 104 7.46666666666666
10045 13 104 7.64444444444444
10050 102 1004 9.84313725490196
10051 101 1004 9.89186565715395
10052 102 1004 9.87562285640328
10053 106 1004 9.77464167060435
10055 1103 10004 9.06980961015412
10056 1065 10004 9.23161842010053
10057 1033 10004 9.38255038913446
10058 957 10004 9.65028792246735
10059 945 10004 9.83747901522255Based on this sampling size, I see no appreciable difference between processing 100, 1000 or 10,000 files with SSIS as described her
Code Snippets
public void Main()
{
int NumberOfFilesToGenerate = (Int32)Dts.Variables["User::FilesToGenerate"].Value;
string baseFolder = Dts.Variables["User::FolderInput"].Value.ToString();
System.Random rand = null;
int fileRows = 0;
DateTime current = DateTime.Now;
int currentRandom = -1;
int seed = 0;
string folder = string.Empty;
string currentFile = string.Empty;
for (int i = 0; i < NumberOfFilesToGenerate; i++)
{
seed = i * current.Month * current.Day * current.Hour * current.Minute * current.Second;
rand = new Random(seed);
currentRandom = rand.Next();
// Create files in sub folders
folder = System.IO.Path.Combine(baseFolder, string.Format("f_{0}", currentRandom % 7));
// Create the folder if it does not exist
if (!System.IO.Directory.Exists(folder))
{
System.IO.Directory.CreateDirectory(folder);
}
currentFile = System.IO.Path.Combine(folder, string.Format("input_{0}.txt", currentRandom));
System.IO.FileInfo f = new FileInfo(currentFile);
using (System.IO.StreamWriter writer = f.CreateText())
{
int upperBound = rand.Next(50);
for (int row = 0; row < upperBound; row++)
{
if (row == 0)
{
writer.WriteLine(string.Format("{0}|{1}", "Col1", "Col2")); }
writer.WriteLine(string.Format("{0}|{1}", row, seed));
}
}
;
}
Dts.TaskResult = (int)ScriptResults.Success;
}IF EXISTS
(
SELECT * FROM sys.tables AS T WHERE T.name = 'dbase_54462' AND T.schema_id = SCHEMA_ID('dbo')
)
BEGIN
DROP TABLE dbo.dbase_54462;
END
CREATE TABLE
dbo.dbase_54462
(
CurrentFile varchar(256) NOT NULL
, Col1 int NOT NULL
, Col2 varchar(50) NOT NULL
, InsertDate datetime NOT NULL DEFAULT(CURRENT_TIMESTAMP)
);SELECT
E.execution_id
, DATEDIFF(s, E.start_time, E.end_time) As duration_s
, ES.rc AS FilesProcessed
, AVG(ES.rc / (1.0 * DATEDIFF(s, E.start_time, E.end_time))) OVER (PARTITION BY ES.rc ORDER BY E.execution_id) AS running_average
FROM
catalog.executions As E
INNER JOIN
(
SELECT
MIN(ES.start_time) As start_time
, MAX(ES.end_time) AS end_time
, count(1) As rc
, ES.execution_id
FROm
catalog.executable_statistics AS ES
GROUP BY
ES.execution_id
) AS ES
ON ES.execution_id = E.execution_id
WHERE
E.package_name = 'ReadAllTheFiles.dtsx'
AND E.execution_id <> 10047
ORDER BY 1,2execution_id duration_s FilesProcessed running_average
10043 15 104 6.93333333333333
10044 13 104 7.46666666666666
10045 13 104 7.64444444444444
10050 102 1004 9.84313725490196
10051 101 1004 9.89186565715395
10052 102 1004 9.87562285640328
10053 106 1004 9.77464167060435
10055 1103 10004 9.06980961015412
10056 1065 10004 9.23161842010053
10057 1033 10004 9.38255038913446
10058 957 10004 9.65028792246735
10059 945 10004 9.83747901522255Context
StackExchange Database Administrators Q#54462, answer score: 10
Revisions (0)
No revisions yet.