patternMinor
Program falls over with larger input files
Viewed 0 times
largerwithprograminputfilesfallsover
Problem
I have been translating a solution from C# into VB.Net, because I know more VB and wanted to understand it better. I have a version up and running that replicates the C# performance with small input files; however with an input file over ~8MB the memory usage skyrockets and it will keep chugging until it consumes all memory.
This is a problem, because the expected use case includes input files over 10MB, up to maybe 20MB.
I've run the code through the Visual Studio profiler on smaller files, and the below function is doing almost all the gruntwork (80%ish). The rest of the class is on paste-bin here if it helps.
The key question, am I doing anything obviously idiotic, that scales really badly, or is substantially different between VB.Net and C#? (C# code, for comparison, is here)
```
Public Function BuildSchedule() As Schedule
Dim cifSchedule = New Schedule()
Try
Dim tiploc_1 As Tiploc = Tiploc.GetInstance()
For i As Integer = 0 To _filelines.Count - 1
Dim stopLocation As String
Select Case _filelines(i).Substring(0, 2)
Case "BS"
If True Then
Dim passStops As String() = {"1", "2"}
If passStops.Contains(_filelines(i).Substring(32, 1)) = False OrElse _filelines(i).Substring(2, 1) = "D" Then
Dim [stop] As Integer = _filelines.Count()
Dim newi As Integer = 0
For j As Integer = i + 1 To [stop] - 1
If _filelines(j).Substring(0, 2) = "BS" Then
[stop] = j
newi = j - 1
End If
Next
i = newi
Exit Select
End If
BuildTrain(cifSchedule, i)
Exit Select
End If
This is a problem, because the expected use case includes input files over 10MB, up to maybe 20MB.
I've run the code through the Visual Studio profiler on smaller files, and the below function is doing almost all the gruntwork (80%ish). The rest of the class is on paste-bin here if it helps.
The key question, am I doing anything obviously idiotic, that scales really badly, or is substantially different between VB.Net and C#? (C# code, for comparison, is here)
```
Public Function BuildSchedule() As Schedule
Dim cifSchedule = New Schedule()
Try
Dim tiploc_1 As Tiploc = Tiploc.GetInstance()
For i As Integer = 0 To _filelines.Count - 1
Dim stopLocation As String
Select Case _filelines(i).Substring(0, 2)
Case "BS"
If True Then
Dim passStops As String() = {"1", "2"}
If passStops.Contains(_filelines(i).Substring(32, 1)) = False OrElse _filelines(i).Substring(2, 1) = "D" Then
Dim [stop] As Integer = _filelines.Count()
Dim newi As Integer = 0
For j As Integer = i + 1 To [stop] - 1
If _filelines(j).Substring(0, 2) = "BS" Then
[stop] = j
newi = j - 1
End If
Next
i = newi
Exit Select
End If
BuildTrain(cifSchedule, i)
Exit Select
End If
Solution
The first thing that strikes me about the code (linked not present) is that the path is passed into the constructor and all the lines read - checking them for validity - and then we make a second call (
Is the input file a text file with a single record on each line?
If so, we can simply read until line is null. Why the Peek?
That however is small, the large problem is reading all the lines and then processing all the lines. Is there a specific reason why the lines are read in the constructor and then processed in
As well as processing the file twice we also store the information twice, once as a string and then once as a schedule entry. There seems to be no need to store the string. Perhaps this shape might help.
It should use less memory - we are not storing the strings from the file, just using them one at a time - and less time - we only iterate through the file once.
BuildSchedule()) to process them. Which means we need to pass through our large file twice. I am also a bit dubious of the end-of-file check being used. Class CifData
Private ReadOnly _filelines As List(Of String)
Public Sub New(filepath As String)
_filelines = New List(Of String)()
Try
Dim validTypes As String() = {"BS", "BX", "LO", "LI", "LT"}
Using CIFReader = New StreamReader(filepath)
While CIFReader.Peek() <> -1
Dim line As String = CIFReader.ReadLine()
If line IsNot Nothing AndAlso validTypes.Contains(line.Substring(0, 2)) Then
_filelines.Add(line)
End If
End While
End Using
Catch ex As Exception
Throw New LoadException("Error: " & ex.Message & Environment.NewLine &
"Please ensure a valid CIF is selected")
End Try
End SubIs the input file a text file with a single record on each line?
If so, we can simply read until line is null. Why the Peek?
That however is small, the large problem is reading all the lines and then processing all the lines. Is there a specific reason why the lines are read in the constructor and then processed in
BuildSchedule() rather than reading them in BuildSchedule()?As well as processing the file twice we also store the information twice, once as a string and then once as a schedule entry. There seems to be no need to store the string. Perhaps this shape might help.
Public Class CifData
Public Function BuildSchedule(fileName As String) As Schedule
Dim ret As New Schedule
For Each line In ReadAllLines(fileName)
ProcessLine(line, ret)
Next
Return ret
End Function
Private Sub ProcessLine(line As String, sch As Schedule)
' Add to schedule based on codes.
' ignore lines with invalid codes
End Sub
Public Iterator Function ReadAllLines(fileName As String) As IEnumerable(Of String)
Using reader = New StreamReader(fileName)
While (True)
Dim line As String = reader.ReadLine()
If line Is Nothing Then
Exit While
End If
Yield line
End While
End Using
End Function
End ClassIt should use less memory - we are not storing the strings from the file, just using them one at a time - and less time - we only iterate through the file once.
Code Snippets
Class CifData
Private ReadOnly _filelines As List(Of String)
Public Sub New(filepath As String)
_filelines = New List(Of String)()
Try
Dim validTypes As String() = {"BS", "BX", "LO", "LI", "LT"}
Using CIFReader = New StreamReader(filepath)
While CIFReader.Peek() <> -1
Dim line As String = CIFReader.ReadLine()
If line IsNot Nothing AndAlso validTypes.Contains(line.Substring(0, 2)) Then
_filelines.Add(line)
End If
End While
End Using
Catch ex As Exception
Throw New LoadException("Error: " & ex.Message & Environment.NewLine &
"Please ensure a valid CIF is selected")
End Try
End SubPublic Class CifData
Public Function BuildSchedule(fileName As String) As Schedule
Dim ret As New Schedule
For Each line In ReadAllLines(fileName)
ProcessLine(line, ret)
Next
Return ret
End Function
Private Sub ProcessLine(line As String, sch As Schedule)
' Add to schedule based on codes.
' ignore lines with invalid codes
End Sub
Public Iterator Function ReadAllLines(fileName As String) As IEnumerable(Of String)
Using reader = New StreamReader(fileName)
While (True)
Dim line As String = reader.ReadLine()
If line Is Nothing Then
Exit While
End If
Yield line
End While
End Using
End Function
End ClassContext
StackExchange Code Review Q#54520, answer score: 4
Revisions (0)
No revisions yet.