HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Program falls over with larger input files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
largerwithprograminputfilesfallsover

Problem

I have been translating a solution from C# into VB.Net, because I know more VB and wanted to understand it better. I have a version up and running that replicates the C# performance with small input files; however with an input file over ~8MB the memory usage skyrockets and it will keep chugging until it consumes all memory.
This is a problem, because the expected use case includes input files over 10MB, up to maybe 20MB.

I've run the code through the Visual Studio profiler on smaller files, and the below function is doing almost all the gruntwork (80%ish). The rest of the class is on paste-bin here if it helps.

The key question, am I doing anything obviously idiotic, that scales really badly, or is substantially different between VB.Net and C#? (C# code, for comparison, is here)

```
Public Function BuildSchedule() As Schedule
Dim cifSchedule = New Schedule()
Try
Dim tiploc_1 As Tiploc = Tiploc.GetInstance()
For i As Integer = 0 To _filelines.Count - 1
Dim stopLocation As String
Select Case _filelines(i).Substring(0, 2)
Case "BS"
If True Then
Dim passStops As String() = {"1", "2"}
If passStops.Contains(_filelines(i).Substring(32, 1)) = False OrElse _filelines(i).Substring(2, 1) = "D" Then
Dim [stop] As Integer = _filelines.Count()
Dim newi As Integer = 0
For j As Integer = i + 1 To [stop] - 1
If _filelines(j).Substring(0, 2) = "BS" Then
[stop] = j
newi = j - 1
End If
Next

i = newi
Exit Select
End If
BuildTrain(cifSchedule, i)
Exit Select
End If

Solution

The first thing that strikes me about the code (linked not present) is that the path is passed into the constructor and all the lines read - checking them for validity - and then we make a second call (BuildSchedule()) to process them. Which means we need to pass through our large file twice. I am also a bit dubious of the end-of-file check being used.

Class CifData
    Private ReadOnly _filelines As List(Of String)
    Public Sub New(filepath As String)
        _filelines = New List(Of String)()

        Try
            Dim validTypes As String() = {"BS", "BX", "LO", "LI", "LT"}

            Using CIFReader = New StreamReader(filepath)
                While CIFReader.Peek() <> -1
                    Dim line As String = CIFReader.ReadLine()

                    If line IsNot Nothing AndAlso validTypes.Contains(line.Substring(0, 2)) Then
                        _filelines.Add(line)
                    End If

                End While
            End Using
        Catch ex As Exception
            Throw New LoadException("Error: " & ex.Message & Environment.NewLine &
                                    "Please ensure a valid CIF is selected")
        End Try
    End Sub


Is the input file a text file with a single record on each line?

If so, we can simply read until line is null. Why the Peek?

That however is small, the large problem is reading all the lines and then processing all the lines. Is there a specific reason why the lines are read in the constructor and then processed in BuildSchedule() rather than reading them in BuildSchedule()?

As well as processing the file twice we also store the information twice, once as a string and then once as a schedule entry. There seems to be no need to store the string. Perhaps this shape might help.

Public Class CifData

    Public Function BuildSchedule(fileName As String) As Schedule

        Dim ret As New Schedule

        For Each line In ReadAllLines(fileName)
            ProcessLine(line, ret)
        Next

        Return ret

    End Function

    Private Sub ProcessLine(line As String, sch As Schedule)
        ' Add to schedule based on codes.
        ' ignore lines with invalid codes
    End Sub

    Public Iterator Function ReadAllLines(fileName As String) As IEnumerable(Of String)
        Using reader = New StreamReader(fileName)
            While (True)
                Dim line As String = reader.ReadLine()
                If line Is Nothing Then
                    Exit While
                End If
                Yield line
            End While
        End Using
    End Function
End Class


It should use less memory - we are not storing the strings from the file, just using them one at a time - and less time - we only iterate through the file once.

Code Snippets

Class CifData
    Private ReadOnly _filelines As List(Of String)
    Public Sub New(filepath As String)
        _filelines = New List(Of String)()

        Try
            Dim validTypes As String() = {"BS", "BX", "LO", "LI", "LT"}

            Using CIFReader = New StreamReader(filepath)
                While CIFReader.Peek() <> -1
                    Dim line As String = CIFReader.ReadLine()

                    If line IsNot Nothing AndAlso validTypes.Contains(line.Substring(0, 2)) Then
                        _filelines.Add(line)
                    End If

                End While
            End Using
        Catch ex As Exception
            Throw New LoadException("Error: " & ex.Message & Environment.NewLine &
                                    "Please ensure a valid CIF is selected")
        End Try
    End Sub
Public Class CifData

    Public Function BuildSchedule(fileName As String) As Schedule

        Dim ret As New Schedule

        For Each line In ReadAllLines(fileName)
            ProcessLine(line, ret)
        Next

        Return ret

    End Function

    Private Sub ProcessLine(line As String, sch As Schedule)
        ' Add to schedule based on codes.
        ' ignore lines with invalid codes
    End Sub


    Public Iterator Function ReadAllLines(fileName As String) As IEnumerable(Of String)
        Using reader = New StreamReader(fileName)
            While (True)
                Dim line As String = reader.ReadLine()
                If line Is Nothing Then
                    Exit While
                End If
                Yield line
            End While
        End Using
    End Function
End Class

Context

StackExchange Code Review Q#54520, answer score: 4

Revisions (0)

No revisions yet.