patterncsharpMinor
Delimited File Reader
Viewed 0 times
filereaderdelimited
Problem
UPDATE: I have refactored the code into a Gist using @Dmitry's answer as a guide. The update is much simpler to grok, implements
I wrote this over the weekend for fun and am looking for critique. Style and readability comments are welcome but what I truly need to know is:
When I ask these of myself I get 1 = yes, 2 = no, and 3 = maaaaaybe. I'd like to add other features like skipping the header row, inferring data types, validating field counts, etc. but I'll be tackling that kind of thing via derivation or extension since such logic will be simpler to implement if based on an existing
Usage:
Features:
Code:
```
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;
namespace ByteTerrace
{
public class DelimitedReader : IEnumerable>
{
private const int DEFAULT_CHUNK_SIZE = 128;
private const char DEFAULT_ESCAPE_CHAR = '"';
private const char DEFAULT_SEPARATOR_CHAR = ',';
private readonly char[] m_buffer;
private readonly Encoding m_encoding;
private readonly char m_escapeChar;
private readonly string m_fileName;
private readonly char m_separatorChar;
public char[] Buffer {
get {
return m_buffer;
}
}
public Enc
IDisposable, and is roughly thirty lines shorter.I wrote this over the weekend for fun and am looking for critique. Style and readability comments are welcome but what I truly need to know is:
- Does it function as advertised?
- Are there any lingering bugs that I've missed?
- Can you come up with a way to make it faster?
When I ask these of myself I get 1 = yes, 2 = no, and 3 = maaaaaybe. I'd like to add other features like skipping the header row, inferring data types, validating field counts, etc. but I'll be tackling that kind of thing via derivation or extension since such logic will be simpler to implement if based on an existing
IEnumerable> like this one.FLAME ON;Usage:
foreach (var row in DelimitedReader.Create(fileName)) {
foreach (var field in row) {
// do stuff
}
}Features:
- Accurate: RFC4180 Compliant
- Efficient: memory usage is (roughly) equal to the size of the largest row
- Fast: average throughput of ~25 megabytes per second
- Flexible: the default encoding and separator/escape characters can be user-defined
- Lightweight: single 160 line class with no external dependencies
Code:
```
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;
namespace ByteTerrace
{
public class DelimitedReader : IEnumerable>
{
private const int DEFAULT_CHUNK_SIZE = 128;
private const char DEFAULT_ESCAPE_CHAR = '"';
private const char DEFAULT_SEPARATOR_CHAR = ',';
private readonly char[] m_buffer;
private readonly Encoding m_encoding;
private readonly char m_escapeChar;
private readonly string m_fileName;
private readonly char m_separatorChar;
public char[] Buffer {
get {
return m_buffer;
}
}
public Enc
Solution
I'd prefer to rely on the builtin functionality as much as possible. I want to believe that use of the builtin stuff makes my code more readable and probably faster.
So my proposal is:
In the class above I use the
According to my test runs, this approach is a bit faster.
So my proposal is:
public class DelimitedReader : IEnumerable, IDisposable
{
private readonly StreamReader reader;
public DelimitedReader(string fileName, Encoding encoding = null)
: this(new StreamReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite),
encoding ?? Encoding.UTF8, encoding == null))
{
}
public DelimitedReader(StreamReader reader)
{
this.reader = reader;
}
public void Dispose()
{
reader.Dispose();
}
public char EscapeChar { get; set; } = '"';
public char SeparatorChar { get; set; } = ',';
private string[] ParseLine(string line)
{
List fields = new List();
char[] charsToSeek = { EscapeChar, SeparatorChar };
bool isEscaped = false;
int prevPos = 0;
while (prevPos GetEnumerator()
{
while (!reader.EndOfStream)
{
yield return ParseLine(reader.ReadLine());
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}In the class above I use the
StreamReader.ReadLine method to read a file line by line, and the String.IndexOf/String.IndexOfAny methods to move within the line.According to my test runs, this approach is a bit faster.
Code Snippets
public class DelimitedReader : IEnumerable<string[]>, IDisposable
{
private readonly StreamReader reader;
public DelimitedReader(string fileName, Encoding encoding = null)
: this(new StreamReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite),
encoding ?? Encoding.UTF8, encoding == null))
{
}
public DelimitedReader(StreamReader reader)
{
this.reader = reader;
}
public void Dispose()
{
reader.Dispose();
}
public char EscapeChar { get; set; } = '"';
public char SeparatorChar { get; set; } = ',';
private string[] ParseLine(string line)
{
List<string> fields = new List<string>();
char[] charsToSeek = { EscapeChar, SeparatorChar };
bool isEscaped = false;
int prevPos = 0;
while (prevPos < line.Length)
{
// If in the escaped mode, seek for the escape char only.
// Otherwise, seek for the both chars.
int nextPos = isEscaped
? line.IndexOf(EscapeChar, prevPos)
: line.IndexOfAny(charsToSeek, prevPos);
if (nextPos == -1)
{
// We reached the end of the line
if (!isEscaped)
{
// Add the rest of the line
fields.Add(line.Substring(prevPos, line.Length - prevPos).Trim());
break;
}
// If there is no closing escape char
throw new InvalidDataException("The following line has invalid format: " + line);
}
char nextChar = line[nextPos];
if (nextChar == EscapeChar)
{
// The next char is the escape char
if (isEscaped)
{
// If already in the escaped mode
fields.Add(line.Substring(prevPos, nextPos - prevPos)); // No Trim
}
isEscaped = !isEscaped; // Toggle mode
}
else
{
// The next char is the delimiter
fields.Add(line.Substring(prevPos, nextPos - prevPos).Trim()); // Trim
}
prevPos = nextPos + 1;
}
return fields.ToArray();
}
public IEnumerator<string[]> GetEnumerator()
{
while (!reader.EndOfStream)
{
yield return ParseLine(reader.ReadLine());
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}Context
StackExchange Code Review Q#145860, answer score: 2
Revisions (0)
No revisions yet.