snippetcsharpMinor
Parse text, break into sentences, break into words, output as XML/CSV
Viewed 0 times
intotextwordscsvoutputxmlparsesentencesbreak
Problem
The task is to:
This is a task I was given for a job interview. I'm rather concerned with maintainbility, readablity, general advice rather then performance. But any suggestions are welcome (maybe I've made some serious and abvious performance mistakes).
Beside implementing the parser and writer I had to expose it in a ASP.NET MVC Web application (using Web Api).
The wole code is available here: https://github.com/inwenis/NorParser
Input:
Expected XML format:
Expected CSV format
Sentence.cs:
Parser.cs:
```
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace NorParser
{
public class Parser
{
private readonly char[] _sentenceSeparators = { '.' };
public List Parse(string input)
{
var parsedSentences = new List();
var sentences = input.Split(_sentenceSeparators, StringSplitOptions.RemoveEmptyEntries);
foreach (var sentence in sentences)
{
var words = ReplaceCharactersNotAllowedInWordsWithSpaces(sentence)
.Split(new []{' '}, StringSplitOptions.RemoveEmptyEntries)
.Select(RemoveLeadingHyphen)
.Select(RemoveTrailingHyphen)
.Select(RemoveLeadingApostrophe)
.Where(w => !string.IsNullOrWhiteSpace(w))
.Where(w
- parser text, break it into sentences
- break sentences into words
- output words in XML or CSV sorted
- allow some whitespace around words and delimiters
This is a task I was given for a job interview. I'm rather concerned with maintainbility, readablity, general advice rather then performance. But any suggestions are welcome (maybe I've made some serious and abvious performance mistakes).
Beside implementing the parser and writer I had to expose it in a ASP.NET MVC Web application (using Web Api).
The wole code is available here: https://github.com/inwenis/NorParser
Input:
Mary had a little lamb. Aesop and.
Expected XML format:
a
had
lamb
little
Mary
Aesop
and
Expected CSV format
, Word 1, Word 2, Word 3, Word 4, Word 5, Word 6, Word 7, Word 8
Sentence 1, a, had, lamb, little, Mary
Sentence 2, Aesop, and
Sentence.cs:
using System.Collections.Generic;
namespace NorParser
{
public class Sentence
{
public List Words { get; set; }
}
}Parser.cs:
```
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace NorParser
{
public class Parser
{
private readonly char[] _sentenceSeparators = { '.' };
public List Parse(string input)
{
var parsedSentences = new List();
var sentences = input.Split(_sentenceSeparators, StringSplitOptions.RemoveEmptyEntries);
foreach (var sentence in sentences)
{
var words = ReplaceCharactersNotAllowedInWordsWithSpaces(sentence)
.Split(new []{' '}, StringSplitOptions.RemoveEmptyEntries)
.Select(RemoveLeadingHyphen)
.Select(RemoveTrailingHyphen)
.Select(RemoveLeadingApostrophe)
.Where(w => !string.IsNullOrWhiteSpace(w))
.Where(w
Solution
private readonly char[] _sentenceSeparators = { '.' };What happened to the
? and !? Those are sentence separators too.private string RemoveLeadingHyphen(string o)
{
return Regex.Replace(o, "^-+|-+$", "");
}This is supposed to remove only leading hypens but it removes trailing ones too.
I would use only one method like
TrimSpecialCharacters and use only one regex:^[-']+|[-']+$As a matter of fact you can make the entire
Parse method a single LINQ expression:public IEnumerable Parse(string input)
{
return
(input ?? throw new ArgumentNullException(nameof(input)))
.Split(_sentenceSeparators, StringSplitOptions.RemoveEmptyEntries)
.Select(sentence => new Sentence
{
Words =
ReplaceCharactersNotAllowedInWordsWithSpaces(sentence)
.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
.Select(RemoveLeadingHyphen)
.Select(RemoveTrailingHyphen)
.Select(RemoveLeadingApostrophe)
.Where(w => !string.IsNullOrWhiteSpace(w))
.Where(w => !w.All(char.IsPunctuation))
.OrderBy(s => s)
.ToList()
}).Where(sentence => sentence.Words.Any());
}The
CsvWriter is not a real writer yet, it cannot write to files. I'd call it CsvGenerator or CsvCreator because this is what it does.You nowhere handle the
, but you use it for csv generation. If any sentence contains a , you won't be able to read it later. The cleaning methods does not trim it.I suggest adjusting the regex for this case:
^[-']+|[-',]+$or adding it to the split list in case someone did not put a space after it.
As a final word: I like your code because you separated all responsibilities and you can test it.
Code Snippets
private readonly char[] _sentenceSeparators = { '.' };private string RemoveLeadingHyphen(string o)
{
return Regex.Replace(o, "^-+|-+$", "");
}^[-']+|[-']+$public IEnumerable<Sentence> Parse(string input)
{
return
(input ?? throw new ArgumentNullException(nameof(input)))
.Split(_sentenceSeparators, StringSplitOptions.RemoveEmptyEntries)
.Select(sentence => new Sentence
{
Words =
ReplaceCharactersNotAllowedInWordsWithSpaces(sentence)
.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
.Select(RemoveLeadingHyphen)
.Select(RemoveTrailingHyphen)
.Select(RemoveLeadingApostrophe)
.Where(w => !string.IsNullOrWhiteSpace(w))
.Where(w => !w.All(char.IsPunctuation))
.OrderBy(s => s)
.ToList()
}).Where(sentence => sentence.Words.Any());
}^[-']+|[-',]+$Context
StackExchange Code Review Q#157782, answer score: 2
Revisions (0)
No revisions yet.