HiveBrain v1.2.0
Get Started
← Back to all entries
snippetcsharpMinor

Parse text, break into sentences, break into words, output as XML/CSV

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
intotextwordscsvoutputxmlparsesentencesbreak

Problem

The task is to:

  • parser text, break it into sentences



  • break sentences into words



  • output words in XML or CSV sorted



  • allow some whitespace around words and delimiters



This is a task I was given for a job interview. I'm rather concerned with maintainbility, readablity, general advice rather then performance. But any suggestions are welcome (maybe I've made some serious and abvious performance mistakes).

Beside implementing the parser and writer I had to expose it in a ASP.NET MVC Web application (using Web Api).
The wole code is available here: https://github.com/inwenis/NorParser

Input:

Mary had a little lamb. Aesop and.


Expected XML format:




a
had
lamb
little
Mary


Aesop
and




Expected CSV format

, Word 1, Word 2, Word 3, Word 4, Word 5, Word 6, Word 7, Word 8
Sentence 1, a, had, lamb, little, Mary
Sentence 2, Aesop, and


Sentence.cs:

using System.Collections.Generic;

namespace NorParser
{
    public class Sentence
    {
        public List Words { get; set; }
    }
}


Parser.cs:

```
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace NorParser
{
public class Parser
{
private readonly char[] _sentenceSeparators = { '.' };

public List Parse(string input)
{
var parsedSentences = new List();
var sentences = input.Split(_sentenceSeparators, StringSplitOptions.RemoveEmptyEntries);
foreach (var sentence in sentences)
{
var words = ReplaceCharactersNotAllowedInWordsWithSpaces(sentence)
.Split(new []{' '}, StringSplitOptions.RemoveEmptyEntries)
.Select(RemoveLeadingHyphen)
.Select(RemoveTrailingHyphen)
.Select(RemoveLeadingApostrophe)
.Where(w => !string.IsNullOrWhiteSpace(w))
.Where(w

Solution

private readonly char[] _sentenceSeparators = { '.' };


What happened to the ? and !? Those are sentence separators too.

private string RemoveLeadingHyphen(string o)
    {
        return Regex.Replace(o, "^-+|-+$", "");
    }


This is supposed to remove only leading hypens but it removes trailing ones too.

I would use only one method like TrimSpecialCharacters and use only one regex:

^[-']+|[-']+$


As a matter of fact you can make the entire Parse method a single LINQ expression:

public IEnumerable Parse(string input)
{
    return
        (input ?? throw new ArgumentNullException(nameof(input)))
        .Split(_sentenceSeparators, StringSplitOptions.RemoveEmptyEntries)
        .Select(sentence => new Sentence
        {
            Words =
                 ReplaceCharactersNotAllowedInWordsWithSpaces(sentence)
                .Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
                .Select(RemoveLeadingHyphen)
                .Select(RemoveTrailingHyphen)
                .Select(RemoveLeadingApostrophe)
                .Where(w => !string.IsNullOrWhiteSpace(w))
                .Where(w => !w.All(char.IsPunctuation))
                .OrderBy(s => s)
                .ToList()
        }).Where(sentence => sentence.Words.Any());                   
}


The CsvWriter is not a real writer yet, it cannot write to files. I'd call it CsvGenerator or CsvCreator because this is what it does.

You nowhere handle the , but you use it for csv generation. If any sentence contains a , you won't be able to read it later. The cleaning methods does not trim it.

I suggest adjusting the regex for this case:

^[-']+|[-',]+$


or adding it to the split list in case someone did not put a space after it.

As a final word: I like your code because you separated all responsibilities and you can test it.

Code Snippets

private readonly char[] _sentenceSeparators = { '.' };
private string RemoveLeadingHyphen(string o)
    {
        return Regex.Replace(o, "^-+|-+$", "");
    }
^[-']+|[-']+$
public IEnumerable<Sentence> Parse(string input)
{
    return
        (input ?? throw new ArgumentNullException(nameof(input)))
        .Split(_sentenceSeparators, StringSplitOptions.RemoveEmptyEntries)
        .Select(sentence => new Sentence
        {
            Words =
                 ReplaceCharactersNotAllowedInWordsWithSpaces(sentence)
                .Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
                .Select(RemoveLeadingHyphen)
                .Select(RemoveTrailingHyphen)
                .Select(RemoveLeadingApostrophe)
                .Where(w => !string.IsNullOrWhiteSpace(w))
                .Where(w => !w.All(char.IsPunctuation))
                .OrderBy(s => s)
                .ToList()
        }).Where(sentence => sentence.Words.Any());                   
}
^[-']+|[-',]+$

Context

StackExchange Code Review Q#157782, answer score: 2

Revisions (0)

No revisions yet.