HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

word break using iteration

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
iterationwordbreakusing

Problem

Here is my word break problem in c#

static void breakSentenceIntoWords(string input,List validWords)
    {
        if (String.IsNullOrEmpty(input))
            throw new Exception("String cannot be empty");

        int indicator = 0;
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i  validwords = new List() { "I", "like", "had", "play","to" };
        breakSentenceIntoWords("Ihadplayliketo", validwords);


I think the complexity here is O(n2) because of the for-loop and the contains method. How can I improve its efficiency ?

Solution

Here you're using the name word but from your example you're actually working with substrings. Word is a vague concept but think about this:

var validWords = new List() { "all", "to", "get", "her" };
BreakSentenceIntoWords("alltogether", validWords);


In this case you have each word from validWords but you also have another word together which is an English word but it's not in your list. Is it correct? If you're talking about words you need to consider some kind of delimiter (spaces, for example) but don't forget that it's not true in every language (Chinese, just to mention one). In your current code you're not even handling spaces.

Also note that in your current code order matters, for the above string you will get different results with { "all", "to", "together" } and this is (probably) unwanted. There are also some other corner cases you didn't handle (see answer below for other details.) I'd suggest that even before you start writing the first line of code you put in-place a huge test set (some of them also adding some randomness, I like my tests to be deterministic but working with text I always fear I forgot something then 10/20 iterations shuffling dictionaries and strings may help me to improve my test suite.)

Let's start with few general points, regardless what I said above about words.

Method name isn't best one you can choose because you're not breaking a sentence into words but you're checking if all words in that sentence are in a list of valid words. Two methods SplitStringIntoWords() and AreAllWordsKnown() might be more appropriate.

You're throwing Exception but you can do better than that, appropriate specific exception type is more informative and will help caller to handle such errors (you may have, for example, different policies for exception handling.) In this case I'd use ArgumentException. I'd also differentiate between a null input and an empty (or blank) input (using ArgumentNullException and ArgumentException respectively.) You should also check validWords to be not-null and not empty. Something like this:

if (input == null)
    throw new ArgumentNullException(nameof(input));

if (String.IsNullOrBlank(input))
    throw new ArgumentException("Input string cannot be blank", nameof(input));

if (validWords == null)
    throw new ArgumentNullException(nameof(validWords));

if (validWords.Count == 0)
    throw new ArgumentException("Known words list is empty.", nameof(validWords));

if (validWords.Any(x => x == null))
    throw new ArgumentException("Known words can't be null.", nameof(validWords));


Second step is to make this method more generic, you do not need caller to use List when any IEnumerable is enough to complete your task. Just change method signature.

You're also writing to console making this method not easy to reuse. Split logic and presentation, you may use a callback parameter if you need to output something or change return type to hold all the information you need. We will see more about this later.

First performance problem comes from iteration through the input string, for each character you search validWords for a match, this can be pretty inefficient when validWords or input grow to larger numbers. A small note about word searching. Linear search is not efficient for very large dictionaries (but it may be OK for 10~20 words), ordered lists and hashtables offer a huge performace gain in search (because you won't actually need to go through the entire list performing an expensive culture aware string comparison to find a match.) Later there is more about this...

Let's split your sentence into words! .NET Framework already has a function ready to use: String.Split(). If you do not specify any delimiter then it will use whitespace characters (which are many more than space and tab). For simplicity in these examples I will omit error checking (and I won't consider punctuation) but you need to do it in your code.

static string[] SplitIntoWords(string sentence)
{
    return sentence.Split(null);
}


You might want to compare with regex implementation to determine which one performs better:

static string[] SplitIntoWords(string sentence)
{
    return Regex.Split(sentence, "\\s+");;
}


Now that you have a list of words you can determine if all of them are valid. First naive implementation (don't forget what we said above about string searching in a dictionary) may be:

static bool AreAllWordsKnown(IEnumerable words, IEnumerable knownWords)
{
    return words.All(x => knownWords.Contains(x));
}


Note that to extract the list of not valid words you can use:

IEnumerable unkownWords = words.Except(knownWords);


Above code can then be rewritten as:

return words.Except(knownWords).Any() == false;


To determine the list of valid words is straightforward (it's just the intersection between words and unknownWords.)

Things are different if you're not searching for words but for

Code Snippets

var validWords = new List<string>() { "all", "to", "get", "her" };
BreakSentenceIntoWords("alltogether", validWords);
if (input == null)
    throw new ArgumentNullException(nameof(input));

if (String.IsNullOrBlank(input))
    throw new ArgumentException("Input string cannot be blank", nameof(input));

if (validWords == null)
    throw new ArgumentNullException(nameof(validWords));

if (validWords.Count == 0)
    throw new ArgumentException("Known words list is empty.", nameof(validWords));

if (validWords.Any(x => x == null))
    throw new ArgumentException("Known words can't be null.", nameof(validWords));
static string[] SplitIntoWords(string sentence)
{
    return sentence.Split(null);
}
static string[] SplitIntoWords(string sentence)
{
    return Regex.Split(sentence, "\\s+");;
}
static bool AreAllWordsKnown(IEnumerable<string> words, IEnumerable<string> knownWords)
{
    return words.All(x => knownWords.Contains(x));
}

Context

StackExchange Code Review Q#154542, answer score: 5

Revisions (0)

No revisions yet.