patterncsharpModerate
Word-counting script in C#
Viewed 0 times
scriptwordcounting
Problem
My goal is to read from standard input, break up the input into words (Unicode letters and apostrophes), case-insensitively, and produce a report to standard output, where each line is a word followed by a space followed by its count. The output should be sorted by words.
This is my code:
My concerns:
This is my code:
using System;
using System.Text.RegularExpressions;
using System.Collections.Generic;
class TraditionalWordCountApp
{
public static void Main(string[] args)
{
SortedDictionary counts = new SortedDictionary();
Regex wordRegex = new Regex(@"[\p{L}']+");
string line;
while ((line = Console.ReadLine()) != null)
{
line = line.ToLower();
foreach (Match m in wordRegex.Matches(line))
{
String word = m.Value;
int count;
counts.TryGetValue(word, out count);
counts[word] = ++count;
}
}
foreach (KeyValuePair pair in counts)
{
Console.WriteLine("{0} {1}", pair.Key, pair.Value);
}
}
}My concerns:
- Should things be wrapped in a namespace?
- Should either the class or the
mainfunction or both bepublic?
- Is there a nicer way to declare the sorted dictionary? Java has
<>, does C#?
- I used the C-style idiom of assigning to a variable and then checking for null, all inside the while-condition! Surely there is a nicer way to do this in C#. Is there? I like neither the embedded assignment nor the billion-dollar-mistake
null. :)
- I searched StackOverflow a bit for a nicer way to update a word frequency dictionary. Most languages seem to have a built-in get-with-default method on dictionaries, but the answers I saw said no, and they recommended
TryGetValue. These answers are old, though. Is there something nicer in the most modern C#?
- Any nicer way to do the last loop?
- Would a LINQ-based solution be nicer? More idiomatic? What would it look like? I think (though
Solution
- It's not mandatory but it's a widely accepted guideline. Break it only if you have a good reason to do so.
- No, they don't. Class should be
internal(default) andstatic(because it has not instance methods).Main()method should be private to make clear that it won't be called outside this class but it's not mandatory (someone preferspublicbecause it's logically called from outside).
For the other points things are more complex that a paragraph...
You can create a generator to read from standard input (see very last paragraph for a discussion about this):
private static IEnumerable ReadAllInputs()
{
while (true)
{
string line = Console.ReadLine();
if (line == null)
break;
yield return line;
}
}You do not need
ToLower() on input, I do not even start to mention all the problems of case insensitive comparison if not done properly: a lower case string may be different from an upper case string (even in number of characters) but they may be compared equal. Don't put yourself in trouble and let the framework do the work for you:var counts = new SortedDictionary(StringComparer.CurrentCultureIgnoreCase);Note that you do not need a
SortedDictionary here, ordering may be done (in this case) when printing output. If performance matters then you should profile by yourself.When you increment word count you also increment local variable
count, not a big performance hit (and compiler may be smart enough to elide local variable) but it's about code clarity. You can do:counts[word] = count + 1;Or simply:
++counts[word];Note that this innocent code
++counts[word] (as well as original counts[word] = ++count) when word doesn't exist in dictionary (TryGetValue returns false) will perform another lookup vanishing part of the benefit of TryGetValue for the first insertion! You have to explicitly call Add().If you want you can refactor
foreach loop to print results with LINQ, not a big gain (IMO), let's see how:Console.WriteLine(String.Join(Environment.NewLine,
counts.Select(x => $"{x.Key} {x.Value}")));If we put everything together with then have something like this:
namespace Test {
static class TraditionalWordCountApp {
static void Main(string[] args) {
var counts = new Dictionary(StringComparer.CurrentCultureIgnoreCase);
var wordRegex = new Regex(@"(?i)[\p{L}']+");
foreach (var match in ReadAllInputs().SelectMany(x => wordRegex.Match(x))) {
if (counts.TryGetValue(match.Value, out count))
counts[word] = count + 1;
else
counts.Add(match.Value, 1);
}
Console.WriteLine(String.Join(Environment.NewLine,
counts.OrderBy(x => x.Key).Select(x => $"{x.Key} {x.Value}")));
}
static IEnumerable ReadAllInputs() {
while (true) { // Yes, in C# this is "safe"
string line = Console.ReadLine();
if (line == null)
break;
// No need to match any regex against an empty line
if (line.Length == 0)
continue;
yield return line;
}
}
}
}All these done you may think that...there should be another way to do it. You can use
GroupBy(). Profile both versions to see which one has better performance (and the one you feel more easy to understand):var counts = ReadAllInputs()
.SelectMany(x => regex.Match(x))
.GroupBy(x => x.Value, StringComparer.CurrentCultureIgnoreCase)
.Select(x => new { Word = x.Key, Count = x.Count() })
.OrderBy(x => x.Word);Then you print it with:
Console.WriteLine(String.Join(Environment.NewLine,
counts.Select(x => $"{x.Word} {x.Count}")));One final note about word counting. It may apply or not to your case but don't forget that word is a concept with different meanings in different languages. Topic is vast but try to paste a - for example - Japanese text in Microsoft Word and check what the word count is: number of characters (another vague concept). If you have to deal with international text you may need to write specific code for different cultures.
Comments. Few notes collected from comments.
Sometimes you see this kind of code:
while (Console.In.Peek() != -1)
yield return Console.ReadLine();It works only if console input is redirected, when input is line buffered then it'll end immediately after first line. Nice and short but simply it doesn't work. If you really really love LINQ and you can't live without one-line functions then you may do this:
static IEnumerable ReadAllInputs() {
return Enumerable.Range(0, Int32.MaxValue)
.Select(_ => Console.ReadLine())
.TakeWhile(x => x != null);
}Can we make our reader more "generic"? The obvious idea is to us
Code Snippets
private static IEnumerable<string> ReadAllInputs()
{
while (true)
{
string line = Console.ReadLine();
if (line == null)
break;
yield return line;
}
}var counts = new SortedDictionary<string, int>(StringComparer.CurrentCultureIgnoreCase);counts[word] = count + 1;++counts[word];Console.WriteLine(String.Join(Environment.NewLine,
counts.Select(x => $"{x.Key} {x.Value}")));Context
StackExchange Code Review Q#135251, answer score: 10
Revisions (0)
No revisions yet.