HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Extracting and cleaning up duplicate emails

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
duplicateemailsextractingandcleaning

Problem

Right now I am using a regex to extract emails from a text document, and filter the duplicates out of this using the Distinct function from linq. After which I run the output file through a clean up of duplicates as well, as the program can be run several times over several different files.

Now my question, would there be a more optimal way of checking the output file for duplicates? Because as it is right now, the bigger the file, the longer it will take.

And I have the feeling there should be a less intense way to check this.

// Extractor
public void Mail(string file)
{
    string strRegex = @"[A-Za-z0-9_\-\+]+@[A-Za-z0-9\-]+\.([A-Za-z]{2,3})(?:\.[a-z]{2})?";
    var myRegex = new Regex(strRegex, RegexOptions.None);

    var matches = new List();
    foreach (Match myMatch in myRegex.Matches(file))
    {
        if (myMatch.Success)
        {
            matches.Add(myMatch.Value);
        }
    }

    var cleanMatch = matches.Distinct().ToList();

    for (var i = 0; i (System.IO.File.ReadAllLines(System.Environment.GetFolderPath(System.Environment.SpecialFolder.DesktopDirectory) + "\\EmailList.txt"));

    lines = lines.Distinct().ToList();
    string target = System.Environment.GetFolderPath(System.Environment.SpecialFolder.DesktopDirectory) + "\\EmailList.txt";
    using (System.IO.StreamWriter newTask = new System.IO.StreamWriter(target, false))
    {
        for (var i = 0; i < lines.Count; i++)
        {
            newTask.WriteLine(lines[i]);
        }
    }
}

Solution

You'll be able to simplify this a lot if you use a better in memory datastructure than a List. You only want distinct items, HashSet will generally be better here as it has O(1) lookups.

public void Mail(string file)
{
    var strRegex = @"[A-Za-z0-9_\-\+]+@[A-Za-z0-9\-]+\.([A-Za-z]{2,3})(?:\.[a-z]{2})?";
    var myRegex = new Regex(strRegex, RegexOptions.None);
    var matches = new HashSet();
    foreach (Match matchedValue in myRegex.Matches(file))
    {
        // Changed per Mjolka's comment
        if (matches.Add(matchedValue.Value))
        {
            Log.Mail(matchedValue.Value);
        }
    }
    CleanDuplicates();
}


Note that I've removed the uneccessary check for Success on the Match object.

Your naming could do with a bit of work myRegex doesn't really mean anything.

You could also store the Regex as a field on the class to declutter this method.

You can leverage a HashSet in your CleanDuplicates method too:

private void CleanDuplicates()
{
    var fileLocation = System.Environment.GetFolderPath(System.Environment.SpecialFolder.DesktopDirectory) + @"\EmailList.txt";

    var lines = new HashSet(File.ReadAllLines(fileLocation));

    File.WriteAllLines(fileLocation, lines.ToArray());
}


Definitely add a using System.IO it removes a lot of clutter.

Prefer built in methods e.g. File.WriteAllLines overwrites the file or creates a new one - no need to whip out a StreamWriter.

Also note that if your file becomes really large this will start getting pretty slow.

Final(?) Edit

Can't use var in the foreach loop as MatchCollection implements IEnumerable only - not the generic version. Sorry about that.

In order to use a Linq method (such as Select) we have to supply the type by calling Cast:

foreach (var matchedValue in myRegex.Matches(file).Cast().Select(m => m.Value))
{
    if (matches.Add(matchedValue))
    {
        Log.Mail(matchedValue);
    }
}

Code Snippets

public void Mail(string file)
{
    var strRegex = @"[A-Za-z0-9_\-\+]+@[A-Za-z0-9\-]+\.([A-Za-z]{2,3})(?:\.[a-z]{2})?";
    var myRegex = new Regex(strRegex, RegexOptions.None);
    var matches = new HashSet<string>();
    foreach (Match matchedValue in myRegex.Matches(file))
    {
        // Changed per Mjolka's comment
        if (matches.Add(matchedValue.Value))
        {
            Log.Mail(matchedValue.Value);
        }
    }
    CleanDuplicates();
}
private void CleanDuplicates()
{
    var fileLocation = System.Environment.GetFolderPath(System.Environment.SpecialFolder.DesktopDirectory) + @"\EmailList.txt";

    var lines = new HashSet<string>(File.ReadAllLines(fileLocation));

    File.WriteAllLines(fileLocation, lines.ToArray());
}
foreach (var matchedValue in myRegex.Matches(file).Cast<Match>().Select(m => m.Value))
{
    if (matches.Add(matchedValue))
    {
        Log.Mail(matchedValue);
    }
}

Context

StackExchange Code Review Q#90714, answer score: 9

Revisions (0)

No revisions yet.