patterncsharpMinor
Extracting and cleaning up duplicate emails
Viewed 0 times
duplicateemailsextractingandcleaning
Problem
Right now I am using a regex to extract emails from a text document, and filter the duplicates out of this using the Distinct function from linq. After which I run the output file through a clean up of duplicates as well, as the program can be run several times over several different files.
Now my question, would there be a more optimal way of checking the output file for duplicates? Because as it is right now, the bigger the file, the longer it will take.
And I have the feeling there should be a less intense way to check this.
Now my question, would there be a more optimal way of checking the output file for duplicates? Because as it is right now, the bigger the file, the longer it will take.
And I have the feeling there should be a less intense way to check this.
// Extractor
public void Mail(string file)
{
string strRegex = @"[A-Za-z0-9_\-\+]+@[A-Za-z0-9\-]+\.([A-Za-z]{2,3})(?:\.[a-z]{2})?";
var myRegex = new Regex(strRegex, RegexOptions.None);
var matches = new List();
foreach (Match myMatch in myRegex.Matches(file))
{
if (myMatch.Success)
{
matches.Add(myMatch.Value);
}
}
var cleanMatch = matches.Distinct().ToList();
for (var i = 0; i (System.IO.File.ReadAllLines(System.Environment.GetFolderPath(System.Environment.SpecialFolder.DesktopDirectory) + "\\EmailList.txt"));
lines = lines.Distinct().ToList();
string target = System.Environment.GetFolderPath(System.Environment.SpecialFolder.DesktopDirectory) + "\\EmailList.txt";
using (System.IO.StreamWriter newTask = new System.IO.StreamWriter(target, false))
{
for (var i = 0; i < lines.Count; i++)
{
newTask.WriteLine(lines[i]);
}
}
}Solution
You'll be able to simplify this a lot if you use a better in memory datastructure than a
Note that I've removed the uneccessary check for
Your naming could do with a bit of work
You could also store the Regex as a field on the class to declutter this method.
You can leverage a
Definitely add a
Prefer built in methods e.g.
Also note that if your file becomes really large this will start getting pretty slow.
Final(?) Edit
Can't use var in the foreach loop as MatchCollection implements
In order to use a Linq method (such as
List. You only want distinct items, HashSet will generally be better here as it has O(1) lookups.public void Mail(string file)
{
var strRegex = @"[A-Za-z0-9_\-\+]+@[A-Za-z0-9\-]+\.([A-Za-z]{2,3})(?:\.[a-z]{2})?";
var myRegex = new Regex(strRegex, RegexOptions.None);
var matches = new HashSet();
foreach (Match matchedValue in myRegex.Matches(file))
{
// Changed per Mjolka's comment
if (matches.Add(matchedValue.Value))
{
Log.Mail(matchedValue.Value);
}
}
CleanDuplicates();
}Note that I've removed the uneccessary check for
Success on the Match object. Your naming could do with a bit of work
myRegex doesn't really mean anything.You could also store the Regex as a field on the class to declutter this method.
You can leverage a
HashSet in your CleanDuplicates method too:private void CleanDuplicates()
{
var fileLocation = System.Environment.GetFolderPath(System.Environment.SpecialFolder.DesktopDirectory) + @"\EmailList.txt";
var lines = new HashSet(File.ReadAllLines(fileLocation));
File.WriteAllLines(fileLocation, lines.ToArray());
}Definitely add a
using System.IO it removes a lot of clutter.Prefer built in methods e.g.
File.WriteAllLines overwrites the file or creates a new one - no need to whip out a StreamWriter.Also note that if your file becomes really large this will start getting pretty slow.
Final(?) Edit
Can't use var in the foreach loop as MatchCollection implements
IEnumerable only - not the generic version. Sorry about that.In order to use a Linq method (such as
Select) we have to supply the type by calling Cast:foreach (var matchedValue in myRegex.Matches(file).Cast().Select(m => m.Value))
{
if (matches.Add(matchedValue))
{
Log.Mail(matchedValue);
}
}Code Snippets
public void Mail(string file)
{
var strRegex = @"[A-Za-z0-9_\-\+]+@[A-Za-z0-9\-]+\.([A-Za-z]{2,3})(?:\.[a-z]{2})?";
var myRegex = new Regex(strRegex, RegexOptions.None);
var matches = new HashSet<string>();
foreach (Match matchedValue in myRegex.Matches(file))
{
// Changed per Mjolka's comment
if (matches.Add(matchedValue.Value))
{
Log.Mail(matchedValue.Value);
}
}
CleanDuplicates();
}private void CleanDuplicates()
{
var fileLocation = System.Environment.GetFolderPath(System.Environment.SpecialFolder.DesktopDirectory) + @"\EmailList.txt";
var lines = new HashSet<string>(File.ReadAllLines(fileLocation));
File.WriteAllLines(fileLocation, lines.ToArray());
}foreach (var matchedValue in myRegex.Matches(file).Cast<Match>().Select(m => m.Value))
{
if (matches.Add(matchedValue))
{
Log.Mail(matchedValue);
}
}Context
StackExchange Code Review Q#90714, answer score: 9
Revisions (0)
No revisions yet.