patterncsharpMinor
Algorithm to categorize episode-file names efficiently
Viewed 0 times
efficientlyfileepisodecategorizenamesalgorithm
Problem
I'm trying to categorize file names of anime episodes (for now) into appropriate title based categories. The Show-titles are parsed from an XML file (I got from Anime News Network).
EDIT: The passed on
DESCRIPTION: The algorithm itself is pretty straightforward, it finds matches for each keyword in the file name with a set list of show-titles and assigns a score corresponding to the show-title. After all the keywords are checked for matches, the show-title with the highest score is given as the
There are a few tweaks to boost the score in certain cases as well, like if the directory names match with the same show-title and if the score and the show-title length are almost the same number.
ADDITION: Class level variable declarations for context
CODE:
```
#region Analyzer
private void RunAnalysis(object backgroundWorker)
{
titles = LoadXML(animeDBPath, "item", "name");
List dirs = new List();
List allFiles = new List();
// Find all directories
foreach (DirectoryInfo d in dirList)
{
dirs.AddRange(d.GetDirectories("*", SearchOption.AllDirectories));
}
// Add the parent directory as well
dirs.AddRange(dirList);
// Find all the files
EDIT: The passed on
backgroundWorker object type variable is just to facilitate GUI update as this algorithm runs on a secondary thread.DESCRIPTION: The algorithm itself is pretty straightforward, it finds matches for each keyword in the file name with a set list of show-titles and assigns a score corresponding to the show-title. After all the keywords are checked for matches, the show-title with the highest score is given as the
category to the file.There are a few tweaks to boost the score in certain cases as well, like if the directory names match with the same show-title and if the score and the show-title length are almost the same number.
ADDITION: Class level variable declarations for context
OpenDialogView openDialog;
List dirList;
BackgroundWorker AnalyzerThread;
// Analyzer Components
public static char[] removablesNum = new char[] { '.', '_', '-', ' ', '^', '!', '@', '#', '
CODE:
```
#region Analyzer
private void RunAnalysis(object backgroundWorker)
{
titles = LoadXML(animeDBPath, "item", "name");
List dirs = new List();
List allFiles = new List();
// Find all directories
foreach (DirectoryInfo d in dirList)
{
dirs.AddRange(d.GetDirectories("*", SearchOption.AllDirectories));
}
// Add the parent directory as well
dirs.AddRange(dirList);
// Find all the files
, '%', '&', '*', '~', '`', '?', '(', ')', '[', ']', '{', '}', '+', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };
public static char[] removables = new char[] { '.', '_', '-', ' ', '^', '!', '@', '#', '
CODE:
```
#region Analyzer
private void RunAnalysis(object backgroundWorker)
{
titles = LoadXML(animeDBPath, "item", "name");
List dirs = new List();
List allFiles = new List();
// Find all directories
foreach (DirectoryInfo d in dirList)
{
dirs.AddRange(d.GetDirectories("*", SearchOption.AllDirectories));
}
// Add the parent directory as well
dirs.AddRange(dirList);
// Find all the files
, '%', '&', '*', '~', '`', '?', '(', ')', '[', ']', '{', '}', '+' };
public static string animeDBPath = "ANN_AnimeDB_20-12-2015.xml";
public string parentPath, outputPath;
public List titles;
public List notSortedFiles;
public List sortedFiles;CODE:
```
#region Analyzer
private void RunAnalysis(object backgroundWorker)
{
titles = LoadXML(animeDBPath, "item", "name");
List dirs = new List();
List allFiles = new List();
// Find all directories
foreach (DirectoryInfo d in dirList)
{
dirs.AddRange(d.GetDirectories("*", SearchOption.AllDirectories));
}
// Add the parent directory as well
dirs.AddRange(dirList);
// Find all the files
Solution
Don't have time to look at
-
I found it useful for readability to prefix private class members with
-
Not sure why you have prefixed
-
You shouldn't use
Either you're sure it's a
-
-
Based on the above point
-
Also
Update for the
A few things to improve here:
-
You split strings in the same way three times - I find it cleaner to extract this into a common method to perform that action.
-
I would extract the scoring of a single title against a specific file name into it's own method - this encapsulates the core scoring logic and then lets you deal with the scoring of all file names in a more condensed way.
-
After you've done 1 and 2 you can apply some LINQ magic again to make it more succinct.
-
I don't have the code of the
The refactored code for the
```
private string[] SplitByRemovables(string value)
{
return value.Split(removables, StringSplitOptions.RemoveEmptyEntries);
}
private int ScoreTitle(string title, string[] filenameParts, string[] directoryParts)
{
var score = filenameParts.Count(p => Regex.IsMatch(title, @"\b" + p + @"\b", RegexOptions.IgnoreCase));
if (score > 0)
{
score += directoryParts.Count(p => Regex.IsMatch(title, @"\b" + p + @"\b", RegexOptions.IgnoreCase));
}
// if the percentage of word matches and total words in the title is > 80% (arbitrary)
// To avoid false matches with longer titles boost the score
int titleWordCount = SplitByRemovables(title).Length;
if ((100 score / (2 titleWordCount)) > 80)
{
score += 2;
}
return score;
}
private List SortFiles(List allFiles, BackgroundWorker backgroundWorker)
{
List categories = new List();
int fileCount = 0;
foreach (FileInfo file in allFiles)
{
fileCount++;
var filenameParts = SplitByRemovables(Path.GetFileNameWithoutExtension(file.Name));
var directoryParts = SplitByRemovables(file.Directory.Name);
var topTitle = titles.Select(t => new { Title = t, Score = ScoreTitle(t, fil
SortFiles more closely right now, so just some general remarks:-
I found it useful for readability to prefix private class members with
_ so make them easily distinguishable from local variables and parameters. I was stumped a few times where a particular variable came from because it was neither declared nor a parameter so I guessed that it must be a class member - having a clear visual queue help. However that's a personal style preference so YMMV.-
Not sure why you have prefixed
filePath with an @ when you use it. This is only necessary when you want to name a variable the same name as a C# keyword like this (so you could have a local variable called @this for example).-
You shouldn't use
as like this:(backgroundWorker as BackgroundWorker).ReportProgress(progressPercentage, fileCount);Either you're sure it's a
BackgroundWorker then use a direct cast or you're not sure then use as plus a null check. The way it stands you might get a NullReferenceException which is usually not very helpful (since technically this could be thrown in a lot of different places). With a direct cast you get at least an InvalidCastException which tells you a lot more about what may have caused it.-
DeAccentTitles should not have to deal with a collection but should just deal with an individual title - it should not concern itself with the requirement that you want to do this for a whole collection of strings. It can also be condensed with the use of LINQ:private string DeAccentTitle(string title)
{
var chars = s.Normalize(NormalizationForm.FormD)
.Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
.ToArray()
return new string(chars).Normalize(NormalizationForm.FormC)
}-
Based on the above point
LoadXML can be condensed as well. Especially the null handling should just disappear:private List LoadXML(string filePath, string descendant, string element)
{
return XDocument.Load(filePath)
.Root
.Descendants(descendant)
.Where(c => c.Element("type").Value == "TV")
.Select(c.Element(element).Value)
.OrderBy(v => v)
.Select(DeAccentTitle)
.ToList();
}-
Also
RunAnalysis can be improved. I would also make that the point where the backgroundWorker is cast to it's actual type.private void RunAnalysis(object backgroundWorker)
{
titles = LoadXML(animeDBPath, "item", "name");
var allFiles = dirList.SelectMany(d => d.GetDirectories("*", SearchOption.AllDirectories))
.SelectMany(d => d.EnumerateFiles())
.ToList();
sortedFiles = SortFiles(allFiles, (BackgroundWorker)backgroundWorker);
}Update for the
SortFiles methodA few things to improve here:
-
You split strings in the same way three times - I find it cleaner to extract this into a common method to perform that action.
-
I would extract the scoring of a single title against a specific file name into it's own method - this encapsulates the core scoring logic and then lets you deal with the scoring of all file names in a more condensed way.
-
After you've done 1 and 2 you can apply some LINQ magic again to make it more succinct.
-
I don't have the code of the
Category class but since the Category.Name seems to be linked to the title name it seems weird that you'd have to pass this in for every file you add. The code would become a bit cleaner if this is tidied up - I haven't that in the below code yet though.The refactored code for the
SortFiles looks like this:```
private string[] SplitByRemovables(string value)
{
return value.Split(removables, StringSplitOptions.RemoveEmptyEntries);
}
private int ScoreTitle(string title, string[] filenameParts, string[] directoryParts)
{
var score = filenameParts.Count(p => Regex.IsMatch(title, @"\b" + p + @"\b", RegexOptions.IgnoreCase));
if (score > 0)
{
score += directoryParts.Count(p => Regex.IsMatch(title, @"\b" + p + @"\b", RegexOptions.IgnoreCase));
}
// if the percentage of word matches and total words in the title is > 80% (arbitrary)
// To avoid false matches with longer titles boost the score
int titleWordCount = SplitByRemovables(title).Length;
if ((100 score / (2 titleWordCount)) > 80)
{
score += 2;
}
return score;
}
private List SortFiles(List allFiles, BackgroundWorker backgroundWorker)
{
List categories = new List();
int fileCount = 0;
foreach (FileInfo file in allFiles)
{
fileCount++;
var filenameParts = SplitByRemovables(Path.GetFileNameWithoutExtension(file.Name));
var directoryParts = SplitByRemovables(file.Directory.Name);
var topTitle = titles.Select(t => new { Title = t, Score = ScoreTitle(t, fil
Code Snippets
(backgroundWorker as BackgroundWorker).ReportProgress(progressPercentage, fileCount);private string DeAccentTitle(string title)
{
var chars = s.Normalize(NormalizationForm.FormD)
.Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
.ToArray()
return new string(chars).Normalize(NormalizationForm.FormC)
}private List<string> LoadXML(string filePath, string descendant, string element)
{
return XDocument.Load(filePath)
.Root
.Descendants(descendant)
.Where(c => c.Element("type").Value == "TV")
.Select(c.Element(element).Value)
.OrderBy(v => v)
.Select(DeAccentTitle)
.ToList();
}private void RunAnalysis(object backgroundWorker)
{
titles = LoadXML(animeDBPath, "item", "name");
var allFiles = dirList.SelectMany(d => d.GetDirectories("*", SearchOption.AllDirectories))
.SelectMany(d => d.EnumerateFiles())
.ToList();
sortedFiles = SortFiles(allFiles, (BackgroundWorker)backgroundWorker);
}private string[] SplitByRemovables(string value)
{
return value.Split(removables, StringSplitOptions.RemoveEmptyEntries);
}
private int ScoreTitle(string title, string[] filenameParts, string[] directoryParts)
{
var score = filenameParts.Count(p => Regex.IsMatch(title, @"\b" + p + @"\b", RegexOptions.IgnoreCase));
if (score > 0)
{
score += directoryParts.Count(p => Regex.IsMatch(title, @"\b" + p + @"\b", RegexOptions.IgnoreCase));
}
// if the percentage of word matches and total words in the title is > 80% (arbitrary)
// To avoid false matches with longer titles boost the score
int titleWordCount = SplitByRemovables(title).Length;
if ((100 * score / (2 * titleWordCount)) > 80)
{
score += 2;
}
return score;
}
private List<Category> SortFiles(List<FileInfo> allFiles, BackgroundWorker backgroundWorker)
{
List<Category> categories = new List<Category>();
int fileCount = 0;
foreach (FileInfo file in allFiles)
{
fileCount++;
var filenameParts = SplitByRemovables(Path.GetFileNameWithoutExtension(file.Name));
var directoryParts = SplitByRemovables(file.Directory.Name);
var topTitle = titles.Select(t => new { Title = t, Score = ScoreTitle(t, filenameParts, directoryParts) })
.OrderByDescending(x => x.Score)
.First();
var childFile = new Children(file);
if (topTitle.Score > 0)
{
var category = categories.FirstOrDefault(c => c.Name == topTitle.Title);
if (category == null)
{
category = new Category(childFile, topTitle.Title);
categories.Add(category);
}
else
{
category.AddChildren(childFile, topTitle.Title);
}
}
else
{
// Files without a score were not matched with any existing category
notSortedFiles.Add(childFile);
}
// Update Progress
// Send percentComplete to the backgroundWorker and the current file number
int progressPercentage = 100 * fileCount / allFiles.Count;
// Only the ReportProgress method can update the UI
backgroundWorker.ReportProgress(progressPercentage, fileCount);
}
return categories;
}Context
StackExchange Code Review Q#115876, answer score: 2
Revisions (0)
No revisions yet.