Recent Entries 3
- principle minor 112d agoConsole app to compare all directory names for similarityI just got the urge to write a small console app to compare all directories name for similarity. I have > 3000 directories and over time some of them are really similar, eg. an update: Test Case ver 1 vs. Test Case ver 2. Well everything is working but it is really slow, it is probably faster for me to sort the directories by name and go through them manually... The code is 200 lines. I understand that this is a lot more than usual but I could not find something about that in the help section and as mentioned a lot it should be completed so here goes: ``` using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using System.IO; using System.Text.RegularExpressions; namespace Similarity { /// /// Credit http://www.dotnetperls.com/levenshtein /// Contains approximate string matching /// static class LevenshteinDistance { /// /// Compute the distance between two strings. /// public static int Compute(string s, string t) { int n = s.Length; int m = t.Length; int[,] d = new int[n + 1, m + 1]; // Step 1 if (n == 0) { return m; } if (m == 0) { return n; } // Step 2 for (int i = 0; i _blackList = new List(); public List blackList { get { return this._blackList; } } public void AddBlackListEntry(string line) { blackList.Add(line); } #endregion static void Main(string[] args) { var directories = Directory.EnumerateDirectories(Directory.GetCurrentDirectory(), "*", SearchOption.TopDirectoryOnly) .Select(x => new DirectoryInfo(x).Name).OrderBy(y => new DirectoryInfo(y).Name).ToList();
- pattern minor 112d agoLevenshtein Distance with Haskell Vectors and MemoizationIs the following an effective way to implement the Levenshtein Distance with Haskell vectors? ``` import qualified Data.Vector as V levenshtein s1 s2 = levenshteinV (V.fromList s1) (V.fromList s2) levenshteinV p1 p2 = lev V.! l1 V.! l2 where lev = V.map levi (V.enumFromN 0 (l1 + 1)) levi i = V.map (levij i) (V.enumFromN 0 (l2 + 1)) levij i j | i == 0 = j | j == 0 = i | otherwise = ((lev V.! (i - 1) V.! j) + 1) `min` ((lev V.! i V.! (j - 1)) + 1) `min` ((lev V.! (i - 1) V.! (j - 1)) + ind (i - 1) (j - 1)) ind i j = if p1 V.! i == p2 V.! j then 0 else 1 l1 = V.length p1 l2 = V.length p2 ``` In particular, should I be using `V.map` to construct the vectors or is there a better approach? Perhaps `V.generate`? Or does it not make a difference because of lazy evaluation?
- pattern minor 112d agoParsing URLs in Pandas DataFrameMy client needs their Google AdWords destination URL query parsed and the values spell checked to eliminate any typos ("use" instead of "us", etc). I'm pulling the data using the AdWords API and putting it into a `dateframe` for manipulation. Everything works, but there are over 100,000 records every pull and sometimes the code takes hours and hours to run. Is there a way to optimize the following code blocks? ``` def parse_url(df): for index, row in df.iterrows(): parsed = urlparse(str(row['Destination URL'])).query parsed = parse_qs(parsed) for k, v in parsed.iteritems(): df.loc[index, k.strip()] = v[0].strip().lower() return df def typo_correct(urlparams, df, dictionary): for index, row in df.iterrows(): for w in urlparams: if df.loc[index,w] == None or len(df.loc[index,w]) high: high = prob word = item+"*" else: pass if high != 1.0: df.loc[index,w] = word df.loc[index, 'Fix'] = "X" return df ``` Basically it parses out the query parameters, and puts them into a dictionary. The script takes the keys and creates headers in the dataframe, then the first function above iterates through and puts the values in the correct location. The second one then goes through each value and checks if it's in a dictionary text file and uses the Levenshtein edit distance to find the right word in the case of a typo. I'm not sure if this is something that can be done using map or apply as I haven't been working with Pandas long. Does anyone have any suggestions?