patternpythonMinor
Speed up script that calculates distribution of every character from user input
Viewed 0 times
scriptusereverycharacterinputthatdistributionfromspeedcalculates
Problem
I have a data set with close to 6 million rows of user input. Specifically, users were supposed to type in their email addresses, but because there was not pattern validation put in place we have a few months worth of interesting input.
I've come up with a script that counts every character, then combines it that so I can see the distribution of all characters. This enables me to do further analysis and get a sense of the most common mistakes so I can begin to clean the data. My question is: how would you optimize the following for speed?
I've run this over ~1/3 of my dataset, and it takes a while; it's still tolerable, I'm just curious if anyone could make it faster.
I've come up with a script that counts every character, then combines it that so I can see the distribution of all characters. This enables me to do further analysis and get a sense of the most common mistakes so I can begin to clean the data. My question is: how would you optimize the following for speed?
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from collections import Counter
df = pd.DataFrame({'input': ['Captain Jean-Luc Picard ','deanna.troi@starfleet.com','geordi @starfleet.com','data@starfleet.com','rik#er@starfleet.com'],
'metric1': np.random.randn(5).cumsum(),
'metric2': np.random.randn(5)})
l = []
for i in range(len(df.index.values)):
l.append(dict(Counter(df.ix[i,'input'])))
dist = pd.DataFrame(l).fillna(0)
dist = dist.sum(axis=0)
print(dist)I've run this over ~1/3 of my dataset, and it takes a while; it's still tolerable, I'm just curious if anyone could make it faster.
Solution
Since you are using
Counter already, it should be faster to do the whole job with it:c = Counter()
for i in range(len(df.index.values)):
c.update(df.ix[i,'input'])
for k, v in c.items():
print(k, v)Code Snippets
c = Counter()
for i in range(len(df.index.values)):
c.update(df.ix[i,'input'])
for k, v in c.items():
print(k, v)Context
StackExchange Code Review Q#70804, answer score: 2
Revisions (0)
No revisions yet.