HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Speed up script that calculates distribution of every character from user input

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptusereverycharacterinputthatdistributionfromspeedcalculates

Problem

I have a data set with close to 6 million rows of user input. Specifically, users were supposed to type in their email addresses, but because there was not pattern validation put in place we have a few months worth of interesting input.

I've come up with a script that counts every character, then combines it that so I can see the distribution of all characters. This enables me to do further analysis and get a sense of the most common mistakes so I can begin to clean the data. My question is: how would you optimize the following for speed?

import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from collections import Counter

df = pd.DataFrame({'input': ['Captain Jean-Luc Picard ','deanna.troi@starfleet.com','geordi @starfleet.com','data@starfleet.com','rik#er@starfleet.com'],
'metric1': np.random.randn(5).cumsum(),
'metric2': np.random.randn(5)})

l = []
for i in range(len(df.index.values)):
    l.append(dict(Counter(df.ix[i,'input'])))
dist = pd.DataFrame(l).fillna(0)
dist = dist.sum(axis=0)
print(dist)


I've run this over ~1/3 of my dataset, and it takes a while; it's still tolerable, I'm just curious if anyone could make it faster.

Solution

Since you are using Counter already, it should be faster to do the whole job with it:

c = Counter()
for i in range(len(df.index.values)):
    c.update(df.ix[i,'input'])

for k, v in c.items():
    print(k, v)

Code Snippets

c = Counter()
for i in range(len(df.index.values)):
    c.update(df.ix[i,'input'])

for k, v in c.items():
    print(k, v)

Context

StackExchange Code Review Q#70804, answer score: 2

Revisions (0)

No revisions yet.