patternpythonMinor
Parsing URLs in Pandas DataFrame
Viewed 0 times
dataframeparsingpandasurls
Problem
My client needs their Google AdWords destination URL query parsed and the values spell checked to eliminate any typos ("use" instead of "us", etc).
I'm pulling the data using the AdWords API and putting it into a
Basically it parses out the query parameters, and puts them into a dictionary. The script takes the keys and creates headers in the dataframe, then the first function above iterates through and puts the values in the correct location.
The second one then goes through each value and checks if it's in a dictionary text file and uses the Levenshtein edit distance to find the right word in the case of a typo.
I'm not sure if this is something that can be done using map or apply as I haven't been working with Pandas long. Does anyone have any suggestions?
I'm pulling the data using the AdWords API and putting it into a
dateframe for manipulation. Everything works, but there are over 100,000 records every pull and sometimes the code takes hours and hours to run. Is there a way to optimize the following code blocks?def parse_url(df):
for index, row in df.iterrows():
parsed = urlparse(str(row['Destination URL'])).query
parsed = parse_qs(parsed)
for k, v in parsed.iteritems():
df.loc[index, k.strip()] = v[0].strip().lower()
return df
def typo_correct(urlparams, df, dictionary):
for index, row in df.iterrows():
for w in urlparams:
if df.loc[index,w] == None or len(df.loc[index,w]) high:
high = prob
word = item+"*"
else:
pass
if high != 1.0:
df.loc[index,w] = word
df.loc[index, 'Fix'] = "X"
return dfBasically it parses out the query parameters, and puts them into a dictionary. The script takes the keys and creates headers in the dataframe, then the first function above iterates through and puts the values in the correct location.
The second one then goes through each value and checks if it's in a dictionary text file and uses the Levenshtein edit distance to find the right word in the case of a typo.
I'm not sure if this is something that can be done using map or apply as I haven't been working with Pandas long. Does anyone have any suggestions?
Solution
I don't know if it's still relevant for you, but I can see some optimization that you could do in your code.
As I see, you are passing all dataframe objects and then parsing specific columns in that dataframe:
It would be faster if you pass only the column that you need to parse.
E.g.
Then first your function would have less work to do and performance would increase - at least slightly.
As I see, you are passing all dataframe objects and then parsing specific columns in that dataframe:
def parse_url(df):
for index, row in df.iterrows():
parsed = urlparse(str(row['Destination URL'])).query #<==
parsed = parse_qs(parsed)
for k, v in parsed.iteritems():
df.loc[index, k.strip()] = v[0].strip().lower()
return dfIt would be faster if you pass only the column that you need to parse.
E.g.
def parse_url(df):
for index, row in df.iterrows():
parsed = urlparse(str(row)).query # <- Notice the change here
parsed = parse_qs(parsed)
for k, v in parsed.items(): #use items() in Python3 and iteritems() in Python2
df.loc[index, k.strip()] = v[0].strip().lower()
return df
parse_url(df['columnName'])Then first your function would have less work to do and performance would increase - at least slightly.
Code Snippets
def parse_url(df):
for index, row in df.iterrows():
parsed = urlparse(str(row['Destination URL'])).query #<==
parsed = parse_qs(parsed)
for k, v in parsed.iteritems():
df.loc[index, k.strip()] = v[0].strip().lower()
return dfdef parse_url(df):
for index, row in df.iterrows():
parsed = urlparse(str(row)).query # <- Notice the change here
parsed = parse_qs(parsed)
for k, v in parsed.items(): #use items() in Python3 and iteritems() in Python2
df.loc[index, k.strip()] = v[0].strip().lower()
return df
parse_url(df['columnName'])Context
StackExchange Code Review Q#70538, answer score: 3
Revisions (0)
No revisions yet.