HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Parsing URLs in Pandas DataFrame

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
dataframeparsingpandasurls

Problem

My client needs their Google AdWords destination URL query parsed and the values spell checked to eliminate any typos ("use" instead of "us", etc).

I'm pulling the data using the AdWords API and putting it into a dateframe for manipulation. Everything works, but there are over 100,000 records every pull and sometimes the code takes hours and hours to run. Is there a way to optimize the following code blocks?

def parse_url(df):
    for index, row in df.iterrows():
        parsed = urlparse(str(row['Destination URL'])).query
        parsed = parse_qs(parsed)
        for k, v in parsed.iteritems():
            df.loc[index, k.strip()] = v[0].strip().lower()
    return df

def typo_correct(urlparams, df, dictionary):
    for index, row in df.iterrows():
         for w in urlparams:
            if df.loc[index,w] == None or len(df.loc[index,w]) high:
                        high = prob
                        word = item+"*"
                    else:
                        pass
                if high != 1.0:                
                    df.loc[index,w] = word
                    df.loc[index, 'Fix'] = "X"
    return df


Basically it parses out the query parameters, and puts them into a dictionary. The script takes the keys and creates headers in the dataframe, then the first function above iterates through and puts the values in the correct location.

The second one then goes through each value and checks if it's in a dictionary text file and uses the Levenshtein edit distance to find the right word in the case of a typo.

I'm not sure if this is something that can be done using map or apply as I haven't been working with Pandas long. Does anyone have any suggestions?

Solution

I don't know if it's still relevant for you, but I can see some optimization that you could do in your code.

As I see, you are passing all dataframe objects and then parsing specific columns in that dataframe:

def parse_url(df):
    for index, row in df.iterrows():
        parsed = urlparse(str(row['Destination URL'])).query #<==
        parsed = parse_qs(parsed)
        for k, v in parsed.iteritems():
            df.loc[index, k.strip()] = v[0].strip().lower()
    return df


It would be faster if you pass only the column that you need to parse.

E.g.

def parse_url(df):
    for index, row in df.iterrows():
        parsed = urlparse(str(row)).query # <- Notice the change here
        parsed = parse_qs(parsed)
        for k, v in parsed.items(): #use items() in Python3 and iteritems() in Python2
            df.loc[index, k.strip()] = v[0].strip().lower()
    return df

parse_url(df['columnName'])


Then first your function would have less work to do and performance would increase - at least slightly.

Code Snippets

def parse_url(df):
    for index, row in df.iterrows():
        parsed = urlparse(str(row['Destination URL'])).query #<==
        parsed = parse_qs(parsed)
        for k, v in parsed.iteritems():
            df.loc[index, k.strip()] = v[0].strip().lower()
    return df
def parse_url(df):
    for index, row in df.iterrows():
        parsed = urlparse(str(row)).query # <- Notice the change here
        parsed = parse_qs(parsed)
        for k, v in parsed.items(): #use items() in Python3 and iteritems() in Python2
            df.loc[index, k.strip()] = v[0].strip().lower()
    return df


parse_url(df['columnName'])

Context

StackExchange Code Review Q#70538, answer score: 3

Revisions (0)

No revisions yet.