HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Extract unique terms from a PANDAS series

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
uniquepandastermsextractseriesfrom

Problem

Background

I have process tons of DataFrames with shapes of ~230 columns x ~2000-50000+ rows. Here is an extremely simplified example;

numbers                colors
0    0.03620894806802     1xYellow ; 2xRed 
1  0.7641262315308163  2xYellow ; 1xOrange 
2  0.5607449770945651   3xYellow ; 2xGreen 
3  0.6714547913365702     1xYellow ; 1xRed 
4  0.8646309438322237     2xYellow ; 1xRed


Problem

I need to break the colors column down to a set that looks like this;
{'Green', 'Orange', 'Red', 'Yellow'}. The example code below can do this but it is painfully slow on huge DataFrames.

import re
import pandas as pd
import numpy as np
# Generating example data
color = ["1xYellow ; 2xRed ",
"2xYellow ; 1xOrange ",
"3xYellow ; 2xGreen ",
"1xYellow ; 1xRed ",
"2xYellow ; 1xRed "]
numbers = np.random.rand(len(color))
ex_df = pd.DataFrame(np.array([numbers,color]).T,
columns = ["numbers","colors"])
# Compile the regex to apply with findall
rx = re.compile("x(\w+)\s")
just_colors = ex_df.colors.apply(rx.findall)
# Below is the painfully slow operation that needs optimization.
present_colors = set(sum(just_colors,[]))


Question

Is there a better method out there for pulling unique terms out of a pandas series?

Solution

It doesn't look like you really need regular expressions. This construct just using basic string operations is about 10x faster than the construct with the regular expressions:

present_colors = set()
for value in ex_df['colors'].values:
    for color in [x.strip() for x in value.split(';')]:
        present_colors.add(color.split('x')[-1])


And a bit faster yet, go with the same code as a generator using itertools:

import itertools as it
present_colors = set(it.chain.from_iterable(
    ([color.split('x')[-1].strip() for color in value.split(';')]
     for value in ex_df['colors'].values)))

Code Snippets

present_colors = set()
for value in ex_df['colors'].values:
    for color in [x.strip() for x in value.split(';')]:
        present_colors.add(color.split('x')[-1])
import itertools as it
present_colors = set(it.chain.from_iterable(
    ([color.split('x')[-1].strip() for color in value.split(';')]
     for value in ex_df['colors'].values)))

Context

StackExchange Code Review Q#156147, answer score: 8

Revisions (0)

No revisions yet.