patternpythonMinor
Extract unique terms from a PANDAS series
Viewed 0 times
uniquepandastermsextractseriesfrom
Problem
Background
I have process tons of DataFrames with shapes of ~230 columns x ~2000-50000+ rows. Here is an extremely simplified example;
Problem
I need to break the
Question
Is there a better method out there for pulling unique terms out of a pandas series?
I have process tons of DataFrames with shapes of ~230 columns x ~2000-50000+ rows. Here is an extremely simplified example;
numbers colors
0 0.03620894806802 1xYellow ; 2xRed
1 0.7641262315308163 2xYellow ; 1xOrange
2 0.5607449770945651 3xYellow ; 2xGreen
3 0.6714547913365702 1xYellow ; 1xRed
4 0.8646309438322237 2xYellow ; 1xRedProblem
I need to break the
colors column down to a set that looks like this;{'Green', 'Orange', 'Red', 'Yellow'}. The example code below can do this but it is painfully slow on huge DataFrames.import re
import pandas as pd
import numpy as np
# Generating example data
color = ["1xYellow ; 2xRed ",
"2xYellow ; 1xOrange ",
"3xYellow ; 2xGreen ",
"1xYellow ; 1xRed ",
"2xYellow ; 1xRed "]
numbers = np.random.rand(len(color))
ex_df = pd.DataFrame(np.array([numbers,color]).T,
columns = ["numbers","colors"])
# Compile the regex to apply with findall
rx = re.compile("x(\w+)\s")
just_colors = ex_df.colors.apply(rx.findall)
# Below is the painfully slow operation that needs optimization.
present_colors = set(sum(just_colors,[]))
Question
Is there a better method out there for pulling unique terms out of a pandas series?
Solution
It doesn't look like you really need regular expressions. This construct just using basic string operations is about 10x faster than the construct with the regular expressions:
And a bit faster yet, go with the same code as a generator using itertools:
present_colors = set()
for value in ex_df['colors'].values:
for color in [x.strip() for x in value.split(';')]:
present_colors.add(color.split('x')[-1])And a bit faster yet, go with the same code as a generator using itertools:
import itertools as it
present_colors = set(it.chain.from_iterable(
([color.split('x')[-1].strip() for color in value.split(';')]
for value in ex_df['colors'].values)))Code Snippets
present_colors = set()
for value in ex_df['colors'].values:
for color in [x.strip() for x in value.split(';')]:
present_colors.add(color.split('x')[-1])import itertools as it
present_colors = set(it.chain.from_iterable(
([color.split('x')[-1].strip() for color in value.split(';')]
for value in ex_df['colors'].values)))Context
StackExchange Code Review Q#156147, answer score: 8
Revisions (0)
No revisions yet.