patternpythonMinor
Categorization algorithm for discrete variables
Viewed 0 times
categorizationdiscretealgorithmforvariables
Problem
I am trying to categorize some data. For that I check the distribution of the data. Then I split based on the number of appearance of each value.
The algorithm I have is working so far but really slow. I am looking to improve the speed.
The speed is important on this one because I treat a lot of different data using the same structure and the data is a bit large (140k rows)
EDIT : I reworked my code as I knew there was the problem with the loop doing too much iterations over my rows. Here is the new version :
```
def RamsesIdCategory(data):
# handling Ramses Id:
print('Starting Ramses Id')
valueRamses= data['Ramses Trade Id'].value_counts()
for i in data.index:
if valueRamses.get(data.get_value(i,'Ramses Trade Id'))<2:
data.set_value(i,'Ramses Trade Id',1)
elif 2<=valueRamses.get(data.get_value(i,'Ramses Trade Id'))<5:
data.set_value(i, 'Ramses Trade Id', 2)
elif 5 <= valueRamses.get(data.get_value(i, 'Ramses Trade Id')) < 10:
The algorithm I have is working so far but really slow. I am looking to improve the speed.
The speed is important on this one because I treat a lot of different data using the same structure and the data is a bit large (140k rows)
def RamsesIdCategory(data):
# handling Ramses Id:
print('Starting Ramses Id')
valueRamses = data['Ramses Trade Id'].unique()
countRamses = data['Ramses Trade Id'].value_counts()
for value in valueRamses:
if countRamses.get(value) < 2:
data['Ramses Trade Id'].loc[data['Ramses Trade Id'] == value] = 1
elif 2 <= countRamses.get(value) < 5:
data['Ramses Trade Id'].loc[data['Ramses Trade Id'] == value] = 2
elif 5 <= countRamses.get(value) < 10:
data['Ramses Trade Id'].loc[data['Ramses Trade Id'] == value] = 3
elif 10 <= countRamses.get(value) < 20:
data['Ramses Trade Id'].loc[data['Ramses Trade Id']== value] = 4
elif 20 <= countRamses.get(value) < 32:
data['Ramses Trade Id'].loc[data['Ramses Trade Id']== value] = 5
else:
data['Ramses Trade Id'].loc[data['Ramses Trade Id'] == value] = 6
print('finished Ramses Id')
return dataEDIT : I reworked my code as I knew there was the problem with the loop doing too much iterations over my rows. Here is the new version :
```
def RamsesIdCategory(data):
# handling Ramses Id:
print('Starting Ramses Id')
valueRamses= data['Ramses Trade Id'].value_counts()
for i in data.index:
if valueRamses.get(data.get_value(i,'Ramses Trade Id'))<2:
data.set_value(i,'Ramses Trade Id',1)
elif 2<=valueRamses.get(data.get_value(i,'Ramses Trade Id'))<5:
data.set_value(i, 'Ramses Trade Id', 2)
elif 5 <= valueRamses.get(data.get_value(i, 'Ramses Trade Id')) < 10:
Solution
The code can be largely simplified using
This is equivalent to your
Now your function can become:
Note that I removed the
apply. But first, you need a better way to test your values and assign them an id:def convert_count_to_id(count, limits=(2, 5, 10, 20, 32)):
for id, limit in enumerate(limits, 1):
if count < limit:
return id
return id + 1This is equivalent to your
elifs chain but harder to get wrong.Now your function can become:
def ramses_id_category(data):
serie_name = 'Ramses Trade Id'
value_ramses = data[serie_name].value_counts()
id_ramses = values_ramses.apply(convert_count_to_id)
data[serie_name] = data[serie_name].apply(id_ramses.get)Note that I removed the
return data at the end. Since you are mutating the parameter in place, there is no need to return it back since the caller will already be able to see the changes on the reference they hold when calling this function.Code Snippets
def convert_count_to_id(count, limits=(2, 5, 10, 20, 32)):
for id, limit in enumerate(limits, 1):
if count < limit:
return id
return id + 1def ramses_id_category(data):
serie_name = 'Ramses Trade Id'
value_ramses = data[serie_name].value_counts()
id_ramses = values_ramses.apply(convert_count_to_id)
data[serie_name] = data[serie_name].apply(id_ramses.get)Context
StackExchange Code Review Q#157790, answer score: 2
Revisions (0)
No revisions yet.