patternpythonMinor
Excel's SUMIFS implemented using PANDAS, the Python Data Analysis Library
Viewed 0 times
thepandasexcelimplementedanalysisusingpythonlibrarydatasumifs
Problem
I've implemented Excel's SUMIFS function in Pandas using the following code. Is there a better — more Pythonic — implementation?
One concern I have with this implementation is that I'm not explicitly specifying the column to be summed.
Data File
Here's an example CSV data file (
Equivalent Excel SUMIFS Function
If I were to open
Where the SUMIFS function syntax is:
from pandas import Series, DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
# pandas equivalent of Excel's SUMIFS function
df.groupby('PROJECT').sum().ix['A001']One concern I have with this implementation is that I'm not explicitly specifying the column to be summed.
Data File
Here's an example CSV data file (
data.csv), although I'm displaying | instead of commas to improve the visual appearance.DATE | EMPLOYEE | PROJECT | HOURS
02/01/14 | Smith, John | A001 | 4.0
02/01/14 | Smith, John | B002 | 4.0
02/01/14 | Doe, Jane | A001 | 3.0
02/01/14 | Doe, Jane | C003 | 5.0
02/02/14 | Smith, John | B002 | 2.0
02/02/14 | Smith, John | C003 | 6.0
02/02/14 | Doe, Jane | A001 | 8.0
Equivalent Excel SUMIFS Function
If I were to open
data.csv in Excel and wanted to determine how many hours were worked on project A001, I would use the SUMIFS formula as follows:=SUMIFS($D2:$D8, $C2:$C8, "A001")
Where the SUMIFS function syntax is:
=SUMIFS(sum_range, criteria_range1, criteria1, [criteria_range2,
criteria2], …)
Solution
The usual approach -- if you want all the projects -- would be
This only applies the
If you're only interested in the total hours of a particular project, then I suppose you could do
or if you dislike the repetition of
but I find that I almost always want to be able to access more than one sum, so these are pretty rare patterns in my code.
Aside:
>>> df.groupby("PROJECT")["HOURS"].sum()
PROJECT
A001 15
B002 6
C003 11
Name: HOURS, dtype: float64This only applies the
sum on the desired column, as this constructs an intermediate SeriesGroupBy object:>>> df.groupby("PROJECT")["HOURS"]
If you're only interested in the total hours of a particular project, then I suppose you could do
>>> df.loc[df.PROJECT == "A001", "HOURS"].sum()
15.0or if you dislike the repetition of
df:>>> df.query("PROJECT == 'A001'")["HOURS"].sum()
15.0but I find that I almost always want to be able to access more than one sum, so these are pretty rare patterns in my code.
Aside:
.ix has fallen out of favour as it has some confusing behaviour. These days it's recommended to use .loc or .iloc to be explicit.Code Snippets
>>> df.groupby("PROJECT")["HOURS"].sum()
PROJECT
A001 15
B002 6
C003 11
Name: HOURS, dtype: float64>>> df.groupby("PROJECT")["HOURS"]
<pandas.core.groupby.SeriesGroupBy object at 0xa94f8cc>>>> df.loc[df.PROJECT == "A001", "HOURS"].sum()
15.0>>> df.query("PROJECT == 'A001'")["HOURS"].sum()
15.0Context
StackExchange Code Review Q#42695, answer score: 5
Revisions (0)
No revisions yet.