HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Excel's SUMIFS implemented using PANDAS, the Python Data Analysis Library

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thepandasexcelimplementedanalysisusingpythonlibrarydatasumifs

Problem

I've implemented Excel's SUMIFS function in Pandas using the following code. Is there a better — more Pythonic — implementation?

from pandas import Series, DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
# pandas equivalent of Excel's SUMIFS function
df.groupby('PROJECT').sum().ix['A001']


One concern I have with this implementation is that I'm not explicitly specifying the column to be summed.

Data File

Here's an example CSV data file (data.csv), although I'm displaying | instead of commas to improve the visual appearance.

DATE | EMPLOYEE | PROJECT | HOURS
02/01/14 | Smith, John | A001 | 4.0
02/01/14 | Smith, John | B002 | 4.0
02/01/14 | Doe, Jane | A001 | 3.0
02/01/14 | Doe, Jane | C003 | 5.0
02/02/14 | Smith, John | B002 | 2.0
02/02/14 | Smith, John | C003 | 6.0
02/02/14 | Doe, Jane | A001 | 8.0


Equivalent Excel SUMIFS Function

If I were to open data.csv in Excel and wanted to determine how many hours were worked on project A001, I would use the SUMIFS formula as follows:

=SUMIFS($D2:$D8, $C2:$C8, "A001")


Where the SUMIFS function syntax is:

=SUMIFS(sum_range, criteria_range1, criteria1, [criteria_range2,
criteria2], …)

Solution

The usual approach -- if you want all the projects -- would be

>>> df.groupby("PROJECT")["HOURS"].sum()
PROJECT
A001       15
B002        6
C003       11
Name: HOURS, dtype: float64


This only applies the sum on the desired column, as this constructs an intermediate SeriesGroupBy object:

>>> df.groupby("PROJECT")["HOURS"]


If you're only interested in the total hours of a particular project, then I suppose you could do

>>> df.loc[df.PROJECT == "A001", "HOURS"].sum()
15.0


or if you dislike the repetition of df:

>>> df.query("PROJECT == 'A001'")["HOURS"].sum()
15.0


but I find that I almost always want to be able to access more than one sum, so these are pretty rare patterns in my code.

Aside: .ix has fallen out of favour as it has some confusing behaviour. These days it's recommended to use .loc or .iloc to be explicit.

Code Snippets

>>> df.groupby("PROJECT")["HOURS"].sum()
PROJECT
A001       15
B002        6
C003       11
Name: HOURS, dtype: float64
>>> df.groupby("PROJECT")["HOURS"]
<pandas.core.groupby.SeriesGroupBy object at 0xa94f8cc>
>>> df.loc[df.PROJECT == "A001", "HOURS"].sum()
15.0
>>> df.query("PROJECT == 'A001'")["HOURS"].sum()
15.0

Context

StackExchange Code Review Q#42695, answer score: 5

Revisions (0)

No revisions yet.