How do I create a sum row and sum column in pandas?

Posted on

Problem :

I’m going through the Khan Academy course on Statistics as a bit of a refresher from my college days, and as a way to get me up to speed on pandas & other scientific Python.

I’ve got a table that looks like this from Khan Academy:

             | Undergraduate | Graduate | Total
-------------+---------------+----------+------
Straight A's |           240 |       60 |   300
-------------+---------------+----------+------
Not          |         3,760 |      440 | 4,200
-------------+---------------+----------+------
Total        |         4,000 |      500 | 4,500

I would like to recreate this table using pandas. Of course I could create a DataFrame using something like

"Graduate": {...},
"Undergraduate": {...},
"Total": {...},

But that seems like a naive approach that would both fall over quickly and just not really be extensible.

I’ve got the non-totals part of the table like this:

df = pd.DataFrame(
    {
        "Undergraduate": {"Straight A's": 240, "Not": 3_760},
        "Graduate": {"Straight A's": 60, "Not": 440},
    }
)
df

I’ve been looking and found a couple of promising things, like:

df['Total'] = df.sum(axis=1)

But I didn’t find anything terribly elegant.

I did find the crosstab function that looks like it should do what I want, but it seems like in order to do that I’d have to create a dataframe consisting of 1/0 for all of these values, which seems silly because I’ve already got an aggregate.

I have found some approaches that seem to manually build a new totals row, but it seems like there should be a better way, something like:

totals(df, rows=True, columns=True)

or something.

Does this exist in pandas, or do I have to just cobble together my own approach?

Solution :

Or in two steps, using the .sum() function as you suggested (which might be a bit more readable as well):

import pandas as pd

df = pd.DataFrame( {"Undergraduate": {"Straight A's": 240, "Not": 3_760},"Graduate": {"Straight A's": 60, "Not": 440},})

#Total sum per column: 
df.loc['Total',:]= df.sum(axis=0)

#Total sum per row: 
df.loc[:,'Total'] = df.sum(axis=1)

Output:

              Graduate  Undergraduate  Total
Not                440           3760   4200
Straight A's        60            240    300
Total              500           4000   4500

append and assign

The point of this answer is to provide an in line and not an in place solution.

append

I use append to stack a Series or DataFrame vertically. It also creates a copy so that I can continue to chain.

assign

I use assign to add a column. However, the DataFrame I’m working on is in the in between nether space. So I use a lambda in the assign argument which tells Pandas to apply it to the calling DataFrame.


df.append(df.sum().rename('Total')).assign(Total=lambda d: d.sum(1))

              Graduate  Undergraduate  Total
Not                440           3760   4200
Straight A's        60            240    300
Total              500           4000   4500

Fun alternative

Uses drop with errors='ignore' to get rid of potentially pre-existing Total rows and columns.

Also, still in line.

def tc(d):
  return d.assign(Total=d.drop('Total', errors='ignore', axis=1).sum(1))

df.pipe(tc).T.pipe(tc).T

              Graduate  Undergraduate  Total
Not                440           3760   4200
Straight A's        60            240    300
Total              500           4000   4500

From the original data using crosstab, if just base on your input, you just need melt before crosstab

s=df.reset_index().melt('index')
pd.crosstab(index=s['index'],columns=s.variable,values=s.value,aggfunc='sum',margins=True)
Out[33]: 
variable      Graduate  Undergraduate   All
index                                      
Not                440           3760  4200
Straight A's        60            240   300
All                500           4000  4500

Toy data

df=pd.DataFrame({'c1':[1,2,2,3,4],'c2':[2,2,3,3,3],'c3':[1,2,3,4,5]}) 
# before `agg`, I think your input is the result after `groupby` 
df
Out[37]: 
   c1  c2  c3
0   1   2   1
1   2   2   2
2   2   3   3
3   3   3   4
4   4   3   5


pd.crosstab(df.c1,df.c2,df.c3,aggfunc='sum',margins
=True)
Out[38]: 
c2     2     3  All
c1                 
1    1.0   NaN    1
2    2.0   3.0    5
3    NaN   4.0    4
4    NaN   5.0    5
All  3.0  12.0   15

The original data is:

>>> df = pd.DataFrame(dict(Undergraduate=[240, 3760], Graduate=[60, 440]), index=["Straight A's", "Not"])
>>> df
Out: 
              Graduate  Undergraduate
Straight A's        60            240
Not                440           3760

You can only use df.T to achieve recreating this table:

>>> df_new = df.T
>>> df_new
Out: 
               Straight A's   Not
Graduate                 60   440
Undergraduate           240  3760

After computing the Total by row and columns:

>>> df_new.loc['Total',:]= df_new.sum(axis=0)
>>> df_new.loc[:,'Total'] = df_new.sum(axis=1)
>>> df_new
Out: 
               Straight A's     Not   Total
Graduate               60.0   440.0   500.0
Undergraduate         240.0  3760.0  4000.0
Total                 300.0  4200.0  4500.0

Leave a Reply

Your email address will not be published. Required fields are marked *