In today’s recipe we’ll touch on the basics of adding numeric values in a pandas DataFrame.
We’ll cover the following cases:
- Sum all rows of one or multiple columns
- Sum by column name/label into a new column
- Adding values by index
- Dealing with nan values
- Sum values that meet a certain condition
Creating the dataset
We’ll start by creating a simple dataset
# Python3
# import pandas into your Python environment.
import pandas as pd
# Now, let's create the dataframe
budget = pd.DataFrame({"person": ["John", "Kim", "Bob"],
"quarter": [1, 1, 1] ,
"consumer_budg": [15000, 35000, 45000],
"enterprise_budg": [20000, 30000, 40000] })
budget.head()
person | quarter | consumer_budg | enterprise_budg | |
---|---|---|---|---|
0 | John | 1 | 15000 | 20000 |
1 | Kim | 1 | 35000 | 30000 |
2 | Bob | 1 | 45000 | 40000 |
How to sum a column? (or more)
For a single column we’ll simply use the Series Sum() method.
# one column
budget['consumer_budg'].sum()
95000
Also the DataFrame has a Sum() method, which we’ll use to add multiple columns:
#addingmultiple columns
cols = ['consumer_budg', 'enterprise_budg']
budget[cols].sum()
We’ll receive a Series objects with the results:
consumer_budg 95000 enterprise_budg 90000 dtype: int64
Sum row values into a new column
More interesting is the case that we want to compute the values by adding multiple column values in a specific row. See this simple example below
# using the column label names
budget['total_budget'] = budget['consumer_budg'] + budget['enterprise_budg']
We have created a new column as shown below:
person | quarter | consumer_budg | enterprise_budg | total_budget | |
---|---|---|---|---|---|
0 | John | 1 | 15000 | 20000 | 35000 |
1 | Kim | 1 | 35000 | 30000 | 65000 |
2 | Bob | 1 | 45000 | 40000 | 85000 |
Note: We could have also used the loc method to subset by label.
Adding columns by index
We can also refer to the columns to sum by index, using the iloc method.
# by index
budget['total_budget'] = budget.iloc[:,2]+ budget.iloc[:,3]
Result will be similar as above
Sum with conditions
In this example, we would like to define a column named high_budget and populate it only if the total_budget is over the 80K threshold.
budget['high_budget'] = budget.query('consumer_budg + enterprise_budg > 80000')['total_budget']
Adding columns with null values
Here we might need a bit of pre-processing to get rid of the null values using fillna().
Let’s quickly create a sample dataset containing null values (see last row).
# with nan
import numpy as np
budget_nan = pd.DataFrame({"person": ["John", "Kim", "Bob", 'Court'],
"quarter": [1, 1, 1,1] ,
"consumer_budg": [15000, 35000, 45000, 50000],
"enterprise_budg": [20000, 30000, 40000, np.nan ] })
person | quarter | consumer_budg | enterprise_budg | high_budget | |
---|---|---|---|---|---|
0 | John | 1 | 15000 | 20000.0 | 35000.0 |
1 | Kim | 1 | 35000 | 30000.0 | 65000.0 |
2 | Bob | 1 | 45000 | 40000.0 | 85000.0 |
3 | Court | 1 | 50000 | NaN | NaN |
Now lets use the DataFrame fillna() method to mass override the null values with Zeros so that we can sum the column values.
budget_nan.fillna(0, inplace=True)
budget_nan['high_budget'] = budget_nan['consumer_budg'] + budget_nan['enterprise_budg']
budget_nan
Voi’la
person | quarter | consumer_budg | enterprise_budg | high_budget | |
---|---|---|---|---|---|
0 | John | 1 | 15000 | 20000.0 | 35000.0 |
1 | Kim | 1 | 35000 | 30000.0 | 65000.0 |
2 | Bob | 1 | 45000 | 40000.0 | 85000.0 |
3 | Court | 1 | 50000 | 0.0 | 50000.0 |