When working with spreadsheets and tabular data, being imported from CSV files or database table, you might need to clean up rows from your Pandas dataframe based on a row condition.
This short tutorial provides a few Python snippets that you’ll be able to use in the following situations:
- You would like to delete a row that contains one or more column values
- You would like to remove a row based on its index
- Here will specifically look into dropping your first and last dataframe rows.
As always, we’ll create our example Pandas dataframe first. Will do it by passing a dictionary with some raw data organized in lists to the pd.DataFrame method.
import pandas as pd
data = ({"Year": [ 2020, 2019, 2020, 2019, 2019, 2020] ,
"Sales Person": ["John", "Xiao", "Evelyn", "Maria", "Samantha", "Maria"],
"Sales Budget": [20000, 32000, 30000, 32000, 54000, 25000]})
df = pd.DataFrame(data=data)
Let’s take a look at the data:
df.head(6)
Year | Sales Person | Sales Budget | |
---|---|---|---|
0 | 2020 | John | 20000 |
1 | 2019 | Xiao | 32000 |
2 | 2020 | Evelyn | 30000 |
3 | 2019 | Maria | 32000 |
4 | 2019 | Samantha | 54000 |
5 | 2020 | Maria | 25000 |
Drop rows based on value or condition
Let’s assume that we want to filter the dataframe based on the Sales Budget. We’ll go ahead and first remove all rows with Sales budget greater or equal to 30K. The easiest way to do that is to create a new dataframe which represents a subset of the original dataframe according to the required values/conditions.
Let’s do that step by step.
df["Sales Budget"]>30000
We get the vector of booleans
0 False 1 True 2 False 3 True 4 True 5 False Name: Sales Budget, dtype: bool
Now we’ll use the vector in order to filter the dataframe as shown below:
subset = df[(df["Sales Budget"]>30000)]
subset.head()
Here’s the result:
Year | Sales Person | Sales Budget | |
---|---|---|---|
1 | 2019 | Xiao | 32000 |
3 | 2019 | Maria | 32000 |
4 | 2019 | Samantha | 54000 |
There is an alternative way to accomplish that: we can use the unary operand (~) to negate a complete statement in Python. Our code could also look as following and produce the same result:
subset = df[~(df["Sales Budget"]<=30000)]
subset.head()
Filter rows according to a complex condition
Let’s assume that we want to filter the rows that has a budget > 30K AND are assigned to Maria.
# the "&" sign represents a boolean AND
subset_AND = df[~(df["Sales Budget"]<=30000) & (df["Sales Person"] =="Maria") ]
subset_AND.head()
Year | Sales Person | Sales Budget | |
---|---|---|---|
3 | 2019 | Maria | 32000 |
Now we’ll that we want to filter the rows that has a budget > 30K OR are assigned to Maria.
# the "|" sign represents a boolean OR
subset_OR = df[~(df["Sales Budget"]<=30000) | (df["Sales Person"] =="Maria") ]
subset_OR.head()
Year | Sales Person | Sales Budget | |
---|---|---|---|
1 | 2019 | Xiao | 32000 |
3 | 2019 | Maria | 32000 |
4 | 2019 | Samantha | 54000 |
5 | 2020 | Maria | 25000 |
Drop a row by index
Let’s assume we want to drop the observation with index 1 and 3. We’ll just go ahead and pass a list of index numbers to the Pandas drop method.
df.drop(index=[1,3])
Year | Sales Person | Sales Budget | |
---|---|---|---|
0 | 2020 | John | 20000 |
2 | 2020 | Evelyn | 30000 |
4 | 2019 | Samantha | 54000 |
5 | 2020 | Maria | 25000 |
Persisting your changes
If you are interested in removing the rows and persist the change in your dataframe, instead or creating a new one, use the inplace parameter as shown below.
df.drop(index=[1,3], inplace=True)
Removing the first row
Index [0] represents the first row in your dataframe, so we’ll pass it to the drop method.
df.drop(index=[0])
Deleting the first couple of rows
df.drop(index = df.index[0:2])
Removing the last row
df.drop(len(df)-1)
len(<dataframe>) returns the dataframe number of rows; we’ll substruct 1 as the index of the first row is 0.