How to filter and subset a Python dataframe by specific values?

Quite often as part of your data wrangling, you’ll invariably need to subset or filter your dataset and select for specific rows by one or multiple values. In today’s quick tutorial i’ll show how to do exactly that.

Creating your data

We’ll start by creating a very simple data set which we’ll use for learning purposes. As we typically do, we’ll first go ahead and import the Pandas library to our work environment and then create a dataframe from a dictionary as shown below.

# import Python Pandas
import pandas as pd

# Now, let's create the dataframe 
df = pd.DataFrame({"Year": [ 2018, 2019, 2020] ,
                     "Person": ["John", "Kim", "Bob"],
                     "Budget": [20000, 30000, 40000]})
'Show the dataframe header
df.head()
YearPersonBudget
02018John20000
12019Kim30000
22020Bob40000

Find df entries by single value

Let’s assume that we would like to show all entries with budget greater than 30K. We’ll first go ahead and define a variable containing the filtering criteria. Then we’ll apply it on the dataframe and pull the selected records

# single condition
mask = df["Budget"]> 30000
df[mask]

Here’s the result:

YearPersonBudget
22020Bob40000

Select rows by multiple conditions

Next we’ll run through a bit more complex example that has a complex logical select criteria.

# Complex select criteria with boolean AND (&)
mask = (df["Budget"]>= 30000) & (df["Year"] == 2020)
df[mask]
YearPersonBudget
12019Kim30000

We can also use boolean OR as well.

# Using logical boolean OR
df[(df["Budget"]>= 30000) | (df["Year"] == 2020)]

The result will be:

YearPersonBudget
12019Kim30000
22020Bob40000

Subset with label (Loc)

Here’s a good example on filtering with boolean conditions with loc.

Find rows by index

You can also select specific rows or values in your dataframe by index as shown below. As a simple example, the code below will subset the first two rows according to row index.

df.iloc[[0,1],:]

The following subset will be returned

YearPersonBudget
02018John20000
12019Kim30000

Bonus: Create a quick bar plot of your data

Besides its very powerful data manipulation capabilities, Pandas has built in data visualization capabilities based on other Python library named MatplotLib. Here’s a simple bar chart that you can create with one line of code, the parameters are pretty much self explanatory.

df.plot(x="Year", y="Budget", kind="bar", title="Budget by year");

Leave a Comment