Quite often as part of your data wrangling, you’ll invariably need to subset or filter your dataset and select for specific rows by one or multiple values. In today’s quick tutorial i’ll show how to do exactly that.
Creating your data
We’ll start by creating a very simple data set which we’ll use for learning purposes. As we typically do, we’ll first go ahead and import the Pandas library to our work environment and then create a dataframe from a dictionary as shown below.
# import Python Pandas
import pandas as pd
# Now, let's create the dataframe
df = pd.DataFrame({"Year": [ 2018, 2019, 2020] ,
"Person": ["John", "Kim", "Bob"],
"Budget": [20000, 30000, 40000]})
'Show the dataframe header
df.head()
Year | Person | Budget | |
---|---|---|---|
0 | 2018 | John | 20000 |
1 | 2019 | Kim | 30000 |
2 | 2020 | Bob | 40000 |
Find df entries by single value
Let’s assume that we would like to show all entries with budget greater than 30K. We’ll first go ahead and define a variable containing the filtering criteria. Then we’ll apply it on the dataframe and pull the selected records
# single condition
mask = df["Budget"]> 30000
df[mask]
Here’s the result:
Year | Person | Budget | |
---|---|---|---|
2 | 2020 | Bob | 40000 |
Select rows by multiple conditions
Next we’ll run through a bit more complex example that has a complex logical select criteria.
# Complex select criteria with boolean AND (&)
mask = (df["Budget"]>= 30000) & (df["Year"] == 2020)
df[mask]
Year | Person | Budget | |
---|---|---|---|
1 | 2019 | Kim | 30000 |
We can also use boolean OR as well.
# Using logical boolean OR
df[(df["Budget"]>= 30000) | (df["Year"] == 2020)]
The result will be:
Year | Person | Budget | |
---|---|---|---|
1 | 2019 | Kim | 30000 |
2 | 2020 | Bob | 40000 |
Subset with label (Loc)
Here’s a good example on filtering with boolean conditions with loc.
Find rows by index
You can also select specific rows or values in your dataframe by index as shown below. As a simple example, the code below will subset the first two rows according to row index.
df.iloc[[0,1],:]
The following subset will be returned
Year | Person | Budget | |
---|---|---|---|
0 | 2018 | John | 20000 |
1 | 2019 | Kim | 30000 |
Bonus: Create a quick bar plot of your data
Besides its very powerful data manipulation capabilities, Pandas has built in data visualization capabilities based on other Python library named MatplotLib. Here’s a simple bar chart that you can create with one line of code, the parameters are pretty much self explanatory.
df.plot(x="Year", y="Budget", kind="bar", title="Budget by year");