In today’s recipe we’ll learn how you can subset a DataFrame by selecting multiple columns.
Preparations
We’ll first create a simple DataFrame that we’ll use throughout the recipe:
# Import Pandas
import pandas as pd
# Create Dataframe from a dictionary
target = pd.DataFrame({"quarter": [ 1, 2, 3, 4,4,3,2,1],
"person": ["Debbie", "Kim", "Dorothy", "Tim"] * 2,
"area":['North', 'South', 'West','Midwest']* 2,
"budget": [43654, 35795, 20860, 25390, 41575, 31964, 31284, 42118]})
target.head()
Here’s our DataFrame
quarter | person | area | budget | |
---|---|---|---|---|
0 | 1 | Debbie | North | 43654 |
1 | 2 | Kim | South | 35795 |
2 | 3 | Dorothy | West | 20860 |
3 | 4 | Tim | Midwest | 25390 |
4 | 4 | Debbie | North | 4157 |
Selecting multiple columns in a Pandas Dataframe
We’ll look into the following cases:
- Subset by column names / label
- Subset based on a list of columns
- Select by column position / index
- Select by condition / specific value
Select by label
We’ll pass the column names as a list as shown below:
# by name
target[['area', 'budget']]
Using the loc indexer
# using the loc indexer - by label
target.loc[:,['area', 'budget']]
Subset multiple cols by list of names
Pretty similar than above, but for better readibility we’ll define the list as a variable upfront.
# predefined Python list
my_cols = ['area', 'budget']
target[my_cols]
Select columns by index / position
In this case we’ll use the iloc indexer to make selections according to the position of the column which is expressed as an integer.
# using the column index
target.iloc[:,2:4]
All the methods discussed above will render the same result
area | budget | |
---|---|---|
0 | North | 43654 |
1 | South | 35795 |
2 | West | 20860 |
3 | Midwest | 25390 |
4 | North | 41575 |
5 | South | 31964 |
6 | West | 31284 |
7 | Midwest | 42118 |
Select by condition / value
Let’s now assume that we want to subset the df by conditions on the column names. This would be useful if you have a very wide data set and you would like to select only columns matching a specific pattern in their name.
In our example we would like to select our columns which name contains the pattern ‘a’.
target.loc[:,target.columns.str.contains('a')]
If you would like to filter the DataFrame rows by one or more conditions, look into this tutorial.