As part of the data scrubbing process for machine learning and data visualization, you might want to get rid of duplicated occurrences in your data set. In today’s post, we’ll learn how you can use Python in order to quickly identify duplicate values in your data, so that you can scrub them later as needed. If removing repeated entries or values is not working for you, read on.
Preparing the data
We’ll get started by creating our fictitious learning sales dataframe. Use the following code in order to have your dataframe created.
# First off, we'll import the Pandas library
import pandas as pd
# Next, we'll create our test dataframe
sales = pd.DataFrame({"Year": [ 2018, 2019, 2019, 2020, 2020] ,
"Quarter": ["Q2", "Q4", "Q3", "Q4", "Q4"],
"Person": ["John", "Kim", "Kim", "John", "John"],
"Sales": [4800, 10000, 7000, 10000, 10000]})
Purposely, the dataframe has some duplicated values in the Person column as well as a duplicated row which we will later on identify and drop from our data.
Check for duplicated values in column/s
We’ll start by identifying dups in a specific column (in our case that would be Person).
# Find any duplicated values in the Person column.
sales[sales.duplicated("Person", keep= False)]
The result will be:
Year | Quarter | Person | Sales | |
---|---|---|---|---|
0 | 2018 | Q2 | John | 4800 |
1 | 2019 | Q4 | Kim | 10000 |
2 | 2019 | Q3 | Kim | 7000 |
3 | 2020 | Q4 | John | 10000 |
4 | 2020 | Q4 | John | 10000 |
But what if we want to keep the first duplicated value and just identify the 2nd and beyond occurrences?
sales[sales.duplicated("Person", keep= "first")]
Year | Quarter | Person | Sales | |
---|---|---|---|---|
2 | 2019 | Q3 | Kim | 7000 |
3 | 2020 | Q4 | John | 10000 |
4 | 2020 | Q4 | John | 10000 |
Note that you are able to look for repeated values in more than one column. You’ll need to pass a list of your label names as shown below:
sales[sales.duplicated(["Person", "Year"], keep="first")]
Year | Quarter | Person | Sales | |
---|---|---|---|---|
2 | 2019 | Q3 | Kim | 7000 |
4 | 2020 | Q4 | John | 10000 |
Find duplicated rows
So far we just identified repeated values in one or more columns. What is we want to check complete rows that are duplicated? Here we go:
# Show entire rows
sales[sales.duplicated(keep= False)]
Year | Quarter | Person | Sales | |
---|---|---|---|---|
3 | 2020 | Q4 | John | 10000 |
4 | 2020 | Q4 | John | 10000 |
As we shown above, we can identify duplicates from the 2nd occurrence and beyond
# Keep the first occurrence
sales[sales.duplicated(keep= "first")]
# note that you can also keep the last occurrence and remove the others
# sales[sales.duplicated(keep= "last")]
Year | Quarter | Person | Sales | |
---|---|---|---|---|
4 | 2020 | Q4 | John | 10000 |
Count Duplicated Pandas rows
Use this snippet if you want to quickly calculate the number of duplicated rows:
# Count number of duplicated rows
len(sales[sales.duplicated(keep= False)])
#Here's an alternative
sales[sales.duplicated(keep= False)].shape[0]
Both will return the same value : 2
Drop duplicates from your dataframe
Finally, Python allows for a very easy way to get rid of redundant repeated rows
# Drop the repeated rows
sales.drop_duplicates()