How to show and count duplicated values and rows in Pandas?

As part of the data scrubbing process for machine learning and data visualization, you might want to get rid of duplicated occurrences in your data set. In today’s post, we’ll learn how you can use Python in order to quickly identify duplicate values in your data, so that you can scrub them later as needed. If removing repeated entries or values is not working for you, read on.

Preparing the data

We’ll get started by creating our fictitious learning sales dataframe. Use the following code in order to have your dataframe created.

# First off, we'll import the Pandas library
import pandas as pd

# Next, we'll create our test dataframe
sales = pd.DataFrame({"Year": [ 2018, 2019, 2019,  2020, 2020] ,
                     "Quarter": ["Q2", "Q4", "Q3",  "Q4", "Q4"],
                     "Person": ["John", "Kim", "Kim",  "John", "John"],
                     "Sales": [4800, 10000, 7000, 10000, 10000]})

Purposely, the dataframe has some duplicated values in the Person column as well as a duplicated row which we will later on identify and drop from our data.

Check for duplicated values in column/s

We’ll start by identifying dups in a specific column (in our case that would be Person).

# Find any duplicated values in the Person column.
sales[sales.duplicated("Person", keep= False)]

The result will be:

	Year	Quarter	Person	Sales
0	2018	Q2	John	4800
1	2019	Q4	Kim	10000
2	2019	Q3	Kim	7000
3	2020	Q4	John	10000
4	2020	Q4	John	10000

But what if we want to keep the first duplicated value and just identify the 2nd and beyond occurrences?


sales[sales.duplicated("Person", keep= "first")]

	Year	Quarter	Person	Sales
2	2019	Q3	Kim	7000
3	2020	Q4	John	10000
4	2020	Q4	John	10000

Note that you are able to look for repeated values in more than one column. You’ll need to pass a list of your label names as shown below:

sales[sales.duplicated(["Person", "Year"], keep="first")]

	Year	Quarter	Person	Sales
2	2019	Q3	Kim	7000
4	2020	Q4	John	10000

Find duplicated rows

So far we just identified repeated values in one or more columns. What is we want to check complete rows that are duplicated? Here we go:

# Show entire rows
sales[sales.duplicated(keep= False)]

	Year	Quarter	Person	Sales
3	2020	Q4	John	10000
4	2020	Q4	John	10000

As we shown above, we can identify duplicates from the 2nd occurrence and beyond

# Keep the first occurrence
sales[sales.duplicated(keep= "first")]
# note that you can also keep the last occurrence and remove the others
# sales[sales.duplicated(keep= "last")]

	Year	Quarter	Person	Sales
4	2020	Q4	John	10000

Count Duplicated Pandas rows

Use this snippet if you want to quickly calculate the number of duplicated rows:

# Count number of duplicated rows
len(sales[sales.duplicated(keep= False)])

#Here's an alternative
sales[sales.duplicated(keep= False)].shape[0]

Both will return the same value : 2

Drop duplicates from your dataframe

Finally, Python allows for a very easy way to get rid of redundant repeated rows

# Drop the repeated rows
sales.drop_duplicates()

Preparing the data

Check for duplicated values in column/s

Find duplicated rows

Count Duplicated Pandas rows

Drop duplicates from your dataframe

Leave a Comment Cancel reply