How to show and count duplicated values and rows in Pandas?

As part of the data scrubbing process for machine learning and data visualization, you might want to get rid of duplicated occurrences in your data set. In today’s post, we’ll learn how you can use Python in order to quickly identify duplicate values in your data, so that you can scrub them later as needed. If removing repeated entries or values is not working for you, read on.

Preparing the data

We’ll get started by creating our fictitious learning sales dataframe. Use the following code in order to have your dataframe created.

# First off, we'll import the Pandas library
import pandas as pd
# Next, we'll create our test dataframe
sales = pd.DataFrame({"Year": [ 2018, 2019, 2019,  2020, 2020] ,
                     "Quarter": ["Q2", "Q4", "Q3",  "Q4", "Q4"],
                     "Person": ["John", "Kim", "Kim",  "John", "John"],
                     "Sales": [4800, 10000, 7000, 10000, 10000]})

Purposely, the dataframe has some duplicated values in the Person column as well as a duplicated row which we will later on identify and drop from our data.

Check for duplicated values in column/s

We’ll start by identifying dups in a specific column (in our case that would be Person).

# Find any duplicated values in the Person column.
sales[sales.duplicated("Person", keep= False)]

The result will be:

YearQuarterPersonSales
02018Q2John4800
12019Q4Kim10000
22019Q3Kim7000
32020Q4John10000
42020Q4John10000

But what if we want to keep the first duplicated value and just identify the 2nd and beyond occurrences?


sales[sales.duplicated("Person", keep= "first")]
YearQuarterPersonSales
22019Q3Kim7000
32020Q4John10000
42020Q4John10000

Note that you are able to look for repeated values in more than one column. You’ll need to pass a list of your label names as shown below:

sales[sales.duplicated(["Person", "Year"], keep="first")]
YearQuarterPersonSales
22019Q3Kim7000
42020Q4John10000

Find duplicated rows

So far we just identified repeated values in one or more columns. What is we want to check complete rows that are duplicated? Here we go:

# Show entire rows
sales[sales.duplicated(keep= False)]
YearQuarterPersonSales
32020Q4John10000
42020Q4John10000

As we shown above, we can identify dups from the 2nd occurrence and beyond

# Keep the first occurrence
sales[sales.duplicated(keep= "first")]
# note that you can also keep the last occurrence and remove the others
# sales[sales.duplicated(keep= "last")]
YearQuarterPersonSales
42020Q4John10000

Count Duplicated rows

Use this snippet if you want to quickly calculate the number of duplicated rows:

# Count number of duplicated rows
len(sales[sales.duplicated(keep= False)])

#Here's an alternative
sales[sales.duplicated(keep= False)].shape[0]

Both will return the same value : 2

Drop duplicates from your dataframe

Finally, Python allows for a very easy way to get rid of redundant repeated rows

# Drop the repeated rows
sales.drop_duplicates()

Leave a Comment