Remove duplicate words from dataframe python

If you're looking to get rid of consecutive duplicates only, this should suffice:

df['Desired'] = df['Current'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')
df

           Current          Desired
0       Racoon Dog       Racoon Dog
1          Cat Cat              Cat
2  Dog Dog Dog Dog              Dog
3  Rat Fox Chicken  Rat Fox Chicken

Details

\b        # word boundary
(\w+)     # 1st capture group of a single word
( 
\s+       # 1 or more spaces
\1        # reference to first group 
)+        # one or more repeats
\b

Regex from here.


To remove non-consecutive duplicates, I'd suggest a solution involving the OrderedDict data structure:

from collections import OrderedDict

df['Desired'] = (df['Current'].str.split()
                              .apply(lambda x: OrderedDict.fromkeys(x).keys())
                              .str.join(' '))
df

           Current          Desired
0       Racoon Dog       Racoon Dog
1          Cat Cat              Cat
2  Dog Dog Dog Dog              Dog
3  Rat Fox Chicken  Rat Fox Chicken

Need to remove duplicates from Pandas DataFrame?

If so, you can apply the following syntax to remove duplicates from your DataFrame:

df.drop_duplicates()

In the next section, you’ll see the steps to apply this syntax in practice.

Step 1: Gather the data that contains the duplicates

Firstly, you’ll need to gather the data that contains the duplicates.

For example, let’s say that you have the following data about boxes, where each box may have a different color or shape:

Color Shape
Green Rectangle
Green Rectangle
Green Square
Blue Rectangle
Blue Square
Red Square
Red Square
Red Rectangle

As you can see, there are duplicates under both columns.

Before you remove those duplicates, you’ll need to create Pandas DataFrame to capture that data in Python.

Step 2: Create Pandas DataFrame

Next, create Pandas DataFrame using this code:

import pandas as pd

boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }
df = pd.DataFrame(boxes, columns = ['Color', 'Shape'])

print(df)

Once you run the code in Python, you’ll get the same values as in step 1:

   Color      Shape
0  Green  Rectangle
1  Green  Rectangle
2  Green     Square
3   Blue  Rectangle
4   Blue     Square
5    Red     Square
6    Red     Square
7    Red  Rectangle

Step 3: Remove duplicates from Pandas DataFrame

To remove duplicates from the DataFrame, you may use the following syntax that you saw at the beginning of this guide:

df.drop_duplicates()

Let’s say that you want to remove the duplicates across the two columns of Color and Shape.

In that case, apply the code below in order to remove those duplicates:

import pandas as pd

boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }
df = pd.DataFrame(boxes, columns = ['Color', 'Shape'])

df_duplicates_removed = df.drop_duplicates()
print(df_duplicates_removed)

As you can see, only the distinct values across the two columns remain:

   Color      Shape
0  Green  Rectangle
2  Green     Square
3   Blue  Rectangle
4   Blue     Square
5    Red     Square
7    Red  Rectangle

But what if you want to remove the duplicates on a specific column, such as the Color column?

In that case, you can specify the column name using a subset:

df.drop_duplicates(subset=[‘Color’])

So the full Python code to remove the duplicates for the Color column would look like this:

import pandas as pd

boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }
df = pd.DataFrame(boxes, columns = ['Color', 'Shape'])

df_duplicates_removed = df.drop_duplicates(subset=['Color'])
print(df_duplicates_removed)

Here is the result:

   Color      Shape
0  Green  Rectangle
3   Blue  Rectangle
5    Red     Square

You may want to check the Pandas Documentation to learn more about removing duplicates from a DataFrame.

View Discussion

Improve Article

Save Article

  • Read
  • Discuss
  • View Discussion

    Improve Article

    Save Article

    Sometimes, while working with Python list we can have a problem in which we need to perform removal of duplicated words from string list. This can have application when we are in data domain. Let’s discuss certain ways in which this task can be performed. 

    Method #1 : Using set() + split() + loop The combination of above methods can be used to perform this task. In this, we first split each list into combined words and then employ set() to perform the task of duplicate removal. 

    Python3

    test_list = ['gfg, best, gfg', 'I, am, I', 'two, two, three']

    print("The original list is : " + str(test_list))

    res = []

    for strs in test_list:

        res.append(set(strs.split(", ")))

    print("The list after duplicate words removal is : " + str(res))

    Output : 

    The original list is : ['gfg, best, gfg', 'I, am, I', 'two, two, three']
    The list after duplicate words removal is : [{'best', 'gfg'}, {'I', 'am'}, {'three', 'two'}]

    Method #2 : Using list comprehension + set() + split() This is similar method to above. The difference is that we employ list comprehension instead of loops to perform the iteration part. 

    Python3

    test_list = ['gfg, best, gfg', 'I, am, I', 'two, two, three']

    print("The original list is : " + str(test_list))

    res = [set(strs.split(", ")) for strs in test_list]

    print("The list after duplicate words removal is : " + str(res))

    Output : 

    The original list is : ['gfg, best, gfg', 'I, am, I', 'two, two, three']
    The list after duplicate words removal is : [{'best', 'gfg'}, {'I', 'am'}, {'three', 'two'}]

    Method: Using sorted()+index()+split()

    Python3

    test_list = ['gfg best gfg', 'I am I', 'two two three' ];a=[]

    for i in test_list:

      words = i.split()

      print(" ".join(sorted(set(words), key=words.index)),end=" ")

    Output

    gfg best I am two three 


    How do I remove repeating words in Python?

    1) Split input sentence separated by space into words. 2) So to get all those strings together first we will join each string in given list of strings. 3) Now create a dictionary using Counter method having strings as keys and their frequencies as values. 4) Join each words are unique to form single string.

    How do you remove duplicates from a DataFrame in Python?

    You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .

    How do you remove duplicates from a DataFrame in Python based on column?

    To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.

    How do you find duplicate words in Python?

    Python.
    string = "big black bug bit a big black dog on his big black nose";.
    #Converts the string into lowercase..
    string = string.lower();.
    #Split the string into words using built-in function..
    words = string.split(" ");.
    print("Duplicate words in a given string : ");.
    for i in range(0, len(words)):.
    count = 1;.