How do you filter out similar text in python?

I have a dataframe, where one column consists of strings:

d = pd.DataFrame[{'text': ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
                           "hello, this is a test. we want to remove entries, where the text is similar to other texts because",
                           "where are you going",
                           "i'm going to the zoo to pet the animals",
                           "where are you going jane"]}]

Question: Some of these strings can be very similar and only differ in, e.g., one or two words. I want to remove all "duplicates", i.e. remove all articles that are similar to each other. In the above example, since the 1. and 2. row are alike, I want to only keep the first. Likewise, row 3 and 5 are similar and I want to keep only row 3. The actual dataframe has around 100k rows.

My attempt: I figured a good starting point is to convert the strings into sets for easy and efficient comparison:

d["text"].str.split[].apply[set]

Next, I'd write a function that compares each row to all the others and removes it if it is at least 90% similar to the others. Here is how I have done it:

def find_duplicates[df]:
    df = df.str.split[].apply[set]
    ls_duplicates = []
    for i in range[len[df]]:
        doc_i = df.iloc[i]
        for j in range[i+1, len[df]]:
            doc_j = df.iloc[j]
            score = len[doc_i.intersection[doc_j]] / len[doc_i]
            if score > 0.9:
                ls_duplicates.append[i]
    return ls_duplicates

find_duplicates[d['text']]

This works for my purposes, but runs very slow. Is there a way to optimize it?

Given the Strings list, the task is to write a Python program to filter all the strings which have a similar case, either upper or lower.

Examples:

Input : test_list = [“GFG”, “Geeks”, “best”, “FOr”, “all”, “GEEKS”] 
Output : [‘GFG’, ‘best’, ‘all’, ‘GEEKS’] 
Explanation : GFG is all uppercase, best is all lowercase.

Input : test_list = [“GFG”, “Geeks”, “best”] 
Output : [‘GFG’, ‘best’] 
Explanation : GFG is all uppercase, best is all lowercase. 
 

Method #1 : Using islower[] + isupper[] + list comprehension

In this, we check for each string to be lower or upper case using islower[] and isupper[], and list comprehension is used to iterate through strings.

Python3

test_list = ["GFG", "Geeks",

             "best", "FOr", "all", "GEEKS"]

print["The original list is : " + str[test_list]]

res = [sub for sub in test_list if sub.islower[] or sub.isupper[]]

print["Strings with same case : " + str[res]]

Output:

The original list is : [‘GFG’, ‘Geeks’, ‘best’, ‘FOr’, ‘all’, ‘GEEKS’] Strings with same case : [‘GFG’, ‘best’, ‘all’, ‘GEEKS’]

Method #2 : Using islower[] + isupper[] + filter[] + lambda

In this, we perform the task of filtering strings using filter[] and lambda function. Rest all the functionality is similar to the above method.

Python3

test_list = ["GFG", "Geeks", "best",

             "FOr", "all", "GEEKS"]

print["The original list is : " + str[test_list]]

res = list[filter[lambda sub : sub.islower[] or sub.isupper[], test_list]]

print["Strings with same case : " + str[res]]

Output:

The original list is : [‘GFG’, ‘Geeks’, ‘best’, ‘FOr’, ‘all’, ‘GEEKS’] Strings with same case : [‘GFG’, ‘best’, ‘all’, ‘GEEKS’]

The time and space complexity for all the methods are the same:

Time Complexity: O[n]

Space Complexity: O[n]


How do you filter text in Python?

filter[] method is a very useful method of Python. One or more data values can be filtered from any string or list or dictionary in Python by using filter[] method. It filters data based on any particular condition. It stores data when the condition returns true and discard data when returns false.

How do I find similar text in Python?

import string def match[a,b]: a,b = a. lower[], b..
compare takes two string and returns a positive integer..
you can edit the al allowed variable in compare , it indicates how large the range we need to search through. ... .
length indicate how many items you want as result, that is most similar to input string..

What is filter [] function in Python?

The filter[] function returns an iterator were the items are filtered through a function to test if the item is accepted or not.

How do you filter a string list in Python?

How to Filter List Elements in Python.
scores = [70, 60, 80, 90, 50] ... .
scores = [70, 60, 80, 90, 50] filtered = [] for score in scores: if score >= 70: filtered.append[score] print[filtered] ... .
filter[fn, list] ... .
scores = [70, 60, 80, 90, 50] filtered = filter[lambda score: score >= 70, scores] print[list[filtered]].

Chủ Đề