I have a dataframe, where one column consists of strings:
d = pd.DataFrame[{'text': ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
"hello, this is a test. we want to remove entries, where the text is similar to other texts because",
"where are you going",
"i'm going to the zoo to pet the animals",
"where are you going jane"]}]
Question: Some of these strings can be very similar and only differ in, e.g., one or two words. I want to remove all "duplicates", i.e. remove all articles that are similar to each other. In the above example, since the 1. and 2. row are alike, I want to only keep the first. Likewise, row 3 and 5 are similar and I want to keep only row 3. The actual dataframe has around 100k rows.
My attempt: I figured a good starting point is to convert the strings into sets for easy and efficient comparison:
d["text"].str.split[].apply[set]
Next, I'd write a function that compares each row to all the others and removes it if it is at least 90% similar to the others. Here is how I have done it:
def find_duplicates[df]:
df = df.str.split[].apply[set]
ls_duplicates = []
for i in range[len[df]]:
doc_i = df.iloc[i]
for j in range[i+1, len[df]]:
doc_j = df.iloc[j]
score = len[doc_i.intersection[doc_j]] / len[doc_i]
if score > 0.9:
ls_duplicates.append[i]
return ls_duplicates
find_duplicates[d['text']]
This works for my purposes, but runs very slow. Is there a way to optimize it?
Given the Strings list, the task is to write a Python program to filter all the strings which have a similar case, either upper or lower.
Examples:
Input : test_list = [“GFG”, “Geeks”, “best”, “FOr”, “all”, “GEEKS”]
Output : [‘GFG’, ‘best’, ‘all’, ‘GEEKS’]
Explanation : GFG is all uppercase, best is all lowercase.Input : test_list = [“GFG”, “Geeks”, “best”]
Output : [‘GFG’, ‘best’]
Explanation : GFG is all uppercase, best is all lowercase.
Method #1 : Using islower[] + isupper[] + list comprehension
In this, we check for each string to be lower or upper case using islower[] and isupper[], and list comprehension is used to iterate through strings.
Python3
test_list
=
[
"GFG"
,
"Geeks"
,
"best"
,
"FOr"
,
"all"
,
"GEEKS"
]
print
[
"The original list is : "
+
str
[test_list]]
res
=
[sub
for
sub
in
test_list
if
sub.islower[]
or
sub.isupper[]]
print
[
"Strings with same case : "
+
str
[res]]
Output:
The original list is : [‘GFG’, ‘Geeks’, ‘best’, ‘FOr’, ‘all’, ‘GEEKS’] Strings with same case : [‘GFG’, ‘best’, ‘all’, ‘GEEKS’]
Method #2 : Using islower[] + isupper[] + filter[] + lambda
In this, we perform the task of filtering strings using filter[] and lambda function. Rest all the functionality is similar to the above method.
Python3
test_list
=
[
"GFG"
,
"Geeks"
,
"best"
,
"FOr"
,
"all"
,
"GEEKS"
]
print
[
"The original list is : "
+
str
[test_list]]
res
=
list
[
filter
[
lambda
sub : sub.islower[]
or
sub.isupper[], test_list]]
print
[
"Strings with same case : "
+
str
[res]]
Output:
The original list is : [‘GFG’, ‘Geeks’, ‘best’, ‘FOr’, ‘all’, ‘GEEKS’] Strings with same case : [‘GFG’, ‘best’, ‘all’, ‘GEEKS’]
The time and space complexity for all the methods are the same:
Time Complexity: O[n]
Space Complexity: O[n]