All of these answers seem to be complicating things or not understanding regex very well. I recommend using special sequences to catch any and all punctuation you're trying to replace with spaces.
My answer is a simplification of Jonathan's leveraging Python regex special sequences rather than a manual list of punctuation and spaces to catch.
import re
tweet = 'I am tired! I like fruit...and milk'
clean = re.sub[r''' # Start raw string block
\W+ # Accept one or more non-word characters
\s* # plus zero or more whitespace characters,
''', # Close string block
' ', # and replace it with a single space
tweet,
flags=re.VERBOSE]
print[tweet + '\n' + clean]
Results:
I am tired! I like fruit...and milk
I am tired I like fruit and milk
Compact version:
tweet = 'I am tired! I like fruit...and milk'
clean = re.sub['\W+\s*', ' ', tweet]
print[tweet + '\n' + clean]
What separates my version from Jonathan's is symbols like hyphens, tildes, parentheses, brackets, etc are all caught and removed, not just the list of given punctuation, catches any non-space whitespace, like tab, newline, etc. and converts to a single space.
Jonathan's version is good if you want to remove a specific list of punctuation but not all punctuation, like my solution does.
If you don't want to even allow underscores in your text, you can replace the special sequence \W
with just a simple [^a-zA-Z0-9]
,
i.e.
tweet = 'I am tired! I like fruit...and milk'
clean = re.sub['[^a-zA-Z0-9]+\s*', ' ', tweet]
print[tweet + '\n' + clean]
Special sequence explanation, from Python's documentation on regex:
"The special sequences consist of '\'
and a character from the list below."
\W
: Matches any character which is not a word character. [A word character, \w
, includes most characters that can be part of a word in any language, as well as numbers and the underscore.]
\s
:
For Unicode [str] patterns: Matches Unicode whitespace characters [which includes [ \t\n\r\f\v]
, and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages].
Many times while working with Python strings, we have a problem in which we need to remove certain characters from strings. This can have applications in data preprocessing in the Data Science domain and also in day-day programming. Let’s discuss certain ways in which we can perform this task using Python.
Method 1: Remove Punctuation from a String with Translate
The first two arguments for string.translate method is empty strings, and the third input is a Python list of the punctuation that should be removed. This instructs the Python method to eliminate punctuation from a string. This is one of the best ways to strip punctuation from a string.
Python3
import
string
test_str
=
'Gfg, is best: for ! Geeks ;'
test_str
=
test_str.translate
[
str
.maketrans['
', '
', string.punctuation]]
print
[test_str]
Output:
Gfg is best for Geeks
Method 2: Remove Punctuation from a String with Python loop
This is the brute way in which this task can be performed. In this, we check for the punctuations using a raw string that contain punctuations and then we construct a string removing those punctuations.
Python3
test_str
=
"Gfg, is best : for ! Geeks ;"
print
[
"The original string is : "
+
test_str]
punc
=
for
ele
in
test_str:
if
ele
in
punc:
test_str
=
test_str.replace[ele, ""]
print
[
"The string after punctuation filter : "
+
test_str]
Output:
The original string is : Gfg, is best : for ! Geeks ; The string after punctuation filter : Gfg is best for Geeks
Method 3: Remove Punctuation from a String with regex
The part of replacing with punctuation can also be performed using regex. In this, we replace all punctuation with an empty string using a certain regex.
Python3
import
re
test_str
=
"Gfg, is best : for ! Geeks ;"
print
[
"The original string is : "
+
test_str]
res
=
re.sub[r
'[^\w\s]'
, '', test_str]
print
[
"The string after punctuation filter : "
+
res]
Output :
The original string is : Gfg, is best : for ! Geeks ; The string after punctuation filter : Gfg is best for Geeks
Method 4: Using for loop, punctuation string and not in operator
Python3
test_str
=
"Gfg, is best : for ! Geeks ;"
print
[
"The original string is : "
+
test_str]
punc
=
res
=
" "
for
ele
in
test_str:
if
ele
not
in
punc:
res
+
=
ele
print
[
"The string after punctuation filter : "
+
res]
Output
The original string is : Gfg, is best : for ! Geeks ; The string after punctuation filter : Gfg is best for Geeks
The Time and Space Complexity for all the methods are the same:
Time Complexity: O[n]
Auxiliary Space: O[n]