Hướng dẫn research special characters python

I have a file (I only show a part) where I would like to remove a special character.

OTU1359 UniRef90_A0A095VQ09 UniRef90_A0A0C1UI80 UniRef90_A0A1M4ZSK2 UniRef90_A0A1W1CJV7 UniRef90_A0A1Z9J2X0 UniRef90_A0A1Z9THL2 UniRef90_A0A2E3B6A5 UniRef90_A0A2E5MT47 UniRef90_A0A2E5VCW9 UniRef90_A0A2E6CDK4 UniRef90_A0A2E6KTE6 UniRef90_A0A2E8AIM6 UniRef90_A0A2E8RIG1 UniRef90_A0A2E8YNS3 UniRef90_A0A2E9VEK0 UniRef90_W6RCT6

OTU0980 UniRef90_A0A084TMQ7 UniRef90_A0A090PK65 UniRef90_A0A0P1G8P0 UniRef90_A0A0P1IHL1 UniRef90_A0A286ILS7 UniRef90_A0A2A5E7H9 UniRef90_A0A2D9J217 UniRef90_H3NS47 UniRef90_H3NSN9 UniRef90_H3NSP0 UniRef90_H3NSP7 UniRef90_H3NUB2 UniRef90_H3NY28 UniRef90_H3NY47 UniRef90_UPI000C2CBC51

I would like to remove the character "OTUXXXX" (it always start by OTU and has always 4 numbers after) . It can appears multiple OTUXXXX by line

I tried :

re.search("OTU[0-9]{4}", line)

It doesn't work.. Any help?

asked Jun 20, 2019 at 9:45

3

I suggest using re.sub and find your pattern matches as whole words to avoid partial matches inside other words.

s = re.sub(r"\s*\bOTU[0-9]{4}\b", "", line).strip()

See the regex demo. The .strip() at the end removes any redundant leading/trailing whitespaces that remain after removing the matches at the end/start of the string.

See the regex graph:

Hướng dẫn research special characters python

answered Jun 20, 2019 at 9:49

Hướng dẫn research special characters python

Wiktor StribiżewWiktor Stribiżew

575k34 gold badges399 silver badges499 bronze badges

You could make use of re.sub which actually performs replacemnt or substitution of matching text with the one you provide. Here you find the doc: https://docs.python.org/3/library/re.html

And here one possible implementaiton:

from re import compile, sub, MULTILINE

text = '''
OTU1359 UniRef90_A0A095VQ09 UniRef90_A0A0C1UI80 UniRef90_A0A1M4ZSK2 UniRef90_A0A1W1CJV7 UniRef90_A0A1Z9J2X0 UniRef90_A0A1Z9THL2 UniRef90_A0A2E3B6A5 UniRef90_A0A2E5MT47 UniRef90_A0A2E5VCW9 UniRef90_A0A2E6CDK4 UniRef90_A0A2E6KTE6 UniRef90_A0A2E8AIM6 UniRef90_A0A2E8RIG1 UniRef90_A0A2E8YNS3 UniRef90_A0A2E9VEK0 UniRef90_W6RCT6

OTU0980 UniRef90_A0A084TMQ7 UniRef90_A0A090PK65 UniRef90_A0A0P1G8P0 UniRef90_A0A0P1IHL1 UniRef90_A0A286ILS7 UniRef90_A0A2A5E7H9 UniRef90_A0A2D9J217 UniRef90_H3NS47 UniRef90_H3NSN9 UniRef90_H3NSP0 UniRef90_H3NSP7 UniRef90_H3NUB2 UniRef90_H3NY28 UniRef90_H3NY47 UniRef90_UPI000C2CBC51
'''

replacemnt = ''
regex = compile(r'OTU\d{4}', flags=MULTILINE)
cleaned = sub(regex, replacemnt, text)

answered Jun 20, 2019 at 9:58

Hướng dẫn research special characters python

GiovaGiova

1,7591 gold badge17 silver badges26 bronze badges

3

Not the answer you're looking for? Browse other questions tagged python regex or ask your own question.

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.

jscs

63.3k13 gold badges149 silver badges193 bronze badges

asked Apr 30, 2011 at 17:41

This can be done without regex:

>>> string = "Special $#! characters   spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'

You can use str.isalnum:

S.isalnum() -> bool

Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.

If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.

wjandrea

24.2k8 gold badges51 silver badges71 bronze badges

answered Apr 30, 2011 at 17:47

user225312user225312

120k66 gold badges167 silver badges181 bronze badges

6

Here is a regex to match a string of characters that are not a letters or numbers:

[^A-Za-z0-9]+

Here is the Python command to do a regex substitution:

re.sub('[^A-Za-z0-9]+', '', mystring)

wjandrea

24.2k8 gold badges51 silver badges71 bronze badges

answered Apr 30, 2011 at 17:46

Andy WhiteAndy White

84.8k47 gold badges173 silver badges208 bronze badges

9

Shorter way :

import re
cleanString = re.sub('\W+','', string )

If you want spaces between words and numbers substitute '' with ' '

answered Aug 7, 2014 at 13:26

tuxErrantetuxErrante

1,16410 silver badges18 bronze badges

6

TLDR

I timed the provided answers.

import re
re.sub('\W+','', string)

is typically 3x faster than the next fastest provided top answer.

Caution should be taken when using this option. Some special characters (e.g. ø) may not be striped using this method.


After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:

  • string1 = 'Special $#! characters spaces 888323'
  • string2 = 'how much for the maple syrup? $20.99? That s ridiculous!!!'

Example 1

'.join(e for e in string if e.isalnum())
  • string1 - Result: 10.7061979771
  • string2 - Result: 7.78372597694

Example 2

import re
re.sub('[^A-Za-z0-9]+', '', string)
  • string1 - Result: 7.10785102844
  • string2 - Result: 4.12814903259

Example 3

import re
re.sub('\W+','', string)
  • string1 - Result: 3.11899876595
  • string2 - Result: 2.78014397621

The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)

Example 3 can be 3x faster than Example 1.

answered Aug 6, 2016 at 1:04

mbeacommbeacom

1,32814 silver badges25 bronze badges

7

Python 2.*

I think just filter(str.isalnum, string) works

In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'

Python 3.*

In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:

''.join(filter(str.isalnum, string)) 

or to pass list in join use (not sure but can be fast a bit)

''.join([*filter(str.isalnum, string)])

note: unpacking in [*args] valid from Python >= 3.5

answered Apr 14, 2016 at 9:32

Grijesh ChauhanGrijesh Chauhan

55.6k19 gold badges134 silver badges199 bronze badges

4

#!/usr/bin/python
import re

strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr

you can add more special character and that will be replaced by '' means nothing i.e they will be removed.

answered May 25, 2014 at 9:28

pkmpkm

2,6151 gold badge27 silver badges44 bronze badges

0

Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.

For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:

import re
s = re.sub(r"[^a-zA-Z0-9]","",s)

This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".

In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.

Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.

import re
s = re.sub(r"[^a-z0-9]","",s.lower())

answered Sep 5, 2018 at 10:02

AndreaAndrea

3,9304 gold badges34 silver badges53 bronze badges

string.punctuation contains following characters:

'!"#$%&\'()*+,-./:;<=>[email protected][\]^_`{|}~'

You can use translate and maketrans functions to map punctuations to empty values (replace)

import string

'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))

Output:

'This is A test'

answered Mar 17, 2020 at 15:14

Vlad BezdenVlad Bezden

75.2k23 gold badges234 silver badges174 bronze badges

s = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s)

answered Jun 15, 2018 at 12:09

snehasneha

7596 silver badges7 bronze badges

Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:

>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>

answered Apr 30, 2011 at 21:07

John MachinJohn Machin

79.3k11 gold badges137 silver badges182 bronze badges

The most generic approach is using the 'categories' of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:

import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien

PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))

def filter_non_printable(s):
    result = []
    ws_last = False
    for c in s:
        c = unicodedata.category(c) in PRINTABLE and c or u'#'
        result.append(c)
    return u''.join(result).replace(u'#', u' ')

Look at the given URL above for all related categories. You also can of course filter by the punctuation categories.

BioGeek

21.1k21 gold badges80 silver badges137 bronze badges

answered Apr 30, 2011 at 18:00

2

For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü, ä, ö) simply add these to the regex search string:

Example for German:

re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)

answered Jun 27, 2020 at 10:00

petezurichpetezurich

8,2808 gold badges37 silver badges54 bronze badges

This will remove all special characters, punctuation, and spaces from a string and only have numbers and letters.

import re

sample_str = "Hel&&lo %% Wo$#[email protected]"

# using isalnum()
print("".join(k for k in sample_str if k.isalnum()))


# using regex
op2 = re.sub("[^A-Za-z]", "", sample_str)
print(f"op2 = ", op2)


special_char_list = ["$", "@", "#", "&", "%"]

# using list comprehension
op1 = "".join([k for k in sample_str if k not in special_char_list])
print(f"op1 = ", op1)


# using lambda function
op3 = "".join(filter(lambda x: x not in special_char_list, sample_str))
print(f"op3 = ", op3)

answered May 11, 2021 at 8:29

Use translate:

import string

def clean(instr):
    return instr.translate(None, string.punctuation + ' ')

Caveat: Only works on ascii strings.

answered Mar 23, 2016 at 19:37

jjmurrejjmurre

3323 silver badges14 bronze badges

2

This will remove all non-alphanumeric characters except spaces.

string = "Special $#! characters   spaces 888323"
''.join(e for e in string if (e.isalnum() or e.isspace()))

Special characters spaces 888323

Dharman

27.6k21 gold badges75 silver badges126 bronze badges

answered Feb 1, 2021 at 16:57

0

import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the 

same as double quotes."""

# if we need to count the word python that ends with or without ',' or '.' at end

count = 0
for i in text:
    if i.endswith("."):
        text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
    count += 1
print("The count of Python : ", text.count("python"))

answered Jul 16, 2018 at 11:52

After 10 Years, below I wrote there is the best solution. You can remove/clean all special characters, punctuation, ASCII characters and spaces from the string.

from clean_text import clean

string = 'Special $#! characters   spaces 888323'
new = clean(string,lower=False,no_currency_symbols=True, no_punct = True,replace_with_currency_symbol='')
print(new)
Output ==> 'Special characters spaces 888323'
you can replace space if you want.
update = new.replace(' ','')
print(update)
Output ==> 'Specialcharactersspaces888323'

answered Oct 27, 2021 at 13:21

function regexFuntion(st) {
  const regx = /[^\w\s]/gi; // allow : [a-zA-Z0-9, space]
  st = st.replace(regx, ''); // remove all data without [a-zA-Z0-9, space]
  st = st.replace(/\s\s+/g, ' '); // remove multiple space

  return st;
}

console.log(regexFuntion('$Hello; # -world--78asdf+-===asdflkj******lkjasdfj67;'));
// Output: Hello world78asdfasdflkjlkjasdfj67

answered Apr 6 at 15:02

Art BinduArt Bindu

5213 silver badges11 bronze badges

import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)

and you shall see your result as

'askhnlaskdjalsdk

answered Feb 25, 2016 at 8:00

Dsw WdsDsw Wds

4644 silver badges17 bronze badges

1

Not the answer you're looking for? Browse other questions tagged python regex string or ask your own question.