I am using urllib to get a string of html from a website and need to put each word in the html document into a list.
Here is the code I have so far. I keep getting an error. I have also copied the error below.
import urllib.request
url = input["Please enter a URL: "]
z=urllib.request.urlopen[url]
z=str[z.read[]]
removeSpecialChars = str.replace["!@#$%^&*[][]{};:,./?\|`~-=_+", " "]
words = removeSpecialChars.split[]
print ["Words list: ", words[0:20]]
Here is the error.
Please enter a URL: //simleyfootball.com
Traceback [most recent call last]:
File "C:\Users\jeremy.KLUG\My Documents\LiClipse Workspace\Python Project 2\Module2.py", line 7, in
removeSpecialChars = str.replace["!@#$%^&*[][]{};:,./?\|`~-=_+", " "]
TypeError: replace[] takes at least 2 arguments [1 given]
asked Jun 2, 2014 at 13:47
One way is to use re.sub, that's my preferred way.
import re
my_str = "hey th~!ere"
my_new_string = re.sub['[^a-zA-Z0-9 \n\.]', '', my_str]
print my_new_string
Output:
hey there
Another way is to use re.escape:
import string
import re
my_str = "hey th~!ere"
chars = re.escape[string.punctuation]
print re.sub[r'['+chars+']', '',my_str]
Output:
hey there
Just a small tip about parameters style in python by
PEP-8 parameters should be remove_special_chars
and not removeSpecialChars
Also if you want to keep the spaces just change [^a-zA-Z0-9 \n\.]
to [^a-zA-Z0-9\n\.]
answered Jun 2, 2014 at 14:01
Kobi KKobi K
7,4696 gold badges39 silver badges83 bronze badges
3
str.replace is the wrong function for what you want to do [apart from it being used incorrectly]. You want to replace any character of a set with a space, not the whole set with a single space [the latter is what replace does]. You can use translate like this:
removeSpecialChars = z.translate [{ord[c]: " " for c in "!@#$%^&*[][]{};:,./?\|`~-=_+"}]
This creates a mapping which maps every character in your list of special characters to a space, then calls translate[] on the string, replacing every single character in the set of special characters with a space.
answered Jun 2, 2014 at 14:02
rassahahrassahah
7265 silver badges6 bronze badges
4
You need to call replace
on z
and not on str
, since you want to replace characters
located in the string variable z
removeSpecialChars = z.replace["!@#$%^&*[][]{};:,./?\|`~-=_+", " "]
But this will not work, as replace looks for a substring, you will most likely need to use regular expression module re
with the sub
function:
import re
removeSpecialChars = re.sub["[!@#$%^&*[][]{};:,./?\|`~-=_+]", " ", z]
Don't forget the []
, which indicates that this is a set of characters to be replaced.
answered Jun 2, 2014 at 13:58
Danny MDanny M
4072 silver badges6 bronze badges
replace operates on a specific string, so you need to call it like this
removeSpecialChars = z.replace["!@#$%^&*[][]{};:,./?\|`~-=_+", " "]
but this is probably not what you need, since this will look for a single string containing all that characters in the same order. you can do it with a regexp, as Danny Michaud pointed out.
as a side note, you might want to look for BeautifulSoup, which is a library for parsing messy HTML formatted text like what you usually get from scaping websites.
answered Jun 2, 2014 at 13:51
PavelPavel
7,2162 gold badges28 silver badges41 bronze badges
2
You can replace the special characters with the desired characters as follows,
import string
specialCharacterText = "H#y #@w @re &*]?"
inCharSet = "!@#$%^&*[][]{};:,./?\|`~-=_+\""
outCharSet = " " #corresponding characters in inCharSet to be replaced
splCharReplaceList = string.maketrans[inCharSet, outCharSet]
splCharFreeString = specialCharacterText.translate[splCharReplaceList]
answered Feb 12, 2015 at 16:08
surendransurendran
4801 gold badge8 silver badges18 bronze badges
Translate seems faster:
N=100000, 30 special characters, string length=70
replace: 0.3251810073852539 re.sub: 0.2859320640563965 translate: 0.12320685386657715
answered Sep 2 at 17:35
YanoYano
5985 silver badges8 bronze badges