How do i remove all html tags in python?

Question

Pyparsing makes it easy to write an HTML stripper by defining a pattern matching all opening and closing HTML tags, and then transforming the input using that pattern as a suppressor. This still leaves the &xxx; HTML entities to be converted - you can use xml.sax.saxutils.unescape to do that:

Nội dung chính Show

Table of Contents #
Strip the HTML tags from a string in Python #
Strip the HTML tags from a string using regex in Python #
How do I remove HTML tags with BeautifulSoup?
How do I remove a tag from a list in Python?
Is it possible to remove the HTML tags from data?
How do I remove text tags in HTML?

source = """
Editors' Pick: Originally published March 22.
 
 Apple (AAPL - Get Report) is waking up the echoes with the reintroduction of a 4-inch iPhone, a model its creators hope will lead the company to victory not just in emerging markets, but at home as well. 
"There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features," Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments. 
The new model, dubbed the iPhone SE, "should unleash a decent upgrade cycle over the coming months," Dawson said. Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.

 """

from pyparsing import anyOpenTag, anyCloseTag
from xml.sax.saxutils import unescape as unescape
unescape_xml_entities = lambda s: unescape(s, {"'": "'", """: '"', " ":" "})

stripper = (anyOpenTag | anyCloseTag).suppress()

print(unescape_xml_entities(stripper.transformString(source)))

gives:

Editors' Pick: Originally published March 22.  Apple (AAPL - Get Report) is waking up the echoes with the reintroduction of a 4-inch iPhone, a model its creators hope will lead the company to victory not just in emerging markets, but at home as well. 
"There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features," Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments. 
The new model, dubbed the iPhone SE, "should unleash a decent upgrade cycle over the coming months," Dawson said. Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.

(And in future, please do not provide sample text or code as non-copy-pasteable images.)

Earlier this week I needed to remove some HTML tags from a text, the target string was already saved with HTML tags in the database, and one of the requirement specifies that in some specific page we need to render it as a raw text.

I knew from the beginning that regular expressions could apply for this challenge, but since I am not an expert with regular expressions I looked for some advise in stack overflow and then I found what I actually needed.

Below is the function I have defined:

def remove_html_tags(text):
    """Remove html tags from a string"""
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

So the idea is to build a regular expression which can find all characters “< >” as a first incidence in a text, and after, using the sub function, we can replace all text between those symbols with an empty string.

Lets see this in the shell:

How do i remove all html tags in python?

Hope this can help you!

Table of Contents #

Strip the HTML tags from a string in Python
Strip the HTML tags from a string using regex in Python

Strip the HTML tags from a string in Python #

To strip the HTML tags from a string in Python:

Extend from the HTMLParser class from the html.parser module.
Implement the handle_data method to get the data between the HTML tags.
Store the data in a list on the class instance.
Call the get_data() method on an instance of the class.

Copied!
from html.parser import HTMLParser


class MLRemover(HTMLParser):
    def __init__(self):
        super().__init__(convert_charrefs=False)
        self.reset()
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, data):
        self.fed.append(data)

    def handle_entityref(self, name):
        self.fed.append(f'&{name};')

    def handle_charref(self, name):
        self.fed.append(f'&#{name};')

    def get_data(self):
        return ''.join(self.fed)


def strip_html(value):
    remover = MLRemover()

    remover.feed(value)
    remover.close()
    return remover.get_data()


my_html = """

  First line
  Second line
  Third line

"""


# First line
# Second line
# Third line

print(strip_html(my_html))

Scroll down to the next subheading if you prefer a RegExp solution.

We extended from the HTMLParser class. The code snippet is very similar to the one used internally by the django module.

The HTMLParser class is used to find tags and other markup and call handler functions.

The data between the HTML tags is passed from the parser to the derived class by calling self.handle_data().

When convert_charrefs is set to True, character references automatically get converted to the corresponding Unicode character.

If convert_charrefs is set to False, character references are passed by calling the self.handle_entityref() or self.handle_charref() methods.

The get_data() method uses the str.join() method to join the list of strings without a separator.

The str.join method takes an iterable as an argument and returns a string which is the concatenation of the strings in the iterable.

The strip_html() function takes a string containing HTML tags and strips them.

Copied!
def strip_html(value):
    remover = MLRemover()

    remover.feed(value)
    remover.close()
    return remover.get_data()


my_html = """

  First line
  Second line
  Third line

"""


# First line
# Second line
# Third line

print(strip_html(my_html))

The function instantiates the class and feeds the string containing the html tags to the parser.

The next step is to call the close() method on the instance to handle any buffered data.

Lastly, we call the get_data() method on the instance to join the list of strings into a single string that doesn't contain any HTML tags.

Alternatively, you can use a regular expression.

Strip the HTML tags from a string using regex in Python #

Use the re.sub() method to strip the HTML tags from a string, e.g. result = re.sub('<.*?>', '', html_string). The re.sub() method will strip all opening and closing HTML tags by replacing them with empty strings.

Copied!
import re

html_string = """

  First
  Second
  
Third

"""
result = re.sub(r'<.*?>', '', html_string)
# First
# Second
# Third
print(result)

The re.sub method returns a new string that is obtained by replacing the occurrences of the pattern with the provided replacement.

If the pattern isn't found, the string is returned as is.

The first argument we passed to the re.sub() method is a regular expression.

The brackets < and > match the opening and closing characters of an HTML tag.

The dot . matches any character except a newline.

The asterisk * matches 0 or more repetitions of the preceding character (any character).

Adding a question mark ? after the qualifier makes it perform a non-greedy or minimal match.

For example, using the regular expression <.*?> will match only .

How do i remove all html tags in python?

Table of Contents #

Strip the HTML tags from a string in Python #

First line

First line

Strip the HTML tags from a string using regex in Python #

Second

Third

How do I remove HTML tags with BeautifulSoup?

How do I remove a tag from a list in Python?

Is it possible to remove the HTML tags from data?

How do I remove text tags in HTML?

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội

`Second`