Hướng dẫn python pretty print html

I am using lxml.html to generate some HTML. I want to pretty print (with indentation) my final result into an html file. How do I do that?

This is what I have tried and got till now

import lxml.html as lh
from lxml.html import builder as E
sliderRoot=lh.Element("div", E.CLASS("scroll"), style="overflow-x: hidden; overflow-y: hidden;")
scrollContainer=lh.Element("div", E.CLASS("scrollContainer"), style="width: 4340px;")
sliderRoot.append(scrollContainer)
print lh.tostring(sliderRoot, pretty_print = True, method="html")

As you can see I am using the pretty_print=True attribute. I thought that would give indented code, but it doesn't really help. This is the output :

OneCricketeer

160k18 gold badges119 silver badges219 bronze badges

asked May 27, 2011 at 9:09

0

I ended up using BeautifulSoup directly. That is something lxml.html.soupparser uses for parsing HTML.

BeautifulSoup has a prettify method that does exactly what it says it does. It prettifies the HTML with proper indents and everything.

BeautifulSoup will NOT fix the HTML, so broken code, remains broken. But in this case, since the code is being generated by lxml, the HTML code should be at least semantically correct.

In the example given in my question, I will have to do this :

from bs4 import BeautifulSoup as bs
root = lh.tostring(sliderRoot) #convert the generated HTML to a string
soup = bs(root)                #make BeautifulSoup
prettyHTML = soup.prettify()   #prettify the html

Tyrannas

3,7931 gold badge10 silver badges16 bronze badges

answered May 29, 2011 at 11:14

bcosynotbcosynot

5,3159 gold badges34 silver badges45 bronze badges

5

Though my answer might not be helpful now, I am dropping it here to act as a reference to anybody else in future.

lxml.html.tostring(), indeed, doesn't pretty print the provided HTML in spite of pretty_print=True.

However, the "sibling" of lxml.html - lxml.etree has it working well.

So one might use it as following:

from lxml import etree, html

document_root = html.fromstring("

hello world

") print(etree.tostring(document_root, encoding='unicode', pretty_print=True))

The output is like this:


  
    

hello world

answered May 12, 2013 at 9:01

Jayesh BhootJayesh Bhoot

1,4631 gold badge21 silver badges33 bronze badges

4

If you store the HTML as an unformatted string, in a variable html_string, it can be done using beautifulsoup4 as follows:

from bs4 import BeautifulSoup
print(BeautifulSoup(html_string, 'html.parser').prettify())

answered Nov 13, 2017 at 7:26

AlexAlex

10.7k6 gold badges61 silver badges71 bronze badges

1

If adding one more dependency is not a problem, you can use the html5print package. The advantage over the other solutions, is that it also beautifies both CSS and Javascript code embedded in the HTML document.

To install it, execute:

pip install html5print

Then, you can either use it as a command:

html5-print ugly.html -o pretty.html

or as Python code:

from html5print import HTMLBeautifier
html = 'Page Title

Some text here

' print(HTMLBeautifier.beautify(html, 4))

answered Mar 19, 2018 at 20:49

Hướng dẫn python pretty print html

pgmankpgmank

4,7655 gold badges34 silver badges47 bronze badges

1

I tried both BeautifulSoup's prettify and html5print's HTMLBeautifier solutions but since I'm using yattag to generate HTML it seems more appropriate to use its indent function, which produces nicely indented output.

from yattag import indent

rawhtml = "String with some HTML code..."

result = indent(
    rawhtml,
    indentation = '    ',
    newline = '\r\n',
    indent_text = True
)

print(result)

answered May 6, 2018 at 8:02

Vadym PaskoVadym Pasko

2414 silver badges4 bronze badges

Under the hood, lxml uses libxml2 to serialize the tree back into a string. Here is the relevant snippet of code that determines whether to append a newline after closing a tag:

    xmlOutputBufferWriteString(buf, ">");
    if ((format) && (!info->isinline) && (cur->next != NULL)) {
        if ((cur->next->type != HTML_TEXT_NODE) &&
            (cur->next->type != HTML_ENTITY_REF_NODE) &&
            (cur->parent != NULL) &&
            (cur->parent->name != NULL) &&
            (cur->parent->name[0] != 'p')) /* p, pre, param */
            xmlOutputBufferWriteString(buf, "\n");
    }
    return;

So if a node is an element, is not an inline tag and is followed by a sibling node (cur->next != NULL) and isn't one of p, pre, param then it will output a newline.

answered May 27, 2011 at 15:40

samplebiassamplebias

36.2k6 gold badges103 silver badges102 bronze badges

Couldn't you just pipe it into HTML Tidy? Either from the shell or through os.system().

answered May 27, 2011 at 13:14

tsmtsm

3,4482 gold badges20 silver badges35 bronze badges

3

If you don't care about quirky HTMLness (e.g. you must support absolutely support those hordes of Netscpae 2.0-using clients, so having
instead of
is a must), you can always change your method to "xml", which seems to work. This is probably a bug in lxml or in libxml, but I couldn't find the reason for it.

answered May 27, 2011 at 12:56

Boaz YanivBoaz Yaniv

6,18620 silver badges29 bronze badges

1

not really my code, I picked it somewhere

def indent(elem, level=0):
    i = '\n' + level * '  '
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + '  '
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

I use it with:

indent(page)
tostring(page)

answered May 27, 2011 at 15:22

sherpyasherpya

4,7822 gold badges32 silver badges49 bronze badges

0