Learning about parsing XML and HTML with lxml

December 22, 2025

Following along with Parsing XML and HTML with lxml .

lxml provides an API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing with an event-driven API (currently only for XML).¹

Tutorial setup #

The examples use StringIO and BytesIO for parsing from files and file-like objects.

In [119]: from lxml import etree

In [120]: from io import StringIO, BytesIO

lxml parsers #

Parsers are represented by parser objects. (I’m assuming this will become clear later on.)

Parsing XML from an in-memory string:

In [121]: xml = '<a xmlns="zalgorithm"><b xmlns="zalgorithm"/></a>'

In [122]: root = etree.fromstring(xml)

In [123]: etree.tostring(root)
Out[123]: b'<a xmlns="zalgorithm"><b xmlns="zalgorithm"/></a>'

To read from a file-like object (as opposed to an in-memory string(?)), use the parse() function. The call to StringIO(xml) is create a file-like object:

In [128]: tree = etree.parse(StringIO(xml))

In [129]: etree.tostring(tree)
Out[129]: b'<a xmlns="zalgorithm"><b xmlns="zalgorithm"/></a>

In practice it’s more common to pass a filename. E.g.:

tree = etree.parse('./foo.xml')

Support for parsing from HTTP, FTP, and zlib depends on lxml compile options. (I haven’t checked my implementation as I’m currently only interested in parsing local files.)

lxml parser options #

See the documentation’s Parser Options section for details.

Parsers accept setup options as keyword arguments. (Note, this clarifies my question about parsers being parser objects. The parser object is created, then passed as an argument in the call to xml_root = fromstring(xml, parser)):

In [130]: parser = etree.XMLParser(ns_clean=True)  # try to clean redundant namespace declarations

In [131]: type(parser)
Out[131]: lxml.etree.XMLParser

In [132]: xml_root = etree.fromstring(xml, parser)

In [133]: etree.tostring(xml_root)
Out[133]: b'<a xmlns="zalgorithm"><b/></a>'

lxml parser error logs #

Parsers have an error_log property:

In [134]: parser = etree.XMLParser()

In [135]: print(len(parser.error_log))
0

In [138]: tree = etree.XML("<root>\n</b>", parser)
# error stack trace follows, then:
In [139]: print(len(parser.error_log))
1

In [140]: error = parser.error_log[0]

In [141]: print(error.message)
Opening and ending tag mismatch: root line 1 and b

lxml parsing HTML #

lxml parsers have a recover keyword argument that the HTML parser sets to True by default.

Example of broken HTML getting fixed by the parser:

In [145]: broken_html = "<html><head><title>foo</title><body><h1>bar></h3></html>"

In [146]: parser = etree.HTMLParser()

In [147]: html_root = etree.fromstring(broken_html, parser)

In [148]: result = etree.tostring(html_root, pretty_print=True, method="html")

In [149]: print(result)
b'<html>\n<head><title>foo</title></head>\n<body><h1>bar&gt;</h1></body>\n</html>\n'

Confirming that when recover=False, broken HTML triggers an error:

In [154]: broken_html = "<html><head><title>foo</title><body><h1>bar></h3></html>"

In [155]: parser = etree.HTMLParser(recover=False)

In [156]: html_root = etree.fromstring(broken_html, parser)
Traceback (most recent call last):
# ...
In [157]: print(len(parser.error_log))
1

In [158]: print(parser.error_log[0].message)
Unexpected end tag : h3

The lxml HTML function #

The HTML() function is similar to the XML() function. If I’m understanding things correctly, it automatically creates the HTMLParser:

In [159]: html_root = etree.HTML("""
     ...: <html>
     ...: <body>
     ...: <h1>this is a test</h1>
     ...: </body>
     ...: </html>
     ...: """)

In [160]: etree.tostring(html_root)
Out[160]: b'<html>\n<body>\n<h1>this is a test</h1>\n</body>\n</html>'

Loading an HTML file to parse #

Skipping ahead a bit here, but I want to load an existing html file:

In [173]: from lxml import html

In [174]: tree = html.parse("/home/scossar/zalgorithm/public/notes/learnin
        ⋮ g-about-parsing-xml-and-html-with-lxml/index.html")

In [177]: docinfo = tree.docinfo

In [179]: print(docinfo.doctype)
<!DOCTYPE html>

In [182]: root = tree.getroot()

In [183]: root.tag
Out[183]: 'html'

I’m giving myself permission to proceed onto [parsing-html-files-with-lxml].

References #

Behnel, Stefan. “The lxml.etree Tutorial”. https://lxml.de/tutorial.html .

“Parsing XML and HTML with lxml”. https://lxml.de/parsing.html . Generated on June 26, 2025.

“Parsing XML and HTML with lxml”, https://lxml.de/parsing.html , Generated on June 26, 2025. ↩︎

Tags: