Learning about parsing XML and HTML with lxml
Following along with Parsing XML and HTML with lxml .
Related to Following the lxml Etree Tutorial .
lxml provides an API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing with an event-driven API (currently only for XML).1
Tutorial setup
The examples use StringIO and BytesIO for parsing from files and file-like objects.
In [119]: from lxml import etree
In [120]: from io import StringIO, BytesIO
lxml parsers
Parsers are represented by parser objects. (I’m assuming this will become clear later on.)
Parsing XML from an in-memory string:
In [121]: xml = '<a xmlns="zalgorithm"><b xmlns="zalgorithm"/></a>'
In [122]: root = etree.fromstring(xml)
In [123]: etree.tostring(root)
Out[123]: b'<a xmlns="zalgorithm"><b xmlns="zalgorithm"/></a>'
To read from a file-like object (as opposed to an in-memory string(?)), use the parse() function.
The call to StringIO(xml) is create a file-like object:
In [128]: tree = etree.parse(StringIO(xml))
In [129]: etree.tostring(tree)
Out[129]: b'<a xmlns="zalgorithm"><b xmlns="zalgorithm"/></a>
In practice it’s more common to pass a filename. E.g.:
tree = etree.parse('./foo.xml')
Support for parsing from HTTP, FTP, and zlib depends on lxml compile options. (I haven’t checked my implementation as I’m currently only interested in parsing local files.)
lxml parser options
See the documentation’s Parser Options section for details.
Parsers accept setup options as keyword arguments. (Note, this clarifies my question about parsers being parser objects. The parser object is created, then passed as an argument in the call to xml_root = fromstring(xml, parser)):
In [130]: parser = etree.XMLParser(ns_clean=True) # try to clean redundant namespace declarations
In [131]: type(parser)
Out[131]: lxml.etree.XMLParser
In [132]: xml_root = etree.fromstring(xml, parser)
In [133]: etree.tostring(xml_root)
Out[133]: b'<a xmlns="zalgorithm"><b/></a>'
lxml parser error logs
Parsers have an error_log property:
In [134]: parser = etree.XMLParser()
In [135]: print(len(parser.error_log))
0
In [138]: tree = etree.XML("<root>\n</b>", parser)
# error stack trace follows, then:
In [139]: print(len(parser.error_log))
1
In [140]: error = parser.error_log[0]
In [141]: print(error.message)
Opening and ending tag mismatch: root line 1 and b
lxml parsing HTML
lxml parsers have a recover keyword argument that the HTML parser sets to True by default.
Example of broken HTML getting fixed by the parser:
In [145]: broken_html = "<html><head><title>foo</title><body><h1>bar></h3></html>"
In [146]: parser = etree.HTMLParser()
In [147]: html_root = etree.fromstring(broken_html, parser)
In [148]: result = etree.tostring(html_root, pretty_print=True, method="html")
In [149]: print(result)
b'<html>\n<head><title>foo</title></head>\n<body><h1>bar></h1></body>\n</html>\n'
Confirming that when recover=False, broken HTML triggers an error:
In [154]: broken_html = "<html><head><title>foo</title><body><h1>bar></h3></html>"
In [155]: parser = etree.HTMLParser(recover=False)
In [156]: html_root = etree.fromstring(broken_html, parser)
Traceback (most recent call last):
# ...
In [157]: print(len(parser.error_log))
1
In [158]: print(parser.error_log[0].message)
Unexpected end tag : h3
The lxml HTML function
The HTML() function is similar to the XML() function. If I’m understanding things correctly, it
automatically creates the HTMLParser:
In [159]: html_root = etree.HTML("""
...: <html>
...: <body>
...: <h1>this is a test</h1>
...: </body>
...: </html>
...: """)
In [160]: etree.tostring(html_root)
Out[160]: b'<html>\n<body>\n<h1>this is a test</h1>\n</body>\n</html>'
Loading an HTML file to parse
Skipping ahead a bit here, but I want to load an existing html file:
In [173]: from lxml import html
In [174]: tree = html.parse("/home/scossar/zalgorithm/public/notes/learnin
⋮ g-about-parsing-xml-and-html-with-lxml/index.html")
In [177]: docinfo = tree.docinfo
In [179]: print(docinfo.doctype)
<!DOCTYPE html>
In [182]: root = tree.getroot()
In [183]: root.tag
Out[183]: 'html'
I’m giving myself permission to proceed onto [parsing-html-files-with-lxml].
References
Behnel, Stefan. “The lxml.etree Tutorial”. https://lxml.de/tutorial.html .
“Parsing XML and HTML with lxml”. https://lxml.de/parsing.html . Generated on June 26, 2025.
-
“Parsing XML and HTML with lxml”, https://lxml.de/parsing.html , Generated on June 26, 2025. ↩︎