Characteristics of XML and HTML elements
Related: Parsing HTML files with lxml
At the moment I’m interested in HTML elements and the HTML structure of my Hugo blog. I’m making the assumption that what I’m learning applies to XML documents in general.
The Python lxml library is being used to parse HTML: Python lxml
XML elements versus element trees
“An Element is the main container object for the ElementTree API”. An element is (or can be(?)) a
tree: “Elements are organized in an XML tree structure”. (Is <br> on its own a tree?)
In the context of lxml, an “ElementTree is mainly a document wrapper around a tree with a root node.”
What’s significant for now is that an ElementTree is what you get back when you call the lxml
parse() function, and an Element is what you get back when you call the find() function on
that ElementTree.
A simple document for testing
I’ll use an actual post,
Hello HTML elements.
The post will change
over time. Here’s the current state of its <main> element:
<main class="mx-1 my-4 grow">
<div class="max-w-4xl mx-auto">
<article class="prose dark:prose-invert">
<h1>Hello HTML Elements</h1>
<time class="text-xs" datetime="2025-12-24T13:03:37-08:00"
>December 24, 2025</time
>
<p>Starting with a simple paragraph.</p>
<div class="text-gray-900 dark:text-gray-100">
<div>Tags:</div>
<ul>
<li><a href="/tags/scratch/">Scratch</a></li>
<li><a href="/tags/coding/">Coding</a></li>
</ul>
</div>
</article>
</div>
</main>
For my reference, the path to the HTML on my computer is
/home/scossar/zalgorithm/public/notes/hello-html-elements/index.html
Creating an ElementTree object with the lxml parse function
The lxml parse() function is used to parse from files and file-like objects:
from lxml import html
filepath = "/home/scossar/zalgorithm/public/notes/hello-html-elements/index.html"
tree = html.parse(filepath) # <class 'lxml.etree._ElementTree'>
Getting an element with the etree or html find method
Note that lxml.etree._ElementTree has a similar API to the Python xml.etree.ElementTree module.
The
lxml implementation
has some additional functions:
iterfind(), findall(), find(), findtext():
root = tree.find(".//main") # <class 'lxml.html.HtmlElement'>
print(root.tag) # main
Documentation:
Elements are organized in an XML tree structure
Iterating over all elements of the main element’s subtree:
for element in root.iter():
print(element.tag)
# main
# div
# article
# h1
# time
# p
# div
# div
# ul
# li
# a
# li
# a
An element is a kind of list:
In [11]: root[0]
Out[11]: <Element div at 0x7fb498c0d1d0>
In [12]: root[1]
--------------------------------------------------------------------------
IndexError: list index out of range
In [13]: root[0][0]
Out[13]: <Element article at 0x7fb498501e00>
Even though main is the first tag returned from the call to root.iter(), root[0] is the first
div (the root’s first child),
root[0][0] is the article tag:
In [19]: root[0][0].tag
Out[19]: 'article'
In [20]: len(root[0][0])
Out[20]: 4
Copying elements to a new element
Iterating over the elements and adding them one-by-one to a new element doesn’t work:
new_root = etree.Element("div")
for element in root.iter():
print("element.tag", element.tag)
new_root.append(element)
new_html = html.tostring(new_root, method="html", encoding="unicode", pretty_print=True)
print("new html:\n", new_html) # returns <div><main></main><div></div><article></article></div>...
original_html = html.tostring(
root, method="html", encoding="unicode", pretty_print=True
) # returns: <main></main>; the subelements have been removed
print("original html:\n", original_html)
Simply appending root to new_root works. In the code below, new_root now contains the root
elements wrapped in div tags. Surprisingly, root is still in it’s original form
In [27]: new_root = etree.Element("div")
In [29]: new_root.append(root)
In [34]: print(html.tostring(new_root, method="html", encoding="unicode",
⋮ pretty_print=True))
# check to see if this has altered the root element:
In [36]: print(html.tostring(root, method="html", encoding="unicode", pretty_print=True)) # unaltered
Contrast that with what happens when I append root[0][0] to new_root:
In [40]: tree = html.parse(filepath)
In [41]: root = tree.find(".//main")
In [42]: new_root = etree.Element("div")
In [43]: new_root.append(root[0][0])
In [44]: print(html.tostring(root, method="html", encoding="unicode", pretty_print=True))
<main class="mx-1 my-4 grow">
<div class="max-w-4xl mx-auto">
</div>
</main>
In [45]: print(html.tostring(new_root, method="html", encoding="unicode", pretty_print=True))
<div>
<article class="prose dark:prose-invert">
<h1>Hello HTML Elements</h1>
<time class="text-xs" datetime="2025-12-24T13:03:37-08:00">December 24, 2025</time>
<p>Starting with a simple paragraph.</p>
<div class="text-gray-900 dark:text-gray-100">
<div>Tags:</div>
<ul>
<li><a href="/tags/scratch/">Scratch</a></li>
<li><a href="/tags/coding/">Coding</a></li>
</ul>
</div>
</article>
</div>
This actually makes sense. Here’s a code demo:
In [85]: tree = html.parse(filepath)
In [86]: main = tree.find('.//main')
In [87]: main.getparent()
Out[87]: <Element body at 0x7fb498045400> # main's parent is the doc's body tag
In [88]: new_fragment = etree.Element("div")
In [89]: new_fragment.append(main)
In [90]: main.getparent()
Out[90]: <Element div at 0x7fb491881b00> # main's parent is new_fragment (0x7fb491881b00)
In [91]: new_fragment[0]
Out[91]: <Element main at 0x7fb4916783c0>
In [92]: new_fragment[0].getparent()
Out[92]: <Element div at 0x7fb491881b00>
In [93]: new_fragment
Out[93]: <Element div at 0x7fb491881b00>
When I print main with:
In [104]: print(html.tostring(main, method="html", encoding="unicode", pretty_print=True))
It still works, because main itself hasn’t been altered, it’s just got a new parent. If instead,
main[0][0] (the article tag) is appended to new_element, the article has been removed from
main, so printing main will return the element with its article tag missing.
My understanding is that lxml elements are Python wrappers around C structures. lxml maintains the
parent-child relationships in its internal C data structures. Calls to append modify those
structures.