Zalgorithm

Characteristics of XML and HTML elements

Related: Parsing HTML files with lxml

At the moment I’m interested in HTML elements and the HTML structure of my Hugo blog. I’m making the assumption that what I’m learning applies to XML documents in general.

The Python lxml library is being used to parse HTML: Python lxml

XML elements versus element trees

“An Element is the main container object for the ElementTree API”. An element is (or can be(?)) a tree: “Elements are organized in an XML tree structure”. (Is <br> on its own a tree?)

In the context of lxml, an “ElementTree is mainly a document wrapper around a tree with a root node.”

What’s significant for now is that an ElementTree is what you get back when you call the lxml parse() function, and an Element is what you get back when you call the find() function on that ElementTree.

A simple document for testing

I’ll use an actual post, Hello HTML elements. The post will change over time. Here’s the current state of its <main> element:

<main class="mx-1 my-4 grow">
  <div class="max-w-4xl mx-auto">
    <article class="prose dark:prose-invert">
      <h1>Hello HTML Elements</h1>

      <time class="text-xs" datetime="2025-12-24T13:03:37-08:00"
        >December 24, 2025</time
      >

      <p>Starting with a simple paragraph.</p>

      <div class="text-gray-900 dark:text-gray-100">
        <div>Tags:</div>
        <ul>
          <li><a href="/tags/scratch/">Scratch</a></li>
          <li><a href="/tags/coding/">Coding</a></li>
        </ul>
      </div>
    </article>
  </div>
</main>

For my reference, the path to the HTML on my computer is /home/scossar/zalgorithm/public/notes/hello-html-elements/index.html

Creating an ElementTree object with the lxml parse function

The lxml parse() function is used to parse from files and file-like objects:

from lxml import html

filepath = "/home/scossar/zalgorithm/public/notes/hello-html-elements/index.html"
tree = html.parse(filepath)  # <class 'lxml.etree._ElementTree'>

Getting an element with the etree or html find method

Note that lxml.etree._ElementTree has a similar API to the Python xml.etree.ElementTree module. The lxml implementation has some additional functions: iterfind(), findall(), find(), findtext():

root = tree.find(".//main")  # <class 'lxml.html.HtmlElement'>
print(root.tag)  # main

Documentation:

Elements are organized in an XML tree structure

Iterating over all elements of the main element’s subtree:

for element in root.iter():
    print(element.tag)

# main
# div
# article
# h1
# time
# p
# div
# div
# ul
# li
# a
# li
# a

An element is a kind of list:

In [11]: root[0]
Out[11]: <Element div at 0x7fb498c0d1d0>

In [12]: root[1]
--------------------------------------------------------------------------
IndexError: list index out of range

In [13]: root[0][0]
Out[13]: <Element article at 0x7fb498501e00>

Even though main is the first tag returned from the call to root.iter(), root[0] is the first div (the root’s first child),

root[0][0] is the article tag:

In [19]: root[0][0].tag
Out[19]: 'article'

In [20]: len(root[0][0])
Out[20]: 4

Copying elements to a new element

Iterating over the elements and adding them one-by-one to a new element doesn’t work:

new_root = etree.Element("div")

for element in root.iter():
    print("element.tag", element.tag)
    new_root.append(element)

new_html = html.tostring(new_root, method="html", encoding="unicode", pretty_print=True)
print("new html:\n", new_html)  # returns <div><main></main><div></div><article></article></div>...

original_html = html.tostring(
    root, method="html", encoding="unicode", pretty_print=True
)  # returns: <main></main>; the subelements have been removed
print("original html:\n", original_html)

Simply appending root to new_root works. In the code below, new_root now contains the root elements wrapped in div tags. Surprisingly, root is still in it’s original form

In [27]: new_root = etree.Element("div")
In [29]: new_root.append(root)

In [34]: print(html.tostring(new_root, method="html", encoding="unicode",
        pretty_print=True))

# check to see if this has altered the root element:
In [36]: print(html.tostring(root, method="html", encoding="unicode", pretty_print=True))  # unaltered

Contrast that with what happens when I append root[0][0] to new_root:

In [40]: tree = html.parse(filepath)

In [41]: root = tree.find(".//main")

In [42]: new_root = etree.Element("div")

In [43]: new_root.append(root[0][0])

In [44]: print(html.tostring(root, method="html", encoding="unicode", pretty_print=True))
<main class="mx-1 my-4 grow">
      <div class="max-w-4xl mx-auto">
</div>
    </main>



In [45]: print(html.tostring(new_root, method="html", encoding="unicode", pretty_print=True))
<div>
<article class="prose dark:prose-invert">
  <h1>Hello HTML Elements</h1>


  <time class="text-xs" datetime="2025-12-24T13:03:37-08:00">December 24, 2025</time>

  <p>Starting with a simple paragraph.</p>

<div class="text-gray-900 dark:text-gray-100">
  <div>Tags:</div>
  <ul>
    <li><a href="/tags/scratch/">Scratch</a></li>
    <li><a href="/tags/coding/">Coding</a></li>
  </ul>
</div>

</article>
</div>

This actually makes sense. Here’s a code demo:

In [85]: tree = html.parse(filepath)

In [86]: main = tree.find('.//main')

In [87]: main.getparent()
Out[87]: <Element body at 0x7fb498045400>  # main's parent is the doc's body tag

In [88]: new_fragment = etree.Element("div")

In [89]: new_fragment.append(main)

In [90]: main.getparent()
Out[90]: <Element div at 0x7fb491881b00>  # main's parent is new_fragment (0x7fb491881b00)

In [91]: new_fragment[0]
Out[91]: <Element main at 0x7fb4916783c0>

In [92]: new_fragment[0].getparent()
Out[92]: <Element div at 0x7fb491881b00>

In [93]: new_fragment
Out[93]: <Element div at 0x7fb491881b00>

When I print main with:

In [104]: print(html.tostring(main, method="html", encoding="unicode", pretty_print=True))

It still works, because main itself hasn’t been altered, it’s just got a new parent. If instead, main[0][0] (the article tag) is appended to new_element, the article has been removed from main, so printing main will return the element with its article tag missing.

My understanding is that lxml elements are Python wrappers around C structures. lxml maintains the parent-child relationships in its internal C data structures. Calls to append modify those structures.

Tags: