Zalgorithm

Parsing HTML files with lxml

Related to Learning about parsing XML and HTML with lxml

lxml API documentation (the links in the tutorial are pointing to http://effbot.org/zone/element-index.htm#documentation, which seems to be for sale) https://lxml.de/apidoc/ .

lxml GitHub: https://github.com/lxml/lxml .

Trial by error — this is not a tutorial.

Loading an HTML file

Loading an HTML file with html.parse:

In [187]: from lxml import html

In [188]: tree = html.parse("/home/scossar/zalgorithm/public/notes/learnin
         g-about-parsing-xml-and-html-with-lxml/index.html")

In [189]: docinfo = tree.docinfo

In [190]: docinfo
Out[190]: <lxml.etree.DocInfo at 0x7f49a05e03d0>

In [191]: docinfo.doctype
Out[191]: '<!DOCTYPE html>'

What is the tree’s type'?

In [192]: type(tree)
Out[192]: lxml.etree._ElementTree

lsml.etree.ElementTree documentation: https://lxml.de/apidoc/lxml.etree.html#lxml.etree._ElementTree .

Trying some ElementTree methods

ElementTree find and getroot methods

find(path, namespaces=None): finds the first toplevel element with given tag. Same ass tree.getroot().find(path).

In [195]: body = tree.find('body')

In [196]: body.tag
Out[196]: 'body'

In [197]: type(body)
Out[197]: lxml.html.HtmlElement

In [199]: body = tree.getroot().find('body')

In [200]: type(body)
Out[200]: lxml.html.HtmlElement

ElementTree findall method

findall(path, namespaces=None): finds all elements matching the ElementPath expression. Same as getroot().findall(path). (Note: a warning that I received when trying this seemed to indicate that the root should be explicitly found first.)

The dumb attempt goes nowhere:

In [201]: tree.findall('p')
Out[201]: []

In [202]: tree.findall('doctype')
Out[202]: []

In [203]: tree.findall('html')
Out[203]: []

Have a look at the Python xml library’s supported XPath syntax: https://docs.python.org/3/library/xml.etree.elementtree.html#supported-xpath-syntax .

The XPath . syntax selects the current node:

In [204]: tree.findall('.')
Out[204]: [<Element html at 0x7f499af3c690>]

The XPath // syntax selects all subelements, on all levels beneath the current element.

Interestingly, this works:

In [210]: paragraphs = tree.findall('//p')

But it gives a warning that seems to indicate I’m understanding things correctly:

<ipython-input-210-8a3fbf13b8bb>:1: FutureWarning: This search incorrectly ignores the root element, and will be fixed in a future version.  If you rely on the current behaviour, change it to './/p'
  paragraphs = tree.findall('//p')

I think this is correct:

In [211]: paragraphs = tree.findall('.//p')

In [212]: len(paragraphs)
Out[212]: 21

But maybe this is better (more explicit):

In [213]: paragraphs = tree.find('.//body').findall('.//p')

In [214]: len(paragraphs)
Out[214]: 21

Or:

In [217]: paragraphs = tree.getroot().findall('.//p')

In [218]: len(paragraphs)
Out[218]: 21

ElementTree iter method

iter(tag=None, *tags): creates an iterator for the root element. Loops over all elements in the tree, in document order. Can be restricted to find only elements with specific tags.

In [219]: for element in tree.iter():
     ...:     print(element.tag)
     ...:
html
head
script
meta
meta
title
link
link
script
body
header
div
div
h1
a
main
div
article
h1
time
p

Trying some Element methods

Get the body element:

In [223]: body = tree.find('.//body')

In [225]: type(body)
Out[225]: lxml.html.HtmlElement

The relevant documentation for HtmlElement methods: https://lxml.de/apidoc/lxml.html.html#lxml.html.HtmlElement .

Confirming that the docs are relevant:

In [229]: body.getprevious().tag
Out[229]: 'head'

Find all H1 elements:

In [235]: headings = body.findall('.//h1')

In [236]: headings
Out[236]: [<Element h1 at 0x7f49a04cd4f0>, <Element h1 at 0x7f49a03cc820>]

In [237]: for heading in headings:
     ...:     print(heading.text)
     ...:
None
Learning About Parsing XML and HTML With lxml

That’s interesting. Hugo adds the title as an H1 element. I’ve also got an H1 element that wraps an anchor element in the site’s header. For what I’m trying to do, I should start from the main element:

In [238]: main = tree.find('.//main')

In [239]: h1 = main.findall('.//h1')

In [241]: h1[0].text
Out[241]: 'Learning About Parsing XML and HTML With lxml'

Getting the text content for each heading section

Note that | is the XPath union operator:

In [254]: headings = main.xpath('.//h1 | .//h2 | .//h3 | .//h4 | .//h5 | .
         //h6')

Not sure how to get started, I asked Claude for a suggestion. It’s interesting that its attempt was so wrong. It’s overly complex and only iterates through the siblings of the heading elements. It’s similar to the way it fails on some DSP problems.

from lxml import html

filepath = "/home/scossar/zalgorithm/public/notes/learning-about-parsing-xml-and-html-with-lxml/index.html"

tree = html.parse(filepath)

root = tree.find(".//main")


def extract_sections_by_heading(root):
    headings = root.xpath(".//h1 | .//h2 | .//h3 | .//h4 | .//h5 | .//h6")
    sections = []

    for i, heading in enumerate(headings):
        next_heading = headings[i + 1] if i + 1 < len(headings) else None

        section_text = []
        current = (
            heading.getnext()
        )  # getnext returns next sibling, not going to work here

        while current is not None:
            if current == next_heading:
                break

            text = current.text
            if text and text.strip():
                section_text.append(text)
            current = current.getnext()
        sections.append(" ".join(section_text))
    return sections

My first attempt. It needs some fine tuning, but the general idea is there:

from lxml import html
from lxml.html import HtmlElement

filepath = "/home/scossar/zalgorithm/public/notes/learning-about-parsing-xml-and-html-with-lxml/index.html"

tree = html.parse(filepath)
root = tree.find(".//main")

def extract_sections(root: HtmlElement):
    sections = []
    current_section = ""
    heading_tags = ("h1", "h2", "h3", "h4", "h5", "h6")
    for element in root.iter():
        if element.tag in heading_tags:
            if current_section:
                sections.append(current_section)

            current_section = ""

        text = element.text
        if text and text.strip():
            current_section += " " + text

    return sections

Assembling an HTML fragment for each heading section

Rough first attempt:

def extract_html_fragments(root: HtmlElement):
    sections = []
    current_fragment = None
    current_parent = None
    heading_tags = ("h1", "h2", "h3", "h4", "h5", "h6")

    for element in root.iter():
        if element.tag in heading_tags:
            if current_fragment is not None:
                serialized = html.tostring(current_fragment)
                sections.append(serialized)
                # sections.append(current_fragment)

            current_fragment = etree.Element(element.tag, element.attrib)
            current_fragment.text = element.text
            current_fragment.tail = None  # ???
            current_parent = current_fragment
        elif current_fragment is not None:
            print("element.tag", element.tag)
            new_elem = etree.Element(element.tag, element.attrib)
            new_elem.text = element.text
            new_elem.tail = element.tail
            current_parent.append(new_elem)

    if current_fragment is not None:
        serialized = html.tostring(current_fragment)
        sections.append(serialized)

    return sections

The call to etree.Element(element.tag, element.attrib) triggers an error if the element isn’t “regular” element, e.g. a comment: <!-- raw HTML omitted -->:

TypeError: Argument must be bytes or unicode, got 'cython_function_or_method'

Comments have the type ElementComment. This should fix it:

    for element in root.iter():
        if not isinstance(element, HtmlElement):
            continue

The sections created from the call html.tostring(current_fragment) are bytes objects:

b'<h2 id="a-second-level-heading">A second level heading<p>A paragraph that </p>\n<em>falls</em> beneath the second level heading.</h2>'

That can be fixed with html.tostring(current_fragment).decode(). (Note, I think the method="html" and encoding="unicode" arguments need to be set here.)

Adding the pretty_print=True argument to the call to tostring reveals an obvious issue:

<h2 id="a-second-level-heading">A second level heading<p>A paragraph that </p>
<em>falls</em> beneath the second level heading.</h2>

I think the problem is that I’m setting the heading elements as the fragment root element.

I don’t think I want the heading to be part of the fragment. This is getting closer:

def extract_html_fragments(root: HtmlElement):
    sections = []
    current_fragment = None
    current_parent = None
    heading_tags = ("h1", "h2", "h3", "h4", "h5", "h6")

    for element in root.iter():
        if not isinstance(element, HtmlElement):
            continue
        if element.tag in heading_tags:
            if current_fragment is not None:
                serialized = html.tostring(
                    current_fragment,
                    pretty_print=True,
                    method="html",
                    encoding="unicode",
                )
                sections.append(serialized)

            current_fragment = etree.Element("div")
            current_parent = current_fragment

        elif current_fragment is not None and current_parent is not None:
            new_elem = etree.Element(element.tag, element.attrib)
            new_elem.text = element.text
            new_elem.tail = element.tail
            current_parent.append(new_elem)

    if current_fragment is not None:
        serialized = html.tostring(
            current_fragment, pretty_print=True, method="html", encoding="unicode"
        )
        sections.append(serialized)

    return sections

Getting the heading for each section

The sections list should be a list of section dicts: {"html": "", "heading": ""}. Or even better (this works great with the HTML I’m using, as long as the article tag is set as the root. For future reference, with more inconsistent HTML I’ll likely have to use iter().)

def extract_html_fragments(root: HtmlElement):
    sections = []
    current_fragment = None
    current_heading = None
    heading_tags = ("h1", "h2", "h3", "h4", "h5", "h6")

    for element in root.iterchildren():
        if element.tag in heading_tags:
            if current_fragment is not None:
                serialized_fragment = serialize_element(current_fragment)
                serialized_heading = serialize_element(current_heading)
                sections.append(
                    {
                        "html_heading": serialized_heading,
                        "html_fragment": serialized_fragment,
                    }
                )

            current_heading = element
            current_fragment = etree.Element("div", {"class": "article-fragment"})

        elif current_fragment is not None:
            current_fragment.append(element)

    if current_fragment is not None:
        serialized_fragment = serialize_element(current_fragment)
        serialized_heading = serialize_element(current_heading)

        sections.append(
            {"html_heading": serialized_heading, "html_fragment": serialized_fragment}
        )

    return sections


filepath = (
    "/home/scossar/zalgorithm/public/notes/a-simple-document-for-testing/index.html"
)
tree = html.parse(filepath)
# print("type(tree)", type(tree))  # lxml.etree._ElementTree
root = tree.find(".//article")

sections = extract_html_fragments(root)

Getting the HTML and text content all in one go

The itertext() method makes it easy to get the text in a crude way, but for providing context for semantic search embeddings, things like images, code blocks, possibly links, etc, need special handling.

Crude implementation for future reference:

def extract_html_fragments(root: HtmlElement):
    sections = []
    current_fragment = None
    current_heading = None
    current_text = ""
    heading_tags = ("h1", "h2", "h3", "h4", "h5", "h6")

    for element in root.iterchildren():
        if element.tag in heading_tags:
            if current_fragment is not None:
                serialized_fragment = serialize_element(current_fragment)
                serialized_heading = serialize_element(current_heading)
                sections.append(
                    {
                        "html_heading": serialized_heading,
                        "html_fragment": serialized_fragment,
                        "section_text": current_text,
                    }
                )

            current_heading = element
            current_fragment = etree.Element("div", {"class": "article-fragment"})
            current_text = f"{element.text} > "

        elif current_fragment is not None:
            for text in element.itertext():
                current_text += f" {text}"

            current_fragment.append(element)

    if current_fragment is not None:
        serialized_fragment = serialize_element(current_fragment)
        serialized_heading = serialize_element(current_heading)

        sections.append(
            {
                "html_heading": serialized_heading,
                "html_fragment": serialized_fragment,
                "section_text": current_text,
            }
        )

    return sections

Instead of for text in element.itertext():, I’ll pass the element to a function:

The characteristics of XML and HTML elements

This deserves its own note: Characteristics of XML and HTML elements

Tags: