Parsing HTML files with lxml
Related to Learning about parsing XML and HTML with lxml
lxml API documentation (the links in the tutorial are pointing to http://effbot.org/zone/element-index.htm#documentation, which seems to be for sale)
https://lxml.de/apidoc/
.
lxml GitHub: https://github.com/lxml/lxml .
Trial by error — this is not a tutorial.
Loading an HTML file
Loading an HTML file with html.parse:
In [187]: from lxml import html
In [188]: tree = html.parse("/home/scossar/zalgorithm/public/notes/learnin
⋮ g-about-parsing-xml-and-html-with-lxml/index.html")
In [189]: docinfo = tree.docinfo
In [190]: docinfo
Out[190]: <lxml.etree.DocInfo at 0x7f49a05e03d0>
In [191]: docinfo.doctype
Out[191]: '<!DOCTYPE html>'
What is the tree’s type'?
In [192]: type(tree)
Out[192]: lxml.etree._ElementTree
lsml.etree.ElementTree documentation:
https://lxml.de/apidoc/lxml.etree.html#lxml.etree._ElementTree
.
Trying some ElementTree methods
ElementTree find and getroot methods
find(path, namespaces=None): finds the first toplevel element with given tag. Same ass
tree.getroot().find(path).
In [195]: body = tree.find('body')
In [196]: body.tag
Out[196]: 'body'
In [197]: type(body)
Out[197]: lxml.html.HtmlElement
In [199]: body = tree.getroot().find('body')
In [200]: type(body)
Out[200]: lxml.html.HtmlElement
ElementTree findall method
findall(path, namespaces=None): finds all elements matching the ElementPath expression. Same as
getroot().findall(path). (Note: a warning that I received when trying this seemed to indicate that
the root should be explicitly found first.)
The dumb attempt goes nowhere:
In [201]: tree.findall('p')
Out[201]: []
In [202]: tree.findall('doctype')
Out[202]: []
In [203]: tree.findall('html')
Out[203]: []
Have a look at the Python xml library’s supported XPath syntax: https://docs.python.org/3/library/xml.etree.elementtree.html#supported-xpath-syntax .
The XPath . syntax selects the current node:
In [204]: tree.findall('.')
Out[204]: [<Element html at 0x7f499af3c690>]
The XPath // syntax selects all subelements, on all levels beneath the current element.
Interestingly, this works:
In [210]: paragraphs = tree.findall('//p')
But it gives a warning that seems to indicate I’m understanding things correctly:
<ipython-input-210-8a3fbf13b8bb>:1: FutureWarning: This search incorrectly ignores the root element, and will be fixed in a future version. If you rely on the current behaviour, change it to './/p'
paragraphs = tree.findall('//p')
I think this is correct:
In [211]: paragraphs = tree.findall('.//p')
In [212]: len(paragraphs)
Out[212]: 21
But maybe this is better (more explicit):
In [213]: paragraphs = tree.find('.//body').findall('.//p')
In [214]: len(paragraphs)
Out[214]: 21
Or:
In [217]: paragraphs = tree.getroot().findall('.//p')
In [218]: len(paragraphs)
Out[218]: 21
ElementTree iter method
iter(tag=None, *tags): creates an iterator for the root element. Loops over all elements in the
tree, in document order. Can be restricted to find only elements with specific tags.
In [219]: for element in tree.iter():
...: print(element.tag)
...:
html
head
script
meta
meta
title
link
link
script
body
header
div
div
h1
a
main
div
article
h1
time
p
Trying some Element methods
Get the body element:
In [223]: body = tree.find('.//body')
In [225]: type(body)
Out[225]: lxml.html.HtmlElement
The relevant documentation for HtmlElement methods:
https://lxml.de/apidoc/lxml.html.html#lxml.html.HtmlElement
.
Confirming that the docs are relevant:
In [229]: body.getprevious().tag
Out[229]: 'head'
Find all H1 elements:
In [235]: headings = body.findall('.//h1')
In [236]: headings
Out[236]: [<Element h1 at 0x7f49a04cd4f0>, <Element h1 at 0x7f49a03cc820>]
In [237]: for heading in headings:
...: print(heading.text)
...:
None
Learning About Parsing XML and HTML With lxml
That’s interesting. Hugo adds the title as an H1 element. I’ve also got an H1 element that wraps an
anchor element in the site’s header. For what I’m trying to do, I should start from the main element:
In [238]: main = tree.find('.//main')
In [239]: h1 = main.findall('.//h1')
In [241]: h1[0].text
Out[241]: 'Learning About Parsing XML and HTML With lxml'
Getting the text content for each heading section
Note that | is the XPath union operator:
In [254]: headings = main.xpath('.//h1 | .//h2 | .//h3 | .//h4 | .//h5 | .
⋮ //h6')
Not sure how to get started, I asked Claude for a suggestion. It’s interesting that its attempt was so wrong. It’s overly complex and only iterates through the siblings of the heading elements. It’s similar to the way it fails on some DSP problems.
from lxml import html
filepath = "/home/scossar/zalgorithm/public/notes/learning-about-parsing-xml-and-html-with-lxml/index.html"
tree = html.parse(filepath)
root = tree.find(".//main")
def extract_sections_by_heading(root):
headings = root.xpath(".//h1 | .//h2 | .//h3 | .//h4 | .//h5 | .//h6")
sections = []
for i, heading in enumerate(headings):
next_heading = headings[i + 1] if i + 1 < len(headings) else None
section_text = []
current = (
heading.getnext()
) # getnext returns next sibling, not going to work here
while current is not None:
if current == next_heading:
break
text = current.text
if text and text.strip():
section_text.append(text)
current = current.getnext()
sections.append(" ".join(section_text))
return sections
My first attempt. It needs some fine tuning, but the general idea is there:
from lxml import html
from lxml.html import HtmlElement
filepath = "/home/scossar/zalgorithm/public/notes/learning-about-parsing-xml-and-html-with-lxml/index.html"
tree = html.parse(filepath)
root = tree.find(".//main")
def extract_sections(root: HtmlElement):
sections = []
current_section = ""
heading_tags = ("h1", "h2", "h3", "h4", "h5", "h6")
for element in root.iter():
if element.tag in heading_tags:
if current_section:
sections.append(current_section)
current_section = ""
text = element.text
if text and text.strip():
current_section += " " + text
return sections
Assembling an HTML fragment for each heading section
Rough first attempt:
def extract_html_fragments(root: HtmlElement):
sections = []
current_fragment = None
current_parent = None
heading_tags = ("h1", "h2", "h3", "h4", "h5", "h6")
for element in root.iter():
if element.tag in heading_tags:
if current_fragment is not None:
serialized = html.tostring(current_fragment)
sections.append(serialized)
# sections.append(current_fragment)
current_fragment = etree.Element(element.tag, element.attrib)
current_fragment.text = element.text
current_fragment.tail = None # ???
current_parent = current_fragment
elif current_fragment is not None:
print("element.tag", element.tag)
new_elem = etree.Element(element.tag, element.attrib)
new_elem.text = element.text
new_elem.tail = element.tail
current_parent.append(new_elem)
if current_fragment is not None:
serialized = html.tostring(current_fragment)
sections.append(serialized)
return sections
The call to etree.Element(element.tag, element.attrib) triggers an error if the element isn’t
“regular” element, e.g. a comment: <!-- raw HTML omitted -->:
TypeError: Argument must be bytes or unicode, got 'cython_function_or_method'
Comments have the type ElementComment. This should fix it:
for element in root.iter():
if not isinstance(element, HtmlElement):
continue
The sections created from the call html.tostring(current_fragment) are bytes objects:
b'<h2 id="a-second-level-heading">A second level heading<p>A paragraph that </p>\n<em>falls</em> beneath the second level heading.</h2>'
That can be fixed with html.tostring(current_fragment).decode(). (Note, I think the method="html"
and encoding="unicode" arguments need to be set here.)
Adding the pretty_print=True argument to the call to tostring reveals an obvious issue:
<h2 id="a-second-level-heading">A second level heading<p>A paragraph that </p>
<em>falls</em> beneath the second level heading.</h2>
I think the problem is that I’m setting the heading elements as the fragment root element.
I don’t think I want the heading to be part of the fragment. This is getting closer:
def extract_html_fragments(root: HtmlElement):
sections = []
current_fragment = None
current_parent = None
heading_tags = ("h1", "h2", "h3", "h4", "h5", "h6")
for element in root.iter():
if not isinstance(element, HtmlElement):
continue
if element.tag in heading_tags:
if current_fragment is not None:
serialized = html.tostring(
current_fragment,
pretty_print=True,
method="html",
encoding="unicode",
)
sections.append(serialized)
current_fragment = etree.Element("div")
current_parent = current_fragment
elif current_fragment is not None and current_parent is not None:
new_elem = etree.Element(element.tag, element.attrib)
new_elem.text = element.text
new_elem.tail = element.tail
current_parent.append(new_elem)
if current_fragment is not None:
serialized = html.tostring(
current_fragment, pretty_print=True, method="html", encoding="unicode"
)
sections.append(serialized)
return sections
Getting the heading for each section
The sections list should be a list of section dicts: {"html": "", "heading": ""}. Or even
better (this works great with the HTML I’m using, as long as the article tag is set as the root.
For future reference, with more inconsistent HTML I’ll likely have to use iter().)
def extract_html_fragments(root: HtmlElement):
sections = []
current_fragment = None
current_heading = None
heading_tags = ("h1", "h2", "h3", "h4", "h5", "h6")
for element in root.iterchildren():
if element.tag in heading_tags:
if current_fragment is not None:
serialized_fragment = serialize_element(current_fragment)
serialized_heading = serialize_element(current_heading)
sections.append(
{
"html_heading": serialized_heading,
"html_fragment": serialized_fragment,
}
)
current_heading = element
current_fragment = etree.Element("div", {"class": "article-fragment"})
elif current_fragment is not None:
current_fragment.append(element)
if current_fragment is not None:
serialized_fragment = serialize_element(current_fragment)
serialized_heading = serialize_element(current_heading)
sections.append(
{"html_heading": serialized_heading, "html_fragment": serialized_fragment}
)
return sections
filepath = (
"/home/scossar/zalgorithm/public/notes/a-simple-document-for-testing/index.html"
)
tree = html.parse(filepath)
# print("type(tree)", type(tree)) # lxml.etree._ElementTree
root = tree.find(".//article")
sections = extract_html_fragments(root)
Getting the HTML and text content all in one go
The itertext() method makes it easy to get the text in a crude way, but for providing context for
semantic search embeddings, things like images, code blocks, possibly links, etc, need special
handling.
Crude implementation for future reference:
def extract_html_fragments(root: HtmlElement):
sections = []
current_fragment = None
current_heading = None
current_text = ""
heading_tags = ("h1", "h2", "h3", "h4", "h5", "h6")
for element in root.iterchildren():
if element.tag in heading_tags:
if current_fragment is not None:
serialized_fragment = serialize_element(current_fragment)
serialized_heading = serialize_element(current_heading)
sections.append(
{
"html_heading": serialized_heading,
"html_fragment": serialized_fragment,
"section_text": current_text,
}
)
current_heading = element
current_fragment = etree.Element("div", {"class": "article-fragment"})
current_text = f"{element.text} > "
elif current_fragment is not None:
for text in element.itertext():
current_text += f" {text}"
current_fragment.append(element)
if current_fragment is not None:
serialized_fragment = serialize_element(current_fragment)
serialized_heading = serialize_element(current_heading)
sections.append(
{
"html_heading": serialized_heading,
"html_fragment": serialized_fragment,
"section_text": current_text,
}
)
return sections
Instead of for text in element.itertext():, I’ll pass the element to a function:
The characteristics of XML and HTML elements
This deserves its own note: Characteristics of XML and HTML elements