Zalgorithm

Debugging wrong HTML fragment being served

Related to [notes / todo#wrong-html-fragment-being-served-for-imaginary-numbers-link]

This might involve the template I’m using to render links. There are some details about it here: notes / Link render hooks.

The issue

The htmx powered link to Real and complex numbers in space on notes / Introduction to imaginary numbers is displaying the fragment for notes/ Introduction to complex numbers.

Run a local deploy

I don’t think the issue is stale data, but to be sure, run the local deploy script.

This is where the context that’s passed to the link template is assembled:

{{/* Try to get dbId if conditions are met */}}
{{ $dbId := "" }}
{{- if not $rel }}
  {{ $fragmentsMap := site.Data.fragments.sections }}
  {{ $url := $attrs.href }}
  {{ $fragmentData := index $fragmentsMap $url }}
  {{ $dbId = $fragmentData.db_id }}

  {{ if eq .Destination "/notes/real-and-complex-numbers-in-space/" }}
    {{ warnf "Destination: %v" .Destination }}
    {{ warnf "Text: %v" .Text }}
    {{ warnf "Page.Title: %v" .Page.Title }}
  {{ end }}

{{- end }}

The link destination is being found:

WARN  Destination: /notes/real-and-complex-numbers-in-space/
WARN  Text: Real and complex numbers in
space
WARN  Page.Title: Debugging wrong HTML fragment being served
WARN  Text: notes / Real and complex numbers in space
WARN  Page.Title: Introduction to Imaginary Numbers

Interestingly, I added the link that’s causing the issue to this page too. It’s also pulling in the wrong HTML fragment.

These should all be the same:

  {{ if eq .Destination "/notes/real-and-complex-numbers-in-space/" }}
    {{ warnf "Destination: %v" .Destination }}
    {{ warnf "$attrs.href: %v" $attrs.href }}
    {{ warnf "$url: %v" $url }}
  {{ end }}

Messy code, but they’re identical values:

WARN  Destination: /notes/real-and-complex-numbers-in-space/
WARN  $attrs.href: /notes/real-and-complex-numbers-in-space/
WARN  $url: /notes/real-and-complex-numbers-in-space/

Get the value for the key from the data JSON file

Use jq to check the value of the key "/notes/real-and-complex-numbers-in-space/"

❯  jq '.["/notes/real-and-complex-numbers-in-space/"]' data/fragments/sections.json
{
  "db_id": 699
}

Check what value’s being found for the key in the template

  {{ if eq .Destination "/notes/real-and-complex-numbers-in-space/" }}
    {{ warnf "Destination: %v" .Destination }}
    {{ warnf "$dbId: %v" $dbId }}
  {{ end }}

It’s the right key:

WARN  $dbId: 699

Check what’s stored for dbId 699 in the database

It’s looking like the issue isn’t on the Hugo end. Checking the value that’s stored for id=699 in the SQLite database that stores the embeddings, the entry that should be associated with the Real and complex numbers in space page is storing the fragment for Introduction to complex numbers.

Hmmm:

embeddings_generator on  master [!] via  v3.11.13 (.venv)
 ipython

In [1]: import sqlite3

In [2]: con = sqlite3.connect("./sqlite/sections.db")

In [3]: cur = con.cursor()

In [4]: cur.execute("SELECT * FROM sections WHERE id = 699")
Out[4]: <sqlite3.Cursor at 0x7ff3216c7cc0>

In [5]: cur.fetchone()
Out[5]:
(699,
 '01KCTBZF1BN6WGKKF96BA3KYN0-',
 '01KCTBZF1BN6WGKKF96BA3KYN0',
 '',
 '<h2><a href="/notes/introduction-to-complex-numbers/">Introduction to complex numbers</a></h2>',
 '<div class="article-fragment"><p>What is a complex number?</p>\n<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">\n      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/SP-YJe7Vldo?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>\n    </div>\n\n</div>',
 1768784646.1574106)

The database table:

CREATE TABLE IF NOT EXISTS sections (
    id INTEGER PRIMARY KEY,
    section_id TEXT NOT NULL UNIQUE,
    post_id TEXT NOT NULL,
    section_heading_slug TEXT NOT NULL,
    html_heading TEXT NOT NULL,
    html_fragment TEXT NOT NULL,
    updated_at REAL NOT NULL,
    UNIQUE(post_id, section_heading_slug)
);

The section_id and post_id rows in the entry that seems to be causing the issue are a little strange (identical except for the - symbol), but that’s OK. The - would be followed by the heading slug if the link was to a section heading rather than to the first section of the post.

The cause of the issue(s)

What actually triggered the issue was my new deploy script and a lack of attention on my part. The script makes a call to:

.venv/bin/python main.py

That script traverses the built Hugo HTML files to extract heading sections that are saved to an SQLite database, and text content that’s used to generate embeddings for Chromadb. If an error happened in the Python code, it would get output to the console, then the script would happily chug along, pushing corrupt data to the Docker container that’s running the API.

This should fix that problem:

#!/usr/bin/env bash
set -euo pipefail  # Exit on error

# Trap errors
trap 'echo "ERROR: Deployment failed at line $LINENO. Remote deployment aborted."; exit 1' ERR

# ...

What was triggering the errors in the Python code?

There were two issues:

The first was happening here:

    for child in root.iterchildren():
        if child.tag in heading_tags:
            if current_fragment is not None and has_text(current_fragment):
                current_fragment = fix_relative_links(current_fragment, rel_path)
                html_fragment = serialize(current_fragment, pretty_print=False)
                html_heading = serialize(current_heading_element, pretty_print=False)
                embeddings_text = section_texts(current_fragment, headings_path)
                sections.append(
                    {
                        "html_fragment": html_fragment,
                        "html_heading": html_heading,
                        "heading_id": heading_id,
                        "heading_href": heading_href,
                        "headings_path": headings_path,
                        "embeddings_text": embeddings_text,
                    }
                )

            current_fragment = etree.Element("div", {"class": "article-fragment"})
            heading_level = get_heading_level(child.tag)
            headings_path = (
                headings_path[:heading_level] + [child.text]
            )  # if child.text is None, an error will be triggered in heading_link function
            current_heading_element, heading_id, heading_href = heading_link(
                child, headings_path, rel_path
            )

It turns out that markdown like this ## $1$ is a number, with the markup.goldmark.extension.passthrough.delimiters set to this:

[markup.goldmark.extensions.passthrough.delimiters]
block = [['\[', '\]'], ['$$', '$$']]
inline = [['\(', '\)'], ['$', '$']]

Results in the text for an HTML element being a null value. So headings_path ends up looking like:

['foo', 'bar', 'baz', None]

The heading_link function then calls:

" > ".join(headings_path)

That doesn’t work:

In [7]:  arr
Out[7]: ['foo', 'bar', 'baz', None]

In [8]: " > ".join(arr)
--------------------------------------------------------------------------
TypeError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 " > ".join(arr)

TypeError: sequence item 3: expected str instance, NoneType found

Error handling could be added to the code, but really I just want the code to fail at this point.

The second issue was related to some overly complex SQL:

   def save_to_sqlite(
        self,
        section_id: str,
        post_id: str,
        section_heading_slug: str,
        html_heading: str,
        html_fragment: str,
        updated_at: float,
    ) -> int:
        cursor = self.con.execute(
            """
INSERT INTO sections
    (section_id, post_id, section_heading_slug, html_heading, html_fragment, updated_at)
VALUES (?, ?, ?, ?, ?, ?)
ON CONFLICT(section_id) DO UPDATE SET
    html_heading = excluded.html_heading,
    html_fragment = excluded.html_fragment,
    updated_at = excluded.updated_at
RETURNING id
        """,
            (
                section_id,
                post_id,
                section_heading_slug,
                html_heading,
                html_fragment,
                updated_at,
            ),
        )

        return cursor.fetchone()[0]

The table has a unique constraint on section_id. (UNIQUE(post_id, section_heading_slug) is actually redundant due to the way that section_id is created. So’s the call the CREATE UNIQUE INDEX...):

   def create_sections_table(self, con: sqlite3.Connection) -> None:
        cur = con.cursor()
        cur.execute("""
CREATE TABLE IF NOT EXISTS sections (
    id INTEGER PRIMARY KEY,
    section_id TEXT NOT NULL UNIQUE,
    post_id TEXT NOT NULL,
    section_heading_slug TEXT NOT NULL,
    html_heading TEXT NOT NULL,
    html_fragment TEXT NOT NULL,
    updated_at REAL NOT NULL,
    UNIQUE(post_id, section_heading_slug)
);
        """)
        cur.execute(
            "CREATE UNIQUE INDEX IF NOT EXISTS idx_section_id ON sections(section_id);"
        )
        cur.execute("CREATE INDEX IF NOT EXISTS idx_post_id ON sections(post_id);")

        return None

The ON CONFLICT clause is (was) causing issues. If there’s a unique key violation, something has gone wrong and the script should exit. What went wrong today is that I copied a Hugo markdown file and gave it a new name. This resulted in two posts on the site with identical id values in their front matter. There are a few ways that can happen, as the id is being added via the default.md file like this:

+++
date = "{{ .Date }}"
id = "{{ .File.UniqueID }}"  # usually OK.
draft = true
title = "{{ replace .File.ContentBaseName "-" " " | title }}"
summary = """
"""
tags = []
+++

A (possibly adequate) solution for now is to remove the ON CONFLICT clause from the INSERT statement and call the INSERT statement in a try block:

      try:
            cursor = self.con.execute(
                """
    INSERT INTO sections
        (section_id, post_id, section_heading_slug, html_heading, html_fragment, updated_at)
    VALUES (?, ?, ?, ?, ?, ?)
    RETURNING id
                """,
                (
                    section_id,
                    post_id,
                    section_heading_slug,
                    html_heading,
                    html_fragment,
                    updated_at,
                ),
            )
            return cursor.fetchone()[0]

        except sqlite3.IntegrityError:
            # print debugging info
            # ...
        raise  # raise exception to stop execution