Zalgorithm

Save HTML fragments associated with embeddings to an SQLite database

I’m running a local application that breaks this site’s HTML files into HTML fragments based on the file’s heading structure, then generates vector embeddings for each HTML fragment. The vector embeddings are saved to a Chroma database . The HTML headings and fragments are also saved to the database as metadata. Note that multiple embedding vectors can be generated for a single section of HTML, so HTML fragments and headings can be associated with multiple embeddings.

HTML fragments have multiple possible uses on the site

The initial goal for generating the HTML fragments along with the embeddings was to display the fragments as results on the search page.

Reading Ted Nelson’s book Literary Machines has me thinking about other possible uses for the fragments. I don’t have Ted Nelson’s level of discipline in terms of application design. I just want to get the HTML fragments and headings into a database other than the Chroma db so that I can fool around with them.

The blog’s source of truth

My approach to the blog is local first . Its ultimate source of truth is a directory of markdown files (~/zalgorithm/content/). Its secondary source of truth is the directory of HTML pages that are built from the markdown files during the Hugo build process. (Related: notes / How the Hugo build process processes footnotes .)

Using a file system instead of a database as the source of truth adds some complexity and risk of errors when attempting to sync files with one or more databases.

The best approach I’ve come up with is to add an “id” field to each markdown file. This is done through the /archetypes/default.md file:

+++
date = "{{ .Date }}"
id = "{{ .File.UniqueID }}"
draft = true
title = "{{ replace .File.ContentBaseName "-" " " | title }}"
summary = """
"""
tags = []
+++

What needs to be associated?

With the markdown files as the source of truth, each markdown file’s “id” needs to be used to generate an embedding_id for each embedding, and a fragment_id for each HTML fragment.

embedding_ids allow for calling the Chroma collection.upsert method so that embeddings can be updated in the Chroma database. This is implemented now:

for index, text in enumerate(embeddings_text):
    embedding_id = f"{post_id}-{index}-{section_heading_slug}"
    # ...
    self.collection.upsert(
        ids=embedding_id, metadatas=metadatas, documents=text
    )

Currently HTML headings and fragments are saved as Chroma metadata, so there’s no issue associating HTML with search results:

    metadatas = {
        "page_title": page_heading,
        "section_heading": section_heading,
        "html_heading": html_heading,
        "html_fragment": html_fragment,
        "updated_at": file_mtime,
    }

But, as the HTML sections can be associated with multiple embedding, I can’t simply use the embedding_id as the index in a separate (SQLite) database.

I’ll try creating section_ids. That will change the Chroma metadatas to:

    metadatas = {
        "page_title": page_heading,
        "section_heading": section_heading,
        "section_id": section_id,
        "updated_at": file_mtime,
    }

Generating ids for each HTML section

The solution (for now) is to use SQLite row ids to identify each section:

    def create_sections_table(self, con: sqlite3.Connection) -> None:
        cur = con.cursor()
        cur.execute("""
CREATE TABLE IF NOT EXISTS sections (
    id INTEGER PRIMARY KEY,  -- Auto-incrementing rowid
    section_id TEXT NOT NULL UNIQUE,
    post_id TEXT NOT NULL,
    section_heading_slug TEXT NOT NULL,
    html_heading TEXT NOT NULL,
    html_fragment TEXT NOT NULL,
    updated_at REAL NOT NULL,
    UNIQUE(post_id, section_heading_slug)
);
        """)
        cur.execute(
            "CREATE UNIQUE INDEX IF NOT EXISTS idx_section_id ON sections(section_id);"
        )
        cur.execute("CREATE INDEX IF NOT EXISTS idx_post_id ON sections(post_id);")

        return None

The section HTML is inserted into the SQLite database prior to generating the embeddings and the row’s ID is returned:

    def save_to_sqlite(
        self,
        section_id: str,
        post_id: str,
        section_heading_slug: str,
        html_heading: str,
        html_fragment: str,
        updated_at: float,
    ) -> int:
        cursor = self.con.execute(
            """
INSERT INTO sections
    (section_id, post_id, section_heading_slug, html_heading, html_fragment, updated_at)
VALUES (?, ?, ?, ?, ?, ?)
ON CONFLICT(section_id) DO UPDATE SET
    html_heading = excluded.html_heading,
    html_fragment = excluded.html_fragment,
    updated_at = excluded.updated_at
RETURNING id
        """,
            (
                section_id,
                post_id,
                section_heading_slug,
                html_heading,
                html_fragment,
                updated_at,
            ),
        )

        return cursor.fetchone()[0]

In the generate_embedding method, set the row’s id (named db_id for now) in the metadatas dict:

    def generate_embedding(self, filepath: Path) -> None:
        html_path, relative_path = self.get_file_paths(filepath)
        if not html_path or not relative_path:
            return None

        print(f"Processing {html_path}")

        post = frontmatter.load(str(filepath))
        file_mtime = filepath.stat().st_mtime
        title = str(post.get("title"))
        post_id = post.get("id", None)
        if not post_id:
            print(
                f"The post '{title}' is missing an 'id' field. Skipping generating an embedding."
            )
            return None

        sections = extract_sections(html_path, relative_path)

        for section in sections:
            html_fragment = section["html_fragment"]
            html_heading = section["html_heading"]
            page_heading = section["headings_path"][0]
            section_heading = section["headings_path"][-1]
            section_heading_slug = self._slugify(section_heading)
            embeddings_text = section["embeddings_text"]
            section_id = f"{post_id}-{section_heading_slug}"

            db_id = self.save_to_sqlite(
                section_id=section_id,
                post_id=str(post_id),  # it's a string!
                section_heading_slug=section_heading_slug,
                html_heading=html_heading,
                html_fragment=html_fragment,
                updated_at=file_mtime,
            )
            self.con.commit()  # do I also need to be closing the connection? (no?)

            for index, text in enumerate(embeddings_text):
                embedding_id = f"{post_id}-{index}-{section_heading_slug}"

                metadatas = {
                    "page_title": page_heading,
                    "section_heading": section_heading,
                    "db_id": db_id,
                    "updated_at": file_mtime,
                }

                self.collection.upsert(
                    ids=embedding_id, metadatas=metadatas, documents=text
                )

Keeping everything in sync

The current setup could break if headings are edited in the markdown files (the source of truth). The current solution for that is to just nuke the full sections.db and chroma.db databases each time the embeddings are generated. I’ll come up with something more sophisticated.

Using the sections.db data in the search application

This change actually simplified the application code (although probably made it somewhat less efficient):

@app.post("/query", response_class=HTMLResponse)
async def query_collection(
    query: Annotated[str, Form()],
    db: aiosqlite.Connection = Depends(get_db_connection, scope="function"),
    collection: AsyncCollection = Depends(get_chroma_collection),
):
    try:
        results = await collection.query(query_texts=[query], n_results=5)
        seen_sections = set()
        html_sections = []

        if not results["metadatas"]:
            return ""  # for now

        for i in range(len(results["ids"][0])):
            metadata = results["metadatas"][0][i]

            section_heading = metadata.get("section_heading", "")
            if section_heading in seen_sections:
                continue

            seen_sections.add(section_heading)

            row_id = metadata.get("db_id", None)
            cursor = await db.execute(
                f"SELECT html_heading, html_fragment FROM sections WHERE id = {row_id}"
            )
            row = await cursor.fetchone()
            if row:
                section_html = "".join(row)
                html_sections.append(section_html)

        response_html = "".join(html_sections)
        return response_html

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))