Zalgorithm

Embeddings visualization with UMAP

This is not a math tutorial. See Why am I writing about math?

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data

  1. The data is uniformly distributed on [a] Riemannian manifold;
  2. The Riemannian metric is locally constant (or can be approximated as such);
  3. The manifold is locally connected.

From these assumptions…1

From those assumptions I realize I’m in over my head. Backtracking a bit…

This gets clearer (in a non-mathematical way) further down in: An attempt at clarifying UMAP’s assumptions.

What is Euclidean space?

“Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid’s Elements, it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean spaces of any positive integer dimension n, which are called Euclidean n-spaces….”2

Examples:


For details about typing math symbols in markdown files, see notes / Typing math symbols with LaTeX.


“Euclidean space is the fundamental, ‘flat,’ multi-dimensional space of classical geometry where standard rules apply, allowing us to measure distances and angles with the familiar Pythagorean theorem…”3

What is a manifold?

“In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an n-dimensional manifold, or n-manifold…is a topological space with the property that each point has a neighborhood that is homeomorphic to an open subset of n-dimensional Euclidean space.”4

What is a space in the context of math?

A space is a set of points.

What is a topological space?

A topology is a set of neighbourhoods (for each point in a space(?)) that allows some concept of “closeness” to be defined.(?)

What is a neighbourhood?

“…a neighbourhood of a point is a set of points containing that point where one can move some amount in any direction away from that point without leaving the set.”5

In the image below: a set VV in the plane is a neighbourhood of a point pp if a small disc around pp is contained in VV. The small disc around pp is an open set UU.6

By Oleg Alexandrov - From the source code at File:Neighborhood illust1.png, heavily cleaned-up, Public Domain, 




https://commons.wikimedia.org/w/index.php?curid=14630911
By Oleg Alexandrov - From the source code at File:Neighborhood illust1.png, heavily cleaned-up, Public Domain, https://commons.wikimedia.org/w/index.php?curid=14630911

What is a Riemannian manifold?

I’ll start with “[a]ny smooth surface in three-dimensional Euclidean space is a Riemannian manifold with a Riemannian metric coming from the way it sits inside the ambient space.”7

The ambient space of a line is the line of real numbers, the ambient space of a point in 2D space is Euclidean 2D space?

A Riemannian manifold is a space that locally resembles Euclidean space.8 Remember that a manifold is a topological space that locally resembles Euclidean space.

What is a Riemannian metric?

A Riemannian manifold has a Riemannian metric — a way to measure lengths, angles, (and volumes(?)) across the space. Getting back to the task at hand (understanding UMAP (uniform manifold approximation and projection)), a space being a Riemannian manifold implies that it’s meaningful to talk about distances between points — distances between vector embeddings.

The second assumption about the data that’s made by UMAP is “the Riemannian metric is locally consistent (or can be approximated as such).”9

An attempt at clarifying UMAP’s assumptions

A manifold is a shape that locally looks flat, even if it’s curved globally. (The earth is a sphere by locally we treat it as flat.)

There’s a hypothesis in ML (the “manifold hypothesis” ) that high dimensional data like language actually lives on a lower dimensional manifold.

That means that this site’s 384 dimensional embeddings aren’t spread uniformly across all the dimensions. They live (mostly(?)) on a lower-dimensional surface (a manifold). That’s why UMAP can compress them down to 2D without losing too much information.

An explanation (simplification) of UMAP’s assumptions:

UMAP is assuming the embeddings live on a smooth, lower-dimensional shape that can be flattened to 2D whild preserving the neighborhood structure.

Visualizing this site’s vector embeddings

Things are getting too abstract. I need something to hold on to.

This site’s search functionality uses a Chroma database. The process of generating the database’s embedding vectors isn’t yet fully documented (todo: document how embeddings are generated.)

Here’s an overview of the database:

import chromadb

chroma_client = chromadb.PersistentClient()
collection = chroma_client.get_collection(name="zalgorithm")

results = collection.get(include=["embeddings", "metadatas", "documents"])

print(results.keys())
# dict_keys(['ids', 'embeddings', 'documents', 'uris', 'included', 'data', 'metadatas'])

print(len(results["metadatas"]))
# 918
print(results["metadatas"][0].keys())
# dict_keys(['db_id', 'page_title', 'section_heading', 'updated_at'])

print(len(results["embeddings"]))
# 918
print(len(results["embeddings"][0]))
# 384

print(results["metadatas"][0]["section_heading"])
# SimCSE: Simple Constrative Learning of Sentence Embeddings (paper)
print(results["embeddings"][0][:10])
# [-0.00372653 -0.05460251  0.03505962  0.04954087  0.06700463  0.07712093
#   0.02034441  0.05408454  0.05195715 -0.01918259]

print(results["metadatas"][1]["section_heading"])
# The Meaning of Meaning (book)
print(results["embeddings"][1][:10])
# [ 0.00272638  0.09517232 -0.05058211  0.04337437 -0.01964302  0.00047342
#   0.04571834  0.08724379  0.04383722  0.03747305]

So there are (currently) 918 entries in the database. Each entry has an id, embedding, metadata, and more.

The embeddings are 384 dimension vectors. (Generated with Chroma DB’s default embedding model, the all-miniLM-L6-v2 model from the Sentence Transformers library.)

Embedding metadata is data that I’ve added to the database. For example, the metadata "db_id" key ("results["metadatas"][<row_number>]["db_id"]) makes it possible to get an embedding’s associated HTML fragment from an SQLite database.

The HTML headings associated with each embedding are in the metadata "section_heading" field. A quick glance shows that the embedding associated with the heading “SimCSE: Simple Constrative Learning of Sentence Embeddings (paper) " is different from the embedding associated with the heading “The Meaning of Meaning (book)”. This is as expected.

Installing lmcinnes/umap

It’s got a few dependencies: https://github.com/lmcinnes/umap:

Python 3.6 or greater

PyNNDescent

PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors. It provides a python implementation of Nearest Neighbor Descent for k-neighbor-graph construction and approximate nearest neighbor search…10

GitHub: https://github.com/lmcinnes/pynndescent

First visualization attempt

Just getting something to appear:

import chromadb
import numpy as np
import matplotlib.pyplot as plt
import umap
import random

chroma_client = chromadb.PersistentClient()
collection = chroma_client.get_collection(name="zalgorithm")

results = collection.get(include=["embeddings", "metadatas", "documents"])

embeddings = np.array(results["embeddings"])
metadatas = results["metadatas"]

reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, metric="cosine")

embedding_2d = reducer.fit_transform(embeddings)

plt.scatter(embedding_2d[:, 0], embedding_2d[:, 1], s=140, alpha=0.75)

for i, metadata in enumerate(metadatas):
    print(f"{i}: {metadata['page_title']}: {metadata['section_heading']}")
    label = str(i)
    x = embedding_2d[i, 0]
    y = embedding_2d[i, 1]
    plt.text(x, y, label, ha="center", va="center", color="white", fontsize=6)

plt.show()

UMAP test one
UMAP test one

At that zoom level an overall pattern can be seen, but the results are illegible. Fortunately, PyPlot windows can be zoomed in on:

UMAP zoom selection
UMAP zoom selection

UMAP zoomed
UMAP zoomed

From there I can see:

The three entries at the bottom right of the plot:

There’s a definite order to the results. My guess is that section headings are having a strong effect on the vectors. The text that’s used to generate the vectors has the full heading path of each section prepended to it, so there’s some heading context that’s not being shown in this (loose) analysis of the data.

What’s in the cluster that’s closest to the center?

UMAP plot center selection
UMAP plot center selection

UMAP plot center
UMAP plot center

A bunch of embeddings related to the cp command. There aren’t a lot of posts like that on the site:

Vectors (remember these are 2D representations of vectors) in the far left, y ~= 0 part of the plot are all related to XML.

Vectors in the far right y ~= 0 are all related to Newton’s method.

Just to the left of the Newton’s method vectors are vectors related to Halley’s method.

The cluster at the far bottom left of the plot is made up of embeddings related to an issue with video rendering on the Omarchy Linux distribution. They form an oddly straight line:

UMAP: a cluster that’s in a straight line
UMAP: a cluster that’s in a straight line

Literary Machines cluster

Maybe the tightest cluster in the plot is made of embeddings related to the book Literary Machines. The embeddings are mostly quotes from the book:

Umap: Literary Machines
Umap: Literary Machines

UMAP: Literary Machines, zoomed
UMAP: Literary Machines, zoomed

Using vector embeddings to reveal the structure of a group of documents

What’s interesting about the Literary Machines cluster is that Ted Nelson is writing about ways presenting the structure of information to the reader. Possibly something along those lines can be done with vector embeddings.

References

McInnes, Leland. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” https://umap-learn.readthedocs.io/en/latest/ .

Wikipedia contributors. “Euclidean space.” Accessed: January 14, 2026. https://en.wikipedia.org/wiki/Euclidean_space .

Wikipedia contributors. “Manifold.” Accessed: January 14, 2026. https://en.wikipedia.org/wiki/Manifold .

Wikipedia contributors. “Neighbourhood (mathematics).” Accessed: January 14, 2026. https://en.wikipedia.org/wiki/Neighbourhood_(mathematics) .

Wikipedia contributors. “Riemannian manifold.” Accessed: January 14, 2026. https://en.wikipedia.org/wiki/Riemannian_manifold .


  1. Leland McInnes, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” https://umap-learn.readthedocs.io/en/latest/↩︎

  2. Wikipedia contributors, “Euclidean space,” Accessed: January 14, 2026, https://en.wikipedia.org/wiki/Euclidean_space↩︎

  3. Google AI, January 13, 2026. ↩︎

  4. Wikipedia contributors, “Manifold.” Accessed: January 14, 2026, https://en.wikipedia.org/wiki/Manifold↩︎

  5. Wikipedia contributors, “Neighbourhood (mathematics),” Accessed: January 14, 2026, https://en.wikipedia.org/wiki/Neighbourhood_(mathematics)↩︎

  6. Wikipedia contributors, “Neighbourhood (mathematics),” Accessed: January 14, 2026, https://en.wikipedia.org/wiki/Neighbourhood_(mathematics)↩︎

  7. Wikipedia contributors, “Riemannian manifold,” Accessed: January 14, 2026, https://en.wikipedia.org/wiki/Riemannian_manifold↩︎

  8. Google AI, January 13, 2026. ↩︎

  9. Leland McInnes, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” https://umap-learn.readthedocs.io/en/latest/↩︎

  10. Leland McInnes, “PyNNDescent (GitHub),” https://github.com/lmcinnes/pynndescent↩︎