Zalgorithm

Calculating per-document embeddings from chunk embeddings

Can the embeddings that have been generated for chunks of text be averaged into embeddings that represent entire documents?

Related to notes / Embeddings visualization with UMAP , embedding vectors have been generated for chunks of this site’s text. Each chunk is has some metadata associated with it:

print(len(results["metadatas"]))
# 918  # 1 metadata dict per embedding

print(results["metadatas"][0].keys())
# dict_keys(['db_id', 'page_title', 'section_heading', 'updated_at'])

Based on the operations that can be performed on vectors , an average embedding can be generated by summing the embeddings for each 'page_title' and divide the result by the number of embeddings that exist for the title.