Calculating per-document embeddings from chunk embeddings
Can the embeddings that have been generated for chunks of text be averaged into embeddings that represent entire documents?
Related to notes / Embeddings visualization with UMAP , embedding vectors have been generated for chunks of this site’s text. Each chunk is has some metadata associated with it:
print(len(results["metadatas"]))
# 918 # 1 metadata dict per embedding
print(results["metadatas"][0].keys())
# dict_keys(['db_id', 'page_title', 'section_heading', 'updated_at'])
Based on the
operations that can be performed on
vectors
, an average embedding
can be generated by summing the embeddings for each 'page_title' and divide the result by the number of embeddings that exist for the title.
Tags: