Sparse Vector Search Setup

Learn how to configure and use sparse vectors for keyword-based search, and combine them with dense embeddings for powerful hybrid search capabilities.

What are Sparse Vectors?

Sparse vectors are high-dimensional vectors with mostly zero values, designed for keyword-based retrieval. Unlike dense embeddings which capture semantic meaning, sparse vectors excel at:

Exact keyword matching: Finding documents containing specific terms
Domain-specific terminology: Better at matching technical terms, proper nouns, and rare words
Lexical retrieval: BM25-style retrieval patterns

Sparse vectors use models like SPLADE that assign importance weights to specific tokens, making them complementary to dense semantic embeddings.

Enabling Sparse Vector Index

To use sparse vectors, add a sparse vector index to your schema. The key parameter is the metadata field name where sparse embeddings will be stored - you can name it whatever you want:

from chromadb import Schema, SparseVectorIndexConfig, K
from chromadb.utils.embedding_functions import ChromaCloudSpladeEmbeddingFunction

schema = Schema()

# Add sparse vector index for keyword-based search
# "sparse_embedding" is just a metadata key name - use any name you prefer
sparse_ef = ChromaCloudSpladeEmbeddingFunction()
schema.create_index(
    config=SparseVectorIndexConfig(
        source_key=K.DOCUMENT,
        embedding_function=sparse_ef
    ),
    key="sparse_embedding"
)

The source_key specifies which field to generate sparse embeddings from (typically K.DOCUMENT for document text), and embedding_function specifies the function to generate the sparse embeddings. This example uses ChromaCloudSpladeEmbeddingFunction, but you can also use other sparse embedding functions like HuggingFaceSparseEmbeddingFunction or FastembedSparseEmbeddingFunction. The sparse embeddings are automatically generated and stored in the metadata field you specify as the key.

Create Collection and Add Data

Create Collection with Schema

import chromadb

client = chromadb.CloudClient(
    tenant="your-tenant",
    database="your-database",
    api_key="your-api-key"
)

collection = client.create_collection(
    name="hybrid_search_collection",
    schema=schema
)

Add Data

When you add documents, sparse embeddings are automatically generated from the source key:

collection.add(
    ids=["doc1", "doc2", "doc3"],
    documents=[
        "The quick brown fox jumps over the lazy dog",
        "A fast auburn fox leaps over a sleepy canine",
        "Machine learning is a subset of artificial intelligence"
    ],
    metadatas=[
        {"category": "animals"},
        {"category": "animals"},
        {"category": "technology"}
    ]
)

# Sparse embeddings for "sparse_embedding" are generated automatically
# from the documents (source_key=K.DOCUMENT)

Using Sparse Vectors for Search

Once configured, you can search using sparse vectors alone or combine them with dense embeddings for hybrid search.