Look at Your Data
Before building our RAG pipelines and inserting data into Chroma collections, it is worth asking ourselves the following questions:- What types of searches do we want to support? (semantic, regex, keyword, etc.)
- What embedding models should we use for semantic and keyword searches?
- Should chunks live in one Chroma collection, or should we use different collections for different chunk types?
- What are the meaningful units of data we want to store as records in our Chroma collections?
- What metadata fields can we leverage when querying?
Search Modalities
Chroma supports various search techniques that are useful for different use cases. Dense search (semantic) uses embeddings to find records that are semantically similar to a query. It excels at matching meaning and intent — a query like “how do I return a product” can surface relevant chunks even if they never use the word “return.” The weakness? Dense search can struggle with exact terms: product SKUs, part numbers, legal case citations, or domain-specific jargon that didn’t appear often in the embedding model’s training data. All Chroma collections enable semantic search by default. You can specify the embedding function your collection will use to embed your data when creating a collection:SKU-4892-X, a drug name like omeprazole, a legal citation like Smith v. Jones (2019), or a model number in a technical manual. Dense search might miss these entirely or return semantically related but wrong results. The tradeoff is that lexical search can’t bridge synonyms or paraphrases — searching “cancel” won’t find chunks that only mention “terminate.”
To enable lexical search on your collection, you can enable a sparse vector index on your collection’s schema with a sparse embedding function:
where_document parameter. You can filter results to only include chunks that contain an exact string or match a regex pattern. This is useful for enforcing structural constraints—like ensuring results contain a specific identifier—or for pattern matching on things like email addresses, dates, or phone numbers.
Embedding Models
Dense embedding models map text to vectors where semantic similarity is captured by vector distance. Chroma has first-class support for many embedding models. The tradeoffs include cost (API-based vs. local), latency, embedding dimensions (which affect storage and search speed), and quality on your specific domain. General-purpose models work well for most text, but specialized models trained on code, legal documents, or medical text can outperform them on domain-specific tasks. Larger models typically produce better embeddings but cost more and run slower—so the right choice depends on your quality requirements and constraints.- If you’re building a customer support bot over general documentation, a model like
text-embedding-3-smalloffers a good balance of quality and cost. - For a codebase search tool, code-specific models will better capture the semantics of function names, syntax, and programming patterns. Chroma works with code-specific models from OpenAI, Cohere, Mistral, Morph, and more.
- If you need to run entirely locally for privacy or cost reasons, smaller open-source models like
all-MiniLM-L6-v2are a practical choice, though with some quality tradeoff.
- If your data contains lots of exact identifiers that must match precisely — SKUs, legal citations, chemical formulas — BM25 is straightforward and effective.
- If you want lexical search that’s more forgiving of vocabulary mismatches, SPLADE can help.
Collections in your Chroma Database
A Chroma collection indexes records using a specific embedding model and configuration. Whether your records live in one Chroma collection or many depends on your application’s access patterns and data types. Use a single collection when:- You are using the same embedding model for all of your data.
- You want to search across everything at once.
- You can distinguish between records using metadata filtering.
- You have different types of data, requiring different embedding models. For example, you have text data and images, which are embedded using different models.
- You have multi-tenant requirements. In this case, establishing a collection per user or organization helps you avoid filtering overhead at query time.