Skip to main content

Filtering with Where

Learn how to filter search results using Where expressions and the Key/K class to narrow down your search to specific documents, IDs, or metadata values.

The Key/K Class

The Key class (aliased as K for brevity) provides a fluent interface for building filter expressions. Use K to reference document fields, IDs, and metadata properties.
from chromadb import K

# K is an alias for Key - use K for more concise code
# Filter by metadata field
K("status") == "active"

# Filter by document content
K.DOCUMENT.contains("machine learning")

# Filter by document IDs
K.ID.is_in(["doc1", "doc2", "doc3"])

Filterable Fields

FieldUsageDescription
K.IDK.ID.is_in(["id1", "id2"])Filter by document IDs
K.DOCUMENTK.DOCUMENT.contains("text")Filter by document content
K("field_name")K("status") == "active"Filter by any metadata field

Comparison Operators

Supported operators:
  • == - Equality (all types: string, numeric, boolean)
  • != - Inequality (all types: string, numeric, boolean)
  • > - Greater than (numeric only)
  • >= - Greater than or equal (numeric only)
  • < - Less than (numeric only)
  • <= - Less than or equal (numeric only)
# Equality and inequality (all types)
K("status") == "published"     # String equality
K("views") != 0                # Numeric inequality
K("featured") == True          # Boolean equality

# Numeric comparisons (numbers only)
K("price") > 100               # Greater than
K("rating") >= 4.5             # Greater than or equal
K("stock") < 10                # Less than
K("discount") <= 0.25          # Less than or equal
Chroma supports three data types for metadata: strings, numbers (int/float), and booleans. Order comparison operators (>, <, >=, <=) currently only work with numeric types.

Set and String Operators

Supported operators:
  • is_in() - Value matches any in the list
  • not_in() - Value doesn’t match any in the list
  • contains() - String contains substring (case-sensitive, currently K.DOCUMENT only)
  • not_contains() - String doesn’t contain substring (currently K.DOCUMENT only)
  • regex() - String matches regex pattern (currently K.DOCUMENT only)
  • not_regex() - String doesn’t match regex pattern (currently K.DOCUMENT only)
# Set membership operators (works on all fields)
K.ID.is_in(["doc1", "doc2", "doc3"])           # Match any ID in list
K("category").is_in(["tech", "science"])       # Match any category
K("status").not_in(["draft", "deleted"])       # Exclude specific values

# String content operators (currently K.DOCUMENT only)
K.DOCUMENT.contains("machine learning")        # Substring search in document
K.DOCUMENT.not_contains("deprecated")          # Exclude documents with text
K.DOCUMENT.regex(r"\bAPI\b")                   # Match whole word "API" in document

# Note: String pattern matching on metadata fields not yet supported
# K("title").contains("Python")                # NOT YET SUPPORTED
# K("email").regex(r".*@company\.com$")        # NOT YET SUPPORTED
String operations like contains() and regex() are case-sensitive by default. The is_in() operator is efficient even with large lists.

Logical Operators

Supported operators:
  • & - Logical AND (all conditions must match)
  • | - Logical OR (any condition can match)
Combine multiple conditions using these operators. Always use parentheses to ensure correct precedence.
# AND operator (&) - all conditions must match
(K("status") == "published") & (K("year") >= 2020)

# OR operator (|) - any condition can match
(K("category") == "tech") | (K("category") == "science")

# Combining with document and ID filters
(K.DOCUMENT.contains("AI")) & (K("author") == "Smith")
(K.ID.is_in(["id1", "id2"])) | (K("featured") == True)

# Complex nesting - use parentheses for clarity
(
    (K("status") == "published") &
    ((K("category") == "tech") | (K("category") == "science")) &
    (K("rating") >= 4.0)
)
Always use parentheses around each condition when using logical operators. Python’s operator precedence may not work as expected without them.

Dictionary Syntax (MongoDB-style)

You can also use dictionary syntax instead of K expressions. This is useful when building filters programmatically. Supported dictionary operators:
  • Direct value - Shorthand for equality
  • $eq - Equality
  • $ne - Not equal
  • $gt - Greater than (numeric only)
  • $gte - Greater than or equal (numeric only)
  • $lt - Less than (numeric only)
  • $lte - Less than or equal (numeric only)
  • $in - Value in list
  • $nin - Value not in list
  • $contains - String contains
  • $not_contains - String doesn’t contain
  • $regex - Regex match
  • $not_regex - Regex doesn’t match
  • $and - Logical AND
  • $or - Logical OR
# Direct equality (shorthand)
{"status": "active"}                        # Same as K("status") == "active"

# Comparison operators
{"status": {"$eq": "published"}}            # Same as K("status") == "published"
{"count": {"$ne": 0}}                       # Same as K("count") != 0
{"price": {"$gt": 100}}                     # Same as K("price") > 100 (numbers only)
{"rating": {"$gte": 4.5}}                   # Same as K("rating") >= 4.5 (numbers only)
{"stock": {"$lt": 10}}                      # Same as K("stock") < 10 (numbers only)
{"discount": {"$lte": 0.25}}                # Same as K("discount") <= 0.25 (numbers only)

# Set membership operators
{"#id": {"$in": ["id1", "id2"]}}            # Same as K.ID.is_in(["id1", "id2"])
{"category": {"$in": ["tech", "ai"]}}       # Same as K("category").is_in(["tech", "ai"])
{"status": {"$nin": ["draft", "deleted"]}}  # Same as K("status").not_in(["draft", "deleted"])

# String operators (currently K.DOCUMENT only)
{"#document": {"$contains": "API"}}         # Same as K.DOCUMENT.contains("API")
# {"title": {"$not_contains": "draft"}}     # Not yet supported - metadata fields
# {"email": {"$regex": ".*@example\\.com"}} # Not yet supported - metadata fields
# {"version": {"$not_regex": "^beta"}}      # Not yet supported - metadata fields

# Logical operators
{"$and": [
    {"status": "published"},
    {"year": {"$gte": 2020}},
    {"#document": {"$contains": "AI"}}
]}                                          # Combines multiple conditions with AND

{"$or": [
    {"category": "tech"},
    {"category": "science"},
    {"featured": True}
]}                                          # Combines multiple conditions with OR

# Complex nested example
{
    "$and": [
        {"$or": [
            {"category": "tech"},
            {"category": "science"}
        ]},
        {"status": "published"},
        {"quality_score": {"$gte": 0.8}}
    ]
}
Each dictionary can only contain one field or one logical operator ($and/$or). For field dictionaries, only one operator is allowed per field.

Common Filtering Patterns

# Filter by specific document IDs
search = Search().where(K.ID.is_in(["doc_001", "doc_002", "doc_003"]))

# Exclude already processed documents
processed_ids = ["doc_100", "doc_101"]
search = Search().where(K.ID.not_in(processed_ids))

# Full-text search in documents
search = Search().where(K.DOCUMENT.contains("quantum computing"))

# Combine document search with metadata
search = Search().where(
    K.DOCUMENT.contains("machine learning") &
    (K("language") == "en")
)

# Price range filtering
search = Search().where(
    (K("price") >= 100) &
    (K("price") <= 500)
)

# Multi-field filtering
search = Search().where(
    (K("status") == "active") &
    (K("category").is_in(["tech", "ai", "ml"])) &
    (K("score") >= 0.8)
)

Edge Cases and Important Behavior

Missing Keys

When filtering on a metadata field that doesn’t exist for a document:
  • Most operators (==, >, <, >=, <=, is_in()) evaluate to false - the document won’t match
  • != evaluates to true - documents without the field are considered “not equal” to any value
  • not_in() evaluates to true - documents without the field are not in any list
# If a document doesn't have a "category" field:
K("category") == "tech"         # false - won't match
K("category") != "tech"         # true - will match
K("category").is_in(["tech"])   # false - won't match
K("category").not_in(["tech"])  # true - will match

Mixed Types

Avoid storing different data types under the same metadata key across documents. Query behavior is undefined when comparing values of different types.
# DON'T DO THIS - undefined behavior
# Document 1: {"score": 95}      (numeric)
# Document 2: {"score": "95"}    (string)
# Document 3: {"score": true}    (boolean)

K("score") > 90  # Undefined results when mixed types exist

# DO THIS - consistent types
# All documents: {"score": <numeric>} or all {"score": <string>}

String Pattern Matching Limitations

Currently, contains(), not_contains(), regex(), and not_regex() operators only work on K.DOCUMENT. These operators do not yet support metadata fields. Additionally, the pattern must contain at least 3 literal characters to ensure accurate results.
# Currently supported - K.DOCUMENT only
K.DOCUMENT.contains("API")              # Works
K.DOCUMENT.regex(r"v\d\.\d\.\d")       # Works
K.DOCUMENT.contains("machine learning") # Works

# NOT YET SUPPORTED - metadata fields
K("title").contains("Python")           # Not supported yet
K("description").regex(r"API.*")        # Not supported yet

# Pattern length requirements (for K.DOCUMENT)
K.DOCUMENT.contains("API")              # 3 characters - good
K.DOCUMENT.contains("AI")               # Only 2 characters - may give incorrect results
K.DOCUMENT.regex(r"\d+")                # No literal characters - may give incorrect results
String pattern matching currently only works on K.DOCUMENT. Support for metadata fields is not yet available. Also, patterns with fewer than 3 literal characters may return incorrect results.
String pattern matching on metadata fields is not currently supported. Full support is coming in a future release, which will allow users to opt-in to additional indexes for string pattern matching on specific metadata fields.

Complete Example

Here’s a practical example combining different filter types:
from chromadb import Search, K, Knn

# Complex filter combining IDs, document content, and metadata
search = (Search()
    .where(
        # Exclude specific documents
        K.ID.not_in(["excluded_001", "excluded_002"]) &

        # Must contain specific content
        K.DOCUMENT.contains("artificial intelligence") &

        # Metadata conditions
        (K("status") == "published") &
        (K("quality_score") >= 0.75) &
        (
            (K("category") == "research") |
            (K("category") == "tutorial")
        ) &
        (K("year") >= 2023)
    )
    .rank(Knn(query="latest AI research developments"))
    .limit(10)
    .select(K.DOCUMENT, "title", "author", "year")
)

results = collection.search(search)

Tips and Best Practices

  • Use parentheses liberally when combining conditions with & and | to avoid precedence issues
  • Filter before ranking when possible to reduce the number of vectors to score
  • Be specific with ID filters - using K.ID.is_in() with a small list is very efficient
  • String matching is case-sensitive - normalize your data if case-insensitive matching is needed
  • Use the right operator - is_in() for multiple exact matches, contains() for substring search

Next Steps