RAG Versioning

Use RAG versioning to keep retrieval artifacts reproducible across document and embedding updates.

Overview

This feature tracks document hashes, embedding batches, and manifest state so teams can detect invalidation quickly and rebuild retrieval indexes deterministically.

What Engineers Use It For

Snapshot document and embedding state as manifests
Detect stale indexes after document or model changes
Rebuild only affected retrieval artifacts
Keep retrieval behavior explainable during incident analysis

Features

Manifest-Based Versioning - Atomic snapshots of document and embedding state
Drift Detection - Identify when documents, models, or both have changed
Invalidation Tracking - Understand why manifests become stale
Efficient Rebuilding - Rebuild indices with only affected documents
Hash-Based Deduplication - Automatic content hash tracking

Quick Start

from briefcase_ai.rag import VersionedEmbeddingPipeline, Document

# Create pipeline
pipeline = VersionedEmbeddingPipeline()

# Create documents
documents = [
    Document(
        id="doc_001",
        content="Retrieval Augmented Generation improves LLM accuracy...",
        metadata={"source": "blog", "date": "2024-01-15"},
        path="blog/rag-intro.md",
        content_hash="sha256:abc123"
    ),
    Document(
        id="doc_002",
        content="Vector embeddings encode semantic meaning...",
        metadata={"source": "docs", "date": "2024-01-20"},
        path="docs/embeddings.md",
        content_hash="sha256:def456"
    ),
]

# Create embedding batch
batch = pipeline.create_embedding_batch(documents)

# Create manifest snapshot
manifest = pipeline.create_manifest("my-index", [batch])

# Check invalidation
report = pipeline.check_invalidation("my-index", documents)
if report.status != ManifestStatus.CURRENT:
    print(f"Manifest is {report.status}: {report.reasons}")

Core Components

Document

Represents a document in the RAG system:

class Document:
    id: str                          # Unique identifier
    content: str                     # Document text content
    metadata: Dict[str, Any]         # Custom metadata (source, date, etc.)
    path: str                        # File path or URI
    content_hash: str                # SHA256 hash of content

Example:

document = Document(
    id="policy_001",
    content="Claims handling procedure: 1. Submit form 2. Review 3. Approve",
    metadata={"policy_version": "2.1", "updated": "2024-02-01"},
    path="policies/claims.txt",
    content_hash="sha256:abc123def456"
)

EmbeddingBatch

Contains embeddings for a set of documents:

class EmbeddingBatch:
    documents: List[Document]        # Source documents
    embeddings: List[List[float]]    # Vector embeddings
    model_id: str                    # Embedding model used
    created_at: datetime             # Creation timestamp

EmbeddingManifest

Atomic snapshot of embedding state:

class EmbeddingManifest:
    id: str                          # Manifest ID
    batch_id: str                    # Reference to EmbeddingBatch
    status: ManifestStatus           # Current status
    document_hashes: Dict[str, str]  # doc_id -> content_hash mapping
    model_id: str                    # Embedding model at snapshot time
    created_at: datetime             # Creation timestamp

ManifestStatus Enum

Indicates the current state of a manifest:

class ManifestStatus(Enum):
    CURRENT            # All documents and models current
    STALE_DOCUMENTS    # Documents have changed
    STALE_MODEL        # Embedding model has changed
    STALE_BOTH         # Both documents and model changed
    REBUILDING         # Index rebuild in progress

Key Operations

create_embedding_batch

Create embeddings for a set of documents:

batch = pipeline.create_embedding_batch(
    documents,
    batch_id="optional-batch-id",
    source_commit="optional-lakefs-commit-sha"
)
# Returns EmbeddingBatch with vectors created from documents

create_manifest

Create an atomic snapshot of the current state:

manifest = pipeline.create_manifest(
    "my-index",      # index_name: str (required)
    [batch],         # batches: List[EmbeddingBatch] (required)
    metadata={"version": "1.0"}  # optional metadata
)
# Returns EmbeddingManifest capturing document hashes and model state

check_invalidation

Detect if manifest is stale:

report = pipeline.check_invalidation(
    "my-index",          # index_name: str (required)
    current_documents,   # current_documents: List[Document] (required)
    current_model="text-embedding-3-large",         # optional
    current_model_version="v3"                      # optional
)
# Returns InvalidationReport with:
# - status: ManifestStatus
# - reasons: List[str] - reasons for staleness
# - affected_documents: List[str] - IDs of changed documents

rebuild_index

Rebuild the vector index after invalidation:

new_batch = pipeline.rebuild_index(manifest)
# Re-embeds affected documents using current model
# Returns updated EmbeddingBatch

InvalidationReport

class InvalidationReport:
    status: ManifestStatus           # Why manifest is invalid
    reasons: List[str]               # Human-readable reasons
    affected_documents: List[str]    # Document IDs that changed
    model_updated: bool              # Whether embedding model changed
    document_updates: Dict[str, str] # doc_id -> change_type

Example:

report = pipeline.check_invalidation("my-index", current_documents)
print(f"Status: {report.status}")
for reason in report.reasons:
    print(f"  - {reason}")
if report.affected_documents:
    print(f"Affected docs: {report.affected_documents}")

Usage Patterns

Tracking Document Updates

from briefcase_ai.rag import VersionedEmbeddingPipeline, Document

pipeline = VersionedEmbeddingPipeline()

# Initial documents
v1_docs = [
    Document(id="d1", content="Version 1 content", path="doc.txt",
             metadata={}, content_hash="sha256:v1hash")
]
v1_batch = pipeline.create_embedding_batch(v1_docs)
v1_manifest = pipeline.create_manifest("my-index", [v1_batch])

# Later, document is updated
v2_docs = [
    Document(id="d1", content="Updated content", path="doc.txt",
             metadata={}, content_hash="sha256:v2hash")
]
report = pipeline.check_invalidation("my-index", v2_docs)
if report.status == ManifestStatus.STALE_DOCUMENTS:
    print("Document changed, need to re-embed")
    v2_batch = pipeline.create_embedding_batch(v2_docs)
    v2_manifest = pipeline.create_manifest("my-index", [v2_batch])

Handling Model Updates

# Detect when embedding model changes
report = pipeline.check_invalidation(
    "my-index", current_docs,
    current_model="text-embedding-3-large",
    current_model_version="v4"
)
if report.status in (ManifestStatus.STALE_MODEL, ManifestStatus.STALE_BOTH):
    print(f"Embedding model updated: {report.reasons}")
    new_batch = pipeline.create_embedding_batch(current_docs)
    new_manifest = pipeline.create_manifest("my-index", [new_batch])

Batch Document Processing

# Process large document sets
documents = [Document(...) for _ in range(1000)]
batch = pipeline.create_embedding_batch(documents)
manifest = pipeline.create_manifest("bulk-index", [batch])

# Later verify state
report = pipeline.check_invalidation("bulk-index", documents)
if not report.affected_documents:
    print("All documents current, index is fresh")

Installation

RAG versioning requires optional dependencies:

pip install "briefcase-ai[rag]"

This installs:

pyarrow>=12.0.0 - Parquet support for embeddings
pinecone-client>=2.2.0 - Pinecone integration
weaviate-client>=3.22.0 - Weaviate integration
chromadb>=0.4.0 - ChromaDB integration

Integration Example

from briefcase_ai.rag import VersionedEmbeddingPipeline, Document
from typing import List

class RAGApplication:
    def __init__(self):
        self.pipeline = VersionedEmbeddingPipeline()
        self.current_manifest = None

    def ingest_documents(self, documents: List[Document]):
        batch = self.pipeline.create_embedding_batch(documents)
        self.current_manifest = self.pipeline.create_manifest(
            "main-index", [batch]
        )
        return self.current_manifest

    def retrieve(self, query: str, documents: List[Document]):
        # Check if retrieval index is current
        report = self.pipeline.check_invalidation("main-index", documents)
        if report.status != ManifestStatus.CURRENT:
            print(f"Rebuilding index: {report.reasons}")
            new_batch = self.pipeline.create_embedding_batch(documents)
            self.current_manifest = self.pipeline.create_manifest(
                "main-index", [new_batch]
            )

        # Perform retrieval with current embeddings
        return self.retrieve_similar(query)

Overview​

What Engineers Use It For​

Features​

Quick Start​

Core Components​

Document​

EmbeddingBatch​

EmbeddingManifest​

ManifestStatus Enum​

Key Operations​

create_embedding_batch​

create_manifest​

check_invalidation​

rebuild_index​

InvalidationReport​

Usage Patterns​

Tracking Document Updates​

Handling Model Updates​

Batch Document Processing​

Installation​

Integration Example​