Skip to main content

RAG Versioning

Use RAG versioning to keep retrieval artifacts reproducible across document and embedding updates.

Overview

This feature tracks document hashes, embedding batches, and manifest state so teams can detect invalidation quickly and rebuild retrieval indexes deterministically.

What Engineers Use It For

  • Snapshot document and embedding state as manifests
  • Detect stale indexes after document or model changes
  • Rebuild only affected retrieval artifacts
  • Keep retrieval behavior explainable during incident analysis

Features

  • Manifest-Based Versioning - Atomic snapshots of document and embedding state
  • Drift Detection - Identify when documents, models, or both have changed
  • Invalidation Tracking - Understand why manifests become stale
  • Efficient Rebuilding - Rebuild indices with only affected documents
  • Hash-Based Deduplication - Automatic content hash tracking

Quick Start

from briefcase_ai.rag import VersionedEmbeddingPipeline, Document

# Create pipeline
pipeline = VersionedEmbeddingPipeline()

# Create documents
documents = [
Document(
id="doc_001",
content="Retrieval Augmented Generation improves LLM accuracy...",
metadata={"source": "blog", "date": "2024-01-15"},
path="blog/rag-intro.md",
content_hash="sha256:abc123"
),
Document(
id="doc_002",
content="Vector embeddings encode semantic meaning...",
metadata={"source": "docs", "date": "2024-01-20"},
path="docs/embeddings.md",
content_hash="sha256:def456"
),
]

# Create embedding batch
batch = pipeline.create_embedding_batch(documents)

# Create manifest snapshot
manifest = pipeline.create_manifest("my-index", [batch])

# Check invalidation
report = pipeline.check_invalidation("my-index", documents)
if report.status != ManifestStatus.CURRENT:
print(f"Manifest is {report.status}: {report.reasons}")

Core Components

Document

Represents a document in the RAG system:

class Document:
id: str # Unique identifier
content: str # Document text content
metadata: Dict[str, Any] # Custom metadata (source, date, etc.)
path: str # File path or URI
content_hash: str # SHA256 hash of content

Example:

document = Document(
id="policy_001",
content="Claims handling procedure: 1. Submit form 2. Review 3. Approve",
metadata={"policy_version": "2.1", "updated": "2024-02-01"},
path="policies/claims.txt",
content_hash="sha256:abc123def456"
)

EmbeddingBatch

Contains embeddings for a set of documents:

class EmbeddingBatch:
documents: List[Document] # Source documents
embeddings: List[List[float]] # Vector embeddings
model_id: str # Embedding model used
created_at: datetime # Creation timestamp

EmbeddingManifest

Atomic snapshot of embedding state:

class EmbeddingManifest:
id: str # Manifest ID
batch_id: str # Reference to EmbeddingBatch
status: ManifestStatus # Current status
document_hashes: Dict[str, str] # doc_id -> content_hash mapping
model_id: str # Embedding model at snapshot time
created_at: datetime # Creation timestamp

ManifestStatus Enum

Indicates the current state of a manifest:

class ManifestStatus(Enum):
CURRENT # All documents and models current
STALE_DOCUMENTS # Documents have changed
STALE_MODEL # Embedding model has changed
STALE_BOTH # Both documents and model changed
REBUILDING # Index rebuild in progress

Key Operations

create_embedding_batch

Create embeddings for a set of documents:

batch = pipeline.create_embedding_batch(
documents,
batch_id="optional-batch-id",
source_commit="optional-lakefs-commit-sha"
)
# Returns EmbeddingBatch with vectors created from documents

create_manifest

Create an atomic snapshot of the current state:

manifest = pipeline.create_manifest(
"my-index", # index_name: str (required)
[batch], # batches: List[EmbeddingBatch] (required)
metadata={"version": "1.0"} # optional metadata
)
# Returns EmbeddingManifest capturing document hashes and model state

check_invalidation

Detect if manifest is stale:

report = pipeline.check_invalidation(
"my-index", # index_name: str (required)
current_documents, # current_documents: List[Document] (required)
current_model="text-embedding-3-large", # optional
current_model_version="v3" # optional
)
# Returns InvalidationReport with:
# - status: ManifestStatus
# - reasons: List[str] - reasons for staleness
# - affected_documents: List[str] - IDs of changed documents

rebuild_index

Rebuild the vector index after invalidation:

new_batch = pipeline.rebuild_index(manifest)
# Re-embeds affected documents using current model
# Returns updated EmbeddingBatch

InvalidationReport

class InvalidationReport:
status: ManifestStatus # Why manifest is invalid
reasons: List[str] # Human-readable reasons
affected_documents: List[str] # Document IDs that changed
model_updated: bool # Whether embedding model changed
document_updates: Dict[str, str] # doc_id -> change_type

Example:

report = pipeline.check_invalidation("my-index", current_documents)
print(f"Status: {report.status}")
for reason in report.reasons:
print(f" - {reason}")
if report.affected_documents:
print(f"Affected docs: {report.affected_documents}")

Usage Patterns

Tracking Document Updates

from briefcase_ai.rag import VersionedEmbeddingPipeline, Document

pipeline = VersionedEmbeddingPipeline()

# Initial documents
v1_docs = [
Document(id="d1", content="Version 1 content", path="doc.txt",
metadata={}, content_hash="sha256:v1hash")
]
v1_batch = pipeline.create_embedding_batch(v1_docs)
v1_manifest = pipeline.create_manifest("my-index", [v1_batch])

# Later, document is updated
v2_docs = [
Document(id="d1", content="Updated content", path="doc.txt",
metadata={}, content_hash="sha256:v2hash")
]
report = pipeline.check_invalidation("my-index", v2_docs)
if report.status == ManifestStatus.STALE_DOCUMENTS:
print("Document changed, need to re-embed")
v2_batch = pipeline.create_embedding_batch(v2_docs)
v2_manifest = pipeline.create_manifest("my-index", [v2_batch])

Handling Model Updates

# Detect when embedding model changes
report = pipeline.check_invalidation(
"my-index", current_docs,
current_model="text-embedding-3-large",
current_model_version="v4"
)
if report.status in (ManifestStatus.STALE_MODEL, ManifestStatus.STALE_BOTH):
print(f"Embedding model updated: {report.reasons}")
new_batch = pipeline.create_embedding_batch(current_docs)
new_manifest = pipeline.create_manifest("my-index", [new_batch])

Batch Document Processing

# Process large document sets
documents = [Document(...) for _ in range(1000)]
batch = pipeline.create_embedding_batch(documents)
manifest = pipeline.create_manifest("bulk-index", [batch])

# Later verify state
report = pipeline.check_invalidation("bulk-index", documents)
if not report.affected_documents:
print("All documents current, index is fresh")

Installation

RAG versioning requires optional dependencies:

pip install "briefcase-ai[rag]"

This installs:

  • pyarrow>=12.0.0 - Parquet support for embeddings
  • pinecone-client>=2.2.0 - Pinecone integration
  • weaviate-client>=3.22.0 - Weaviate integration
  • chromadb>=0.4.0 - ChromaDB integration

Integration Example

from briefcase_ai.rag import VersionedEmbeddingPipeline, Document
from typing import List

class RAGApplication:
def __init__(self):
self.pipeline = VersionedEmbeddingPipeline()
self.current_manifest = None

def ingest_documents(self, documents: List[Document]):
batch = self.pipeline.create_embedding_batch(documents)
self.current_manifest = self.pipeline.create_manifest(
"main-index", [batch]
)
return self.current_manifest

def retrieve(self, query: str, documents: List[Document]):
# Check if retrieval index is current
report = self.pipeline.check_invalidation("main-index", documents)
if report.status != ManifestStatus.CURRENT:
print(f"Rebuilding index: {report.reasons}")
new_batch = self.pipeline.create_embedding_batch(documents)
self.current_manifest = self.pipeline.create_manifest(
"main-index", [new_batch]
)

# Perform retrieval with current embeddings
return self.retrieve_similar(query)