RAG Versioning
Use RAG versioning to keep retrieval artifacts reproducible across document and embedding updates.
Overview
This feature tracks document hashes, embedding batches, and manifest state so teams can detect invalidation quickly and rebuild retrieval indexes deterministically.
What Engineers Use It For
- Snapshot document and embedding state as manifests
- Detect stale indexes after document or model changes
- Rebuild only affected retrieval artifacts
- Keep retrieval behavior explainable during incident analysis
Features
- Manifest-Based Versioning - Atomic snapshots of document and embedding state
- Drift Detection - Identify when documents, models, or both have changed
- Invalidation Tracking - Understand why manifests become stale
- Efficient Rebuilding - Rebuild indices with only affected documents
- Hash-Based Deduplication - Automatic content hash tracking
Quick Start
from briefcase_ai.rag import VersionedEmbeddingPipeline, Document
# Create pipeline
pipeline = VersionedEmbeddingPipeline()
# Create documents
documents = [
Document(
id="doc_001",
content="Retrieval Augmented Generation improves LLM accuracy...",
metadata={"source": "blog", "date": "2024-01-15"},
path="blog/rag-intro.md",
content_hash="sha256:abc123"
),
Document(
id="doc_002",
content="Vector embeddings encode semantic meaning...",
metadata={"source": "docs", "date": "2024-01-20"},
path="docs/embeddings.md",
content_hash="sha256:def456"
),
]
# Create embedding batch
batch = pipeline.create_embedding_batch(documents)
# Create manifest snapshot
manifest = pipeline.create_manifest("my-index", [batch])
# Check invalidation
report = pipeline.check_invalidation("my-index", documents)
if report.status != ManifestStatus.CURRENT:
print(f"Manifest is {report.status}: {report.reasons}")
Core Components
Document
Represents a document in the RAG system:
class Document:
id: str # Unique identifier
content: str # Document text content
metadata: Dict[str, Any] # Custom metadata (source, date, etc.)
path: str # File path or URI
content_hash: str # SHA256 hash of content
Example:
document = Document(
id="policy_001",
content="Claims handling procedure: 1. Submit form 2. Review 3. Approve",
metadata={"policy_version": "2.1", "updated": "2024-02-01"},
path="policies/claims.txt",
content_hash="sha256:abc123def456"
)
EmbeddingBatch
Contains embeddings for a set of documents:
class EmbeddingBatch:
documents: List[Document] # Source documents
embeddings: List[List[float]] # Vector embeddings
model_id: str # Embedding model used
created_at: datetime # Creation timestamp
EmbeddingManifest
Atomic snapshot of embedding state:
class EmbeddingManifest:
id: str # Manifest ID
batch_id: str # Reference to EmbeddingBatch
status: ManifestStatus # Current status
document_hashes: Dict[str, str] # doc_id -> content_hash mapping
model_id: str # Embedding model at snapshot time
created_at: datetime # Creation timestamp
ManifestStatus Enum
Indicates the current state of a manifest:
class ManifestStatus(Enum):
CURRENT # All documents and models current
STALE_DOCUMENTS # Documents have changed
STALE_MODEL # Embedding model has changed
STALE_BOTH # Both documents and model changed
REBUILDING # Index rebuild in progress
Key Operations
create_embedding_batch
Create embeddings for a set of documents:
batch = pipeline.create_embedding_batch(
documents,
batch_id="optional-batch-id",
source_commit="optional-lakefs-commit-sha"
)
# Returns EmbeddingBatch with vectors created from documents
create_manifest
Create an atomic snapshot of the current state:
manifest = pipeline.create_manifest(
"my-index", # index_name: str (required)
[batch], # batches: List[EmbeddingBatch] (required)
metadata={"version": "1.0"} # optional metadata
)
# Returns EmbeddingManifest capturing document hashes and model state
check_invalidation
Detect if manifest is stale:
report = pipeline.check_invalidation(
"my-index", # index_name: str (required)
current_documents, # current_documents: List[Document] (required)
current_model="text-embedding-3-large", # optional
current_model_version="v3" # optional
)
# Returns InvalidationReport with:
# - status: ManifestStatus
# - reasons: List[str] - reasons for staleness
# - affected_documents: List[str] - IDs of changed documents
rebuild_index
Rebuild the vector index after invalidation:
new_batch = pipeline.rebuild_index(manifest)
# Re-embeds affected documents using current model
# Returns updated EmbeddingBatch
InvalidationReport
class InvalidationReport:
status: ManifestStatus # Why manifest is invalid
reasons: List[str] # Human-readable reasons
affected_documents: List[str] # Document IDs that changed
model_updated: bool # Whether embedding model changed
document_updates: Dict[str, str] # doc_id -> change_type
Example:
report = pipeline.check_invalidation("my-index", current_documents)
print(f"Status: {report.status}")
for reason in report.reasons:
print(f" - {reason}")
if report.affected_documents:
print(f"Affected docs: {report.affected_documents}")
Usage Patterns
Tracking Document Updates
from briefcase_ai.rag import VersionedEmbeddingPipeline, Document
pipeline = VersionedEmbeddingPipeline()
# Initial documents
v1_docs = [
Document(id="d1", content="Version 1 content", path="doc.txt",
metadata={}, content_hash="sha256:v1hash")
]
v1_batch = pipeline.create_embedding_batch(v1_docs)
v1_manifest = pipeline.create_manifest("my-index", [v1_batch])
# Later, document is updated
v2_docs = [
Document(id="d1", content="Updated content", path="doc.txt",
metadata={}, content_hash="sha256:v2hash")
]
report = pipeline.check_invalidation("my-index", v2_docs)
if report.status == ManifestStatus.STALE_DOCUMENTS:
print("Document changed, need to re-embed")
v2_batch = pipeline.create_embedding_batch(v2_docs)
v2_manifest = pipeline.create_manifest("my-index", [v2_batch])
Handling Model Updates
# Detect when embedding model changes
report = pipeline.check_invalidation(
"my-index", current_docs,
current_model="text-embedding-3-large",
current_model_version="v4"
)
if report.status in (ManifestStatus.STALE_MODEL, ManifestStatus.STALE_BOTH):
print(f"Embedding model updated: {report.reasons}")
new_batch = pipeline.create_embedding_batch(current_docs)
new_manifest = pipeline.create_manifest("my-index", [new_batch])
Batch Document Processing
# Process large document sets
documents = [Document(...) for _ in range(1000)]
batch = pipeline.create_embedding_batch(documents)
manifest = pipeline.create_manifest("bulk-index", [batch])
# Later verify state
report = pipeline.check_invalidation("bulk-index", documents)
if not report.affected_documents:
print("All documents current, index is fresh")
Installation
RAG versioning requires optional dependencies:
pip install "briefcase-ai[rag]"
This installs:
pyarrow>=12.0.0- Parquet support for embeddingspinecone-client>=2.2.0- Pinecone integrationweaviate-client>=3.22.0- Weaviate integrationchromadb>=0.4.0- ChromaDB integration
Integration Example
from briefcase_ai.rag import VersionedEmbeddingPipeline, Document
from typing import List
class RAGApplication:
def __init__(self):
self.pipeline = VersionedEmbeddingPipeline()
self.current_manifest = None
def ingest_documents(self, documents: List[Document]):
batch = self.pipeline.create_embedding_batch(documents)
self.current_manifest = self.pipeline.create_manifest(
"main-index", [batch]
)
return self.current_manifest
def retrieve(self, query: str, documents: List[Document]):
# Check if retrieval index is current
report = self.pipeline.check_invalidation("main-index", documents)
if report.status != ManifestStatus.CURRENT:
print(f"Rebuilding index: {report.reasons}")
new_batch = self.pipeline.create_embedding_batch(documents)
self.current_manifest = self.pipeline.create_manifest(
"main-index", [new_batch]
)
# Perform retrieval with current embeddings
return self.retrieve_similar(query)