Full RAG Pipeline: 4 Vector Stores, Hybrid Search, and Reranking in One Template
Table of Contents
TL;DR: Version 0.2.2 of the full-stack AI agent template adds a complete production RAG pipeline with 4 swappable vector stores (Milvus, Qdrant, ChromaDB, pgvector), 4 embedding providers (OpenAI, Voyage, Gemini multimodal, SentenceTransformers local), hybrid search combining BM25 keyword matching with vector similarity via Reciprocal Rank Fusion, and optional reranking through Cohere’s API or local CrossEncoder models. Each of the 5 pipeline steps — parse, chunk, embed, store, search — is a pluggable abstraction configurable via environment variables:
VECTOR_STORE=qdrant,RAG_HYBRID_SEARCH=true,EMBEDDING_MODEL=voyage-3. The reranking strategy retrieves 3x more results than needed and re-scores them, consistently improving precision without touching embeddings. Document versioning uses SHA256 content hashing to prevent duplicate chunks on re-ingestion. The RAG tool integrates with all 5 supported AI frameworks and includes source attribution with filename, page number, chunk number, and similarity score for agent citations.
You know the drill. You want to add RAG to your AI app. So you start: pick a vector database, write an embedding pipeline, figure out chunking, wire up retrieval, add it to your agent as a tool, build a frontend to manage documents…
Three weeks later you have a working prototype. Then someone asks “can we try Qdrant instead of Milvus?” and you realize your vector store is hardcoded in 14 places.
We just shipped v0.2.2 of our open-source full-stack AI template, and RAG was the biggest addition. Not a toy demo — a production pipeline with 4 vector stores, 4 embedding providers, hybrid search, reranking, document versioning, and a management dashboard. All configurable. All swappable.
Here’s what we built and why.
The Architecture: 5 Steps, Every One Configurable
Every RAG system does the same thing: parse → chunk → embed → store → search. The difference is how many decisions you have to make at each step.
In our template, each step is a pluggable abstraction:
Document Upload │ ├── Parse: PyMuPDF (default) | LlamaParse (130+ formats) | python-docx │ ├── Chunk: recursive (default) | markdown | fixed │ └── chunk_size=512, overlap=50 (configurable via env vars) │ ├── Embed: OpenAI | Voyage | Gemini (multimodal) | SentenceTransformers (local) │ └── dimensions auto-derived from model name │ ├── Store: Milvus | Qdrant | ChromaDB | pgvector │ └── Search: vector | hybrid (BM25 + vector + RRF) | + reranking (Cohere | CrossEncoder)You pick your stack during project generation. The template wires everything up. No glue code.
4 Vector Stores, 1 Interface
The biggest design decision was making vector stores swappable. We implemented BaseVectorStore with four backends:
class BaseVectorStore(ABC): async def insert_document(self, collection_name: str, document: Document) -> None async def search(self, collection_name: str, query: str, limit: int = 4) -> list[SearchResult] async def delete_document(self, collection_name: str, document_id: str) -> None async def get_collection_info(self, collection_name: str) -> CollectionInfoMilvus — production-grade, runs as 3 Docker services (etcd + MinIO + Milvus). Best for large-scale deployments. Cosine similarity with IVF_FLAT indexing. See Milvus documentation for details.
Qdrant — single Docker service, great balance of performance and simplicity. Our default recommendation for most teams. See Qdrant documentation for details.
ChromaDB — embedded mode, zero Docker required. Perfect for prototyping and local development. Just pip install chromadb.
pgvector — uses your existing PostgreSQL. No new infrastructure. HNSW indexing. If you already have Postgres, this is the lowest-friction option.
Switching between them? One environment variable:
# In your .env:VECTOR_STORE=qdrant # or: milvus, chromadb, pgvectorThe template handles connection strings, Docker services, schema creation, and index configuration automatically.
Hybrid Search: Why Vector-Only Isn’t Enough
Pure vector search works well for semantic queries (“documents about building safety”). It fails on exact matches (“find contract #2024-0847”) because embeddings don’t preserve exact strings.
Our hybrid search combines both:
async def retrieve(self, query: str, collection_name: str, limit: int = 5): # Step 1: Vector search (semantic) raw_results = await self.store.search(collection_name, query, limit=limit * fetch_multiplier)
# Step 2: BM25 keyword search if self._hybrid_enabled: bm25_results = await self._bm25_search(query, collection_name, limit * fetch_multiplier) if bm25_results: raw_results = self._rrf_fuse(raw_results, bm25_results)
# Step 3: Rerank (optional) if should_rerank and self.rerank_service: results = await self.rerank_service.rerank(query=query, results=raw_results, top_k=limit * 2)
return results[:limit]The fusion uses Reciprocal Rank Fusion (RRF) — a simple but effective algorithm that combines rankings from multiple sources:
@staticmethoddef _rrf_fuse(vector_results, bm25_results, k=60): scores = {} for rank, r in enumerate(vector_results): key = r.content[:100] scores[key] = scores.get(key, 0) + 1.0 / (k + rank + 1) for rank, r in enumerate(bm25_results): key = r.content[:100] scores[key] = scores.get(key, 0) + 1.0 / (k + rank + 1) return sorted_by_score(scores)Enable it with one env var: RAG_HYBRID_SEARCH=true.
Reranking: The Quality Multiplier
Initial retrieval casts a wide net. Reranking narrows it down. We support two options:
Cohere Reranker (API) — the fastest way to improve retrieval quality. Send your results + query, get them re-scored by a model trained specifically for relevance ranking:
response = await self.client.rerank( query=query, documents=[result.content for result in results], model="rerank-v3.5", top_n=top_k,)CrossEncoder (local) — runs a SentenceTransformers cross-encoder model locally. No API calls, no data leaves your infrastructure:
pairs = [[query, result.content] for result in results]scores = self.model.predict(pairs) # Runs locally on CPU/GPUThe pipeline is: retrieve 3x more results than needed, rerank, return top-k. This consistently improves precision without touching your embeddings or vector store.
Document Versioning: SHA256 Dedup
Re-ingesting a document shouldn’t create duplicates. Our pipeline uses content hashing:
async def ingest_file(self, filepath, collection_name, replace=True): document = await self.processor.process_file(filepath)
# Check for existing version by source path or content hash existing_id = await self._find_existing_by_source(collection_name, source_path) if not existing_id: existing_id = await self._find_existing_by_hash(collection_name, document.metadata.content_hash)
# Replace old chunks with new ones if existing_id: await self.store.delete_document(collection_name, existing_id)
await self.store.insert_document(collection_name, document)Google Drive sync? Same logic — changed files get re-embedded, unchanged files skip.
4 Embedding Providers
| Provider | Model | Dimensions | API Key? |
|---|---|---|---|
| OpenAI | text-embedding-3-small | 1536 | Yes |
| Voyage | voyage-3 | 1024 | Yes |
| Gemini | gemini-embedding-exp-03-07 | 3072 | Yes |
| SentenceTransformers | all-MiniLM-L6-v2 | 384 | No (local) |
Dimensions are auto-derived from the model name — no manual configuration:
EMBEDDING_DIMENSIONS = { "text-embedding-3-small": 1536, "voyage-3": 1024, "gemini-embedding-exp-03-07": 3072, "all-MiniLM-L6-v2": 384,}Gemini is the interesting one — it supports multimodal embeddings. Text and images in the same vector space. We use it for image description extraction from PDFs.
The Agent Integration
RAG becomes an agent tool — search_knowledge_base — available to all 5 AI frameworks (Pydantic AI, LangChain, LangGraph, CrewAI, DeepAgents):
async def search_knowledge_base( query: str, collection: str = "documents", collections: list[str] | None = None, # Multi-collection search top_k: int = 5,) -> str: """Search with automatic reranking & hybrid search if enabled."""Results include source attribution: filename, page number, chunk number, and similarity score. The agent’s system prompt instructs it to cite sources with [1], [2] references.
Key Takeaways
- RAG is a pipeline of 5 decisions (parse, chunk, embed, store, search) — our template makes each one configurable without code changes
- Vector-only search misses exact matches — hybrid (BM25 + vector + RRF) catches both semantic and keyword queries
- Reranking is the cheapest quality improvement — 3x over-retrieve + rerank consistently beats tuning embeddings
- Document versioning prevents duplicate chunks — SHA256 content hash + source path tracking
- One env var switches everything —
VECTOR_STORE=pgvector,RAG_HYBRID_SEARCH=true,EMBEDDING_MODEL=voyage-3
If you’re choosing which AI framework to pair with your RAG pipeline, our framework comparison guide covers the trade-offs. And to get a full production stack running quickly, see how to ship a production AI app fast.