Introduction to Production RAG
Retrieval-Augmented Generation (RAG) is easy to prototype but notoriously hard to push to production. When moving from a Jupyter notebook to an enterprise SaaS, you face challenges around multi-tenancy, chunking strategies, vector database latency, and embedding cost.
At VegaRAG, we built our entire vector ingestion and retrieval pipeline on standard AWS architectural primitives to ensure it scaled horizontally.
1. The Ingestion Pipeline
Whenever a user uploads a PDF or website URL to VegaRAG, the document hits S3 first. This acts as our source of truth and audit log. By asynchronously triggering our ingestion queue, we process the file with pypdf or BeautifulSoup depending on the MIME type.
We chose the RecursiveCharacterTextSplitter from LangChain with a chunk size of 1000 characters and 200 characters of overlap. This provides enough semantic context for Amazon Bedrock Nova to formulate an answer while fitting neatly into the embedding model's context window.
2. Amazon Titan Embeddings v2
We deliberately chose Amazon Titan Text v2 over OpenAI's text-embedding-3-large. Not only does it keep our data exclusively within our AWS environment (simplifying SOC2 compliance), but its latency via the AWS Bedrock runtime is incredibly low.
response = bedrock_runtime.invoke_model(
modelId='amazon.titan-embed-text-v2:0',
body=json.dumps({"inputText": chunk_text})
)3. Pinecone Serverless and Multi-Tenancy
We use Pinecone Serverless for our vector database. The biggest engineering decision was how to isolate data between organizations. Instead of provisioning separate indexes per tenant (which introduces massive overhead and cold starts), we use a single global index and rely on Pinecone Namespaces.
When user A queries their bot (bot_123), the request exclusively hits the namespace bot_123. This guarantees strict data boundaries and accelerates vector search by only scanning a subset of the shards.
Conclusion
Combining S3, Amazon Bedrock, and Pinecone Serverless delivers a highly scalable, isolated, and low-latency RAG infrastructure. By keeping components loosely coupled, we are able to iteratively improve our chunking without impacting the API surface area.