The Tabular Data Problem in AI
Passing a massive CSV into an LLM context window is a recipe for hallucinations and token overload. RAG (vectorizing strings) is terrible for numbers — asking "What was the average revenue in Q3?" requires mathematical aggregations, not cosine similarity.
Enter DuckDB: In-Process Analytical SQL
To solve this, VegaRAG implemented a Text-to-SQL pipeline powered by DuckDB. DuckDB is absolutely phenomenal for querying raw data files natively in a Python environment without deploying a hosted PostgreSQL cluster.
1. Schema Registration
When a user uploads a CSV or Excel file, we archive it in S3 and briefly parse it using Pandas to extract the Data Types and Column names. We save this Schema Definition into DynamoDB under the partition key AGENT#bot_id.
2. LLM SQL Generation
When the LangGraph routes a query to the sql intent, it fetches all schemas associated with the bot and prompts Bedrock Nova to generate a syntactically correct DuckDB SQL statement:
SELECT department, SUM(salary)
FROM read_csv_auto('s3://bucket/bot_123/tables/employees.csv')
GROUP BY department;3. Execution and Natural Language Formatting
We execute the generated SQL against the raw CSV in S3 via DuckDB HTTPFS wrappers. Once the dataframe result is returned (e.g., [("Sales", 5000000), ("Engineering", 6000000)]), we pass it back to Bedrock in the streaming chat endpoint, instructing it to display the table elegantly to the end user.
This isolates analytical SQL execution safely within our backend while giving the user a magical, instantaneous plain-English response to massive spreadsheet data.