Technical Deep Dive

Full Technical
Breakdown.

Every component, data flow, design decision, and architectural trade-off explained in precise detail — from DNS resolution to Bedrock token streaming.

Infrastructure

AWS Cloud Infrastructure

Three independent ECS Fargate services sit behind a single Application Load Balancer. Traffic is split by URL path prefix — no EC2 instances, no SSH access, no hardcoded credentials anywhere in the stack.

ALB Listener Rules — How Traffic Is Split

The ALB has a single HTTPS listener on port 443 with three forwarding rules evaluated in strict priority order. When a request arrives, the ALB walks through the rules top-to-bottom and sends the request to the first matching target group.

Priority 1/api/*TG-Backend (port 8000)

FastAPI — handles all AI, ingestion, and config API calls

Priority 2/chat/*TG-ChatUI (port 3001)

Next.js Chat UI — serves the chat interface and its proxy routes

Priority 3/*TG-Frontend (port 3000)

Default catch-all — Next.js dashboard, landing page, all marketing pages

Route 53 + ACM (TLS)

The custom domain resolves via Route 53 Alias record pointing directly at the ALB DNS name. An ACM wildcard certificate covers *.vegarag.com and is auto-attached to the HTTPS listener. Renewal is fully automatic — no manual certificate rotation. HTTP on port 80 is permanently redirected to HTTPS at the ALB layer, so no plaintext traffic ever reaches a container.

VPC + Security Groups

All three Fargate tasks share one security group. Inbound rules allow ports 80 and 443 (ALB), plus 3000, 3001, and 8000 for ALB health check probes. No direct public internet access to container ports — all traffic flows through the ALB. Tasks use 'assignPublicIp: ENABLED' purely so they can pull images from ECR over the internet gateway; no inbound connections can initiate to the task directly.

Fargate + ECR

Each service has its own ECR repository. On deploy, the Docker image is built locally, pushed to ECR, and a new ECS task definition revision is registered. ECS then performs a rolling update — spinning up the new task, waiting for ALB health checks to pass, then draining and killing the old task. Zero-downtime deployments by default.

Task Definition Specs

Backend Service
CPU512 vCPU units
Memory1024 MB
Port8000
IAM RoleecsTaskExecutionRole (Bedrock, DynamoDB, S3, ECR)
Health CheckGET /health → 200 OK
Log Group/ecs/vegarag-backend
Frontend Service
CPU256 vCPU units
Memory512 MB
Port3000
IAM RoleecsTaskExecutionRole (ECR pull only)
Health CheckGET / → 200 OK
Log Group/ecs/vegarag-frontend
Chat UI Service
CPU512 vCPU units
Memory1024 MB
Port3001
IAM RoleecsTaskExecutionRole (ECR pull only)
Health CheckGET /chat → 200 OK
Log Group/ecs/vegarag-chat-ui
Orchestration

LangGraph StateGraph Agent

Every chat request runs through a compiled LangGraph StateGraph — a directed acyclic graph of nodes and conditional edges. State flows through nodes as an immutable TypedDict, making the agent fully deterministic and debuggable.

What GraphState Contains

Every node in the graph reads from and writes to a shared state object. This state is passed forward through the graph — no global variables, no shared memory between requests. Each chat invocation gets its own isolated state.

    query

    The raw user message as typed in the chat input.

    bot_id

    Which agent is responding — determines which Pinecone namespace, system prompt, and DynamoDB records to use.

    session_id

    Identifies the conversation thread — used to group activity records for history.

    intent

    Populated after the intent router runs: exactly one of 'casual', 'rag', or 'sql'.

    retrieved_context

    Populated by the RAG retriever — the top-5 document chunks concatenated as a single string.

    sql_result

    Populated by the SQL executor — the DuckDB query result formatted as a markdown table.

    final_response

    Set after Bedrock response generation — the full streamed AI reply.

The 5-Node Graph Topology

START
Entry

LangGraph's built-in START node. Receives the initial state and immediately passes it to the intent router. No logic here — just the graph entry point.

01
Intent Router Node

Calls Bedrock Nova Lite with the user's query and a strict JSON schema. The LLM must return exactly one of three intents. Uses temperature=0 for maximum determinism. No hallucination risk — the schema enforces the output format.

Conditional Edge

A routing function reads the intent from state and returns a node name as a string. LangGraph uses this to decide the next node. This is the only branching point in the entire graph.

02a
RAG Retriever Node

Only runs if intent='rag'. Embeds the query with Titan v2, queries Pinecone, and appends the top-5 chunks to state as retrieved_context.

02b
SQL Executor Node

Only runs if intent='sql'. Fetches table schemas from DynamoDB, asks Nova Lite to generate DuckDB SQL, executes it against the S3 CSV via HTTPFS, and appends the result table to state.

03
Response Generation Node

All three branches converge here. Injects context (RAG chunks or SQL table, or nothing for casual) into the Bedrock Nova Pro prompt via XML <context> markers and streams tokens back via SSE.

Why LangGraph Instead of a Simple if/else Chain?

Deterministic Execution

LangGraph compiles the graph into a static execution plan. Every run follows the same deterministic path based on state — no hidden branching, no surprise side effects between requests.

Observability

Each node and edge is instrumented. LangSmith can trace every invocation — which node ran, what state looked like entering and exiting, how long each step took. Critical for debugging production issues.

Extensibility

Adding a new capability (e.g., a web search node) requires adding one node and one conditional edge. No need to refactor the entire flow — the graph handles routing.

Retrieval

RAG Ingestion + Retrieval Pipeline

Every agent gets its own isolated Pinecone namespace. Documents are chunked, embedded with Amazon Titan v2, and upserted at ingestion time. At query time, the user's question is embedded and compared against all stored vectors using approximate nearest-neighbour search.

Ingestion — How Documents Enter the System

1
Source Loading

URLs are extracted with LangChain's WebBaseLoader (BeautifulSoup under the hood — strips JS-rendered noise, extracts article body). PDFs are processed with PyPDFLoader which preserves paragraph structure across page boundaries. CSVs and Excel files skip the RAG pipeline entirely and go to the SQL engine instead.

2
Text Chunking

Documents are split using RecursiveCharacterTextSplitter with chunk_size=1000 characters and chunk_overlap=200. The splitter tries paragraph boundaries first (\n\n), then line breaks, then sentence endings, then word boundaries — working from biggest to smallest separator until chunks fit. The 200-character overlap preserves context across chunk edges.

3
Embedding Generation

Each chunk is separately passed to Amazon Titan Embed Text v2 via Bedrock. The model returns a 1536-dimensional dense vector — each dimension is a float32 value capturing a different semantic feature of the text. Titan v2 supports an 8,192 token input window, so even large chunks fit in a single embedding call.

4
Pinecone Upsert

Vectors are upserted to Pinecone Serverless in batches. Each vector includes the raw chunk text and source URL as metadata. The namespace parameter is always set to the bot_id — this physically partitions all vectors between agents. Cross-tenant retrieval is architecturally impossible regardless of what a prompt injection attack sends.

Retrieval — How Context Is Found

1
Query Embedding

The user's raw query string is sent to Titan Embed Text v2 through the same pipeline as ingestion. This produces a 1536-dim query vector in the exact same vector space as the stored document vectors — a prerequisite for meaningful similarity comparison.

2
ANN Search

Pinecone runs Approximate Nearest Neighbour (ANN) search using the HNSW algorithm. HNSW builds a hierarchical graph of vectors at index time and traverses it at query time — achieving sub-50ms p99 latency even at millions of vectors by trading a tiny amount of recall for massive speed gains over exact brute-force search.

3
Top-K Selection

Pinecone returns the 5 most similar vectors by cosine similarity score. Top-5 is deliberately small — sending too much context fills the Bedrock prompt context window and causes the LLM to 'average' the information instead of focusing on the most relevant chunks. 5 chunks at ~1000 chars each = ~5,000 chars of context.

4
Context Assembly

The 5 retrieved chunk texts are joined with a '---' separator and set as retrieved_context in the graph state. The response generation node then wraps this in <context>...</context> XML tags when constructing the Bedrock prompt — a clear structural signal to the LLM that this is external reference material, not part of the conversation.

Technical Specs

Embedding ModelAmazon Titan Embed Text v2 (amazon.titan-embed-text-v2:0)
Vector Dimension1536 float32 values per vector
Input Window8,192 tokens per embedding call
Similarity MetricCosine similarity (angle between vectors, not magnitude)
Index TypeHNSW — Hierarchical Navigable Small World graph
Retrieval LatencySub-50ms p99 for indexes up to 10M vectors
Tenant IsolationPinecone namespace = bot_id, enforced on every query
SQL Analytics

Text-to-SQL with DuckDB

RAG is fundamentally broken for aggregate math. When intent is classified as 'sql', VegaRAG generates SQL against the user's uploaded CSV and Excel tables and executes it in-memory with DuckDB — no separate database server, no ETL, no persistence overhead.

Why RAG Fails for Numeric Questions

RAG retrieves the most similar text chunks to a question. For "What was total revenue in Q3?", it will retrieve chunks that mention revenue or Q3 — but those chunks are raw text fragments. Summing numbers scattered across text fragments is something embedding similarity cannot do. The LLM then has to hallucinate an aggregate from partial evidence.

SQL, by contrast, operates on the actual numbers. A SUM(revenue) WHERE quarter='Q3' query returns the exact answer from every row — no inference, no approximation, no hallucination risk on the numeric result.

SQL Generation — How Nova Writes the Query

Before calling the LLM, VegaRAG fetches the table schema for this agent from DynamoDB. The schema record stores the column names, types, and S3 URI for every uploaded table. This schema is formatted into a natural-language prompt that tells Nova exactly what tables exist and what their columns are called.

Nova Lite is instructed with strict rules: return only valid DuckDB SQL, use single quotes for strings, use ILIKE for case-insensitive matching, reference the S3 URI directly in the FROM clause so DuckDB can read it via HTTPFS. Temperature is set to 0 for maximum determinism — the same question should always produce the same SQL.

A regex post-processor strips any markdown code fence characters before the SQL is executed — a defensive measure against models wrapping output in backtick blocks despite being told not to.

DuckDB Execution — Why No Database Server

    HTTPFS from S3

    DuckDB's HTTPFS extension reads CSV and Parquet files directly from S3 URLs using signed requests. The file is streamed in chunks — never fully downloaded to the container's local disk. This means no storage overhead per query.

    Columnar Execution

    DuckDB uses a vectorised columnar execution engine — it processes data in batches of 2,048 values per column at a time, enabling SIMD CPU instruction parallelism. For GROUP BY and SUM aggregations, this is 10-100x faster than row-by-row Pandas operations.

    In-Memory, Stateless

    Each SQL execution spawns a fresh DuckDB connection object, runs the query, returns the result as a DataFrame, and the connection is garbage-collected. No persistent database state between requests. No connection pooling needed. No locking.

    Full SQL Dialect

    DuckDB supports window functions (RANK, LAG, LEAD), CTEs, PIVOT, UNNEST for JSON arrays, ASOF joins for time-series, and automatic type casting. Essentially a full analytical SQL engine that fits in a Python import.

Schema Record in DynamoDB

When a CSV is uploaded, VegaRAG reads the header row, infers column types, and writes a schema record to DynamoDB. This record is what the SQL executor reads at query time — it never re-scans the file to learn the structure.

PKAGENT#bot_8159fbf0
SKTABLE#sales_data.csv
s3_uris3://vegarag-data/bot_8159fbf0/sales_data.csv
columnsdate (DATE), product (VARCHAR), revenue (DECIMAL), quantity (INTEGER)
row_count14,832 rows indexed at upload time
Real-time

Server-Sent Events (SSE) Streaming

Tokens stream from Amazon Bedrock into a FastAPI AsyncIterator, get re-wrapped by the Next.js proxy into LangGraph-format SSE events, and land in the browser DOM via the LangGraph React SDK — no buffering, instant time-to-first-token.

Step 1 — FastAPI SSE Response

The FastAPI /api/chat endpoint returns a StreamingResponse with media_type="text/event-stream". This tells the HTTP client not to wait for the response body to complete before reading — it can consume each newline-delimited chunk as it arrives.

Bedrock's invoke_model_with_response_stream returns an EventStream object. FastAPI iterates this stream asynchronously — each iteration yields one chunk from Bedrock's token buffer. Each chunk contains a delta object with a "text" field containing one or more characters.

Each token is immediately formatted as an SSE event: data: {"text": "token"}. A final data: [DONE] event signals stream end. Response headers disable all caching and set Connection: keep-alive to prevent ALB timeout during long responses.

Step 2 — Next.js Proxy Re-wrap

The LangGraph SDK doesn't know how to consume VegaRAG's raw SSE format — it expects a specific event type called values containing a full LangGraph message array. The Next.js proxy bridges this gap.

For every token received from the backend, the proxy constructs a new complete message array (human message + accumulated AI text so far), serialises it as JSON, and emits it as an SSE event with event: values. Before the first token, it emits an event: metadata with the run ID.

This means the browser receives N events for N tokens — each containing the full accumulated message. The SDK's React hook updates its internal state on every event, triggering a React re-render that appends the new token to the DOM. The visual effect is identical to ChatGPT's streaming output.

Step 3 — LangGraph SDK React Hook

The useStream hook from @langchain/langgraph-sdk/react manages the entire client-side streaming lifecycle. When the user submits a message, it: creates or reuses a thread ID, POSTs to the proxy's /runs/stream endpoint, and opens a ReadableStream on the response body.

The hook's internal reducer processes each incoming SSE event, merges the message array into the component's React state via an immutable update, and triggers a re-render. The UI component reads stream.messages which updates after every token — React's batching ensures this doesn't cause 60fps jank even on fast streams.

When the stream ends, the hook automatically fires a GET /threads/{id}/state request to hydrate the final canonical message state from DynamoDB — ensuring tool call messages and metadata survive beyond the stream lifetime.

SSE Event Wire Format — 4 Event Types

metadataFirst event, before any tokens

Carries the run_id (UUID). Used by the SDK to correlate this stream with a specific LangGraph run.

valuesOnce per token (N events total)

Carries the full updated message array. The AI message content field grows by one token each event.

errorOn backend failure

Carries an error object. SDK surfaces this as an error toast in the UI and stops rendering.

[DONE]Last event in stream

Plain SSE data with literal string [DONE]. Signals the proxy to close the ReadableStream controller.

Database

DynamoDB Single-Table Design

Every piece of VegaRAG's persistent state — agents, configs, data sources, chat logs, and analytics — lives in one DynamoDB table. Composite primary keys enable O(1) access patterns with zero SQL joins.

Why Single-Table Design?

DynamoDB charges for read and write capacity units — not joins. Multiple tables would require multiple round-trips to assemble related data. Single-table design co-locates all an agent's data under one physical partition, enabling a single Query call to fetch everything. The trade-off is that the key schema must be designed upfront to support all access patterns — no ad-hoc querying.

USER#{email} / AGENT#{bot_id}ListAgents: Query PK=USER#{email}, SK begins_with 'AGENT#'

One item per agent per user. The email is the Cognito identity that created the agent. All agent metadata lives here — the dashboard's agent list is a single DynamoDB Query on this PK with a SK prefix scan.

bot_idShort unique identifier (bot_8159fbf0 — 8 hex chars of UUID4)
nameHuman-readable agent name set at creation
statusDraft or Active — used to filter displayed agents
createdAtISO 8601 timestamp — used for newest-first sort in the dashboard
AGENT#{bot_id} / CONFIGGetConfig: GetItem PK=AGENT#{id}, SK='CONFIG' | SaveConfig: PutItem (overwrites entire item)

Single item per agent. PutItem with SK='CONFIG' overwrites the entire config atomically — no partial updates, no merge conflicts. GetItem is O(1) key lookup — no scan, no filter.

system_promptThe full system-level instruction given to the LLM on every request
brand_colorHex color used in embed widget styling (e.g. #2563eb)
nameDisplay name used in the chat window header
chat_titleCustom chat window title — overrides agent name if set
chat_logo_urlDirect URL to a PNG/SVG logo shown in the chat UI header
welcome_messageFirst message shown when the chat window opens with no history
AGENT#{bot_id} / SOURCE#{type}#{identifier}ListSources: Query PK=AGENT#{id}, SK begins_with 'SOURCE#' | Delete: DeleteItem exact SK

One item per document. The full URL or filename is embedded in the SK so deletions are an exact-key lookup — no scan needed to find the record. TABLE sources have a second companion record at TABLE#{filename} which stores the column schema for SQL execution.

SK format (URL)SOURCE#URL#https://docs.example.com/page
SK format (PDF)SOURCE#PDF#{filename}.pdf
SK format (TABLE)SOURCE#TABLE#{filename}.csv — also has a paired TABLE# schema record
status'indexed' once Pinecone upsert completes, 'pending' during ingestion
chunk_countNumber of vector chunks written to Pinecone for this source
ingestedAtTimestamp of successful ingestion
ACTIVITY#{bot_id} / ENTRY#{timestamp}#{session_id}GetActivity: Query PK=ACTIVITY#{id}, SK begins_with 'ENTRY#' — newest-first sort by SK

One item per chat exchange (one human + one AI turn). The timestamp in the SK enables range-key time ordering — newest items sort last alphabetically which is then reversed in the application layer. Session grouping is done in Python by iterating items and grouping by session_id value.

session_idUUID or browser-generated thread ID — groups all turns in one conversation
user_msgVerbatim user question as typed
ai_responseFull accumulated AI response after stream ended
intentClassified intent: casual, rag, or sql — used for analytics
timestampISO 8601 — used for time-ordering and display in chat history sidebar
STATS#{bot_id} / DAY#{YYYY-MM-DD}GetAnalytics: Query PK=STATS#{id}, SK begins_with 'DAY#' — sorted by date

One item per agent per day. DynamoDB's ADD operation atomically increments numeric attributes without read-modify-write race conditions. The analytics chart queries the last 30 days by date range on the SK. No separate analytics database needed.

query_countAtomic ADD increment on every chat request — no read-modify-write needed
token_countApproximate Bedrock token usage — incremented after stream completes
Chat UI

Chat UI Service & Protocol Proxy

The Chat UI is a separate Next.js 15 app deployed as its own Fargate service on port 3001 with basePath='/chat'. Its primary role is bridging the LangGraph SDK's wire protocol to VegaRAG's REST API — not the UI itself.

Why Is This a Separate Service?

The open-source LangGraph SDK (used for the streaming chat UI) expects a specific backend protocol: thread creation, runs/stream SSE events in a particular format, thread state hydration endpoints. VegaRAG's FastAPI backend was built independently and has a different REST API shape.

Rather than rewrite either the SDK or the backend, the Chat UI service acts as a translation layer. Its Next.js API routes at /chat/api/langgraph/* implement the exact interface the SDK expects, then translate each call into the appropriate VegaRAG API call.

Being a separate service also enables independent scaling — if chat traffic spikes independently of the dashboard, only the Chat UI Fargate task needs more capacity, not the entire frontend.

basePath — The Critical Routing Configuration

Next.js has a basePath configuration option that prefixes all routes with a path segment. With basePath: "/chat", a Next.js API route defined at /api/langgraph is actually served at /chat/api/langgraph externally.

This is essential. Without basePath, the LangGraph SDK would make browser requests to vegarag.com/api/langgraph/* — which the ALB routes to the backend, not the Chat UI proxy, causing 404s.

With basePath, requests go to vegarag.com/chat/api/langgraph/* — which the ALB's /chat/* rule correctly routes to the Chat UI container where the proxy handles it. In local dev, basePath is disabled (the app is accessed directly at localhost:3001 with no prefix).

Proxy Route Translation Table

GET /infohandled locally

Returns hardcoded {version: '1.0.0'} — just confirms the proxy is alive

POST /threadshandled locally

Generates a UUID thread_id and returns it — no backend call needed

POST /threads/searchfwd to backend

→ GET /api/agents/{id}/activity — groups by session_id, returns thread list for history sidebar

GET /threads/{id}handled locally

Returns a synthetic thread object with idle status — no backend call

GET /threads/{id}/statefwd to backend

→ GET /api/activity/session/{id} — fetches all turns, rebuilds LangGraph message format with stable IDs

POST /threads/{id}/historyfwd to backend

→ GET /api/activity/session/{id} — returns checkpoint array for history navigation

POST /threads/{id}/runs/streamfwd to backend

→ POST /api/chat — forward query, re-stream Bedrock tokens as LangGraph values SSE events

Message ID Stability — Why It Matters

When the LangGraph SDK finishes consuming a stream, it immediately calls GET /threads/{id}/state to get the canonical final state. If the message IDs in this response differ from what was streamed, React sees new objects and re-renders the entire message list — causing tool call messages and structured outputs to visually disappear and reappear.

VegaRAG solves this by deriving message IDs deterministically from the session_id and SK key rather than generating random UUIDs. The same formula runs during streaming and during state reconstruction — producing identical IDs so React's reconciler treats them as the same elements and skips re-render.

Security

Zero-Trust Security Model

No hardcoded AWS credentials. No cross-tenant data leakage. All AWS auth via short-lived IAM session tokens. Bedrock inference stays inside the AWS backbone — never touches the public internet.

IAM Task Roles — No Static Keys

  • ECS Fargate tasks obtain temporary credentials through the EC2 Instance Metadata Service (IMDS) endpoint — a link-local HTTP address (169.254.169.254) accessible only from inside the task's network namespace.
  • Credentials have a 15-minute TTL and are automatically rotated by AWS STS. The application never stores them — boto3 refreshes them transparently on expiry.
  • The task role policy is scoped to exactly the services needed: Bedrock:InvokeModel, DynamoDB:Query/GetItem/PutItem/UpdateItem, S3:GetObject/PutObject, ECR:GetAuthorizationToken, logs:CreateLogStream/PutLogEvents.
  • No AWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY environment variables exist anywhere in the system — not in Docker images, task definitions, or CI/CD pipelines.

Cognito Authentication

  • The frontend dashboard is protected by Amazon Cognito. Every dashboard page check verifies a JWT RS256 access token issued by the Cognito User Pool.
  • Tokens are signed with the User Pool's RSA private key and verified client-side against the public JWKS endpoint — no round-trip to Cognito required on every page load.
  • The Cognito User Pool enforces email verification at sign-up. Optional MFA with TOTP authenticator apps is supported via Cognito's built-in MFA settings.
  • Public-facing chat UI at /chat is intentionally unauthenticated — end users of a deployed chat widget should not need to create accounts just to chat.

Multi-Tenant Data Isolation

  • Pinecone namespace equals bot_id on every single query — hardcoded in the retrieval node, never derived from user input. A prompt injection attack that tries to set namespace='other_bot' cannot override this parameter.
  • DynamoDB primary keys always include the bot_id segment (PK: AGENT#{bot_id}). The application never performs full table scans — every query is key-anchored to the specific agent, making cross-tenant reads structurally impossible.
  • S3 objects are stored at s3://bucket/{bot_id}/{filename} — IAM prefix conditions can further restrict per-bot access if needed.
  • CloudWatch log streams are keyed per ECS task — no log interleaving between tenants sharing the same container.

Bedrock — Why AI Inference Never Hits the Public Internet

Every OpenAI, Anthropic, or Cohere API call your application makes traverses the public internet. The packet leaves your server, enters the public routing table, passes through multiple ISP hops, and arrives at the AI provider's data centre. At every hop, TLS provides confidentiality — but the provider still terminates TLS and processes your prompt in plaintext on their infrastructure. Your data is subject to their retention and training policies.

Amazon Bedrock is fundamentally different. When a VegaRAG Fargate task calls Bedrock's API, the packet travels from the container's ENI through the VPC router, across AWS's private fibre backbone, and into the Bedrock service endpoint — all within the same AWS region, all on private infrastructure. The packet never enters the public internet routing table. This is equivalent to a private network call, not an internet API call.

Data residency
All inference stays within your selected AWS region. No data leaves the region boundary.
No training on your data
AWS contractually guarantees Bedrock prompts and responses are not used to train foundation models.
VPC routing
Traffic routes over AWS backbone fiber, not public internet. No TLS termination by third parties.
Audit trail
All Bedrock API calls are logged in CloudTrail with full request metadata. Queryable by security team.
Compliance
Bedrock is HIPAA eligible, SOC 1/2/3 compliant, and ISO 27001 certified — inherited by applications using it.