Technical Deep Dive

Full Technical
Breakdown.

Every component, data flow, design decision, and architectural trade-off explained in precise detail — from DNS resolution to Bedrock token streaming.

Infrastructure

AWS Cloud Infrastructure

Three independent ECS Fargate services sit behind a single Application Load Balancer. Traffic is split by URL path prefix — no EC2 instances, no SSH access, no hardcoded credentials anywhere in the stack.

ALB Listener Rules — How Traffic Is Split

The ALB has a single HTTPS listener on port 443 with three forwarding rules evaluated in strict priority order. When a request arrives, the ALB walks through the rules top-to-bottom and sends the request to the first matching target group.

Priority 1/api/*TG-Backend (port 8000)

FastAPI — handles all AI, ingestion, and config API calls

Priority 2/chat/*TG-ChatUI (port 3001)

Next.js Chat UI — serves the chat interface and its proxy routes

Priority 3/*TG-Frontend (port 3000)

Default catch-all — Next.js dashboard, landing page, all marketing pages

Route 53 + ACM (TLS)

The custom domain resolves via Route 53 Alias record pointing directly at the ALB DNS name. An ACM wildcard certificate covers *.vegarag.com and is auto-attached to the HTTPS listener. Renewal is fully automatic — no manual certificate rotation. HTTP on port 80 is permanently redirected to HTTPS at the ALB layer, so no plaintext traffic ever reaches a container.

VPC + Security Groups

All three Fargate tasks share one security group. Inbound rules allow ports 80 and 443 (ALB), plus 3000, 3001, and 8000 for ALB health check probes. No direct public internet access to container ports — all traffic flows through the ALB. Tasks use 'assignPublicIp: ENABLED' purely so they can pull images from ECR over the internet gateway; no inbound connections can initiate to the task directly.

Fargate + ECR

Each service has its own ECR repository. On deploy, the Docker image is built locally, pushed to ECR, and a new ECS task definition revision is registered. ECS then performs a rolling update — spinning up the new task, waiting for ALB health checks to pass, then draining and killing the old task. Zero-downtime deployments by default.

Task Definition Specs

Backend Service
CPU512 vCPU units
Memory1024 MB
Port8000
IAM RoleecsTaskExecutionRole (Bedrock, DynamoDB, S3, ECR)
Health CheckGET /health → 200 OK
Log Group/ecs/vegarag-backend
Frontend Service
CPU256 vCPU units
Memory512 MB
Port3000
IAM RoleecsTaskExecutionRole (ECR pull only)
Health CheckGET / → 200 OK
Log Group/ecs/vegarag-frontend
Chat UI Service
CPU512 vCPU units
Memory1024 MB
Port3001
IAM RoleecsTaskExecutionRole (ECR pull only)
Health CheckGET /chat → 200 OK
Log Group/ecs/vegarag-chat-ui
Orchestration

LangGraph StateGraph Agent

Every chat request runs through a compiled LangGraph StateGraph — a directed acyclic graph of nodes and conditional edges. State flows through nodes as an immutable TypedDict, making the agent fully deterministic and debuggable.

What GraphState Contains

Every node in the graph reads from and writes to a shared state object. This state is passed forward through the graph — no global variables, no shared memory between requests. Each chat invocation gets its own isolated state.

    query

    The raw user message as typed in the chat input.

    bot_id

    Which agent is responding — determines which Pinecone namespace, system prompt, and DynamoDB records to use.

    session_id

    Identifies the conversation thread — used to group activity records for history.

    intent

    Populated after the intent router runs: exactly one of 'casual', 'rag', or 'sql'.

    retrieved_context

    Populated by the RAG retriever — the top-5 document chunks concatenated as a single string.

    sql_result

    Populated by the SQL executor — the DuckDB query result formatted as a markdown table.

    final_response

    Set after Bedrock response generation — the full streamed AI reply.

The 5-Node Graph Topology

START
Entry

LangGraph's built-in START node. Receives the initial state and immediately passes it to the intent router. No logic here — just the graph entry point.

01
Intent Router Node

Calls Bedrock Nova Lite with the user's query and a strict JSON schema. The LLM must return exactly one of three intents. Uses temperature=0 for maximum determinism. No hallucination risk — the schema enforces the output format.

Conditional Edge

A routing function reads the intent from state and returns a node name as a string. LangGraph uses this to decide the next node. This is the only branching point in the entire graph.

02a
RAG Retriever Node

Only runs if intent='rag'. Embeds the query with Titan v2, queries Pinecone, and appends the top-5 chunks to state as retrieved_context.

02b
SQL Executor Node

Only runs if intent='sql'. Fetches table schemas from DynamoDB, asks Nova Lite to generate DuckDB SQL, executes it against the S3 CSV via HTTPFS, and appends the result table to state.

03
Response Generation Node

All three branches converge here. Injects context (RAG chunks or SQL table, or nothing for casual) into the Bedrock Nova Pro prompt via XML <context> markers and streams tokens back via SSE.

Why LangGraph Instead of a Simple if/else Chain?

Deterministic Execution

LangGraph compiles the graph into a static execution plan. Every run follows the same deterministic path based on state — no hidden branching, no surprise side effects between requests.

Observability

Each node and edge is instrumented. LangSmith can trace every invocation — which node ran, what state looked like entering and exiting, how long each step took. Critical for debugging production issues.

Extensibility

Adding a new capability (e.g., a web search node) requires adding one node and one conditional edge. No need to refactor the entire flow — the graph handles routing.

Retrieval

RAG Ingestion + Retrieval Pipeline

Every agent gets its own isolated Pinecone namespace. Documents are chunked, embedded with Amazon Titan v2, and upserted at ingestion time. At query time, the user's question is embedded and compared against all stored vectors using approximate nearest-neighbour search.

Ingestion — How Documents Enter the System

1
Source Loading

URLs are extracted with LangChain's WebBaseLoader (BeautifulSoup under the hood — strips JS-rendered noise, extracts article body). PDFs are processed with PyPDFLoader which preserves paragraph structure across page boundaries. CSVs and Excel files skip the RAG pipeline entirely and go to the SQL engine instead.

2
Text Chunking

Documents are split using RecursiveCharacterTextSplitter with chunk_size=1000 characters and chunk_overlap=200. The splitter tries paragraph boundaries first (\n\n), then line breaks, then sentence endings, then word boundaries — working from biggest to smallest separator until chunks fit. The 200-character overlap preserves context across chunk edges.

3
Embedding Generation

Each chunk is separately passed to Amazon Titan Embed Text v2 via Bedrock. The model returns a 1536-dimensional dense vector — each dimension is a float32 value capturing a different semantic feature of the text. Titan v2 supports an 8,192 token input window, so even large chunks fit in a single embedding call.

4
Pinecone Upsert

Vectors are upserted to Pinecone Serverless in batches. Each vector includes the raw chunk text and source URL as metadata. The namespace parameter is always set to the bot_id — this physically partitions all vectors between agents. Cross-tenant retrieval is architecturally impossible regardless of what a prompt injection attack sends.

Retrieval — How Context Is Found

1
Query Embedding

The user's raw query string is sent to Titan Embed Text v2 through the same pipeline as ingestion. This produces a 1536-dim query vector in the exact same vector space as the stored document vectors — a prerequisite for meaningful similarity comparison.

2
ANN Search

Pinecone runs Approximate Nearest Neighbour (ANN) search using the HNSW algorithm. HNSW builds a hierarchical graph of vectors at index time and traverses it at query time — achieving sub-50ms p99 latency even at millions of vectors by trading a tiny amount of recall for massive speed gains over exact brute-force search.

3
Top-K Selection

Pinecone returns the 5 most similar vectors by cosine similarity score. Top-5 is deliberately small — sending too much context fills the Bedrock prompt context window and causes the LLM to 'average' the information instead of focusing on the most relevant chunks. 5 chunks at ~1000 chars each = ~5,000 chars of context.

4
Context Assembly

The 5 retrieved chunk texts are joined with a '---' separator and set as retrieved_context in the graph state. The response generation node then wraps this in <context>...</context> XML tags when constructing the Bedrock prompt — a clear structural signal to the LLM that this is external reference material, not part of the conversation.

Technical Specs

Embedding ModelAmazon Titan Embed Text v2 (amazon.titan-embed-text-v2:0)
Vector Dimension1536 float32 values per vector
Input Window8,192 tokens per embedding call
Similarity MetricCosine similarity (angle between vectors, not magnitude)
Index TypeHNSW — Hierarchical Navigable Small World graph
Retrieval LatencySub-50ms p99 for indexes up to 10M vectors
Tenant IsolationPinecone namespace = bot_id, enforced on every query
Data Warehouse

Enterprise Text-to-SQL with PostgreSQL

RAG is fundamentally broken for aggregate math. When intent is classified as 'sql', VegaRAG generates SQL against the user's data and executes it securely in PostgreSQL using strict Row-Level Security (RLS).

Why RAG Fails for Numeric Questions

RAG retrieves text chunks. For "What was total revenue in Q3?", it retrieves chunks mentioning revenue or Q3. Summing numbers scattered across text is something embedding similarity cannot do.

SQL, by contrast, operates on actual numbers. A SUM(revenue) WHERE quarter='Q3' query returns exact math — no inference, no hallucination.

Row-Level Security (RLS) — The Multi-Tenant Barrier

Executing LLM-generated SQL against a database is traditionally extremely dangerous. We use an Enterprise PostgreSQL Warehouse to solve this using RLS.

Before any SQL executes, the backend runs a strict context setup command. Postgres RLS policies evaluate the tenant variable on every single query. Even if a prompt injection attack generates a malicious query, the database engine strictly drops all rows not belonging to that tenant before returning data.

The Postgres connection is also strictly forced to read-only, completely neutralizing SQL injection attacks at the driver level.

Asynchronous S3 Data Ingestion

    FastAPI BackgroundTasks

    When users upload massive CSV or Excel files, the HTTP connection returns a '202 Accepted' instantly. The parsing, DataFrame conversion, and S3 upload happens in a decoupled background thread, preventing AWS ALB 60s timeouts.

    Automated Schema Extraction

    During the background task, the file headers are read and types inferred. A schema map is saved to DynamoDB so the LLM knows exactly what columns exist when generating the SQL later.

    No ETL Pipeline Required

    The data flows directly from the user's browser, into S3, and is securely mapped to their tenant ID in PostgreSQL — ready for immediate natural language querying.

Guardrails

Zero-Trust Security & Guardrails

VegaRAG intercepts user input and LLM output in real-time, masking PII before it reaches Bedrock and automatically rejecting AI hallucinations before they reach the user.

Input Layer: Microsoft Presidio PII Redaction

Sending raw user prompts to external LLMs can leak Social Security Numbers, credit cards, or internal emails. VegaRAG runs Microsoft Presidio with a SpaCy NLP engine locally inside the container.

User: My SSN is 123-45-678. What's my balance?

System intercept: My SSN is <US_SSN>. What's my balance?

Bedrock never sees the SSN. This allows the platform to be fully compliant with SOC2 and GDPR without relying on third-party API promises.

Output Layer: Dual-LLM Entailment Checks

How do you prove a RAG system isn't hallucinating? VegaRAG implements a "Dual-LLM" architecture. After the primary Bedrock Nova Pro model streams an answer, a secondary, smaller Bedrock Micro model is triggered in the background.

  • It runs a strict "entailment check" comparing the generated answer against the retrieved Pinecone context.
  • If the Micro model detects facts in the answer that do not exist in the context, it flags the response as a Hallucination in the DynamoDB activity logs.
  • This automated MLOps evaluation prevents "vibes-based" testing and ensures strict accuracy enforcement.
Observability

Telemetry, Caching & Rate Limiting

A system is only enterprise-grade if you can monitor it, cache it, and protect it from noisy neighbors.

Pinecone Semantic Caching

Instead of paying for Bedrock tokens on every identical query, VegaRAG stores previous Q&A pairs in a separate Pinecone Cache index.

When a user asks a similar question (cosine similarity > 0.95), the system instantly returns the cached response in under 50ms. Token cost: $0.

Token Bucket Rate Limiting

Multi-tenant architecture is vulnerable to "noisy neighbors" who exhaust API quotas. We implemented a memory-based Token Bucket algorithm inside FastAPI.

Each tenant gets a fixed capacity of tokens per minute. If they burst above it, they receive an HTTP 429 Too Many Requests, protecting the cluster without requiring Redis.

OpenTelemetry Distributed Tracing

VegaRAG outputs Structured JSON logs injected with trace_id and span_id values.

When a request flows from the Next.js Proxy → FastAPI → Bedrock → Pinecone, AWS X-Ray builds a beautiful visual waterfall chart, isolating exact latency bottlenecks down to the millisecond.

Real-time

Server-Sent Events (SSE) Streaming

Tokens stream from Amazon Bedrock into a FastAPI AsyncIterator, get re-wrapped by the Next.js proxy into LangGraph-format SSE events, and land in the browser DOM via the LangGraph React SDK — no buffering, instant time-to-first-token.

Step 1 — FastAPI SSE Response

The FastAPI /api/chat endpoint returns a StreamingResponse with media_type="text/event-stream". This tells the HTTP client not to wait for the response body to complete before reading — it can consume each newline-delimited chunk as it arrives.

Bedrock's invoke_model_with_response_stream returns an EventStream object. FastAPI iterates this stream asynchronously — each iteration yields one chunk from Bedrock's token buffer. Each chunk contains a delta object with a "text" field containing one or more characters.

Each token is immediately formatted as an SSE event: data: {"text": "token"}. A final data: [DONE] event signals stream end. Response headers disable all caching and set Connection: keep-alive to prevent ALB timeout during long responses.

Step 2 — Next.js Proxy Re-wrap

The LangGraph SDK doesn't know how to consume VegaRAG's raw SSE format — it expects a specific event type called values containing a full LangGraph message array. The Next.js proxy bridges this gap.

For every token received from the backend, the proxy constructs a new complete message array (human message + accumulated AI text so far), serialises it as JSON, and emits it as an SSE event with event: values. Before the first token, it emits an event: metadata with the run ID.

This means the browser receives N events for N tokens — each containing the full accumulated message. The SDK's React hook updates its internal state on every event, triggering a React re-render that appends the new token to the DOM. The visual effect is identical to ChatGPT's streaming output.

Step 3 — LangGraph SDK React Hook

The useStream hook from @langchain/langgraph-sdk/react manages the entire client-side streaming lifecycle. When the user submits a message, it: creates or reuses a thread ID, POSTs to the proxy's /runs/stream endpoint, and opens a ReadableStream on the response body.

The hook's internal reducer processes each incoming SSE event, merges the message array into the component's React state via an immutable update, and triggers a re-render. The UI component reads stream.messages which updates after every token — React's batching ensures this doesn't cause 60fps jank even on fast streams.

When the stream ends, the hook automatically fires a GET /threads/{id}/state request to hydrate the final canonical message state from DynamoDB — ensuring tool call messages and metadata survive beyond the stream lifetime.

SSE Event Wire Format — 4 Event Types

metadataFirst event, before any tokens

Carries the run_id (UUID). Used by the SDK to correlate this stream with a specific LangGraph run.

valuesOnce per token (N events total)

Carries the full updated message array. The AI message content field grows by one token each event.

errorOn backend failure

Carries an error object. SDK surfaces this as an error toast in the UI and stops rendering.

[DONE]Last event in stream

Plain SSE data with literal string [DONE]. Signals the proxy to close the ReadableStream controller.

Database

DynamoDB Single-Table Design

Every piece of VegaRAG's persistent state — agents, configs, data sources, chat logs, and analytics — lives in one DynamoDB table. Composite primary keys enable O(1) access patterns with zero SQL joins.

Why Single-Table Design?

DynamoDB charges for read and write capacity units — not joins. Multiple tables would require multiple round-trips to assemble related data. Single-table design co-locates all an agent's data under one physical partition, enabling a single Query call to fetch everything. The trade-off is that the key schema must be designed upfront to support all access patterns — no ad-hoc querying.

USER#{email} / AGENT#{bot_id}ListAgents: Query PK=USER#{email}, SK begins_with 'AGENT#'

One item per agent per user. The email is the Cognito identity that created the agent. All agent metadata lives here — the dashboard's agent list is a single DynamoDB Query on this PK with a SK prefix scan.

bot_idShort unique identifier (bot_8159fbf0 — 8 hex chars of UUID4)
nameHuman-readable agent name set at creation
statusDraft or Active — used to filter displayed agents
createdAtISO 8601 timestamp — used for newest-first sort in the dashboard
AGENT#{bot_id} / CONFIGGetConfig: GetItem PK=AGENT#{id}, SK='CONFIG' | SaveConfig: PutItem (overwrites entire item)

Single item per agent. PutItem with SK='CONFIG' overwrites the entire config atomically — no partial updates, no merge conflicts. GetItem is O(1) key lookup — no scan, no filter.

system_promptThe full system-level instruction given to the LLM on every request
brand_colorHex color used in embed widget styling (e.g. #2563eb)
nameDisplay name used in the chat window header
chat_titleCustom chat window title — overrides agent name if set
chat_logo_urlDirect URL to a PNG/SVG logo shown in the chat UI header
welcome_messageFirst message shown when the chat window opens with no history
AGENT#{bot_id} / SOURCE#{type}#{identifier}ListSources: Query PK=AGENT#{id}, SK begins_with 'SOURCE#' | Delete: DeleteItem exact SK

One item per document. The full URL or filename is embedded in the SK so deletions are an exact-key lookup — no scan needed to find the record. TABLE sources have a second companion record at TABLE#{filename} which stores the column schema for SQL execution.

SK format (URL)SOURCE#URL#https://docs.example.com/page
SK format (PDF)SOURCE#PDF#{filename}.pdf
SK format (TABLE)SOURCE#TABLE#{filename}.csv — also has a paired TABLE# schema record
status'indexed' once Pinecone upsert completes, 'pending' during ingestion
chunk_countNumber of vector chunks written to Pinecone for this source
ingestedAtTimestamp of successful ingestion
ACTIVITY#{bot_id} / ENTRY#{timestamp}#{session_id}GetActivity: Query PK=ACTIVITY#{id}, SK begins_with 'ENTRY#' — newest-first sort by SK

One item per chat exchange (one human + one AI turn). The timestamp in the SK enables range-key time ordering — newest items sort last alphabetically which is then reversed in the application layer. Session grouping is done in Python by iterating items and grouping by session_id value.

session_idUUID or browser-generated thread ID — groups all turns in one conversation
user_msgVerbatim user question as typed
ai_responseFull accumulated AI response after stream ended
intentClassified intent: casual, rag, or sql — used for analytics
timestampISO 8601 — used for time-ordering and display in chat history sidebar
STATS#{bot_id} / DAY#{YYYY-MM-DD}GetAnalytics: Query PK=STATS#{id}, SK begins_with 'DAY#' — sorted by date

One item per agent per day. DynamoDB's ADD operation atomically increments numeric attributes without read-modify-write race conditions. The analytics chart queries the last 30 days by date range on the SK. No separate analytics database needed.

query_countAtomic ADD increment on every chat request — no read-modify-write needed
token_countApproximate Bedrock token usage — incremented after stream completes
Chat UI

Chat UI Service & Protocol Proxy

The Chat UI is a separate Next.js 15 app deployed as its own Fargate service on port 3001 with basePath='/chat'. Its primary role is bridging the LangGraph SDK's wire protocol to VegaRAG's REST API — not the UI itself.

Why Is This a Separate Service?

The open-source LangGraph SDK (used for the streaming chat UI) expects a specific backend protocol: thread creation, runs/stream SSE events in a particular format, thread state hydration endpoints. VegaRAG's FastAPI backend was built independently and has a different REST API shape.

Rather than rewrite either the SDK or the backend, the Chat UI service acts as a translation layer. Its Next.js API routes at /chat/api/langgraph/* implement the exact interface the SDK expects, then translate each call into the appropriate VegaRAG API call.

Being a separate service also enables independent scaling — if chat traffic spikes independently of the dashboard, only the Chat UI Fargate task needs more capacity, not the entire frontend.

basePath — The Critical Routing Configuration

Next.js has a basePath configuration option that prefixes all routes with a path segment. With basePath: "/chat", a Next.js API route defined at /api/langgraph is actually served at /chat/api/langgraph externally.

This is essential. Without basePath, the LangGraph SDK would make browser requests to vegarag.com/api/langgraph/* — which the ALB routes to the backend, not the Chat UI proxy, causing 404s.

With basePath, requests go to vegarag.com/chat/api/langgraph/* — which the ALB's /chat/* rule correctly routes to the Chat UI container where the proxy handles it. In local dev, basePath is disabled (the app is accessed directly at localhost:3001 with no prefix).

Proxy Route Translation Table

GET /infohandled locally

Returns hardcoded {version: '1.0.0'} — just confirms the proxy is alive

POST /threadshandled locally

Generates a UUID thread_id and returns it — no backend call needed

POST /threads/searchfwd to backend

→ GET /api/agents/{id}/activity — groups by session_id, returns thread list for history sidebar

GET /threads/{id}handled locally

Returns a synthetic thread object with idle status — no backend call

GET /threads/{id}/statefwd to backend

→ GET /api/activity/session/{id} — fetches all turns, rebuilds LangGraph message format with stable IDs

POST /threads/{id}/historyfwd to backend

→ GET /api/activity/session/{id} — returns checkpoint array for history navigation

POST /threads/{id}/runs/streamfwd to backend

→ POST /api/chat — forward query, re-stream Bedrock tokens as LangGraph values SSE events

Message ID Stability — Why It Matters

When the LangGraph SDK finishes consuming a stream, it immediately calls GET /threads/{id}/state to get the canonical final state. If the message IDs in this response differ from what was streamed, React sees new objects and re-renders the entire message list — causing tool call messages and structured outputs to visually disappear and reappear.

VegaRAG solves this by deriving message IDs deterministically from the session_id and SK key rather than generating random UUIDs. The same formula runs during streaming and during state reconstruction — producing identical IDs so React's reconciler treats them as the same elements and skips re-render.

Security

Zero-Trust Security Model

No hardcoded AWS credentials. No cross-tenant data leakage. All AWS auth via short-lived IAM session tokens. Bedrock inference stays inside the AWS backbone — never touches the public internet.

IAM Task Roles — No Static Keys

  • ECS Fargate tasks obtain temporary credentials through the EC2 Instance Metadata Service (IMDS) endpoint — a link-local HTTP address (169.254.169.254) accessible only from inside the task's network namespace.
  • Credentials have a 15-minute TTL and are automatically rotated by AWS STS. The application never stores them — boto3 refreshes them transparently on expiry.
  • The task role policy is scoped to exactly the services needed: Bedrock:InvokeModel, DynamoDB:Query/GetItem/PutItem/UpdateItem, S3:GetObject/PutObject, ECR:GetAuthorizationToken, logs:CreateLogStream/PutLogEvents.
  • No AWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY environment variables exist anywhere in the system — not in Docker images, task definitions, or CI/CD pipelines.

Cognito Authentication

  • The frontend dashboard is protected by Amazon Cognito. Every dashboard page check verifies a JWT RS256 access token issued by the Cognito User Pool.
  • Tokens are signed with the User Pool's RSA private key and verified client-side against the public JWKS endpoint — no round-trip to Cognito required on every page load.
  • The Cognito User Pool enforces email verification at sign-up. Optional MFA with TOTP authenticator apps is supported via Cognito's built-in MFA settings.
  • Public-facing chat UI at /chat is intentionally unauthenticated — end users of a deployed chat widget should not need to create accounts just to chat.

Multi-Tenant Data Isolation

  • Pinecone namespace equals bot_id on every single query — hardcoded in the retrieval node, never derived from user input. A prompt injection attack that tries to set namespace='other_bot' cannot override this parameter.
  • DynamoDB primary keys always include the bot_id segment (PK: AGENT#{bot_id}). The application never performs full table scans — every query is key-anchored to the specific agent, making cross-tenant reads structurally impossible.
  • S3 objects are stored at s3://bucket/{bot_id}/{filename} — IAM prefix conditions can further restrict per-bot access if needed.
  • CloudWatch log streams are keyed per ECS task — no log interleaving between tenants sharing the same container.

Bedrock — Why AI Inference Never Hits the Public Internet

Every OpenAI, Anthropic, or Cohere API call your application makes traverses the public internet. The packet leaves your server, enters the public routing table, passes through multiple ISP hops, and arrives at the AI provider's data centre. At every hop, TLS provides confidentiality — but the provider still terminates TLS and processes your prompt in plaintext on their infrastructure. Your data is subject to their retention and training policies.

Amazon Bedrock is fundamentally different. When a VegaRAG Fargate task calls Bedrock's API, the packet travels from the container's ENI through the VPC router, across AWS's private fibre backbone, and into the Bedrock service endpoint — all within the same AWS region, all on private infrastructure. The packet never enters the public internet routing table. This is equivalent to a private network call, not an internet API call.

Data residency
All inference stays within your selected AWS region. No data leaves the region boundary.
No training on your data
AWS contractually guarantees Bedrock prompts and responses are not used to train foundation models.
VPC routing
Traffic routes over AWS backbone fiber, not public internet. No TLS termination by third parties.
Audit trail
All Bedrock API calls are logged in CloudTrail with full request metadata. Queryable by security team.
Compliance
Bedrock is HIPAA eligible, SOC 1/2/3 compliant, and ISO 27001 certified — inherited by applications using it.