Back to posts

Designing a Production-Minded RAG Chatbot for a Personal Website

·8 min read·llm·architecture·rag·system-design

Most RAG tutorials optimize for enterprise scale: huge corpora, distributed infrastructure, and complex orchestration. My use case was different.

I wanted a chatbot on edwingao.com that could answer questions about my writing and projects, cite sources, stream responses, and stay cheap enough to run continuously.

This article explains the architecture, tradeoffs, implementation details, and production hardening work that made it reliable.

Goals And Constraints

I set explicit constraints before writing code:

ConstraintWhy it mattered
Cost under about $1/monthPersonal project should be sustainable
No manual indexing stepsPublishing content should auto-refresh retrieval
Source-grounded answers with citationsAvoid hallucinated claims about my work
Good UX on desktop and mobileChat should feel native, not bolted on
Clear failure behaviorInvalid input should return 400, infra failures degrade gracefully

High-Level Architecture

The system has two pipelines:

  1. Build-time indexing
  2. Runtime retrieval and generation

RAG Chatbot — End-to-End Architecture

Stack: Next.js App Router, Vercel AI SDK, OpenAI (text-embedding-3-small, gpt-4o-mini), Upstash Vector.

Why I Use Markdown + Diagrams For Visual Experience

I wanted this system to be understandable in under a few minutes, especially for readers scanning quickly (including recruiters). So I intentionally structured the post with Markdown primitives and diagram callouts.

What I optimized for:

  1. Fast scanning: short sections, numbered headings, concise bullet lists.
  2. Decision visibility: tables for constraints and architectural choices.
  3. Concrete credibility: code blocks for implementation details, not just high-level claims.
  4. Mental model first: architecture and sequence diagrams placed before deep implementation sections.

Markdown made this easy to maintain in the same content pipeline as the rest of the site, and diagrams reduced cognitive load for non-specialist readers.

Design Decisions

DecisionChoiceReason
Retrieval strategyVector search + threshold filterBetter semantic match than lexical overlap
Invocation strategyLLM tool callingModel decides when retrieval is needed
Indexing scheduleBuild-time scriptPredictable, cheap, no runtime indexing cost
ChunkingMarkdown-aware + token-aware + overlapBetter recall across section boundaries
Citation mappingStable per-request source mapDeterministic [1], [2] references
Cost controlDelta indexing by contentHashAvoid re-embedding unchanged chunks

1) Chunking: Retrieval Quality Starts Here

Chunking is usually the highest leverage part of a small RAG system. If chunk boundaries are poor, retrieval quality drops no matter how good the model is.

My chunker is:

  1. Markdown-aware (splits on H1-H3 boundaries)
  2. Token-aware (js-tiktoken, GPT tokenizer)
  3. Overlap-aware (50-token overlap between adjacent chunks)
  4. Code-fence-aware (never splits inside fenced code blocks)
const TARGET_CHUNK_TOKENS = 450;
const MAX_CHUNK_TOKENS = 1000;
const OVERLAP_TOKENS = 50;

I also generate deterministic chunk IDs (post:{slug}:{index}), so re-indexing can compare stable IDs and hashes.

A subtle bug worth calling out

I hit an overlap bug where boundary text could merge words (...startsNEXT) when prepending overlap. That degraded embedding quality at chunk boundaries.

Fix: insert a separator when needed if overlap suffix and next chunk prefix do not already contain whitespace.

2) Embeddings + Retrieval: Keep It Simple, Deterministic, Cheap

I use text-embedding-3-small for both indexing and query embeddings.

Reasons:

  1. Strong enough quality for a small corpus
  2. Very low cost
  3. Consistent dimensions and behavior across indexing/runtime

Retrieval flow:

  1. Embed query
  2. Query top-K vectors (topK=10)
  3. Filter by RAG_MIN_SCORE (default 0.3)
  4. Return top 5 chunks for the LLM
const results = await index.query({
  vector: queryEmbedding,
  topK: 10,
  includeMetadata: true,
  includeData: true,
});

const filtered = results.filter(
  (r) => typeof r.score === "number" && r.score >= minScore,
);

return filtered.slice(0, 5);

The retrieval module returns structured statuses (ok, no_results, unavailable) so the chat route can degrade cleanly.

3) Delta Indexing: Production Behavior On A Small Budget

Delta Indexing — Three Scenarios

Naive indexing re-embeds everything every run. That is expensive, slow, and unnecessary.

I store a contentHash in vector metadata, then compare current chunks to existing vectors:

  1. Same ID + same hash -> skip
  2. Same ID + different hash -> re-embed/upsert
  3. Missing ID in current set -> prune stale vector
if (existingHash && existingHash === chunk.metadata.contentHash) {
  stats.skipped++;
} else {
  changed.push(chunk);
  stats.embedded++;
}

This changed indexing behavior from "always re-embed" to true incremental indexing.

Automating indexing in the build pipeline

To avoid manual mistakes, I created a dedicated indexing script and wired it into the build command:

{
  "scripts": {
    "index-content": "tsx scripts/index-content.ts",
    "build": "npm run index-content && next build"
  }
}

This gives me a simple guarantee: every time I run npm run build, content is chunked and indexed before the app is built. So when I add a new article or project, I do not need to remember a separate indexing step.

4) API Route: Tool-Based Retrieval With Guardrails

The chat endpoint uses streamText and exposes a retrieval tool.

const result = streamText({
  model: openai(process.env.OPENAI_CHAT_MODEL?.trim() || "gpt-4o-mini"),
  system: SYSTEM_PROMPT,
  messages: await convertToModelMessages(messages),
  maxOutputTokens: 1000,
  stopWhen: stepCountIs(5),
  tools: {
    getInformation: tool({ ... }),
  },
});

Why tool calling instead of always-retrieve

It allows the model to route behavior:

  1. Off-topic/general question -> answer directly
  2. Site-specific question -> call retrieval tool
  3. Follow-up question -> optionally re-query with refined terms

Validation and status semantics

I hardened request handling with explicit validation error types so invalid client payloads return 400 instead of accidental 500.

Examples:

  1. Empty or malformed message list -> 400
  2. Oversized message content -> 400
  3. Infra/provider failures -> 500

This matters for observability, client behavior, and API correctness.

5) Client Streaming + Citation Rendering

The UI uses the AI SDK chat stream and extracts citations from tool output metadata (not from raw text tokens).

function extractCitations(parts: UIMessage["parts"]): ChatCitation[] {
  const citationsById = new Map<number, ChatCitation>();
  // parse tool output parts with state === "output-available"
  return Array.from(citationsById.values());
}

Then the message component parses [1], [2] in response text and binds them to source links.

Two production details that matter:

  1. Stable memoization in ChatMessage with a custom comparator that checks citation content, not only citation count.
  2. Safe link routing for internal vs external URLs (<Link> for internal paths, <a target="_blank" rel="noopener noreferrer"> for external links).

Streaming UX: How I approached the user experience

Streaming quality is not only a backend concern. It is a user-experience concern.

I focused on these interaction details:

  1. Immediate feedback: responses start streaming token-by-token instead of waiting for the full completion.
  2. Visible progress: the active assistant message shows a typing indicator while streaming.
  3. Stable reading flow: previous messages are memoized so only the active message updates during stream.
  4. Citation timing: citations render after the response stabilizes, which avoids distracting layout jumps mid-stream.
  5. Input behavior: the send path is guarded while a request is in flight, preventing accidental duplicate submissions.
  6. Mobile ergonomics: visual viewport handling prevents keyboard and drawer animation conflicts on iOS.

These decisions improved perceived performance and made the chat feel more reliable even before raw latency optimizations.

6) Reliability And Test Strategy

Test Strategy — Three Layers

I treat this as an engineering system, not only a demo. I added tests at multiple layers:

  1. Pure-function tests for delta indexing logic
  2. Component tests for citation rendering and link behavior
  3. Route-level tests with mocked dependencies to verify request validation and model call settings
expect(streamTextMock).toHaveBeenCalledWith(
  expect.objectContaining({ maxOutputTokens: 1000 }),
);

That gives confidence the contract stays correct while iterating quickly.

Results

Cost (estimated at ~1K monthly queries)

ServiceCost
Embeddings (indexing)< $0.01
Embeddings (queries)~ $0.01
GPT-4o-mini generation~ $0.50 - $2.00
Upstash Vector (free tier)$0
Total~$0.50 - $2.00/month

Runtime

MetricTargetObserved
Time to first token (warm)< 1.5s~1s
Full answer latency< 8s~4-6s
No-change re-indexing< 3s~2s

What I Would Improve Next

  1. Add an eval set for retrieval quality (query -> expected source) to catch regressions automatically.
  2. Run chunk-size experiments (300, 450, 600) and measure citation accuracy, not just latency.
  3. Consider a reranker only if corpus size and ambiguity grow enough to justify extra cost/latency.
  4. Add lightweight request tracing for retrieval status and citation usage in production.

Practical Takeaways

  1. Start with chunk quality. Better chunking beats premature model tuning.
  2. Keep retrieval explicit and observable through tools and structured statuses.
  3. Use content hashes for build-time indexing economics.
  4. Treat API semantics (400 vs 500) as product behavior, not implementation detail.
  5. Build tests at the right layer: pure logic, component behavior, and route integration.

Why This Project Matters

This project was intentionally scoped small, but engineered with production discipline:

  1. Architecture and tradeoff decisions were explicit.
  2. Reliability bugs were identified and fixed.
  3. Test coverage was added for core failure modes.
  4. Cost, latency, and UX constraints were measured and met.

If you are building a similar assistant for a docs site, portfolio, or internal knowledge base, this pattern is a practical baseline that scales from side project constraints to production expectations.