Building a RAG Chatbot for My Personal Website
How I built a retrieval-augmented generation chatbot that answers questions about my blog posts and projects, for under $1/month.
All the articles I've posted.
How I built a retrieval-augmented generation chatbot that answers questions about my blog posts and projects, for under $1/month.
How gathering your information in one place transforms AI from a generic assistant into your personal superpower.
Why this feels like the shift from horsepower to a real vehicle—and why I am excited about the future.
Why vLLM requests can take 2-3 hours under heavy load - analyzing KV cache block exhaustion, queue starvation.
Continuing the Mini SGLang deep dive - covering request batching, overlap scheduling, and tensor parallelism.
Deep dive into Mini SGLang architecture - covering system design, engine initialization, KV cache, and single request lifecycle.
A comprehensive look at modern techniques for optimizing transformer model inference.
Best practices for managing GPU memory in deep learning applications.
Breaking down attention mechanisms from self-attention to multi-head attention.