Posts

All the articles I've posted.

7 of 10 posts

Showing 7 of 10 posts

Apr 1, 2026

Why real multimodal model support is a six-layer serving-stack contract, from API extraction to decoder reentry.

Feb 21, 2026

How I designed, implemented, and hardened a cost-efficient RAG chatbot for my personal site with citations, streaming, and build-time indexing.

Feb 5, 2026

How gathering your information in one place transforms AI from a generic assistant into your personal superpower.

Jan 25, 2026

Why this feels like the shift from horsepower to a real vehicle—and why I am excited about the future.

Jan 14, 2026

Why vLLM requests can take 2-3 hours under heavy load - analyzing KV cache block exhaustion, queue starvation.

Jan 6, 2026

Continuing the Mini SGLang deep dive - covering request batching, overlap scheduling, and tensor parallelism.

Jan 4, 2026

Deep dive into Mini SGLang architecture - covering system design, engine initialization, KV cache, and single request lifecycle.