How the chat works: RAG, grounded in my own writing

TL;DR

It’s RAG. Retrieval-Augmented Generation: fetch the relevant text first, then let the model answer from it.
Content becomes chunks. Every MDX file is split by ## heading into passages.
Chunks become vectors. Each chunk is embedded into a 1,536-dimension vector and stored in Postgres (pgvector, HNSW index, cosine distance).
Questions get embedded too. Your question becomes a vector, and the five nearest chunks above a similarity threshold come back.
The model answers only from those. Retrieved chunks go into the prompt; the answer is grounded and cited. If nothing matches, it says so instead of inventing.

Why ground a chatbot at all

A raw language model will happily answer “what does Peter do?” from whatever it half-remembers from training — which, for a specific private person, is mostly nothing, so it fills the gap with plausible fiction. That’s the worst possible failure mode for a personal site: a confident, wrong biography.

Grounding flips the contract. The model is told, in effect: here are the only passages you’re allowed to use; answer from these or admit you don’t know.Every claim traces back to something I actually wrote, and the answer shows you the sources. When I want the chat to know something new, I edit the writing — not a prompt, not a database row.

What RAG actually is

RAG — Retrieval-Augmented Generation — is two steps glued together. Retrieval: find the passages most relevant to the question. Generation:give those passages to the model and have it write the answer. The “augmented” part is that the model’s own knowledge is augmented, at question time, with text it didn’t see during training.

The trick that makes retrieval work is embeddings: turning text into a list of numbers (a vector) such that passages about similar things land near each other in space. Find the question’s neighbors, and you’ve found the relevant passages. So yes — to answer the obvious question directly: this chat is RAG.

From writing to vectors

Everything the chat can cite lives in MDX files under content/ — the About page, the six case studies, the essays. A build step chunks and embeds them:

Chunk by heading. Each file is split on its ## sections, so one chunk is one coherent topic. The frontmatter (title + summary) becomes its own lead-in chunk. Anything over 4,000 characters is split further on paragraph breaks.
Embed each chunk. Every chunk goes through openai/text-embedding-3-small (via the Vercel AI Gateway) and comes back as a 1,536-number vector.
Only re-embed what changed.Each chunk’s text is hashed (SHA-256). On the next build, unchanged chunks keep their existing vector; only edited ones get re-embedded. Editing one section of one file re-embeds a handful of chunks, not the whole corpus.

pnpm embeddingsdiff by content hash

  about: +2 -2 =6
  work/bonvivant: +0 -0 =5
  writing/the-software-house-of-one: +1 -1 =4
  Done. Added 3, deleted 3, unchanged 64.

That’s a real run from editing two sections: 3 chunks changed, 64 untouched. No wasted embedding calls, no wasted tokens.

Where the vectors live

Chunks and their vectors sit in a single Postgres table (Supabase + the pgvector extension). The vector column is indexed with HNSW using cosine distance— an approximate-nearest-neighbor index that finds the closest vectors fast without scanning every row. A scope column lets two knowledge bases share the table: peterwd for this chat, and pico-demo for the embedded restaurant demo. Every search filters on scope, so the two never bleed into each other.

Answering a question, step by step

When you send a message, the chat route runs this pipeline before a single word is generated:

1.Embed the question with the same model used at index time, so it lands in the same vector space as the chunks.
2.Vector search for the top 5 chunks whose cosine similarity beats 0.3. This is the core retrieval call.
3.Check confidence. If the best match is below 0.45, a second search (threshold 0.15, titles only) collects nearby section names, so the model can point you toward adjacent topics instead of just saying “nothing matches.”
4.Build the prompt.The retrieved chunks, a voice profile (style only, no biography), and grounding rules (“use only this context; don’t invent”) go into the system prompt.
5.Generate & cite. anthropic/claude-sonnet-4-6 streams the answer. The sources behind it attach as citation chips — suppressed when the top match is below 0.25, so a thin retrieval never wears a citation it didn’t earn.

Four questions, four retrievals

These are real results from the live vector search — the actual chunks and similarity scores that come back for four questions. Watch the threshold do its job, including the last one, where the right answer is to retrieve nothing.

Strong matchscope: peterwdtop similarity 0.556

Question

“What does Peter do at Scripps?”

Retrieved chunks (top 5, threshold 0.3)

0.556About Peter Dirickson · Scripps Health[about]

0.453About Peter Dirickson · Path[about]

0.429About Peter Dirickson · (preamble)[about]

0.428About Peter Dirickson · Side B[about]

0.416About Peter Dirickson · Identity[about]

What the model does

Top similarity 0.556 is well above the 0.45 confidence gate. The Scripps Health chunk lands first, the model answers straight from it, and the answer carries a citation to about.

Right case study surfacesscope: peterwdtop similarity 0.553

Question

“Tell me about BonVivant”

Retrieved chunks (top 5, threshold 0.3)

0.553BonVivant · (preamble)[work/bonvivant]

0.548BonVivant · The wager[work/bonvivant]

0.462About Peter Dirickson · Now[about]

0.354Vaifredo · The wager[work/vaifredo]

0.349Pico · (preamble)[work/pico]

What the model does

The two BonVivant chunks score highest, but adjacent products (Vaifredo, Pico) also clear 0.3 because the question is about the portfolio's vocabulary. The model grounds on the top BonVivant chunks; the lower hits give it context without dominating.

Single relevant sectionscope: peterwdtop similarity 0.485

Question

“Does Peter make music?”

Retrieved chunks (top 5, threshold 0.3)

0.485About Peter Dirickson · Side B[about]

0.444About Peter Dirickson · (preamble)[about]

0.327About Peter Dirickson · Path[about]

0.312About Peter Dirickson · Identity[about]

What the model does

Only one section (Side B) really answers this, and it tops out at 0.485 — above 0.45, so confidence holds. Just four chunks clear the 0.3 floor; the rest of the corpus is correctly left out.

Nothing matches — correctscope: peterwdtop similarity —

Question

“What's the capital of France?”

Retrieved chunks (top 5, threshold 0.3)

(none scored above 0.3 — retrieval returns empty)

What the model does

Zero chunks score above 0.3, so retrieval returns empty. This is the trigger for the honest path: the model is told the context is empty and says it can't help with that rather than answering from general knowledge. No citation is attached.

What grounding buys you

Honest gaps.When the corpus doesn’t cover something, the chat says so — it doesn’t fabricate a confident answer.
Real citations.Every grounded answer shows the sources it drew from, and they’re suppressed when retrieval was too thin to count.
Edits flow from the writing.To change what the chat knows, I edit an MDX file and re-run the embeddings. The rendered page and the chat’s knowledge come from the same source.
No hallucinated biography. The voice profile controls how it talks; the indexed content controls what it can claim. Identity facts stay grounded, never improvised.

Grounding is one half of keeping an AI feature honest; scoring its output is the other. The companion guide, LLM-as-judge, covers how every answer this chat gives is graded for groundedness and voice. If you’re building an AI feature and want to talk about keeping it honest in production, the chat panel on the home page is the fastest way to reach me — or .