What “index-matched chunks” meant in our stack

The phrase collides with two different “index” ideas in this workspace:

  1. Memory RAG (this article): Each retrieved unit is a vector row keyed by chunk_id, with folder-like area_path and document fields. Prompt formatting labels hits as numbered entries and prints area_path/document so the model can cite a stable source. Duplicate hits from multiple queries collapse by chunk_id, keeping the highest score.

  2. macOS UI automation (Rabbit / BCL): “Index” means the interactive element index from the last discover pass; not text-chunk identity. That is a different pipeline (accessibility tree, click-by-index).

This article is about (1) only.

Folder and file system as the user-visible database

Each conversation gets a directory under conversations/<id>/ with multiple Markdown artifacts—for example a full thread, a user-inputs-only file (to bias retrieval toward the user’s words), rotating AI notes, and summaries. The design treats the filesystem as a human-auditable backing store while SQLite holds embeddings for fast search.

Chunking prefers semantic units: entries parsed from time-stamped sections (not arbitrary token windows), producing chunk IDs that combine area path, document name, and timestamp. That makes chunk_id match the logical entry in the tree, which is what we mean by “chunk-matched” in prompts: the model sees text that traces back to a specific path and time.

Every message: assemble, format, inject

On each user message, the backend can call getContextForTurn, which aggregates AI notes, a life-areas knowledge tree, multi-query vector search over the current transcript, and recent conversation history. The formatted string becomes conversation.ragContext before streaming the LLM response.

Multi-query strategy runs three embedding searches (recent user text, all user text, full conversation) with weights, merges results, applies recency and optional area boosts, and takes a top-K after deduplication. The intent is to balance immediate intent against longer user history and full-dialog drift without a single naive query.

A later architecture fix documented in code moved RAG and artifact context into a shared prefix shared by OpenAI, Anthropic, and Gemini builders—previously some providers could miss artifact or RAG context entirely.

Troubleshooting and operational lessons

Graceful degradation: If embeddings or the database are not ready (first launch, dev flags skipping the DB), the server logs a warning and continues without RAG so the user still gets a reply.

Cold start: First conversations legitimately return empty vectors; that is expected, not a bug.

Integration gap (fixed in code comments): Context assembly existed before the websocket handler called it; wiring getContextForTurn into handleUserMessage was the critical path fix so the feature actually ran in production flows.

What may be novel

Combining local Transformers.js embeddings, chunk_id-stable dedup across three queries, explicit area_path/document provenance in the prompt, and provider-agnostic injection is a coherent pattern for privacy-sensitive desktop voice agents. Novelty is relative; the contribution is end-to-end documentation of a working integration, not a new embedding model.

Boundaries

Token budgets and weights are configurable; exact production numbers should be read from PromptManagementService / admin config at runtime. Do not conflate this RAG pipeline with iOS Accountability’s timer-based turn-taking—they are separate products in this repo collection.