What “index-matched chunks” meant in our stack
The phrase collides with two different “index” ideas in this workspace:
Memory RAG (this article): Each retrieved unit is a vector row keyed by
chunk_id, with folder-likearea_pathanddocumentfields. Prompt formatting labels hits as numbered entries and printsarea_path/documentso the model can cite a stable source. Duplicate hits from multiple queries collapse bychunk_id, keeping the highest score.macOS UI automation (Rabbit / BCL): “Index” means the interactive element index from the last
discoverpass; not text-chunk identity. That is a different pipeline (accessibility tree, click-by-index).
This article is about (1) only.
Folder and file system as the user-visible database
Each conversation gets a directory under conversations/<id>/ with multiple Markdown artifacts—for example a full thread, a user-inputs-only file (to bias retrieval toward the user’s words), rotating AI notes, and summaries. The design treats the filesystem as a human-auditable backing store while SQLite holds embeddings for fast search.
Chunking prefers semantic units: entries parsed from time-stamped sections (not arbitrary token windows), producing chunk IDs that combine area path, document name, and timestamp. That makes chunk_id match the logical entry in the tree, which is what we mean by “chunk-matched” in prompts: the model sees text that traces back to a specific path and time.
Every message: assemble, format, inject
On each user message, the backend can call getContextForTurn, which aggregates AI notes, a life-areas knowledge tree, multi-query vector search over the current transcript, and recent conversation history. The formatted string becomes conversation.ragContext before streaming the LLM response.
Multi-query strategy runs three embedding searches (recent user text, all user text, full conversation) with weights, merges results, applies recency and optional area boosts, and takes a top-K after deduplication. The intent is to balance immediate intent against longer user history and full-dialog drift without a single naive query.
A later architecture fix documented in code moved RAG and artifact context into a shared prefix shared by OpenAI, Anthropic, and Gemini builders—previously some providers could miss artifact or RAG context entirely.
Troubleshooting and operational lessons
Graceful degradation: If embeddings or the database are not ready (first launch, dev flags skipping the DB), the server logs a warning and continues without RAG so the user still gets a reply.
Cold start: First conversations legitimately return empty vectors; that is expected, not a bug.
Integration gap (fixed in code comments): Context assembly existed before the websocket handler called it; wiring getContextForTurn into handleUserMessage was the critical path fix so the feature actually ran in production flows.
What may be novel
Combining local Transformers.js embeddings, chunk_id-stable dedup across three queries, explicit area_path/document provenance in the prompt, and provider-agnostic injection is a coherent pattern for privacy-sensitive desktop voice agents. Novelty is relative; the contribution is end-to-end documentation of a working integration, not a new embedding model.
Boundaries
Token budgets and weights are configurable; exact production numbers should be read from PromptManagementService / admin config at runtime. Do not conflate this RAG pipeline with iOS Accountability’s timer-based turn-taking—they are separate products in this repo collection.