Quality & improvement

Iterating on accuracy, corrections, verification, speed, and outdated notes.

Iterating

Plain-language notes on quality, trust, and improvement. Nothing here is built yet; these are directions to explore.

Accuracy

How can I make sure it is accurate?

The honest baseline: RAG does not guarantee truth. It finds passages that look relevant and asks a model to summarize them. Accuracy depends on (1) what you ingested, (2) whether retrieval found the right passages, and (3) whether the model stayed faithful to those passages.

Things to try over time:

  • Citations first. Always show which excerpts were used. If the answer cannot point to a source, treat it as suspect.
  • Retrieve before you answer. The model should only speak from retrieved text, not from general world knowledge (your system prompt already pushes this way).
  • Better source material. Garbage in, garbage out. Notes that are clear, dated, and generalized beat messy OCR.
  • Smell tests. Ask questions you already know the answer to. If it gets those wrong, fix retrieval or prompts before trusting it on hard questions.
  • Separate “I don’t know” from guessing. Prefer a short “the notes don’t say” over a confident wrong answer.

Accuracy is not one knob. It is retrieval quality + faithful synthesis + corpus quality working together.


Corrections

When it gives a bad answer, how do I correct it?

You are not “re-training” a model in the big ML sense. With local RAG, you usually fix the knowledge base or the pipeline, not the weights.

Practical correction paths:

  1. Fix the note. Edit or add a Markdown entry with the right information. Re-ingest that file (or that notebook). The next query should pick up the fix.
  2. Add a correction entry. New dated section: “2026-06-21 — correction: previously I wrote X; the current view is Y.” Retrieval can then find both history and update.
  3. Re-tag / metadata (future). Tags like topic:pricing, status:current, status:archived so retrieval can prefer “current” chunks.
  4. Feedback log (future). A simple local file: question, bad answer, what it should have said, which note ID was wrong. Use that to tune chunk size, k, or prompts — not magic, but a record of failure modes.

What “re-tag” means in practice: you are labeling chunks so search knows what they are (topic, date, superseded or not). Chroma already stores metadata per chunk; extending that is engineering, not re-training.


Verification

Can I add a verifier step that double-checks every response?

Yes, in principle. A common pattern:

  1. Retrieve passages.
  2. Draft an answer from those passages.
  3. Verify — a second pass (same or smaller model) that asks: “Is every claim in this answer supported by these passages? List unsupported claims.”
  4. Publish only if verification passes, or return the draft with warnings.

Tradeoffs:

  • Pros: Catches “the model made something up” when the claim is not in the excerpts.
  • Cons: Slower (two model calls), costs more compute, verifier can also be wrong, and it cannot fix bad retrieval — if the right note was never retrieved, verification only checks against the wrong context.

Plain rule: Verifier reduces unsupported synthesis; it does not fix missing or wrong sources.


Accuracy

How can I reduce or limit hallucinations?

LeverIdea
**Strict prompt**Answer only from excerpts; say when information is missing.
**Lower temperature**Less creative paraphrase, more literal (when your stack exposes this).
**Citation requirement**Every factual sentence should map to a chunk `[1]`, `[2]`.
**Verifier pass**See above.
**Smaller claims**Ask narrow questions; big vague questions invite invention.
**Stronger retrieval**If the right chunk is in the top results, the model has less room to wander.
**Do not ask it to predict**“What should I do?” pulls in general advice; “What did I write about X?” stays grounded.

Hallucination here usually means: the model said something that is not in your notes (or twisted what they say).

Levers:

You will not get to zero. Goal is: fail visibly (missing info, weak citations) instead of fail confidently.


Accuracy

How can I make it fast to respond?

Where time goes today:

  1. Embedding your question (Ollama).
  2. Vector search (Chroma — usually fast).
  3. Chat completion (Ollama — usually the slowest part).
  4. Optional verifier (doubles chat time).

Speed ideas:

  • Smaller chat model for drafts; bigger model only when you need quality.
  • Fewer chunks (k) — less context to read, faster, but easier to miss the right note.
  • Cache embeddings — do not re-embed unchanged files on every ingest.
  • Keep Ollama warm — first request after idle is slower.
  • Skip verifier by default — turn on for “important” queries only.
  • Pre-filter by date or notebook — search a smaller index when you know the scope.

Tradeoff to accept: faster often means sloppier retrieval or shorter answers. Tune for “good enough in 5 seconds” vs “best possible in 30 seconds.”


Accuracy

How can it get better over time — including conflicting or outdated notes?

Real notebooks contradict themselves. That is normal. The system should reflect decisions, not pretend there was only ever one truth.

Patterns to iterate on:

1. Dates are your friend

You already chunk by ## YYYY-MM-DD. Newer entries should often win when the question is “what do we do now?” Prompts can say: prefer the latest dated entry when entries conflict, unless the question is historical.

2. Explicit status on chunks (metadata)

When you ingest, attach fields like:

  • status: current — this is how things are done now.
  • status: archived — kept for history, not active guidance.
  • supersedes: notebook-01:3 — this entry replaces an older chunk.

Retrieval can boost current and down-rank archived, or filter them out unless you ask “what did I used to think?”

3. Archive, do not delete

When you change your mind, keep the old note but mark it archived in metadata or add a line at the top: “Superseded 2026-06-21 — see entry on that date.” Deleting old text loses audit trail; archiving preserves “why we changed.”

4. Decision entries

When two docs disagree, add a short decision note:

Decision (2026-06-21): We no longer follow approach A. Approach B is current. Reason: …

That becomes the chunk retrieval should prefer for “how do we do this?”

5. Feedback loop without ML

“Better over time” does not require retraining models:

  • Better notes and decision entries.
  • Better tags and status fields.
  • Tune k, chunk size, prompts from repeated failures.
  • Optional reviewer log of bad answers.

6. Conflicting info in one answer

When retrieval pulls both old and new, the model should say they conflict, cite both dates, and state which is marked current — not silently merge them.


Roadmap

Suggested order to experiment

  1. 1

    Citations + “I don’t know” — already partly there; tighten prompts.

  2. 2

    Correction workflow — edit note → re-ingest → re-ask.

  3. 3

    Status metadata — `current` / `archived` on ingest.

  4. 4

    Decision entries — manual, when you change your mind.

  5. 5

    Verifier pass — optional, for high-stakes questions.

  6. 6

    Speed tuning — model size, `k`, skip verifier by default.

  7. 7

    Multimodal — images and scans (separate track from accuracy of text RAG).

Bar

What success looks like (personal bar)

  • I can see which notes an answer came from.
  • When it is wrong, I know whether to fix the note, the tags, or the question.
  • Outdated guidance stays findable but not default.
  • Responses feel fast enough to use daily, not just demo once.
  • I am not paying a cloud service to hold the corpus, and I am not guessing whether the model made it up without checking the citations.