Did That Prompt Still Work? Keep a Plain-Text Eval File

Q: Do I need an eval platform to test my prompts?

No. Hosted eval platforms exist and are useful at team scale, but the documented mechanic scales down. The same engineers who build the dashboards advise you to store the dataset as a CSV or JSON file in your repository and start small with 10-20 high-priority examples. A folder of plain Markdown fixtures is that, owned by you.

Q: What is a golden dataset for LLM evals?

A golden dataset is a curated collection of inputs and their ideal outputs or evaluation criteria — the saved baseline a regression test compares against. A team manages large ones; your personal version is one file per prompt, holding the input, the expected shape, and the last answer you trusted.

A personal AI eval is a plain-text file per prompt — the input, the expected shape, and the last answer you trusted — that you re-run and eyeball-diff yourself. You are not buying a dashboard. You are keeping a fixture you own, so that when a prompt that worked last month returns garbage, you can say exactly what changed.

Most people keep nothing. They write a prompt, it works, they move on. Then the hosted model updates underneath them and the output quietly shifts, and they have no earlier answer to compare against, so they cannot even prove it broke. A Stanford and UC Berkeley team measured this directly: "the behavior of the 'same' LLM service can change substantially in a relatively short amount of time" ¹. The fix is not a platform. It is a habit, and the habit fits in a folder.

What most people believe about a working prompt

The comforting belief is that a prompt is a settled thing. You tuned the wording, the output looked right, and the model is the same model it was yesterday, so the prompt should keep working. Treat it as code that compiled once and need never be recompiled.

That belief is reasonable for software you run locally. It is wrong for a prompt aimed at a hosted model. The model is not a fixed dependency you pinned; it is a remote service that the vendor revises on a schedule you do not see. Your prompt text stays identical while the thing reading it changes. The output can drift without a single character of your wording moving, which is exactly why "I didn't touch it" is no defense.

Why "I didn't change anything" stops being true

Hosted models change underneath you, and the change is invisible until the output is wrong. The same prompt can degrade because the service was updated, not because you edited it. Developers name two causes: "prompt drift" and "model drift (upstream model updates such as GPT-4o change behaviour)" ². Model drift is the one you cannot see coming.

The Stanford and UC Berkeley study put a number on it. "GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy)" ¹. Same service, same task, three months apart, accuracy nearly halved. Read it carefully, though: the paper found change in both directions, since GPT-3.5 improved at the same task over the same window ¹. The honest lesson is not "models always get worse." It is that a hosted model can shift, silently, and in either direction, and you will not notice on the one prompt you care about unless you kept something to check it against. The drift in the study showed up across three months, not three years, so the practical cadence is not "set it and forget it." Re-check when a vendor announces a model update, when you edit the prompt, and otherwise on a slow calendar you actually keep. The authors close on the obvious consequence: the result highlights "the need for continuous monitoring of LLMs" ¹.

The fixture you own instead

The alternative is small and durable: one plain-text file per prompt, holding three things — the exact input you send, the shape you expect back, and the last answer you were happy with. You re-run the prompt, paste the new answer in, and read the two side by side. The file is yours to keep, diff, and version forever.

This is the personal-scale form of a practice teams already document. Engineers call the saved baseline a golden dataset, "a curated collection of inputs and their ideal outputs or evaluation criteria" ³, and the small form is legitimate. The guidance is to "start small with 10-20 high-priority examples that cover your most critical use cases and common edge cases" and to "store this dataset as a CSV or JSON file in your repository" ⁴. A folder of plain .md files is the same idea, scaled to one person and one prompt at a time. The point is not the file format. The point is that a reproducible baseline lives somewhere you control, not on a server someone else can change or retire.

Why plain text, specifically? Because the eval should outlast every tool you ran it through. Steph Ango, Obsidian's CEO, put the principle plainly: "Apps are ephemeral, but your files have a chance to last" ⁵. An eval kept inside a vendor's dashboard is only as durable as that vendor's roadmap. An eval kept in a file you own, by Ango's standard "files you can control, in formats that are easy to retrieve and read" ⁶, is yours whether the platform survives or not.

What to write down tomorrow

The whole technique is three fields and a habit. Save a file per prompt, re-run it on a trigger, and read the diff with your own eyes. Nothing to install, nothing to log into: just a file and a routine. Here is the fixture, and the rule for when to re-run it.

Save a fixture per prompt: the input verbatim, the expected shape, the last-good answer.
Re-run it on a trigger: when the model updates, when you edit the prompt, or on a fixed cadence.
Diff by eye, then version: paste the new answer, compare against last-good, and keep the file under version control so the history shows when behavior moved.

Here is the shape of one fixture, ready to copy:

# eval: extract-action-items

## input
System: Extract action items from the meeting notes below as a
bulleted list. Each item: owner, task, due date. Notes:
"Maya will send the budget draft by Friday. Sam owns the vendor call."

## expected shape
- A bulleted list, one action per line
- Each line names an owner and a task; due date if stated, else "—"
- No prose preamble, no closing summary

## last-good (re-run: 2026-06-21, model updated → re-check)
- Maya — send the budget draft — Friday
- Sam — own the vendor call — —

## notes
Re-run when the model version changes or I touch the prompt.
Watch for: preamble creeping back in; due dates hallucinated.

The fixture earns its keep on the expected shape and notes lines, not the answer. The shape is what a smoke test actually checks: did the output stop being a clean bulleted list, did a preamble creep back, did it invent a due date. Those are the obvious breaks a person catches at a glance — and they are most of what silent drift does to a working prompt. The notes line is where you record the failure you have seen before, so the next re-run starts with a list of things to look for rather than a blank stare. A fixture you have re-run three times is worth more than one you wrote once, because each pass teaches it what "broken" looks like for this particular prompt.

The honest limits of this

A plain-text eval is a smoke test, not a metric. It catches obvious breakage — wrong shape, a refusal, a format that collapsed — and misses the subtle quality drift a real harness scoring many examples would surface. One person eyeball-diffing one fixture is the floor of the practice: reproducibility you own, not enterprise rigor.

Three caveats keep it honest. First, do not hand the diff to an AI and call it judged. An LLM grading text is itself biased. Researchers find "the self-preference bias in LLMs has posed significant risks, including promoting specific styles or policies intrinsic to the LLMs," and the root cause is that "LLMs prefer texts more familiar to them" ⁷⁸. The human eyeball stays load-bearing. Second, a fixed fixture is the floor, not the ceiling: the practitioner consensus is that "a combined approach works best: random sampling for breadth, golden datasets for deterministic guards" ². A saved baseline is one half of that, not a substitute for the other. Third, keep regulated work out of it. If a prompt touches health, legal, or anything carrying personal data, a hand-rolled eyeball test is not a safeguard, and the fixture should hold a synthetic example, never real records.

Frequently asked questions

How do I know if my AI prompt still works after the model changed?

Re-run the saved input from your eval file and compare the new answer against your last-good baseline. Hosted models drift — the Stanford and UC Berkeley study found GPT-4 fell from 84% to 51% on one task in three months ¹ — so the only reliable signal is a fixture you kept and can diff today.

Why did my ChatGPT prompt suddenly get worse or stop working?

Usually model drift: the vendor updated the service, and your unchanged prompt now meets a changed model. Developers name "prompt drift" and "model drift" as the two causes ²; the second is invisible from your side. Without a saved baseline you cannot tell which one moved, which is the entire case for keeping an eval file.

Do I need an eval platform to test my prompts?

No. Hosted eval platforms exist and are genuinely useful at team scale, but the documented mechanic scales down. The same engineers who build the dashboards advise you to "store this dataset as a CSV or JSON file in your repository" and "start small with 10-20 high-priority examples" ⁴. A folder of plain .md fixtures is that, owned by you.

What is a golden dataset for LLM evals?

A golden dataset is "a curated collection of inputs and their ideal outputs or evaluation criteria" ³ — the saved baseline a regression test compares against. A team manages large ones; your personal version is one file per prompt, holding the input, the expected shape, and the last answer you trusted.

Can an LLM grade its own outputs reliably?

Not on its own. LLM-as-judge carries self-preference bias — it favors styles "intrinsic to the LLMs" and prefers "texts more familiar to them" ⁷⁸ — so an AI grading your diff is not a neutral check. Use it to draft a first read if you like, but the human eyeball decides whether the output actually broke.

How is this different from a prompt library?

A prompt library stores the prompts; an eval tests whether a stored prompt still works over time. The library is the noun, the eval is the verb. They pair naturally: once you have a prompt worth keeping, the eval file is how you know it still does its job a month later.

The model you rent will change without telling you; the fixture you own is the only thing that will tell you when. Keep the file, re-run it, read the diff with your own eyes — and your AI work stays in plain text you control, the way a note kept in mnmnote.com stays yours regardless of which tool you opened it in.

Chen, L., Zaharia, M., & Zou, J. "How Is ChatGPT's Behavior Changing over Time?" Stanford University / UC Berkeley, arXiv:2307.09009, 18 July 2023 (study window March–June 2023). https://arxiv.org/abs/2307.09009. Accessed 2026-06-21. ↩ ↩² ↩³ ↩⁴ ↩⁵
"Practical Developer." "Random Prompt Sampling vs Golden Dataset: Which Works Better for LLM Regression Tests?" DEV Community, 23 June 2025. https://dev.to/practicaldeveloper/random-prompt-sampling-vs-golden-dataset-which-works-better-for-llm-regression-tests-1ln7. Accessed 2026-06-21. ↩ ↩² ↩³
Kinde. "CI/CD for Evals: Running Prompt and Agent Regression Tests in GitHub Actions." Kinde Learn. https://www.kinde.com/learn/ai-for-software-engineering/ai-devops/ci-cd-for-evals-running-prompt-and-agent-regression-tests-in-github-actions/. Accessed 2026-06-21. ↩ ↩²
Kinde. "CI/CD for Evals: Running Prompt and Agent Regression Tests in GitHub Actions." Kinde Learn. https://www.kinde.com/learn/ai-for-software-engineering/ai-devops/ci-cd-for-evals-running-prompt-and-agent-regression-tests-in-github-actions/. Accessed 2026-06-21. ↩ ↩²
Ango, S. "File over app." stephango.com, 1 July 2023. https://stephango.com/file-over-app. Accessed 2026-06-21. ↩
Ango, S. "File over app." stephango.com, 1 July 2023. https://stephango.com/file-over-app. Accessed 2026-06-21. ↩
Wataoka, K., Takahashi, T., & Ri, R. "Self-Preference Bias in LLM-as-a-Judge." arXiv:2410.21819, 29 October 2024. https://arxiv.org/abs/2410.21819. Accessed 2026-06-21. ↩ ↩²
Wataoka, K., Takahashi, T., & Ri, R. "Self-Preference Bias in LLM-as-a-Judge." arXiv:2410.21819, 29 October 2024. https://arxiv.org/abs/2410.21819. Accessed 2026-06-21. ↩ ↩²