An eval harness is what tells you whether your agent is actually working. Without one, you have intuition, demos, and the occasional painful realization. With one, you have a number that moves over time.
The hard part isn't running the harness. It's writing the test cases. Most agent behavior is fuzzy — there isn't a single right answer to "summarize this earnings transcript" or "analyze whether this Polymarket dispute is likely to resolve YES." The eval has to score outputs against expected shapes, and the shapes themselves take more thought to write than the agent does to run.
Two patterns hold up. Deterministic checks where possible: did the agent produce valid JSON, did it call the right tool, did it stay within position-size limits. LLM-as-judge for the parts that aren't deterministic: a separate, frozen model scores the agent's output against a rubric. Both are imperfect; both are better than guessing.
The reason this matters isn't testing in the software-engineering sense. It's avoiding silent regression. Models change underneath you — a provider swaps the frontier model overnight, an open-weight model gets re-trained, a system prompt gets edited. The eval harness is the only thing standing between you and quietly worse output.
The first version of any agent I build now ships with a small eval suite. Twenty cases minimum. The agent and the harness grow together. The day you need the harness is the day you wish you had one.