Evals before claims

eduardo@aguilera.ee (Eduardo Aguilera) — Wed, 20 May 2026 00:00:00 +0000

Most AI features ship on vibes. Someone tries three prompts, the output looks good, and it goes to production. Then it fails quietly on the fourth case and nobody notices until a customer does.

Write the eval first. Pin the cases you care about. Measure before you claim the feature works.

1def test_extraction(model, cases):
2 passed = 0
3 for case in cases:
4 result = model.extract(case.input)
5 passed += result == case.expected
6 return passed / len(cases)

An eval is not a unit test. It is a measurement you keep running as the model, the prompt, and the data drift underneath you. Treat the score as a contract.

Evals on Aguilera Engineering

Evals before claims