Aguilera Engineering

Evals before claims

Most AI features ship on vibes. Someone tries three prompts, the output looks good, and it goes to production. Then it fails quietly on the fourth case and nobody notices until a customer does.

Write the eval first. Pin the cases you care about. Measure before you claim the feature works.

1def test_extraction(model, cases):
2    passed = 0
3    for case in cases:
4        result = model.extract(case.input)
5        passed += result == case.expected
6    return passed / len(cases)

An eval is not a unit test. It is a measurement you keep running as the model, the prompt, and the data drift underneath you. Treat the score as a contract.

The discipline is old. The subject is new. That’s the whole job.

#evals #reliability

Reply to this post by email ↪