<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Evals on Aguilera Engineering</title><link>https://aguilera.ee/blog/tags/evals/</link><description>Recent content in Evals on Aguilera Engineering</description><generator>Hugo -- gohugo.io</generator><language>en-US</language><managingEditor>eduardo@aguilera.ee (Eduardo Aguilera)</managingEditor><webMaster>eduardo@aguilera.ee (Eduardo Aguilera)</webMaster><copyright>Eduardo Aguilera</copyright><lastBuildDate>Wed, 20 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://aguilera.ee/blog/tags/evals/index.xml" rel="self" type="application/rss+xml"/><item><title>Evals before claims</title><link>https://aguilera.ee/blog/evals-before-claims/</link><pubDate>Wed, 20 May 2026 00:00:00 +0000</pubDate><author>eduardo@aguilera.ee (Eduardo Aguilera)</author><guid>https://aguilera.ee/blog/evals-before-claims/</guid><description>&lt;p&gt;Most AI features ship on vibes. Someone tries three prompts, the output looks
good, and it goes to production. Then it fails quietly on the fourth case and
nobody notices until a customer does.&lt;/p&gt;
&lt;p&gt;Write the eval first. Pin the cases you care about. Measure before you claim the
feature works.&lt;/p&gt;





&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="ln"&gt;1&lt;/span&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_extraction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="ln"&gt;2&lt;/span&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="ln"&gt;3&lt;/span&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="ln"&gt;4&lt;/span&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="ln"&gt;5&lt;/span&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="ln"&gt;6&lt;/span&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;An eval is not a unit test. It is a measurement you keep running as the model,
the prompt, and the data drift underneath you. Treat the score as a contract.&lt;/p&gt;</description><content:encoded><![CDATA[<p>Most AI features ship on vibes. Someone tries three prompts, the output looks
good, and it goes to production. Then it fails quietly on the fourth case and
nobody notices until a customer does.</p>
<p>Write the eval first. Pin the cases you care about. Measure before you claim the
feature works.</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">test_extraction</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">cases</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="n">passed</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="k">for</span> <span class="k">case</span> <span class="ow">in</span> <span class="n">cases</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">        <span class="n">result</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">extract</span><span class="p">(</span><span class="k">case</span><span class="o">.</span><span class="n">input</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">        <span class="n">passed</span> <span class="o">+=</span> <span class="n">result</span> <span class="o">==</span> <span class="k">case</span><span class="o">.</span><span class="n">expected</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">    <span class="k">return</span> <span class="n">passed</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">cases</span><span class="p">)</span></span></span></code></pre></div><p>An eval is not a unit test. It is a measurement you keep running as the model,
the prompt, and the data drift underneath you. Treat the score as a contract.</p>
<p>The discipline is old. The subject is new. That&rsquo;s the whole job.</p>
]]></content:encoded></item></channel></rss>