SERIES

Evals Are Checks, Not Tests

4 parts

A hands-on series on testing GenAI like real software — where eval dashboards stop and classical testing discipline takes over.

GitHub Code
  1. Part 1 GENAI_TESTING

    Your Evals Are Checks, Not Tests

    Air Canada's chatbot cost CAD $812 for an answer its evals scored as faithful. Five classical software-testing patterns catch what your eval dashboard misses.

    JUN 11, 2026 38 min read
  2. Part 2 GENAI_TESTING

    The System Under Test: A Broadband Support Agent

    Meet Atlas: a broadband support agent on LangGraph and MCP, mapped against OWASP's 10 agentic risks and the real incidents that prove each part can fail.

    JUN 18, 2026 28 min read
  3. Part 3 GENAI_TESTING

    The Harness Is the Product: Models Are a Commodity

    The teams shipping agents don't have a better model; they have a better harness. Five properties that make one trustworthy around a LangGraph and MCP agent.

    JUN 21, 2026 22 min read
  4. Part 4 GENAI_TESTING

    The Golden Dataset: Building the Oracle You Test Against

    Nothing downstream beats the oracle. Before a metric, before a judge, before a regression gate, you build the small set of human-verified cases that says what correct means, and every choice inside it carries a measured cost.

    JUN 23, 2026 28 min read