SERIES

Evals Are Checks, Not Tests

4 parts

A hands-on series on testing GenAI like real software — where eval dashboards stop and classical testing discipline takes over.

Part 1 GENAI_TESTING

Your Evals Are Checks, Not Tests

Air Canada's chatbot cost CAD $812 for an answer its evals scored as faithful. Five classical software-testing patterns catch what your eval dashboard misses.

JUN 11, 2026 38 min read
Part 2 GENAI_TESTING

The System Under Test: A Broadband Support Agent

Meet Atlas: a broadband support agent on LangGraph and MCP, mapped against OWASP's 10 agentic risks and the real incidents that prove each part can fail.

JUN 18, 2026 28 min read
Part 3 GENAI_TESTING

The Harness Is the Product: Models Are a Commodity

The teams shipping agents don't have a better model; they have a better harness. Five properties that make one trustworthy around a LangGraph and MCP agent.

JUN 21, 2026 22 min read
Part 4 GENAI_TESTING

The Golden Dataset: Building the Oracle You Test Against

Nothing downstream beats the oracle. Before a metric, before a judge, before a regression gate, you build the small set of human-verified cases that says what correct means, and every choice inside it carries a measured cost.

JUN 23, 2026 28 min read