Evals Are Checks, Not Tests
4 parts
A hands-on series on testing GenAI like real software — where eval dashboards stop and classical testing discipline takes over.
GitHub Code- Part 1 GENAI_TESTING
Your Evals Are Checks, Not Tests
Air Canada's chatbot cost CAD $812 for an answer its evals scored as faithful. Five classical software-testing patterns catch what your eval dashboard misses.
JUN 11, 2026 38 min read - Part 2 GENAI_TESTING
The System Under Test: A Broadband Support Agent
Meet Atlas: a broadband support agent on LangGraph and MCP, mapped against OWASP's 10 agentic risks and the real incidents that prove each part can fail.
JUN 18, 2026 28 min read - Part 3 GENAI_TESTING
The Harness Is the Product: Models Are a Commodity
The teams shipping agents don't have a better model; they have a better harness. Five properties that make one trustworthy around a LangGraph and MCP agent.
JUN 21, 2026 22 min read - Part 4 GENAI_TESTING
The Golden Dataset: Building the Oracle You Test Against
Nothing downstream beats the oracle. Before a metric, before a judge, before a regression gate, you build the small set of human-verified cases that says what correct means, and every choice inside it carries a measured cost.
JUN 23, 2026 28 min read