GENAI

GenAI

4 posts

Testing GenAI in production: RAG retrieval, evals vs tests, agentic trajectories, and the classical failures that hide behind a green dashboard. Grouped into hands-on series.

GENAI_TESTING

The Golden Dataset: Building the Oracle You Test Against

Nothing downstream beats the oracle. Before a metric, before a judge, before a regression gate, you build the small set of human-verified cases that says what correct means, and every choice inside it carries a measured cost.

JUN 23, 2026 28 min read

GENAI_TESTING 22 MIN

The Harness Is the Product: Models Are a Commodity

The teams shipping agents don't have a better model; they have a better harness. Five properties that make one trustworthy around a LangGraph and MCP agent.

JUN 21, 2026

GENAI_TESTING 28 MIN

The System Under Test: A Broadband Support Agent

Meet Atlas: a broadband support agent on LangGraph and MCP, mapped against OWASP's 10 agentic risks and the real incidents that prove each part can fail.

JUN 18, 2026

GENAI_TESTING 38 MIN

Your Evals Are Checks, Not Tests

Air Canada's chatbot cost CAD $812 for an answer its evals scored as faithful. Five classical software-testing patterns catch what your eval dashboard misses.

JUN 11, 2026

GenAI

Posts in GenAI

The Golden Dataset: Building the Oracle You Test Against

The Harness Is the Product: Models Are a Commodity

The System Under Test: A Broadband Support Agent

Your Evals Are Checks, Not Tests