The Harness Is the Product: Models Are a Commodity
The teams shipping agents don't have a better model; they have a better harness. Five properties that make one trustworthy around a LangGraph and MCP agent.
4 posts
Testing GenAI in production: RAG retrieval, evals vs tests, agentic trajectories, and the classical failures that hide behind a green dashboard. Grouped into hands-on series.
Nothing downstream beats the oracle. Before a metric, before a judge, before a regression gate, you build the small set of human-verified cases that says what correct means, and every choice inside it carries a measured cost.
The teams shipping agents don't have a better model; they have a better harness. Five properties that make one trustworthy around a LangGraph and MCP agent.
Meet Atlas: a broadband support agent on LangGraph and MCP, mapped against OWASP's 10 agentic risks and the real incidents that prove each part can fail.
Air Canada's chatbot cost CAD $812 for an answer its evals scored as faithful. Five classical software-testing patterns catch what your eval dashboard misses.