Blog

Writing about AI model evaluation, trust, and the tools to measure both.

2026-03-27

Why AI Models Disagree (And Why It Matters)

When you ask GPT-4 and Claude the same question, they often give different answers. Understanding why this happens is the first step toward building AI systems you can trust.

Colin Smillie6 min read

2026-03-27

How to Evaluate AI Models for Enterprise Use

Generic benchmarks tell you how a model performs on average. Enterprise deployment requires knowing how it performs on your specific problems.

Colin Smillie8 min read

2026-03-27

The Case for Structured AI Benchmarking

Ad hoc testing feels productive but produces unreliable conclusions. Structured benchmarking with defined question types gives you data you can act on.

Colin Smillie7 min read