Blog
Writing about AI model evaluation, trust, and the tools to measure both.
Why AI Models Disagree (And Why It Matters)
When you ask GPT-4 and Claude the same question, they often give different answers. Understanding why this happens is the first step toward building AI systems you can trust.
Colin Smillie6 min read
How to Evaluate AI Models for Enterprise Use
Generic benchmarks tell you how a model performs on average. Enterprise deployment requires knowing how it performs on your specific problems.
Colin Smillie8 min read
The Case for Structured AI Benchmarking
Ad hoc testing feels productive but produces unreliable conclusions. Structured benchmarking with defined question types gives you data you can act on.
Colin Smillie7 min read