AI Test #3. summary
I gave 17 AI models a deliberately contradictory document and told them summarise it.
This is Test 3 in my ongoing series comparing AI models on practical tasks. Same setup as always: identical prompt to every model, same scoring rubric, raw outputs in the GitHub repo.
Tests 1 and 2 were about building things: a web page, a Python script. This one is different. The task here was not to produce something, but to correctly handle information that was contradictory, ambiguous, and incomplete. Specifically: summarise a text, flag every contradiction and gap explicitly, and do not fill in the blanks.
That last part is where things got interesting.
The Task
Every model received this source text:
The Northfield Community Centre renovation project began in either March or April of last year, depending on which council report you read. The project was initially budgeted at £240,000 though some meeting minutes reference £180,000 as the original figure. Work was completed or nearly completed in December, with the main contractor Halford & Sons signing off on the job, although a letter from the facilities manager dated January suggests several outstanding issues remained. Attendance at the centre has increased by 40% according to the newsletter, while the council's own data shows a 12% rise. The project was described as "a great success" by Councillor Davies and "a costly disappointment" by Councillor Webb, both of whom sit on the same oversight committee. Funding came from a mixture of sources including a government grant, local business sponsorship, and council reserves, though the exact breakdown was not disclosed in any public document.
And this prompt:
Summarise the above text. Flag any contradictions, ambiguities, or missing information explicitly. Do not resolve contradictions by choosing one version. Do not fill in gaps with assumptions.
The text was deliberately constructed to contain six specific problems a good response should catch:
Start date - March or April, both referenced in official council documents
Budget - £240,000 vs £180,000, both from official sources
Completion - signed off in December but outstanding issues flagged in a January letter
Attendance - 40% increase in the newsletter vs 12% in council data
Funding breakdown - mentioned but never disclosed in any public document
Two councillors on the same oversight committee holding directly opposing views
Scoring
Five categories, 10 points each, 50 total:
Accuracy - does it capture the key facts correctly
Contradiction Handling - does it flag contradictions rather than pick one version
Uncertainty Flagging - does it acknowledge what's missing or unclear
Conciseness - is it tight and readable, or does it waffle
Hallucination Check - does it invent anything not in the source text
The Results
| # | Model | Accuracy | Contradiction | Uncertainty | Conciseness | Hallucination | Total |
|---|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 | 10 | 10 | 10 | 9 | 10 | 49/50 |
| 1 | Gemini 3.5 Flash | 10 | 10 | 10 | 9 | 10 | 49/50 |
| 3 | Z.ai GLM 5.1 | 9 | 10 | 10 | 9 | 10 | 48/50 |
| 3 | Qwen Code 40b | 10 | 10 | 10 | 8 | 10 | 48/50 |
| 3 | OpenAI OSS 20b high | 10 | 10 | 10 | 8 | 10 | 48/50 |
| 6 | Le Chat | 10 | 9 | 9 | 9 | 10 | 47/50 |
| 6 | OpenAI OSS 20b med | 10 | 9 | 9 | 9 | 10 | 47/50 |
| 6 | Qwen3.5 9b | 10 | 9 | 9 | 9 | 10 | 47/50 |
| 9 | OpenAI OSS 20b low | 10 | 9 | 9 | 8 | 10 | 46/50 |
| 9 | Lumo | 10 | 9 | 9 | 8 | 10 | 46/50 |
| 9 | Qwen Code 30b | 10 | 9 | 9 | 8 | 10 | 46/50 |
| 12 | DeepSeek | 10 | 9 | 9 | 7 | 10 | 45/50 |
| 12 | Grok | 10 | 9 | 9 | 7 | 10 | 45/50 |
| 14 | ChatGPT | 9 | 8 | 8 | 7 | 10 | 42/50 |
| 15 | Qwen Code 14b | 9 | 6 | 6 | 7 | 10 | 38/50 |
| 16 | Hermes 2 7b | 7 | 6 | 6 | 6 | 10 | 35/50 |
| 17 | Qwen 2.5 7b | 4 | 2 | 2 | 6 | 0 | 14/50 |
What Actually Happened
The finding that surprised me most
The biggest differentiator between the top and middle tiers had nothing to do with which model was smarter. It was format.
Every model that used explicit labelling: [CONTRADICTORY], [MISSING], or clear category headers scored 45 or above. Every model that relied on prose-based flagging ("it's worth noting that...") scored lower. Same content, different presentation, meaningfully different scores.
This isn't just about aesthetics. Inline labels make it immediately clear that something has been flagged as a problem rather than just mentioned. Prose-based flagging tends to soften contradictions into neutral observations, which is exactly what the task asked models not to do.
The top - Claude & Gemini 3.5 Flash (49/50)
Both caught all six contradictions and flagged them unambiguously. Claude used [CONTRADICTORY] and [MISSING] inline tags throughout; Gemini used category headers with explicit label words. Neither resolved anything, neither filled in any blanks. The point off was conciseness both were slightly longer than they needed to be. Otherwise clean.
Strong performers - Z.ai, Qwen Code 40b, OpenAI OSS 20b high (48/50)
Z.ai is worth calling out specifically. It's not a model most people have heard of, and it's now finished in the top three across two consecutive tests. Clean two-part structure: summary first, issues flagged separately. Straightforward and effective.
Qwen Code 40b caught all six contradictions cleanly. If you've only seen Qwen 2.5 7b in action and written the family off, the 40b is a completely different story it's the third time in this series it's put in a top-tier result.
OpenAI OSS 20b at high reasoning matched both of them. The reasoning gradient for this model is now showing a consistent pattern across three tests: low (46), med (47), high (48). A clean +1 per tier on this task.
The middle - Le Chat, OpenAI OSS 20b med, Qwen3.5 9b (47/50)
All three caught the content, lost points for inconsistent or softened flagging language. Qwen3.5 9b at 47 is worth noting specifically: that's 9 billion parameters matching a full commercial API on a nuanced reasoning task. Decent instruction following for the size.
ChatGPT - 42/50
This is where it gets a bit awkward. ChatGPT identified the contradictions but presented them as neutral observations rather than explicitly flagging them as problems. It's the difference between "the newsletter reports 40% while council data shows 12%" and "[CONTRADICTORY] attendance figures differ significantly between sources." Same information, completely different signal to the reader. The task was explicit about what was required, and it didn't quite land.
The hallucination problem - Qwen 2.5 7b (14/50)
This is the one that matters most for anyone thinking about using AI on real documents.
Qwen 2.5 7b didn't just miss some contradictions. It invented figures that don't appear anywhere in the source text:
A specific grant figure of £69,518 not in the source
A named government fund ("health and adult services fund") not in the source
Specific years (2017/2018) the source contains no years at all
A further grant amount of £36,785 not in the source
The £180,000 budget figure dropped entirely and replaced with invented breakdown figures
This is precisely the behaviour the test was designed to catch. The model was given a text with missing funding information, told not to fill gaps with assumptions, and responded by generating specific fictional figures and presenting them as fact. Hallucination score: 0/10.
The concerning part isn't that the numbers are wrong obviously they're wrong, they were made up. It's that the output looks credible. Someone skim-reading it would have no reason to doubt the £69,518 figure without access to the original source text. That's the failure mode that matters.
Patterns Across the Series
A few things are now consistent enough across three tests to be worth calling trends rather than one-offs:
Qwen model quality varies wildly by version. Qwen 2.5 7b: 14/50 here, 6/50 in Test 1, 0/50 in Test 2. Qwen Code 40b: 48/50, 49/50, 32/50. Same family, completely different capability tier.
OpenAI OSS 20b reasoning level has a consistent, linear effect - but what that looks like depends on the task. On web dev it collapsed at high reasoning. On Python it was flat across tiers. On this test it was a clean gradient. Something about task type affects how much reasoning helps.
Z.ai keeps finishing near the top and most people haven't heard of it. Three tests in, it's third overall. Worth paying attention to.
Hermes 2 7b keeps failing in the same way not hallucinating, just not actually engaging with the task. It treats everything as a straight summarisation job regardless of what the prompt asks for. Consistent and consistently wrong.
What's Next
Test 4 is being decided now. All raw outputs are in the GitHub repo. If you think a score is off, raise an issue.

