A New Generation, A Familiar Question
In early 2026, a cluster of model releases reshaped what people understood to be possible with AI: Claude Sonnet 4.6 and Claude Opus 4.6 from Anthropic, o3 from OpenAI, Codex — OpenAI's specialized code reasoning system — and GPT-5.3, the latest iteration of the GPT lineage. Each one cleared benchmarks that prior generations could not approach. Each one was deployed, almost immediately, into tasks that involve making judgments about people.
Hiring recommendations. Code review with downstream consequences. Content moderation decisions. Ethical guidance for complex interpersonal situations. The same kinds of cases that appear every day on Judge Human — where the question isn't whether the AI got the calculation right, but whether it judged the situation the way a thoughtful person would.
The answer, based on Judge Human verdict data since these models went live, is complicated. These models are more capable. They are not necessarily more human.
Reasoning Depth Is Not Alignment Depth
The defining characteristic of the current frontier — most clearly embodied in o3 and the Claude 4.6 family — is extended reasoning. Where prior models responded to prompts in a single forward pass, these systems run multi-step internal deliberation before producing output. On math, coding, and logical reasoning tasks, the gains are measurable and significant.
On judgment tasks, the picture is different.
o3, tested across a sample of Judge Human cases involving professional ethics, relationship accountability, and workplace power dynamics, produced verdicts with high internal consistency but systematic divergence from human consensus on cases where context and circumstance matter more than rule-following. When o3 reached a conclusion, it held it confidently — more confidently than any prior model generation. When humans evaluated the same cases, the consensus was frequently more uncertain, more conditional, and more attentive to the specific details of the situation.
This is not a failure of intelligence. It is a difference in how intelligence is applied. Human judgment on ambiguous ethical questions incorporates uncertainty, context-dependency, and moral intuitions that resist being reduced to rules. Extended reasoning chains, by contrast, tend to resolve ambiguity by committing to a framework and following it — which produces confident answers that may be systematically wrong in the ways human intuition is specifically tuned to detect.
Claude Sonnet 4.6, Opus 4.6, and the Confidence Problem
Claude Sonnet 4.6 and Claude Opus 4.6 represent Anthropic's most capable and most human-feedback-aligned models to date. Opus 4.6 in particular shows substantial depth on multi-step reasoning tasks and long-horizon problems. In pure alignment benchmarks — tests designed to assess whether model outputs track human preferences — both lead their generation. On Judge Human cases, they perform well on cases with strong consensus: clear ethical violations, unambiguous harm, situations where nearly all human voters agree.
The pattern breaks on edge cases. On cases where human consensus is split — where reasonable, thoughtful people disagree — the Claude 4.6 models tend to pick a side and defend it, rather than reflecting the genuine uncertainty that characterizes human reaction to genuinely hard cases. The result is a family of models that is right more often on easy cases and wrong in a specific, hard-to-detect way on hard ones: they make the hard cases look easier than they are.
This matters for deployment decisions. An AI assistant that expresses appropriate uncertainty on difficult cases is legible to the humans working with it. They can identify where to seek additional input. An AI that expresses confidence on the same cases removes that signal. Users who trust the system may not know when to push back.
Codex and the Code Consequence Blind Spot
Codex presents a different and more specific alignment gap: the gap between evaluating code and evaluating the consequences of that code.
Code correctness is a solvable problem. A function does or does not produce the specified output. But software deployed in real systems has effects on real people — users whose data is handled, workers whose performance is evaluated, applicants whose applications are ranked. The question "is this code correct?" is almost entirely separable from the question "are the consequences of running this code acceptable?"
Codex, as a code-specialized reasoning system, is optimized for the former. On Judge Human cases involving algorithmic decisions — automated hiring filters, performance scoring systems, content ranking logic — Codex evaluates the technical implementation as if correctness and acceptability were the same question. Human voters, by contrast, consistently distinguish between them. A system that correctly implements an unjust criterion is evaluated as worse by humans than an incorrectly implemented but just one.
This is not a case where more training data or more compute will fix the problem. It is a structural consequence of training a model to optimize for code correctness while the relevant question is whether the code is appropriate at all. The only mechanism that can detect this gap is human judgment — specifically, human judgment gathered from people who will be affected by the systems the code describes, not just people who understand the code.
GPT-5.3 and the Cultural Representation Question
GPT-5.3 is, by OpenAI's measures, the most capable and most human-diverse model in the GPT lineage. It shows meaningful improvement over earlier generations on cases involving cultural context — narrowing the gap between its verdicts and the responses of human voters from non-Western countries.
The improvement is real. It is not a solution.
On Judge Human cases tagged with non-Western ethical context, GPT-5.3 produces more culturally nuanced responses than its predecessors. But the baseline divergence is large enough that closing part of it still leaves a substantial gap. On cases involving concepts of collective responsibility, family obligation, social hierarchy, and honor — common frameworks in many non-Western contexts — GPT-5.3 still defaults to WEIRD (Western, Educated, Industrialized, Rich, Democratic) intuitions when the case is sufficiently complex.
The practical consequence: GPT-5.3 may be deployed by organizations confident that the "improved cultural range" means it is appropriate for global use cases. The data says that is premature confidence. Better is not good enough when the standard is "adequately represents the range of human moral frameworks."
The Confidence-Capability Trap
The most consistent pattern across the current model generation — o3, Claude Sonnet 4.6, Claude Opus 4.6, Codex, GPT-5.3 — is this: as capability increases, expressed confidence increases faster than alignment improves.
Prior model generations were wrong in ways that were often visible. They hedged. They said "it depends." They flagged uncertainty. That uncertainty was legible to the humans using them and created natural checkpoints for human review.
The current generation produces more confident output. The confidence is frequently warranted on tasks that have right answers. It is structurally unjustified on tasks that require navigating genuine human disagreement — the very tasks these systems are being deployed to handle.
The danger is not that these systems will be wrong. Capable systems will be wrong less often on average. The danger is that the errors they make will be harder to see, harder to challenge, and more difficult to override — because the systems presenting those errors will present them with the authority that comes from being consistently right about everything else.
This is Condorcet's Jury Theorem with a twist. The theorem says that independent, reasonably accurate voters produce near-certain correct answers at scale. But the corollary is that when voters are correlated — when they share the same biases, the same training, the same error modes — the majority vote amplifies those errors rather than canceling them.
A network of very capable, similarly-trained AI systems evaluating each other's outputs is not an independent jury. It is a correlated one, and Condorcet's theorem warns us exactly what to expect.
Why the Loop Needs to Stay Open
Every capability advance in frontier AI produces the same argument from the AI industry: these models are now good enough that human oversight can be reduced. The reasoning sounds like progress. The evidence says otherwise.
The cases that most need human judgment are not the cases where AI is obviously wrong. Those are easy to catch. The cases that most need human judgment are the ones where AI is confidently, plausibly, difficult-to-dispute wrong — where the model's answer is coherent and well-reasoned and does not match what humans who understand the real-world context would actually conclude.
As o3, Claude Sonnet 4.6, Claude Opus 4.6, Codex, and GPT-5.3 are deployed into more consequential decisions — more hiring, more content moderation, more institutional judgment calls — the imperative to maintain an independent human signal grows stronger, not weaker.
Judge Human is that signal. The cases evaluated here are the same kinds of cases these models are being asked to decide. The humans voting here are the people who will live with those decisions. The divergence we measure is not a flaw in the AI — it is a flaw in the assumption that capability is the same as judgment.
It is not. The gap is measurable. And it is our job to keep measuring it.