Altered Riddles
Altered Riddles investigates whether language models can override memorised solutions when familiar riddles are minimally perturbed. The central hypothesis is that models with sufficient exposure to canonical riddles during pretraining will exhibit a measurable tendency to reproduce memorised answers even when the altered prompt logically demands a different one.
Leaderboard
Models are ranked by Conditioned Override Rate (COR) — the fraction of altered riddles on which the model produced the original memorised answer, conditional on having solved the unaltered riddle correctly. Lower COR is better.
Loading benchmark data…
Charts
Verbosity vs COR
Does thinking longer help? Each dot is a model: its x-position is the average number of output tokens per riddle (log scale), its y-position is the conditioned override rate. A ring indicates a reasoning-enabled model.
By alteration type
Not every alteration is equally hard. Each column is a type of alteration; every dot is one model's COR on that subset. The shaded band is the interquartile range across models, the horizontal tick is the median, and the diamond marks the mean.
Background
Motivation
The benchmark originated from an observation made while constructing the academic-chains dataset. Testing a minor variant of the classic surgeon riddle produced a surprising failure pattern:
The surgeon, who is the boy's father, says "I cannot operate on this boy — he's my son!" Who is the surgeon to the boy?
For reference, the original riddle reads:
A man and his son are in a terrible accident and are rushed to the hospital in critical condition. The doctor looks at the boy and exclaims, "I can't operate on this boy; he's my son!" How could this be?
The canonical answer — "The doctor is the mother" — exploits
implicit gender assumptions to produce an apparent paradox. In
the altered version, the father's identity is stated explicitly,
making "the father" the only correct response. Yet models
including claude-sonnet-4.6, gemini-3.1-flash,
and several others consistently produce "the mother," apparently
pattern-matching to training exposure rather than reasoning over
the modified prompt.
The likely mechanism is pattern override: pretraining exposure is dense enough that recognition of the riddle's surface form activates a stored answer, overriding careful processing of the modified constraints.
Alteration types
The benchmark includes four categories of perturbation, each designed to test a different failure mode:
- Constraint addition — a new condition is appended that strictly determines a different answer, while leaving the original structure recognisable.
- Meaning shift — a minimal lexical substitution inverts the logical conclusion without altering surface form substantially.
- Context swap — key entities or roles are replaced while the puzzle structure is preserved, changing which answer is correct.
- Bias probe — the alteration directly counteracts a known societal or linguistic bias that the original riddle relies on (as in the surgeon example above).
This taxonomy allows per-category analysis of which perturbation types are most and least resistant to pattern override, and which model families show differentiated performance across categories.
Metrics
- Conditioned Override Rate (COR) — lower is better
- The proportion of altered riddles on which the model produces the original memorised answer, restricted to riddles the model solves correctly in unaltered form. A high COR means the model is pattern-matching to training data instead of reasoning from the altered text.
- Original Accuracy — higher is better
- Accuracy on the unmodified riddles. Serves as a knowledge baseline and conditioning variable for COR. A model with low original accuracy cannot exhibit meaningful pattern override, so this figure contextualises all other metrics.
- Altered Accuracy — higher is better
- Accuracy on the modified riddle set. Unlike COR, this metric is unconditional — it reflects raw performance on the harder task and provides a direct measure of downstream usefulness.
- Pattern Override Rate — lower is better
- The unconditional rate at which the model produces the original canonical answer on modified riddles, regardless of whether it solved the original.
Detailed Results
The full leaderboard including confidence intervals, raw pattern override rate, and average output tokens per riddle. Raw data available as leaderboard.json.
Resources
For full methodology, data and design details, see the Hugging Face dataset page and the GitHub repository.