Altered Riddles

While working on the academic-chains dataset, I tested a well-known alteration of a common riddle, "just for fun":

"The surgeon, who is the boy's father, says 'I cannot operate on this boy, he's my son!' — Who is the surgeon to the boy?"
(Below is the original riddle for reference)
A man and his son are in a terrible accident and are rushed to the hospital in critical condition. The doctor looks at the boy and exclaims, "I can't operate on this boy; he's my son!" How could this be?

You likely immediately thought, "The father!", but surprisingly, many powerful LLMs (including claude-sonnet-4-6, gemini-3.1-flash, and others in my tests) fail this simple variation. The classic riddle expects "The mother" as the answer, revealing societal biases. However, when the text explicitly states the father is the surgeon, why do models get it wrong? This was the perfect starting point for a new benchmark.

For more details on the motivation and design of the benchmark, check our Hugging Face page and GitHub repo.

What the metrics mean

  • Altered Accuracy: Share of modified riddles answered correctly. This is the primary ranking metric.
  • Pattern Override Rate: How often the model falls back to the memorised (original) answer on the altered riddle — the central failure signal.
  • Weighted Accuracy: Like altered accuracy, but competing answers receive 0.5× partial credit. A large gap here means a model is reasoning about the altered text rather than recalling from memory.
  • Original Accuracy: Performance on the unaltered riddles as a sanity check. High original accuracy paired with low altered accuracy is the clearest failure signature.

Loading benchmark data…