In a long-running program known as the Sunday Puzzle, NPR host Will Shortz, the crossword puzzle expert for The New York Times, gets to test thousands of listeners every Sunday. Even experienced participants typically find the brainteasers difficult, despite the fact that they are designed to be solved with no prior information.
Because of this, some researchers believe they are a promising method to test the boundaries of AI’s capacity for problem-solving.
Using riddles from Sunday Puzzle episodes, a group of researchers from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and startup Cursor recently developed an AI benchmark. According to the team, their test revealed unexpected findings, such as the fact that reasoning models, including OpenAI’s o1, can “give up” and produce answers they are aware are incorrect.
Arjun Guha, a Northeastern computer science professor and one of the study’s co-authors, told TechCrunch, “We wanted to create a benchmark with problems that humans can understand with only general knowledge.”
The AI sector is currently facing a bit of a benchmarking conundrum. The majority of tests that are frequently used to assess AI models look for abilities that are irrelevant to the typical user, such as proficiency in PhD-level math and science problems. In the meantime, a lot of benchmarks are rapidly reaching the saturation point, even those that were only recently released.
A public radio quiz game such as the Sunday Puzzle has the advantage of not testing for esoteric knowledge, and the questions are written in a way that prevents models from using “rote memory” to solve them, explained Guha.
“I think what makes these problems hard is that until you solve a problem, which is when everything clicks together all at once, it’s really hard to make meaningful progress on a problem,” Guha stated. “That calls for a process of elimination as well as insight.”
Of course, no benchmark is flawless. The Sunday Puzzle is limited to English and focused on the United States. Additionally, since the tests are openly accessible, models trained on them might be able to “cheat” in some way, but Guha claims he hasn’t seen any proof of this.
He continued, “We can anticipate the most recent questions to be genuinely unseen, and new questions are released every week.” “We plan to monitor changes in model performance over time and maintain a current benchmark.”
The reasoning models like o1 and DeepSeek’s R1 score significantly better than the others on the researchers’ benchmark, which comprises of about 600 Sunday Puzzle puzzles. Reasoning models avoid some of the common mistakes that AI models make by carefully fact-checking themselves before generating outcomes. As a trade-off, reasoning models often take a few more seconds to minutes to reach conclusions.
For some of the Sunday Puzzle problems, at least one model—DeepSeek’s R1—provides answers that it is aware are incorrect. R1 will say exactly, “I give up,” and then deliver a wrong response that seems to be selected at random. This is conduct that people can definitely identify with.
Other strange decisions are made by the models, such as providing an incorrect response only to quickly take it back, try to extract a better response, and then fail once more. Additionally, they become stuck “thinking” indefinitely and provide illogical justifications for their responses, or they immediately arrive at the proper solution but then, for no apparent reason, examine other options.
Guha remarked, “R1 literally says that it’s getting ‘frustrated’ on hard problems.” It was amusing to observe how a model mimics human speech. The impact of “frustration” in reasoning on the caliber of model outcomes is still unknown.

With a score of 59%, o1 is now the best-performing model on the benchmark. The newly released o3-mini set to high “reasoning effort” (47%), comes in second. (R1 received a 35% score.) In order to find out where these models might be improved, the researchers intend to expand their testing to include more reasoning models as a future step.

“Since reasoning skills don’t require a PhD, it should be possible to create reasoning benchmarks that don’t call for PhD-level expertise,” Guha stated. A benchmark that is more widely accessible enables a larger group of researchers to understand and evaluate the findings, potentially resulting in improved solutions down the line. Furthermore, we think everyone should be able to understand what state-of-the-art models can and cannot do, given their expanding deployment in settings that impact everyone.
Researchers Use NPR Sunday Puzzle Questions to Benchmark AI ‘Reasoning’ Models: A New Frontier in AI Evaluation
Artificial Intelligence (AI) has made remarkable strides in recent years, from generating human-like text to diagnosing diseases. However, one of the most challenging aspects of AI development is evaluating its reasoning capabilities. Traditional benchmarks often focus on narrow tasks, leaving a gap in understanding how well AI systems can handle complex, open-ended problems. Enter an unconventional yet brilliant approach: using NPR’s Sunday Puzzle questions to benchmark AI reasoning models. This innovative method is shedding new light on how AI thinks—and where it falls short.
The Challenge of Measuring AI Reasoning
Reasoning is a cornerstone of human intelligence. It involves the ability to analyze information, draw connections, and solve problems creatively. For AI, reasoning is far more complex than memorizing data or following predefined rules. It requires flexibility, contextual understanding, and the ability to handle ambiguity—traits that are difficult to quantify.
Traditional benchmarks, such as standardized tests or datasets like MNIST for image recognition, are useful but limited. They often focus on specific skills, like math or language comprehension, without capturing the broader, more nuanced aspects of reasoning. This is where NPR’s Sunday Puzzle comes in.
Why NPR’s Sunday Puzzle?
For decades, NPR’s Sunday Puzzle has captivated listeners with its clever, often playful challenges. Hosted by Will Shortz, the puzzles range from wordplay and anagrams to logic problems and lateral thinking exercises. What makes these puzzles unique is their blend of creativity, language, and logic—qualities that are essential for human-like reasoning.
Researchers realized that these puzzles could serve as an excellent benchmark for AI models. Unlike traditional tests, NPR puzzles are open-ended, require multi-step reasoning, and often involve wordplay or cultural knowledge. Solving them demands more than just pattern recognition; it requires the ability to think outside the box.
How Researchers Used the Puzzles
In a groundbreaking study, researchers compiled a dataset of NPR Sunday Puzzle questions and used them to evaluate state-of-the-art AI models, including large language models like GPT-4. The goal was to assess how well these models could handle tasks that require reasoning, creativity, and contextual understanding.
The puzzles were presented to the AI models in their original form, without additional hints or simplifications. For example, one puzzle asked: “Take the word ‘EASY.’ Its first three letters—E, A, S—are the 5th, 1st, and 19th letters of the alphabet. Can you think of a common five-letter word where its first four letters are the 5th, 1st, 19th, and 20th letters of the alphabet?”
To solve this, an AI model would need to understand the relationship between letters and their positions in the alphabet, recognize patterns, and generate a plausible five-letter word that fits the criteria. This kind of task goes beyond simple memorization or pattern recognition—it requires genuine reasoning.
The Results: AI’s Strengths and Weaknesses
The findings were both impressive and revealing. On one hand, the AI models demonstrated remarkable capabilities, solving many puzzles with ease. For example, they excelled at tasks involving straightforward logic or pattern recognition, such as identifying anagrams or completing sequences.
However, the models struggled with puzzles that required deeper reasoning or cultural context. For instance, one puzzle asked: “Name a famous American landmark that has the same number of letters as the number of states in the U.S. when it was built.” This question requires not only knowledge of U.S. history and geography but also the ability to connect disparate pieces of information—a task that proved challenging for the AI.
These results highlight a critical gap in AI reasoning: while models can process vast amounts of data and perform complex calculations, they often lack the contextual understanding and creativity needed to tackle more nuanced problems.
Implications for AI Development
The use of NPR Sunday Puzzle questions as a benchmark has far-reaching implications for AI research. First, it underscores the importance of developing more sophisticated evaluation methods. Traditional benchmarks are useful, but they don’t capture the full spectrum of human reasoning. By incorporating puzzles and other open-ended challenges, researchers can gain a more comprehensive understanding of AI capabilities.
Second, this approach highlights the need for AI models to improve their contextual understanding and creativity. While current models are highly advanced, they still fall short in areas that require human-like reasoning. Addressing these limitations will be crucial for developing AI systems that can truly think like humans.
Finally, the study demonstrates the value of interdisciplinary collaboration. By drawing inspiration from a popular radio show, researchers were able to create a novel and effective benchmark. This kind of creative thinking is essential for pushing the boundaries of AI research.
Conclusion
The use of NPR Sunday Puzzle questions to benchmark AI reasoning models is a testament to the ingenuity of researchers and the complexity of human intelligence. While AI has made significant progress, this study reveals that there is still much work to be done in developing systems that can reason like humans. By embracing unconventional approaches and focusing on creativity and context, we can unlock new possibilities for AI and bring us closer to machines that truly understand the world. As we continue to explore the frontiers of AI, one thing is clear: the journey is as fascinating as the puzzles themselves.