AI's mathematical mirage: Apple study challenges notion of AI reasoning
21 October 2024
Author
Sascha Brodsky Tech Reporter, IBM

In a study that's sending ripples through the tech world, Apple researchers have cast doubt on the notion that large language models (LLMs) are capable of genuine reasoning.

The paper’s authors set out to examine the mathematical reasoning capabilities of current AI models, including industry leader GPT-4 from OpenAI, by introducing a new symbolic data set. By presenting familiar mathematical concepts in unfamiliar ways, the researchers sought to challenge the models' understanding beyond mere pattern recognition.

The results were striking: most of the LLMs they tested performed significantly worse when faced with these novel representations of math problems, suggesting that these systems may rely more on pattern matching than actual problem-solving skills.

"This paper has fundamentally proven that LLMs can't reason," says Ash Minhas, IBM Technical Content Manager. "They're just pattern matching.”

The road to AGI

This revelation has experts questioning the depth of AI's current capabilities and the path forward in the field. The study's findings underscore the distinction between artificial narrow intelligence (ANI) and artificial general intelligence (AGI), suggesting that current LLMs land firmly in the former category, Minhas said.

The AI field is increasingly embracing the possibility of achieving AGI, which refers to AI systems capable of learning and understanding like humans, applying knowledge across various domains, performing diverse tasks and potentially surpassing human abilities in everything from reasoning to creative pursuits.

Helen Toner, a former board member of OpenAI and director of strategy at Georgetown University’s Center for Security and Emerging Technology, recently testified before a US Senate Judiciary subcommittee that "the biggest disconnect I see between public perceptions and AI insider perspectives comes from inside the handful of companies that are working to build ‘artificial general intelligence’ (AGI), i.e. AI that is roughly as smart as a human.” She said that leading AI companies such as OpenAI, Google and Anthropic are treating building AGI as “an entirely serious goal.”

However, some experts say that the AGI is far from a reality. "This paper underscores that we're still in the world of ANI,” Minhas says. “We haven't reached AGI.”

Benchmark controversy

The paper also highlights the need for better benchmarks in the AI industry. According to Minhas, current benchmark problems are flawed because models can solve them through pattern matching rather than actual reasoning. "If the benchmarks were based on actual reasoning, or if the reasoning problems were more complex, then all the models would perform terribly," he says.

Minhas says the Apple researchers created this synthetic dataset, a collection of data used to train and test AI models and algorithms, by mixing up the symbols

“They've proven that these models' performance degrades when you start tweaking and changing things in the input sequence, whether through the symbols themselves or extra context like superfluous tokens," he says.

The Apple study's methodology involved introducing various "fluffs" and clauses to the training set to observe how model performance changed. However, Jess Bozorg, IBM Data Scientist, points out a potential limitation: "They didn't specify how many categories of fluffs they considered in their additions, or what types of fluffs they used from which categories," she says.

One of the paper's critiques of current LLM benchmarks is the issue of data contamination. Bozorg explains that the Apple study used the GSM-8K dataset. set that contains grade-school math word problems created by humans. "There's data leakage,” she says. “This means that the model had already seen some of this data during the testing stage in their training."

Contamination is a widespread issue in the industry. Minhas says that the GSM-8K dataset “is such an industry benchmark that there are bits and pieces of it all over the training data that all models know about. This is a fundamental problem with all of these created benchmarks."

Interestingly, the study revealed that GPT-4 performed notably better than other models when tested on the new symbolic dataset. Minhas speculates on the reason: "Is it possible that when training GPT-4, they thought about symbolic representations and generated test data like that? Maybe it's still just doing pattern matching, but it had this data type in its training dataset."

Minhas points out that researchers are trying to move beyond pattern matching by introducing memory into AI systems. "That's one way we're trying to make them more general, but it's still only pattern matching based on what you've given it," he says.

The Apple study has exposed significant limitations in current AI systems, revealing that the journey toward truly intelligent machines is still far from complete. Now, experts say, the AI community faces the challenge of bridging the gap between pattern matching and genuine reasoning.

“The transformer architecture alone isn’t enough for reasoning,” Minhas says. “Advancements in model architecture are needed for reasoning capabilities.”

Think Newsletter

 

The latest AI and tech insights from Think

Sign up today
Take the next step

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai Book a live demo