arXiv:2402.00157v1 [cs.CL] 31 Jan 2024
Large Language Models for Mathematical Reasoning:
Progresses and Challenges
Janice Ahn
Rishu Verma
Renze Lou
Di Liu
Rui Zhang
and Wenpeng Yin
The Pennsylvania State University Temple University
{jfa5672, wenpeng}@psu.edu; diliu@temple.edu
Mathematical reasoning serves as a cornerstone
for assessing the fundamental cognitive capa-
bilities of human intelligence. In recent times,
there has been a notable surge in the devel-
opment of Large Language Models (LLMs)
geared towards the automated resolution of
mathematical problems. However, the land-
scape of mathematical problem types is vast
and varied, with LLM-oriented techniques un-
dergoing evaluation across diverse datasets and
settings. This diversity makes it challenging
to discern the true advancements and obsta-
cles within this burgeoning field. This survey
endeavors to address four pivotal dimensions:
i) a comprehensive exploration of the various
mathematical problems and their correspond-
ing datasets that have been investigated; ii) an
examination of the spectrum of LLM-oriented
techniques that have been proposed for math-
ematical problem-solving; iii) an overview of
factors and concerns affecting LLMs in solving
math; and iv) an elucidation of the persisting
challenges within this domain. To the best of
our knowledge, this survey stands as one of the
first extensive examinations of the landscape
of LLMs in the realm of mathematics, provid-
ing a holistic perspective on the current state,
accomplishments, and future challenges in this
rapidly evolving field.
1 Introduction
Mathematical reasoning is crucial to human intel-
ligence, driving ongoing efforts in the AI commu-
nity to autonomously tackle math challenges. This
pursuit inherently calls for an augmentation of AI
capabilities, delving into the intricate realms of tex-
tual comprehension, image interpretation, tabular
analysis, symbolic manipulation, operational logic,
and a nuanced grasp of world knowledge. As the
AI landscape evolves, the endeavor to empower
machines with a comprehensive understanding of
diverse mathematical facets becomes not only a tes-
tament to technological prowess but also a pivotal
stride towards achieving a more generalized and
adept AI.
In recent times, the landscape of AI has been
reshaped by the ascendancy of Large Language
Models (LLMs) as formidable tools for automating
intricate tasks. Notably, LLMs have proven to be
potent assets in unraveling the nuances of mathe-
matical problem-solving (Romera-Paredes et al.,
2023; Imani et al., 2023). Their language capabili-
ties fuel focused exploration in utilizing them for
mathematical reasoning, uncovering fresh insights
into the synergy between language and logic.
However, amid this progress, the current state
of LLM-oriented research in mathematics presents
a complex panorama. Diverse mathematical prob-
lem types pose a formidable challenge, exacerbated
by the varied evaluation metrics, datasets, and set-
tings employed in the assessment of LLM-oriented
techniques (Testolin, 2023; Lu et al., 2023c). The
lack of a unified framework hampers our ability to
gauge the true extent of progress achieved and im-
pedes a coherent understanding of the challenges
that persist in this evolving field.
This survey endeavors to cast a spotlight on the
multifaceted landscape of LLMs in the realm of
mathematics. We plan to traverse four crucial di-
mensions: a meticulous exploration of math prob-
lem types and the datasets associated with them;
an in-depth analysis of the evolving techniques em-
ployed by LLMs in mathematical problem-solving;
an examination of factors that affect the LLMs solv-
ing math problems; and a critical discussion on the
persisting challenges that loom over this burgeon-
ing field.
To our knowledge, this survey marks one of the
first comprehensive examinations of LLMs specif-
ically tailored for mathematics. By weaving to-
gether insights from various dimensions, we aim to
provide a holistic understanding of the current state
of affairs in LLM-driven mathematical reasoning,
shedding light on achievements, challenges, and
arXiv:2402.00157v1 [cs.CL] 31 Jan 2024

the uncharted territories that await exploration in
this captivating intersection of language and logic.
2 Related Work
To the best of our knowledge, the existing literature
on summarizing mathematical research, particu-
larly within the context of LLMs, remains limited.
Notably, Chang et al. (2023) conducted a compre-
hensive evaluation of LLMs, incorporating an ex-
amination of their performance in mathematical
problem-solving, albeit with a relatively brief ex-
ploration of the mathematical field. Conversely,
both (Testolin, 2023) and (Lu et al., 2023c) delved
into the application of Deep Learning in the domain
of mathematical reasoning. Our work distinguishes
itself on three fronts: firstly, we concentrate on
LLMs, providing a more in-depth analysis of their
various advancements; secondly, beyond merely
reporting progress, we engage in a thorough discus-
sion of the challenges inherent in this trajectory;
and thirdly, we extend our scrutiny to encompass
the perspective of mathematics pedagogy. In do-
ing so, we contribute a nuanced perspective that
seeks to broaden the understanding of LLMs in the
context of mathematical research.
The only work contemporaneous with ours is
(Liu et al., 2023b). In comparison, our contribution
lies in: i) not only introducing various methods
but also paying more attention to various factors
affecting model performance; ii) taking a broader
perspective on the progress of LLM in the field
of mathematics, elucidating not only from the AI
perspective but also from the perspective of ed-
ucation. It emphasizes that the pursuit of model
performance alone, while neglecting human factors,
is something that needs attention.
3 Math Problems & Datasets
This section concisely overviews prominent math-
ematical problem types and associated datasets,
3.1 Arithmetic
This category of problems entails pure mathemati-
cal operations and numerical manipulation, devoid
of the need for the model to interpret text, images,
or other contextual elements. An illustrative exam-
ple is presented below, where “Q” denotes ques-
tions and “A” for answers.
Q: 21 + 97
A: 118
The dataset MATH-140 (Yuan et al., 2023) con-
tains 401 arithmetic expressions for 17 groups.
3.2 Math Word Problems
MATH WORD PROBLEMS (MWP) are mathemati-
cal exercises or scenarios presented in the form of
written or verbal descriptions rather than straight-
forward equations in ARITHMETIC. These prob-
lems require individuals to decipher the informa-
tion provided, identify relevant mathematical con-
cepts, and formulate equations or expressions to
solve the given problem. MWP often reflect real-
world situations, allowing individuals to apply
mathematical principles to practical contexts. Solv-
ing these problems typically involves critical think-
ing, problem-solving skills, and the application of
mathematical operations to find a solution.
MWP invariably comprise a question (Q) and
its corresponding final answer (A) (referred to as
Question-Answer). However, the presence or ab-
sence of additional clues can give rise to various
versions of these problems. Variations may emerge
based on factors such as the availability of an equa-
tion (E; referred to as Question-Equation-Answer)
or the provision of a step-by-step rationale (R;
Question-Rationale-Answer) to guide the problem-
solving process.
Question-Answer. The instance of this type of
MWP consists of a question (Q) and the final an-
swer (A), such as:
Q: Lily received $20 from her mum. After
spending $10 on a storybook and $2.5 on
a lollipop, how much money does she have
A: $7.5
Question-Equation-Answer. Compared with
Question-Answer, this MWP type provides the
equation solution, such as
Q: Jack had 8 pens and Mary had 5 pens.
Jack gave 3 pens to Mary. How many pens
does Jack have now?
E: 8 − 3
A: 5 (optional)
Question-Rationale-Answer. This type of
MWP includes answers and reasoning paths, akin
to the Chain-of-Thought method, which explicates
reasoning steps rather than defining problem types

CMATH (Wei et al., 2023)
Chinese; grade 1-6
SAT-MATH (Zhong et al., 2023)
SVAMP (Patel et al., 2021)
Three types of variations
ASDIV (Miao et al., 2020)
Problem type and grade level annotated
MAWPS (Koncel-Kedziorski et al., 2016)
Extension of ADDSUB, MULTIARITH, etc.
PARAMAWPS (Raiyan et al., 2023)
Paraphrased, adversarial MAWPS
SINGLEEQ (Koncel-Kedziorski et al., 2015)
ADDSUB (Hosseini et al., 2014)
Only addition and subtraction
MULTIARITH (Roy and Roth, 2015)
Multi-step reasoning
DRAW-1K (Upadhyay and Chang, 2017)
MATH23K (Wang et al., 2017)
APE210K (Zhao et al., 2020)
K6 (Yang et al., 2023)
Chinese; grade 1-6
CM17K (Qin et al., 2021)
Chinese; grade 6-12
CARP (Zhang et al., 2023a)
GSM8K (Cobbe et al., 2021)
Linguistically diverse
MATH (Hendrycks et al., 2021)
Problems are put into difficulty levels 1-5
PRM800K (Lightman et al., 2023)
MATH w/ step-wise labels
MATHQA (Amini et al., 2019)
GRE examinations; have quality concern
AQUA (Ling et al., 2017)
GRE&GMAT questions
ARB (Sawada et al., 2023)
Contest problems and university math proof
GHOSTS (Frieder et al., 2023)
THEOREMQA-MATH (Chen et al., 2023b)
Theorem as rationale
LILA (Mishra et al., 2022)
Incorporates 20 existing datasets
MATH-INSTRUCT (Yue et al., 2023)
Instruction-following style
TABMWP (Lu et al., 2023b)
Tabular MWP; below the College level
Table 1: Datasets for Math Word Problems.
E = Elementary, M = Middle School, H = High School, C = College, H = Hybrid
(Wei et al., 2022). The rationale guides correct
problem-solving and serves as a valuable reference
for model training, including fine-tuning and
few-shot learning.
Q: Beth bakes 4, or 2 dozen batches of
cookies in a week. If these cookies are
shared amongst 16 people equally, how
many cookies does each person consume?
R: Beth bakes 4 2 dozen batches of
cookies for a total of 4 ∗ 2 =<< 4 ∗ 2 =
8 >> 8 dozen cookies. There are 12
cookies in a dozen and she makes 8 dozen
cookies for a total of 12∗8 =<< 12∗8 =
96 >> 96 cookies. She splits the 96
cookies equally amongst 16 people so
they each eat 96/16 =<< 96/16 = 6 >>
6 cookies.
A: 6
Table 1 lists most datasets that are summarized
in three categories: Question-Answer, Question-
Equation-Answer, and Question-Rationale-Answer.
In addition to the above three MWP types of con-
ventional styles, recent work studied MWP in
given tables and even MWP generation.
Tabular MWP. TABMWP (Lu et al., 2023b) is
the first dataset to study MWP over tabular context
on open domains and is the largest in terms of data
size. Each problem in TABMWP is accompanied
by a tabular context, which is represented in three
formats: an image, a semi-structured text, and a
structured table.
Table 2: Table for the tabular MWP example.
T : Table 2
Q: Henrik bought 2.5 kilograms of oval
beads. How much did he spend? (Unit:
A: 5

MWP Generation. Instead of deriving the an-
swer for a given math question, this type of mathe-
matical reasoning tries to generate MWP questions.
For example, Wang et al. (2021) fine-tuned GPT-
2 (Radford et al., 2019) on equation-to-MWP in-
stances for MWP generation. The effectiveness of
GPT-3’s question-generation capabilities was as-
sessed by Zong and Krishnamachari (2023), who
instructed the model to generate a question similar
to a provided MWP question. Deb et al. (2023) an-
alyzed a group of LLMs (GPT-4, GPT-3.5, PaLM-
2 (Anil et al., 2023), and LLaMa (Touvron et al.,
2023a)), and found a significant drop in accuracy
for backward reasoning compared to forward rea-
soning. Norberg et al. (2023) used GPT-4 to rewrite
human-written MWP, reporting optimal readabil-
ity, lexical diversity, and cohesion scores, although
GPT-4 rewrites incorporated more low-frequency
3.3 Geometry
Compared with MWP, GEOMETRY problems in-
volve a distinct set of challenges. While MWP of-
ten requires logical reasoning and arithmetic op-
erations, geometry problems demand a spatial un-
derstanding of shapes, sizes, and their interrela-
tionships. Solving geometry problems typically
entails applying geometric principles, theorems,
and formulas to analyze and deduce properties of
geometric figures. Furthermore, current geometry
approaches mainly rely on symbolic methods and
predefined search heuristics, highlighting the spe-
cialized strategies required in this domain (Trinh
et al., 2024). This contrast in problem-solving
approaches highlights the multifaceted nature of
mathematical challenges and the varied skill sets
required in different mathematical domains. An
example can be seen as follows and Table 3 lists
mainstream datasets.
Q: a=7 inches; b=24 inches; c=25 inches;
h=5.4 inches; What is its area? (Unit:
square inches)
A: 24.03
GEOSHADER (Alvin et al., 2017)
GEOS (Seo et al., 2015)
GEOS++ (Sachan et al., 2017)
GEOS-OS (Sachan and Xing, 2017)
GEOMETRY3K (Lu et al., 2021)
GEOQA (Chen et al., 2021a)
UNIGEO (Chen et al., 2022)
Table 3: Geometry datasets
3.4 Automated theorem proving
In the specialized area of Automated Theorem
Proving (ATP), the inherent challenges are unique
and encompass a wide spectrum, akin to those
found in distinct mathematical fields. ATP’s core
focus is on autonomously constructing proofs for
specified conjectures, requiring a blend of logical
analysis and a profound grasp of formal languages,
supported by an extensive knowledge base. Its
application is crucial in areas like the validation
and development of both software and hardware
For example, the MINIF2F dataset (Zheng et al.,
2022) stands out in ATP, featuring a series of com-
plex Olympiad-level mathematical problems, de-
signed to evaluate theorem-proving systems includ-
ing Metamath (Yu et al., 2023), Lean (Han et al.,
2022), and Isabelle (Wenzel et al., 2008). In a
similar vein, the HOList benchmark (Bansal et al.,
2019), with its comprehensive array of theorem
statements from various corpora, sets a sequential
proving challenge for ATP systems, where each
theorem must be proved using only the lemmas
preceding it. Additionally, the COQGYM dataset
(Yang and Deng, 2019) provides a broad ATP en-
vironment, showcasing a rich collection of more
than 71,000 proofs penned by humans, all within
the framework of the Coq proof assistant. These
datasets illustrate the diverse methodologies and
skillsets necessary in ATP, reflecting the multi-
faceted nature of solving mathematical problems.
3.5 Math in vision-language context
CHARTQA (Masry et al., 2022), with 9.6K human-
written questions and 23.1K model-generated ques-
tions have explored a variety of complex reasoning
questions that involve several logical and arithmetic
operations over charts. MATHVISTA (Lu et al.,
2023a): size: 6K; it features seven types of mathe-
matical reasoning: algebraic reasoning, arithmetic

reasoning, geometry reasoning, logical reasoning,
numeric common sense, scientific reasoning, and
statistical reasoning. In addition, fine-grained meta-
data are available, including question type, answer
type, language, source, category, task, grade level,
and visual context.
4 Methodologies
We summarize these methods into three progressive
levels: i) Prompting frozen LLMs, ii) Strategies en-
hancing frozen LLMs, and iii) Fine-tuning LLMs.
4.1 Prompting frozen LLMs
We organize prior work by typical LLMs.
GPT-3. Zong and Krishnamachari (2023) eval-
uated the use of GPT-3, a 175B parameter trans-
former model for three related challenges pertain-
ing to math word problems: i) classifying word
problems, ii) extracting equations from word prob-
lems, and iii) generating word problems.
ChatGPT. Shakarian et al. (2023) reported the
first independent evaluation of ChatGPT on MWP,
and found that ChatGPT’s performance changes
dramatically based on the requirement to show its
work. Cheng and Zhang (2023) assessed Chat-
GPT, OpenAI’s latest conversational chatbot and
LLM, on its performance in elementary-grade arith-
metic and logic problems, and found that Chat-
GPT performed better than previous models such
as InstructGPT (Ouyang et al., 2022) and Minerva
(Lewkowycz et al., 2022).
GPT-4. Wu et al. (2023) adapted and evaluated
several existing prompting methods to the usage
of GPT-4, including a vanilla prompt, Program-
of-Thoughts prompt (Chen et al., 2023a), and Pro-
gram Synthesis prompt (Drori et al., 2022). The
study by Gu (2023) investigated the capability of
GPT-4 to actively engage in math-oriented brain-
storming sessions. This includes tasks like iden-
tifying new research problems, refining problem
formulations, and suggesting potential methods or
unconventional solutions, all achieved through it-
erative ideation with a human partner—a common
practice in collaborative brainstorming with other
GPT4V & Bard. Lu et al. (2023a) presented
MATHVISTA, a benchmark of evaluating math-
ematical reasoning in visual context, conducted
a comprehensive, quantitative evaluation of three
LLMs (i.e, ChatGPT, GPT-4, Claude-2 (Bai et al.,
2022)), two proprietary large multimodal mod-
els (LMMs) (i.e., GPT4V, Bard), and seven
open-source LMMs, with Chain-of-Thought and
Multiple. Wei et al. (2023) evaluated a variety
of popular LLMs, including both commercial and
open-source options, aiming to provide a bench-
mark tool for assessing the following question:
to what grade level of Chinese elementary school
math do the abilities of popular LLMs correspond?
4.2 Strategies enhancing frozen LLMs
Preprocessing the math question. An et al.
(2023a) explored ChatGPT for the dataset SVAMP
and observed that substituting numerical expres-
sions with English expressions can elevate the per-
More advanced prompts. Chain-of-thought
(Wei et al., 2022), the first time to steer the
LLMs to do step-by-step math reasoning, Self-
Consistency (Wang et al., 2023) tried multiple
Chain-of-Thought reasoning paths and leverage the
consistency mechanism to discover a more proba-
ble answer. Zhou et al. (2023a) proposed a novel
and effective prompting method, explicit code-
based self-verification, to further boost the mathe-
matical reasoning potential of GPT-4 Code Inter-
preter. This method employs a zero-shot prompt
on GPT-4 Code Interpreter to encourage it to use
code to self-verify its answers.
Using external tool. Yamauchi et al. (2023) em-
ployed an external tool, specifically the Python
REPL, to correct errors in Chain-of-Thought. Their
demonstration highlighted that integrating Chain-
of-Thought and Python REPL using a markup
language improves the reasoning capabilities of
ChatGPT. In a related context, He-Yueya et al.
(2023) introduced an approach that merges an
LLM, Codex (Chen et al., 2021b), capable of pro-
gressively formalizing word problems into vari-
ables and equations, with an external symbolic
solver adept at solving the generated equations.
Program-of-Thought (Chen et al., 2023a) separates
the computational aspect from the reasoning by
utilizing a Language Model (primarily Codex) to
articulate the reasoning procedure as a program.
The actual computation is delegated to an external
computer, responsible for executing the generated
programs to arrive at the desired answer.

Improving the whole interaction. Wu et al.
(2023) introduced MathChat, a conversational
framework designed for chat-based LLMs. In
this framework, math problems from the MATH
dataset are resolved through a simulated conversa-
tion between the model and a user proxy agent.
Considering more comprehensive factors in eval-
uation. While accuracy is crucial in evaluating
LLMs for math problem-solving, it shouldn’t be the
sole metric. Other important dimensions include:
i) Confidence Provision: Imani et al. (2023)’s
”MathPromper” boosts LLM performance and con-
fidence by generating algebraic expressions, pro-
viding diverse prompts, and evaluating consensus
among multiple runs. ii) Verifiable Explanations:
Gaur and Saunshi (2023) used concise, verifiable
explanations to assess LLM reasoning, revealing
their proficiency in zero-shot solving of symbolic
MWPand their ability to produce succinct explana-
4.3 Fine-tuning LLMs
Learning to select in-context examples. As in-
dicated by prior research, few-shot GPT-3’s perfor-
mance is susceptible to instability and may decline
to near chance levels due to the reliance on in-
context examples. This instability becomes more
pronounced when dealing with intricate problems
such as TABMWP. In addressing this issue, Lu
et al. (2023b) introduced PROMPTPG, which can
autonomously learn to select effective in-context
examples through policy gradient interactions with
the GPT-3 API, eliminating the need for manually
designed heuristics.
Generating intermediate steps. Nye et al.
(2021) initiated the fine-tuning of decoder-only
LLMs, ranging from 2M to 137B in size. Their
approach involved training these models to solve
integer addition and polynomial evaluation by gen-
erating intermediate computation steps into a des-
ignated “scratchpad.” In a related effort, Zhang
et al. (2023b) introduced a fine-tuning strategy for
GPT-2 or T5, enabling them to produce step-by-
step solutions with a combination of textual and
mathematical tokens leading to the final answer.
Additionally, Yang et al. (2023) applied a step-by-
step strategy in fine-tuning a series of GLM models
(Zeng et al., 2023), specifically tailored for solving
distinct Chinese mathematical problems. Minerva,
developed by Lewkowycz et al. (2022), enhances
LLMs’ ability to generate intermediate steps in
complex math problems. Its fine-tuning of diverse
datasets enables nuanced, step-by-step problem-
solving, demonstrating advanced handling of intri-
cate mathematical concepts.
Learning an answer verifier. OpenAI re-
searchers, per Cobbe et al. (2021), fine-tuned a
GPT-3 model of 175B as a verifier, assigning
probabilities to solution candidates. In explor-
ing reexamination processes for MWP solving,
Bin et al. (2023) introduced Pseudo-Dual Learn-
ing, involving solving and reexamining modules.
For MWP solution, Zhu et al. (2023) developed a
cooperative reasoning-induced PLM, with GPT-J
(Wang and Komatsuzaki, 2021) generating paths
and DeBERTa-large (He et al., 2021) supervising
evaluation. Google researchers, as per Liu et al.
(2023c), observed improved correctness in LLMs
with multiple attempts, which hints that LLMs
might generate correct solutions while struggling
to differentiate between accurate and inaccurate
ones. They sequentially fine-tuned their PaLM 2
model (Anil et al., 2023) as a solution generator,
evaluator, and generator again.
Learning from enhanced dataset. Emulating
the error-driven learning process observed in hu-
man learning, An et al. (2023b) conducted fine-
tuning on various open-source LLMs within the
LLaMA (Touvron et al., 2023a), LLaMA-2 (Tou-
vron et al., 2023b), CodeLLaMA (Rozi`ere et al.,
2023), WizardMath (Luo et al., 2023), MetaMath
(Yu et al., 2023), and Llemma (Azerbayev et al.,
2023) families. This fine-tuning utilized mistake-
correction data pairs generated by GPT-4. To
mitigate over-reliance on knowledge distillation
from LLM teachers, Liang et al. (2023a) fine-
tuned LLaMA-7B on existing mathematical prob-
lem datasets that exhibit diverse annotation styles.
In a related approach, Raiyan et al. (2023) demon-
strated that training on linguistic variants of prob-
lem statements and implementing a voting mecha-
nism for candidate predictions enhance the math-
ematical reasoning and overall robustness of the
Teacher-Student knowledge distillation. Liang
et al. (2023b) utilized GPT-3 to coach a more
efficient MWP solver (RoBERTa-based encoder-
decoder (Liu et al., 2019)). They shifted the focus
from explaining existing exercises to identifying
the student model’s learning needs and generating
new, tailored exercises. The resulting smaller LLM

achieves competitive accuracy on the SVAMP
dataset with significantly fewer parameters com-
pared to state-of-the-art LLMs.
Finetuning on many datasets. Mishra et al.
(2022) conducted fine-tuning on a series of GPT-
Neo2.7B causal language models (Black et al.,
2021) using LILA, a composite of 20 existing math
datasets. Similarly, Yue et al. (2023) created “Math-
Instruct”, a meticulously curated instruction tun-
ing dataset. Comprising 13 math datasets with
intermediate Chain-of-Thought and Program-of-
Thought rationales, this dataset was used to fine-
tune Llama (Touvron et al., 2023a,b; Rozi`ere et al.,
2023) models across different scales. The result-
ing models demonstrate unprecedented potential in
cross-dataset generalization.
Math solver ensemble. Yao et al. (2023) incor-
porated a problem typing subtask that combines
the strengths of the tree-based solver and the LLM
solver (ChatGLM-6B (Zeng et al., 2023)).
5 Analysis
5.1 LLMs’s robustness in math
Patel et al. (2021) provided strong evidence that the
pre-LLM MWP solvers, mostly LSTM-equipped
encoder-decoder models, rely on shallow heuristics
to achieve high performance on some simple bench-
mark datasets, then introduced a more challenging
dataset, SVAMP, created by applying carefully
chosen variations over examples sampled from
preceding datasets. Stolfo et al. (2023) observed
that, among non-instruction-tuned LLMs, the larger
ones tend to be more sensitive to changes in the
ground-truth result of a MWP, but not necessarily
more robust. However, a different behavior exists
in the instruction-tuned GPT-3 models, which show
a remarkable improvement in both sensitivity and
robustness, although the robustness reduces when
problems get more complicated. Wei et al. (2023)
assessed the robustness of several top-performing
LLMs by augmenting the original problems in the
curated CMATH dataset with distracting informa-
tion. Their findings reveal that GPT-4 can maintain
robustness while other models fail.
Zhou et al. (2023b) proposed a new dataset RO-
BUSTMATH to evaluate the robustness of LLMs in
math-solving ability. Extensive experiments show
that (i) Adversarial samples from higher-accuracy
LLMs are also effective for attacking LLMs with
lower accuracy; (ii) Complex MWPs (such as more
solving steps, longer text, more numbers) are more
vulnerable to attack; (iii) We can improve the ro-
bustness of LLMs by using adversarial samples in
few-shot prompts.
5.2 Factors in influencing LLMs in math
The comprehensive evaluation conducted by Yuan
et al. (2023) encompasses OpenAI’s GPT series,
including GPT-4, ChatGPT2, and GPT-3.5, along
with various open-source LLMs. This analysis
methodically examines the elements that impact the
arithmetic skills of LLMs, covering aspects such as
tokenization, pre-training, prompting techniques,
interpolation and extrapolation, scaling laws, Chain
of Thought (COT), and In-Context Learning (ICL).
Tokenization. This research underscores tok-
enization’s critical role in LLMs’ arithmetic perfor-
mance (Yuan et al., 2023). Models like T5, lacking
specialized tokenization for arithmetic, are less ef-
fective than those with advanced methods, such as
Galactica (Taylor et al., 2022) and LLaMA, which
show superior accuracy in arithmetic tasks. This
indicates that token frequency in pre-training and
the method of tokenization are key to arithmetic
Pre-training Corpus. Enhanced arithmetic skills
in LLMs correlate with the inclusion of code and
LATEX in pre-training data (Yuan et al., 2023).
Galactica, heavily utilizing LATEX, excels in arith-
metic tasks, while models like Code-DaVinci-002,
better at reasoning, lags in arithmetic, highlight-
ing a distinction between arithmetic and reasoning
Prompts. The nature of input prompts greatly
affects LLMs’ arithmetic performance (Liu et al.,
2023a; Lou et al., 2023). Without prompts, perfor-
mance drops (Yuan et al., 2023). Models like Chat-
GPT, which respond well to instructional system-
level messages, demonstrate the importance of
prompt type. Instruction tuning in pre-training also
emerges as a significant factor (Yue et al., 2023).
Model Scale. There’s a noted correlation be-
tween parameter count and arithmetic capability
in LLMs (Yuan et al., 2023). Larger models gen-
erally perform better, but a performance plateau
is observed, as shown by Galactica’s similar out-
comes at 30B and 120B parameters. However, this
doesn’t always mean superior performance, with
smaller models like ChatGPT occasionally outper-
forming larger ones.

5.3 Perspectives of mathematics pedagogy
While machine learning emphasizes LLMs’
problem-solving abilities in mathematics, in prac-
tical education, their primary role is to aid learn-
ing. Thus, the focus shifts from mere mathematical
performance to a crucial consideration of LLMs’
understanding of students’ needs, capabilities, and
learning methods.
Advantages of deploying LLMs in math edu-
cation. Educators have observed the following
benefits of leveraging LLMs for math education. (i)
LLMs foster critical thinking and problem-solving
skills, as they provide comprehensive solutions and
promote rigorous error analysis (Matzakos et al.,
2023); (ii) Educators and students prefer LLM-
generated hints because of their detailed, sequen-
tial format and clear, coherent narratives (Gattupalli
et al., 2023); (iii) LLMs introduce a conversational
style in problem-solving, an invaluable asset in
math education (Gattupalli et al., 2023); (iv) The
impact of LLMs extends beyond mere computa-
tional assistance, offering deep insights and under-
standing spanning diverse disciplines like Algebra,
Calculus, and Statistics (Rane, 2023).
Disadvantages of deploying LLMs in math edu-
cation. (i) Potential for misinterpretation. Misin-
terpretation of students’ queries or errors in provid-
ing explanations by LLMs could lead to confusion.
Inaccurate responses might result in the reinforce-
ment of misconceptions, impacting the quality of
education (Yen and Hsu, 2023). (ii) Limited un-
derstanding of individual learning styles. LLMs
may struggle to cater to diverse learning styles, as
they primarily rely on algorithms and might not
fully grasp the unique needs of each student. Some
learners may benefit more from hands-on activi-
ties or visual aids that LLMs may not adequately
address. Gresham (2021) proposed that hints pro-
duced by GPT-4 could be excessively intricate for
younger students who have shorter attention spans.
(iii) Privacy and data security issues. Deploying
LLMs involves collecting and analyzing substan-
tial amounts of student data. Privacy concerns may
arise if proper measures are not in place to safe-
guard this data from unauthorized access or misuse.
6 Challenges
Data-driven & limited generalization. The pre-
vailing trend in current research revolves around
the curation of extensive datasets. Despite this
emphasis, there is a noticeable lack of robust gener-
alization across various datasets, grade levels, and
types of math problems. Examining how humans
acquire math-solving skills suggests that machines
may need to embrace continual learning to enhance
their capabilities.
LLMs’ brittleness in math reasoning. The
fragility of LLMs in mathematical reasoning is
evident across three dimensions. Firstly, when pre-
sented with questions expressed in varying textual
forms (comprising words and numbers), LLMs ex-
hibit inconsistent performance. Secondly, for iden-
tical questions, an LLM may yield different final
answers through distinct reasoning paths during
multiple trials. Lastly, pre-trained math-oriented
LLMs are susceptible to attacks from adversarial
inputs, highlighting their vulnerability in the face
of manipulated data.
Human-oriented math interpretation. The cur-
rent LLM-oriented math reasoning, such as chain-
of-thoughts, does not take into account the needs
and comprehension abilities of users, such as stu-
dents. As an example, Yen and Hsu (2023) discov-
ered that GPT-3.5 had a tendency to misinterpret
students’ questions in the conversation, resulting
in a failure to deliver adaptive feedback. Addi-
tionally, research conducted by Gresham (2021)
revealed that GPT-4 frequently overlooks the prac-
tical comprehension abilities of younger students.
It tends to generate overly intricate hints that even
confuse those students. Consequently, there is a
pressing need for increased AI research that ac-
tively incorporates human factors into its design,
ensuring future developments align more closely
with the nuanced requirements of users.
7 Conclusion
This survey on LLMs for Mathematics delves into
various aspects of LLMs in mathematical reason-
ing, including their capabilities and limitations.
The paper discusses different types of math prob-
lems, datasets, and the persisting challenges in the
domain. It highlights the advancements in LLMs,
their application in educational settings, and the
need for a human-centric approach in math edu-
cation. We hope this paper will guide and inspire
future research in the LLM community, fostering
further advancements and practical applications in
diverse mathematical contexts.

