Can a Large Language Model be a Calculator?
I’ve recently been exploring the math capabilities of large language models (LLMs). In particular, I’ve been interested in whether they solve maths problems through memorisation or if their methods generalise. Previously, I’ve asked models questions requiring complex maths reasoning, encouraging them to explain their answers in detail (which can help LLMs solve maths problems). However, I was also curious about what LLMs can do with concise answers, and wanted to experiment to find out. I focused on straightforward calculations, rather than anything that required more complicated reasoning and logical deduction. These were the kinds of questions that, in my experience, models found easiest, and so this seemed like the best place to start. This meant, essentially, I was trying to see if an LLM could act like a calculator, or if it could only mimic one from memory.
Experiments
I tested out a GPT-3.5-turbo instance that had been fine-tuned on some math data (from the MATH dataset), on a range of randomly generated maths calculations. The responses were filtered to only include those where the model provided a solution with no other text.
I started with multiplication, and measured the model accuracy on some randomly generated questions with numbers of various sizes:
The model was good at multiplying smaller numbers, but was worse at larger numbers and at four digits and above, it failed completely. So, it definitely hadn’t learned to do long multiplication correctly, it could only generalise so far. To achieve this level of accuracy through memorisation, the LLM would have to remember over a million question-answer pairs. That is not completely beyond the realm of possibility, given how much factual information large models can store.
After multiplication, I tried generating questions made up of sequences of operations (+, -, ×) on double digit numbers. I varied the lengths of the questions and computed the accuracy for each length:
The model could do simple calculations reliably and sometimes solved questions involving many operations in sequence. However, from length 6 to 10, the accuracy tailed away completely. Again, the model did not demonstrate a reliable method that generalised arbitrarily. However, this level of accuracy cannot be down to memorisation alone. The number of possible questions with 6 operations and 7 double digit numbers is vastly bigger than the number of parameters in the model - i.e. the entire model itself doesn’t contain enough data to store the required questions and answers.
Lastly, I looked at finding square roots:
The model was able to answer very accurately up to roots of size 800 (meaning it could find the square root of six digit numbers). The accuracy tailed off fairly quickly as the numbers scaled up further, (although even at very large values the model still occasionally got it right, e.g. answering √160157638809 = 400197, and √158748855489 = 398433). To get similar scores as these through memorisation, it would only have to learn around 50,000 question answer pairs, which is very plausible.
Interpretation
These results show the LLM has limitations which wouldn’t be there if it had learned to compute like a calculator. However, it does demonstrate some interesting abilities. For example, even if it has memorised simple arithmetic, it seems to be able to chain together such calculations to compute fairly long sequences of operations.
Looking back over the model responses, I found that, for multiplication and square roots, when the model was incorrect, it was usually close to the right answer. Even for 6-digit × 6-digit multiplication, the model was within 1% of the correct solution 80% of the time. Similar results held for large square roots. This closeness suggests something more complex than memorisation is going on.
I also found that for multiplication the last digit was correct over 95% of the time, even up to 6-digit × 6-digit numbers. The last digit of a multiplication follows a simple rule, which it seems the model has learned to follow.
When answering questions, the model has to generate each output digit separately. So it would make sense if it had learned some approximate methods for finding each digit. These methods could work well together on smaller numbers, but start to fail on larger results that have more digits to compute.
The square root experiments showed least generalisation. We might expect that square roots are less common in the training data, and so less have been memorised. Another explanation is that taking square roots is more complicated than simple arithmetic. There are no easy methods for doing it by hand, so we might expect the model to struggle to learn simple approximations. The fact that the model is often close to correct suggests that it is doing something more than memorisation, but that could be something simple, such as giving the answer to the closest question it can remember.
Last Words
When solving maths questions directly, with no additional working, it seems LLMs are not truly calculating the answer, but also don’t rely on memorisation alone. The results suggest that for this kind of problem, when possible, models learn heuristics that work well for smaller numbers. For more difficult problems, it is more plausible that LLMs rely on memorisation, but even then, there is some smoothing going on, allowing it to get approximately correct answers. It may even be that LLMs learn a large number of heuristics that each apply to a narrow area, so when a question is asked, the model has to recall the correct heuristic. So, there may not even be a clean distinction between memorisation and generalisation.
Can an LLM act like a calculator, or does it just mimic from memory? Not exactly either of these, but something in-between. As noted before, evaluating LLM intelligence can be a lot more complicated than it first appears.