Testing the Limits of ChatGPT in Predictive Analytics
Testing prompting ChatGPT to solve a Complex Time Series Forecasting Task.
Large Language Models (LLMs) are reshaping industries today because they are a disruptively powerful tool for language based tasks such as:
- Sentiment Analysis
- Named Entity Recognition
- Summarization
- Translation
- Chatbots
- Content Generation such as blogs, your emails, creative writing, poetry
- Code completion
- Q&A
As a subset of Deep Learning models built specifically for natural language processing tasks, LLMs are really great at language tasks. So much so that we might be tempted to try to use them to solve a lot of our problems — not just language problems.
But AI and ML models are task specific, and it is important to choose the right model for the right task. Things can actually go horribly awry if we use the wrong model, and result in significant negative impact on people and businesses. The misappropriation of powerful technology like this is actually even dangerous when it results in applying faulty tech for critical use cases that impact human health, opportunity, or livelihoods. The wrong model for the wrong job will lead to bad predictions and here are some things you don’t want to predict poorly — the stock market, medical diagnoses, insurance approvals, loan approvals — are just a few things that affect our quality of life. We can’t get these wrong.
And I’ve noticed a trend of a natural temptation to apply LLMs beyond the natural language tasks they are designed for. Specifically I’ve noticed LLMs are bad at math. One of the clearest examples of this overreach is, in my opinion, using language models for what should be mathematical tasks. Although LLMs are great at language-based pattern recognition and “reasoning”, they aren’t great with numbers. Just take a look at these examples:
- We Tested an AI Tutor for Kids. It Struggled With Basic Math.
- Apple Says Generative AI Isn’t Good At Math
- Maths test stumps AI models: which number is bigger, 9.90 or 9.11?
- Researchers question AI’s ‘reasoning’ ability as models stumble on math problems with trivial changes
So will LLMs replace Machine Learning and other traditional mathematical algorithms we use to solve numeric tasks? I think not, but I wanted to try it myself for good measure, so I did an experiment using an LLM for a tabular data task that standard old-school Machine Learning models traditionally knocked out of the park — time series forecasting. The results might surprise you.
Experiment
I used an example inventory forecasting task, a clearly numerical problem not a language problem, and I compared four different models to perform the forecasting task. I used SARIMAX, XGBoost, LSTM, and prompted OpenAI’s ChatGPT-4.0.
My time series data is from Kaggle. The dataset contains daily trends in sales for products in a convenience store. I filtered it to a specific category — beauty products, to simplify the task. My goal was to forecast the weekly sales trends as this is a common task in the industry often performed by simple mathematical models before we had LLMs. I wanted to see if the LLM would be able to perform well on this this data-intensive task.
This task is interesting for a few reasons:
- Requires completely numeric inputs
- The output must be completely numeric too
- The data contains sequential patterns. Notice that there are seasonal and weekly patterns in this data, lending itself well to a modeling approach that captures patterns in sequences such as LSTMs and it is something LLMs can generally do.
The results put SARIMAX on top, but just barely, as it is almost on par with XGBoost and the LSTM. With more hyperparameter tuning I could probably get either XGBoost or the LSTM to win. But this was enough for me to end the project here because prompting ChatGPT was the worst possible approach of the 4! Going through this experiment helped highlight for me why that could be, and I want to share those learnings with you too. So let’s dive in.
SARIMAX
The SARIMAX model is an extension of the autoregressive forecasting method ARIMA from our Stats 101 course. This model uses moving average(MR) and autoregressive(AR) components and seasonal components with the ability to incorporate seasonal information and signals from external variables (or “features”). The MR terms capture the current patterns in the data using a moving average to estimate the general trend in the signal and forecast it, and the autoregressive terms model the dependency on previous values in the sequence so they are and are capturing correlation trends such as dependency between sequential terms for cyclical and linear patterns.
So this model can account for recurring patterns in a series, and consider additional variables as well, as it forecasts a series into the future. (If you want to build more intuition around this model, I enjoyed this article on Medium by Brendan Artley). I’ve included the equation to help build that intuition.
Where
Notice how each mathematical component in the equation is representing a physical trend in the data that you could assume exists theoretically based on your knowledge of the space. So SARIMAX is particularly useful for time series forecasting when the data exhibits seasonality and external factors (like weather, promotions, or holidays) influence the patterns. That’s specifically a good fit for this project. It is almost like it was designed for that.
In the forecasting graph, we can see that the SARIMAX forecast captures the jumps based on the promotions and holidays, and it has a sense for the weekly trends too. Nice.
XGBoost
XGBoost is my favorite ML model. Here is why. At this point in my career I have gotten to know hundreds upon hundreds of Data Scientists, and I always ask them about their past modeling projects. I want to know what modeling approaches they tested, and which one won. Over and over and over the winner is … XGBoost!
XGBoost is an implementation of a boosted trees model that works by making an ensemble of many small decision trees (weak learners) sequentially to make a prediction. It uses the concept of compiling the output of many small weak models to make one powerful prediction.
Where
The best way to illustrate the mathematics of XGBoost is with a tree diagram. Here is an example of one of the many trees (fk(xt)) in a project I did awhile back to classify wine quality based on another free dataset from kaggle.
XGBoost is a simple yet nonlinear modeling method, so it is useful for capturing complex non linear trends in your data. It is capturing the jumps based on the promotions and holidays and 7 day lags, but it is off. As an example, promo day from 7 days ago is pushing a prediction from today waaaay up. So the effect of the holidays on the time series is not properly captured in the forecasts, even though overall it looks pretty accurate, I would not roll this out until I experimented more with properly incorporating features to help the model manage this effect.
LSTM
An LSTM (Long Short-Term Memory) is a type of Neural Network (RNN) designed to capture sequential patterns in data. They were popular for NLP tasks, and also gained popularity for time series forecasting since the nature of the timeseries is still a sequential series and their architecture is designed to capture this.They function by using a gating mechanism to retain some of the information learned previously in the sequence, and that information is used at the time of inference. LSTMs have been effective in learning patterns and trends from historical data, allowing them to make accurate predictions about future values by retaining important information across long sequences.
I think Colhan’s Blog provides the best description of LSTMs with the perfect illustrations, so no need for me to recreate it☺.
The LSTM produces a good best fit line as the forecast here. Notice that it is not explaining as much of the variance as SARIMAX does. But it is accurately capturing the 7 day and promo effects in the forecast. This is a good model for the task, and if I messed around with tuning the hyperparameters, I could probably get this one to win.
ChatGPT
Ok now for the part we’ve all been waiting for — solving the time series forecast by prompting ChatGPT-4o!
I actually tried two methods here. As a first attempt, I simply pasted in the data in JSON format and wrote a prompt asking the LLM to perform the forecasting tasks.
To my surprise — the LLM decided to write code to run an ARIMA model to produce the forecast without being prompted to do so! (For context, ARIMA is a very similar model to SARIMAX that I used above.) The code the LLM provided was full of tiny errors, but if edited properly it would’ve produced something very similar to the approaches I tried above. What was interesting to me is that ChatGPT perhaps knew better than to rely on it is language processing skills to return the forecast as a number. It instead used code to define the math required for this task. Because code is just a language, LLMs actually do pretty well on coding tasks. I personally love using LLMs as a coding assistant to write code more productively — although it still requires a code-savvy human in the loop to edit and guide the process.
I still wanted to see if I could get the LLM to directly produce the forecast through a prompting approach. So as a second approach, I prompted ChatGPT-4o with something called a “data narrative”. Instead of directly dumping the data into the prompt, the data narrative I created specifically ran python code to extract and summarize key trends in the data that should help the LLM with the right context to perform the forecasting task. Similar to feature engineering.
Then, I used a prompt template that incorporated this data narrative and requested a forecast. For the inference step, a single call would need to be made to the LLM with this information, and a forecast would be produced.
Although it was somewhat of an intensive coding task to orchestrate this, it worked. I was able to get predictions and make the same metric RMSE and MAE calculations and graph the forecast so I could compare this approach with the other models.
One interesting issue I ran into was that it was difficult to force numeric output even with a lot of prompt engineering to try to force this outcome. The prediction the LLM made was still sometimes a direct number, but sometimes it would be a string like “15 units”. If this was my chosen approach, this nuance would be something I’d have to monitor for and perform some regex processing on the predictions to prevent this or other possible similar issues that are the effect of using a language generation model here.
Here is an example of the inference step in ChatGPT.
So the best modeling approach in this experiment was SARIMAX, and the other mathematical models XGBoost and LSTM are competitive and probably the winning options with further tuning.
The Basic Math of LLMs
So why are LLMs so bad at math? Isn’t AI supposed to be a powerful reasoning engine? Why don’t they seem to have a head for numbers too?
Understanding the architecture of any model makes it somewhat obvious what it could be good at, when they should be used, what they’ll fail at. So understanding the mathematics is critical in helping us choose the right model for the right task. That math is what is used to define and optimize these algorithms to tune them to perform well at specific tasks. The math behind LLMs reveals that these modes are optimized to perform well on language based tasks, as they are designed mathematically to process text data. To select the right model for the job, it is critical to understand some core math concepts in the underlying architecture of both LLMs, and in deep learning algorithms in general. (If you need a primer on the math behind LLMs, I loved this one from the 3Blue1Brown youtube channel).
Another reason it is important to understand the math behind the model chosen for the job is that this knowledge will also allow you to form a meaningful hypothesis about the possible failure points of the model so you can monitor for them and mitigate them in production.
To illustrate this, let’s take a peek at the math behind LLMs and rationalize why they’re so great at language tasks, and contrast that with the models that are great at the numerical tasks of forecasting.
Embeddings
LLMs need to mathematically encode the complex language patterns in the datasets it is trained on. But the training data is words, not numbers, so how do we do math on words? The first concept to understand is embeddings. Embeddings are a method for mathematically encoding the meaning of words as a numeric vector. Once the word is transformed into a vector we can start doing math with it, and the math represents language patterns. Words with similar meanings and the context around the words end up being encoded mathematically and represented by the uninterpretable numbers in this vector.
Transformers
LLMs are trained to generate new text based on huge datasets of training data (like the entire internet for example). These models need to have a unique architecture specifically designed to capture and mathematically encode the complex language patterns in this data it is trained on — this is accomplished via transformers and the attention mechanisms.
Transformers are a type of deep learning model architecture designed to handle processing the complex patterns in a sequence of words. It processes sequential data in parallel instead of step by step, and uses layers of attention mechanisms to encode long range patterns in the text data to influence the test generation step.
This allows the model to focus on the most relevant parts of an input sequence when generating sequential text. The LLM can efficiently capture the context around a word in a document, and capture the relationship between words in human language. It is this unique ability that gives LLMs the perceived power to understand language and speak coherently with users. The attention mechanism was first discovered and published in the famous 2017 paper Attention Is All You Need.
Loss Functions
Loss (and Cost) Function(s) are common concepts in predictive mathematical models and are a core component of all the algorithms I tested in this experiment. The Loss function is an equation designed to measure the difference between the predicted output and the real output. MAE and RMSE are great tangible examples of loss functions.
When these models are ‘trained’, computations are run to define the parameters of the equation for the model such that the loss is minimized (or the most accurate results are produced). This is done using an optimization technique called gradient descent. That’s how we teach the model to learn how to produce the right results from the training data.
The brilliance of the loss functions is knowing what they are, and being able to control and tweak them. Because the models are purpose built to optimize an outcome via the minimizing the loss function you defined, you can use this to define the purpose of these models. I think that could give us a little control over the robots.
LLMs are already optimized to complete the next word in a sentence. In LLMs, the model is typically trained to predict the probability of the next word (or token) in a sequence, given the preceding words. Cross-entropy loss is used to measure how well the predicted probabilities match the actual next word in the sequence.
Probabilistic Output
These models are “generative” meaning they produce new output based on patterns observed in the training data instead of simply analyzing the patterns in the existing data. The output is nondeterministic, meaning there is natural variance in the output produced so it won’t always produce the same output even when the input is the same. This is due to the probabilistic nature of constructing the output.
To generate text, the LLMs leverage the information about the text data it was trained on that it captured and encoded mathematically during training to probabilistically generate natural language by predicting the next word in a sequence. The prediction takes the form of a probability distribution over the next words of chunks of text that could complete the sequence. This is what enables these algorithms to perform so well on a wide range of natural language processing tasks.
I love this diagram to illustrate how the LLMs operate as probabilistic text generators.
So in a nutshell, time series forecasting tasks require the model to understand explicit seasonal and cyclical relationships, and explicit optimization criteria related to the outcome we are trying to optimize (minimizing error in the forecast of a series of numbers) that LLMs are not designed to facilitate. The mathematical models I used against the LLM facilitate this explicitly in their design. Statistical and machine learning models are purpose-built to minimize numeric error metrics. In contrast, LLMs are optimized to reduce token prediction loss. When choosing a model for a task, it is key to know that the LLMs operate as probabilistic text generators, and should be applied as such.
Can we get LLMs to do our Math for us?
Maybe.
LLMs seem applicable to many domains and disrupt a variety of industries when applied properly. So can we continue to advance the technology to get them to be good at math?
It may be possible with some technological advancements to get these models to effectively bridge domains into mathematics.
There are a lot of things we could continue to experiment with to study the possibility of improving LLMs mathematical capabilities. Tinkering with prompting, parameters like temperature. I’ve even wondered about finetuning it on math paper and books to provide it with the proper targeted domain knowledge.
Increasing LLM’s effectiveness at mathematical tasks is still an active and exciting area of research and I want to point you to a few promising studies that stood out to me:
- Mathematical Reasoning Through LLM Finetuning
- WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
- MathCoder: Seamless Code Integrations in LLMs for Enhanced Mathematical Reasoning
- Scaling test-time compute allows LLMs to “think harder” and improve performance on math problems
But should I throw all the traditional ML modeling approaches I know out the window and turn my attention towards LLMs 100%? Will these language models replace Machine Learning or other more traditional mathematical methods for processing data? No. Replacing mathematical algorithms with LLMs in mathematical tasks such as time series forecasting is not only unnecessary but risky. I believe that we should instead be using purpose-built models that still dominate predictive and numerical tasks.
But don’t forget what LLMs are great at. The LLM is built to process language data, and their architecture is beautiful and complex and built to capture the complex nuances of human language.
Another capability worth noting is that LLMs are great coding assistants and code completion tools — this is because code is a language. So if we want LLMs to do our math for us, asking them to generate code (i.e. SQL queries, LaTeX equations, or python functions) to solve direct numerical tasks is a brilliant approach given the current limitations of the technology.
LLMs and Generative AI in general is catching on as a new technology, and there are new advancements in it almost daily. As practitioners, we need to continue to experiment to push the boundaries of what it can do, and contribute to the discovery of new capabilities. The early phase of an industry like this involves a lot of experimentation. I expect some projects to fail, and some to win, and we need to approach projects with curiosity to experiment and resilience against the failures we may experience. As we tinker, we’ll discover more and more and reach an equilibrium where we fully understand the limits and proper applications of the tech.
Keep studying your math. I’m excited to see what you build. 🚀