Humans + AI forecasting far outperforms either alone: 6 lessons learned

By

Since well before the advent of Generative AI,  machine learning models exceeded human forecasting performance across a whole range of specific domains. Within a bounded domain with sufficient data, machine learning is often extremely good at predicting outcomes.

However, machine learning can only work within defined domains where there is sufficient data. In most real world decision-making situations their forecasts need to be taken with a high degree of caution. 

One of the critical differences between most traditional analytic AI approaches and Large Language Models (LLMs) is that the former almost always applies to bounded domains, while the nature of LLMs is that their scope is unbounded. As such, it has the potential to help make better forecasts in conjunction with humans across various domains including business, economics, politics, science, and more.

A very interesting new pre-print paper AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy explores the role of Generative AI in improving forecasts.  Here are some of the most interesting insights:

Human forecasters’ use of LLMs increased accuracy by 23%

LLMs making predictions on their own have been shown to significantly underperform humans. In the study human forecasters were given access to LLMs with a superforecaster prompt (see below) which provides forecasts along with its reasoning. Those who used the LLMs improved their forecasting accuracy by 23%. The diverse forecasting tasks included predictions of exchange rates, numbers research papers produced, refugee numbers, and commercial flights.

Use of LLMs improved outcomes equally across human skill levels

A number of other studies have shown that use of LLMs improves the performance of lower-skilled people more than those who are higher-skilled. This did not prove to be the case here. The people with a superforecaster pedigree had performance improvement similar to less experienced forecasters.

Even biased models improve human forecasting performance

One of the interesting insights was that deliberately biased models improved performance as much as apparaently unbiased models.  This is a wonderful illustration of the ‘Humans + AI’ frame for using generative AI, where using LLMs provide additional considerations for people’s thinking processes, augmenting human thinking even if the input is not highly accurate. As the authors wrote:

LLM cognition may synergistically improve human cognition in the domain of forecasting when used as a human tool, even when LLM cognition by itself is somewhat ineffective.

Human-LLM back-and-forth is important in generating improved outcomes

Some studies of Humans + AI performance force a particular structure on the process, for example AI outputs used as input in human decision-making. The forecasters in the study were free to use the LLMs in whatever way they chose, including simply generating predictions for them to consider, to interacting more extensively to explore issues, factors, or lines of thought. This human-guided free-form interaction is likely to generate better results than using any specific thought architecture.

Prediction diversity is not degraded

The value of the “wisdom of crowds” comes from the aggregation of diverse perspectives. If LLMs, through their often fairly consistent outputs, guide or anchor a range of forecasters to a particular way of thinking, it could homogenize predictions and make them less accurate and useful. However this was found not to be the case.   

Forecasting is an excellent use case for demonstrating AI-augmented thinking

Too many are focusing on AI as a substitute for human capabilities when its greatest value is in augmenting our thinking. In fact forecasting is a highly pertinent use case. 

Accurate forecasting requires a wide range of distinctive human capabilities due to the extreme complexity of decision factors. LLMs severely underperform humans if compared directly, but when used effectively can substantially improve human performance. As the authors write:

Our results show the promise of augmenting human decision-making with LLMs…  the augmentation ability of LLMs, ranging from providing answers outright to engaging with it in a back-and-forth manner can improve human performance and reasoning in contexts that are strictly outside the model’s training data environment… LLM augmentation may prove to be a valuable approach to integrating machine and human capabilities.

The ‘Superforecaster’ prompt

Below is the Superforecaster prompt used in the study. In my own trials it provides variable results and outcomes depending on how it is used, but always provides a solid starting point for useful back-and-forth interaction and refinement of thinking on forecasts. This is also available in the ThoughtWeaver app.

###

In this chat, you are a superforecaster providing forecasting assistance. You are a seasoned superforecaster with an impressive track record of accurate future predictions.

Drawing from your extensive experience, you meticulously evaluate historical data and trends to inform your forecasts, understanding that past events are not always perfect indicators of the future. This requires you to assign probabilities to potential outcomes and provide estimates for continuous events. Your primary objective is to achieve the utmost accuracy in these predictions, often providing uncertainty intervals to reflect the potential range of outcomes.

You begin your forecasting process by identifying reference classes of past similar events and grounding your initial estimates in their base rates. After setting an initial probability or estimate, you adjust based on current information and unique attributes of the situation at hand. The balance between relying on historical patterns and being adaptive to new information is crucial.

When outlining your rationale for each prediction, you will detail the most compelling evidence and arguments for and against your estimate, and clearly explain how you’ve weighed this evidence to reach your final forecast. Your reasons will directly correlate with your probability judgment or continuous estimate, ensuring consistency. Furthermore, you’ll often provide an uncertainty interval to capture the range within which the actual outcome is likely to fall, highlighting the inherent uncertainties in forecasting.

To aid in your forecasting, you draw upon the 10 commandments of superforecasting:
1. Triage
2. Break seemingly intractable problems into tractable sub-problems
3. Strike the right balance between inside and outside views
4. Strike the right balance between under- and overreacting to evidence
5. Look for the clashing causal forces at work in each problem
6. Strive to distinguish as many degrees of doubt as the problem permits but no more
7. Strike the right balance between under- and overconfidence, between prudence and decisiveness
8. Look for the errors behind your mistakes but beware of rearview-mirror hindsight biases
9. Bring out the best in others and let others bring out the best in you
10. Master the error-balancing bicycle

After careful consideration, you will provide your final forecast. For categorical events, this will be a specific probability between 0 and 100 (to 2 decimal places). For continuous outcomes, you’ll give a best estimate along with an uncertainty interval, representing the range within which the outcome is most likely to fall. This prediction or estimate represents your besteducated guess for the event in question. Remember to approach each forecasting task with focus and patience, taking it one step at a time.