Humans-in-the-loop forecasting: integrating data science and business planning

by THOMAS OLAVSON

Thomas leads a team at Google called "Operations Data Science" that helps Google scale its infrastructure capacity optimally. ln this post he describes where and how having “humans in the loop” in forecasting makes sense, and reflects on past failures and successes that have led him to this perspective.



Our team does a lot of forecasting. It also owns Google’s internal time series forecasting platform described in an earlier blog post. I am sometimes asked whether there should be any role at all for "humans-in-the-loop” in forecasting. For high stakes, strategic forecasts, my answer is: yes! But this doesn't have to be an either-or choice, as I explain below.

Forecasting at the “push of a button”?

In conferences and research publications, there is a lot of excitement these days about machine learning methods and forecast automation that can scale across many time series. My team and I are excited by this too (see [1] for reflections on the recent M4 forecasting competition by my colleagues). But looking through the blogosphere, some go further and posit that “platformization” of forecasting and “forecasting as a service” can turn anyone into a data scientist at the push of a button. Others argue that there will still be a unique role for the data scientist to deal with ambiguous objectives, messy data, and knowing the limits of any given model. These perspectives can be useful if seen as part of a spectrum of forecasting problems, each calling for different approaches. But what is missing from this discussion is that the range of the role of humans in the loop is wider than just that of the data scientist. There are some problems where not only should the data scientist be heavily involved, but the data scientist should also involve non-data scientist stakeholders in the forecasting process.

Tactical vs strategic forecasts

Forecasting problems may be usefully characterized on a continuum between tactical on the one hand, and strategic on the other. This classification is based on the purpose, horizon, update frequency and uncertainty of the forecast. These characteristics of the problem drive the forecasting approaches.

The table below summarizes different forecasting problems as tactical and strategic:

Strategic Forecasts
Tactical Forecasts
Problem Characteristics
  • Purposean input to medium- to long-term planning in order to guide product, investment, or high stakes capacity planning decisions
  • Horizon: months to years
  • Update frequency: monthly or less
  • Uncertaintydifficult to quantify solely based on historical data due to long horizon, non-stationarity and possibly censored data.
  • Purposean input to short-term and mostly automated planning processes like inventory replenishment, workforce planning, production scheduling, etc.
  • Horizon: days to weeks
  • Update frequencyweekly or more
  • Uncertainty: quantifiable through model fitting or backtesting on historical data
Forecasting Approaches
  • Methods: triangulation between alternate modeling methods and what-if analysis
  • Key metrics: summaries of forecast changes and drivers; thresholds to flag significant gaps between alternate forecasts
  • Humans in the loop: data scientist to suggest different forecast methods and generate or collect those forecasts; stakeholders to review differences and approve a “consensus” forecast
  • Methods: automated pipeline of time series forecasts; if many related series are available then global, ML and/or hierarchical models may be appropriate
  • Key metrics: point forecast and prediction interval accuracy metrics for model evaluation
  • Humans in the loop: data scientist to build and maintain models; may include judgmental adjustments on model output, but used sparingly
Table 1: Strategic and tactical forecasts.

Some also distinguish further between tactical and operational forecasts [2]. The latter involve updates at least daily. In this case there is no time at all for human review, and forecast automation is essential.

In choosing the appropriate method, a key distinction lies in the business stakes associated with a given forecast publication cycle. Based on the decisions being made and how quickly plans can adjust to new forecast updates, what is the cost of forecasting too high or too low? If the costs of prediction error are asymmetric (e.g. predicting too low is more costly than predicting too high), decisions should plan to a certain quantile forecast (e.g. 95th percentile). This may be true for both strategic and tactical forecasts. For example, long-term capacity or short-term inventory may be planned to a high quantile forecast, if the cost of a shortage is much greater than the cost of holding excess.

The ROI of human involvement

When it comes to human involvement, the key difference is in the magnitude of costs associated with any one forecast cycle. What is the reduction in cost of a forecast that was improved by human intervention? This defines the ROI on the investment of human time.

Tactical forecasts have a higher frequency of updates and a shorter forecast horizon. Thus, there is both less time to make adjustments and less return on human time in doing so. If the number of forecasts is not too unwieldy and the forecasts not too frequent, there may be some room for what Hyndman calls “judgmental adjustments” to the model output, as a sort of light-weight version of strategic forecasting. Hyndman cautions that adjustments should be made sparingly using a structured and systematic approach and are “most effective when there is significant additional information at hand or strong evidence of the need for an adjustment.” [3].

In contrast, strategic forecasts benefit from a higher level of human review and a more formal process for triangulating between different forecast methods, some of which may rely primarily on judgment and forward-looking information.

Figure 1: A Google data center

As an example, consider Google’s forecasting and planning for data center capacity. This capacity is planned years in advance due to long lead times for land, utility infrastructure, and construction of physical buildings with cooling infrastructure. Once built, the data centers can be populated with compute and storage servers at much shorter lead times. Future demand for servers is uncertain, but the cost of empty data center space is much less than the shortage cost of not being able to deploy compute and storage when needed to support product growth. We therefore plan capacity to a high quantile demand forecast.

Prediction intervals are critical for our quantile forecast. But unlike in the tactical case, we have a limited time series history available for backtesting. Nor can we learn prediction intervals across a large set of parallel time series, since we are trying to generate intervals for a single global time series. With those stakes and the long forecast horizon, we do not rely on a single statistical model based on historical trends.

I sometimes see the erroneous application of a tactical approach to strategic forecasting problems. Strategic forecasts drive high stakes decisions at longer horizons, so they should not be approached simply as a black box forecasting service, divorced from decision-making. Done right, strategic forecasts can provide insights to decision makers on trends, incorporate forward-looking knowledge of product plans and technology roadmaps when relevant, expose the risks and biases of relying on any one forecasting methodology, and invite input from stakeholders on the uncertainty ranges. In this case, there is a high return on investment (ROI) of human time to triangulate among different forecasts to arrive at a consensus forecast.

This focus on the ROI of human time turns on its head the conventional wisdom from 50 years ago, where the essence of how to choose the right forecasting technique was how much computational time to invest to arrive at “good enough” forecast accuracy [4]. Today, as computation has become cheap, the key tradeoff is between the human time invested vs “good enough” forecast accuracy. For tactical forecasts of many parallel time series, computational time may still be a consideration. But even here, a greater concern is the time invested by data scientists (mostly at the development stage) in data analysis and cleaning, feature engineering, and model development.

A forecast triangulation framework

As stated earlier, strategic forecasts should triangulate between a variety of methodologies. But this does not mean simply presenting a menu of forecasts from which decision makers can choose. Consider again the example of long-term forecasts for data center capacity planning. We might generate at least three types of forecasts with fundamentally different world views on which factors really drive growth: an “affordability” forecast based on forecasted revenue growth and the relationship between data center capacity and revenue growth; a “resource momentum” forecast based on historical trends for compute and storage usage translated into data center capacity needs using technology roadmaps; or a “power momentum” time series forecast based on historical consumption of usable data center capacity (measured in watts of usable capacity). Each has some merit, but simply presenting all three as a choice shirks responsibility for actually arriving at the “best” forecast.

The data scientist could try to build a single model that integrates all the signals together, but doing so typically relies on historical data to determine which features have the most predictive value. Boiling all the information down to a single model does not help us challenge to what degree we think the future will differ from the past. A single model may also not shed light on the uncertainty range we actually face. For example, we may prefer one model to generate a range, but use a second scenario-based model to “stress test” the range. If the alternate model is plausible with a small probability, then we’d like to see that the “stress test” forecast scenario still falls inside the prediction interval generated from our preferred model.

Rather than providing a menu of models, or a single model, the data scientist needs to play a bigger role in reviewing and evaluating forecasts. In particular, the data scientist must take responsibility for stakeholders approving the “best” forecast from all available information sources. By “best forecast”, we mean the most accurate forecasts and prediction intervals. Using multiple forecasts forces a conversation about the drivers, a revisitation of the input assumptions. It provides the occasion for deeper exploration of which inputs that can be influenced and which risks can be proactively managed.

Over the life of the forecast, the data scientist will publish historical accuracy metrics. But due to the long time lag between forecasts and actuals, these metrics alone are insufficient. The data scientist will conduct post-mortem analyses and adjustments when actual demand deviates significantly from the forecast. Every forecast update will include metrics to provide insight on change drivers, and will flag significant gaps between different model forecasts.

Note that this approach assumes that the forecasts directly drive high stakes decisions, and are not judged subsequently. If forecasts are simply used as a baseline to detect trend changes, then other approaches and less investment may be appropriate. But even in those cases, it may be worth the data scientist’s time to understand which decisions are in fact being made based on the trends detected by the forecast. (If the forecast cannot influence decisions, it does not merit a data scientist). Finally, data scientists must recognize that their forecasts may be used more broadly than first anticipated, and broad communications may have more value than first realized. It is therefore good discipline to provide forecast release notes explaining key risks and drivers of change.

The diagram below shows how we approach strategic forecasting for high stakes infrastructure capacity planning decisions at Google. The data scientist surfaces differences between a proposed forecast and one or more benchmark forecasts. The proposed forecast is the forecast believed to require the fewest subsequent adjustments. This proposed forecast may still have other shortcomings, such as being prone to biases of human judgment, or lacking a robust prediction interval. The benchmarks forecasts are used as cross-checks, and to gain insight into how the future may differ from the past.

The data scientist advocates for methods to include as benchmarks, as well as the method used as the proposed forecast. Some but not necessarily all of these forecasts may be generated by the data scientist directly. For example, proposed forecasts may come from customers, if their forecasts are based on forward-looking information about product and technology plans that would be difficult for the data scientist to extract as inputs into a predictive model. At least one of the forecast methods will have a quantitative prediction interval generated by the data scientist, so that other forecasts can be considered in the context of this range.

Figure 2: Forecast triangulation

Integrating customer forecasts with statistical forecasts

In strategic forecasting, the proposed forecast may rely partially on forecasts or assumptions not owned by the data scientist. In the supply chain context, forecast and information sharing between buyers and suppliers is called “collaborative forecasting and planning.” This collaboration might also be between internal customers and an internal supplier. Using customer forecasts as the proposed forecast can capture valuable information about future inorganic growth events or trends that are difficult to extract as features for a predictive model. On the other hand, these customer forecasts can be aspirational and often lack high quality prediction intervals. Customer forecasts may further suffer from what Kahneman calls the “inside view”, where a forecaster (a customer in this case) may extrapolate from a narrow set of personal experiences and specific circumstances without the benefit of an “outside view” that can learn from a much larger set of analogous experiences. 

So what to do? An operations team in need of a forecast to plan against may poorly frame this as an either-or proposition — either they accept the customer forecast (perhaps interpreting it as a high quantile forecast scenario), or discard it in favor of a statistical time series forecast with a quantitative prediction interval. The alternative we use is the forecast triangulation framework described above. We collect base and high scenario customer forecasts and generate statistical forecasts, and we build a process to approve a “consensus” forecast using as inputs the proposed customer forecasts and one or more benchmark time series forecasts. This allows us to capture forward-looking information signaled in the customer forecast, while checking for bias and adding prediction intervals from the time series forecasts. A variant of this method would be to provide the customer with a baseline statistical forecast and allow them to make adjustments to it. Either can work well, as long as the difference between the statistical forecast and approved consensus forecast is reviewed, understood and approved by a set of decision makers who are accountable for the costs of both under-forecasting and over-forecasting. Where there are significant gaps between a customer forecast and a statistical forecasts, the process requires a good story to explain the gap that everyone understands before it is approved as the consensus forecast. It may take multiple forecasting cycles to resolve, but typically we see the following:
  • an approval by all parties that the gap is legitimate due to forward-looking factors
  • removal of outlier events from history or model adjustments to improve the accuracy of statistical forecasts
  • convergence between the customer and statistical forecast
In an internal customer-supplier setting, we have found it useful to require “consensus” to mean alignment between customers and other stakeholders, since the customers are the ones who most acutely feel the pain of shortages due to under-forecasting. Also included in the approver group is Finance (who are particularly concerned about the costs of excess capacity) and operations teams (who are responsible for executing on the plans and help mediate between customers and Finance to drive forecast alignment).

There are more sophisticated alternatives, such as contracts and risk-sharing agreements between customers and suppliers. In fact, there is a body of literature on optimal contracting structures between buyers and suppliers [5]. Unfortunately, formal risk-sharing agreements can be cumbersome and difficult to put in place. This is particularly true in operational planning domains where contracting around risk is not common, as it is in financial markets. We have found that a simple but effective approach is to help customers forecast by providing useful statistical benchmark forecasts, while also inviting their input on what “inorganic” events may require adjustment of the statistical forecasts. Not only does this improve forecast quality and build a common understanding of forecast drivers, it also creates a shared fate. As the old adage goes, all forecasts are wrong. It is easy in hindsight for stakeholders to second-guess the data scientist’s statistical forecast if the data scientist did not make concerted efforts to consult them about forward-looking information they may have had.

Case study: machines demand planning

Below is an example of the evolution of an important strategic forecast process at Google. It illustrates the benefits and pitfalls of automation, and follows the thesis-antithesis-synthesis narrative.

Original process — customer-driven forecast

In our supply chain planning for new machines (storage and compute servers), we would stock inventory of component parts based on forecasts so that we could quickly build and deliver machines to fulfill demand from internal customers such as Search, Ads, Cloud, and Youtube. The operations teams would plan with only high level “machine count” forecast based on input from our internal customers. There was no accounting for error in the customer forecasts and no credible benchmark time series forecast. There was little internal alignment between product and finance functions on the machine count forecasts — it was not uncommon to see a 2x difference in the machine count forecasts during annual planning discussions. These would actually be competing forecasts, and there was no clear process for reconciling them nor documenting a single plan of record consensus forecast visible to all.

This disconnect between customer and finance forecasts often was not resolved outside of supply chain lead time (over 6 months for some components). The teams planning component inventory were therefore left having to judge how much to discount the customer forecast (and risk being responsible for a shortage) or uplift the finance guidance (and risk being responsible for excess inventory). Even if the operations team brought materials in early as a safety stock hedge, the forecasted mix of components would often be wrong. The outcome was both poor on-time delivery due to shortages on specific components and excess inventory due to overall machine count forecasts being too high. In the face of poor on-time delivery and long lead times for new machines, internal customers needed to hold large reserves to buffer against unpredictable machine deliveries. We had the worst possible outcomes — high supply chain inventory and a poor customer experience that led to high idle deployed inventory in the fleet.

First attempt — Stats forecast

Our first attempt at fixing this problem was to remove all humans from the forecasting loop — we placed all our bets on the statistical forecasts for new component demand. Our data scientists developed statistical forecasts for each machine component category and determined the forecast quantile we needed to plan against for each component type to meet on-time delivery targets for machines. We invested several quarters in building and tuning the forecasting models to reduce the error as much as possible. But this was never implemented because the predictions required us to triple our safety stock inventory in order to meet our on-time delivery goals. The forecast ranges were too wide and so the solution was just too costly. Component inventory is planned to high quantile forecasts, and the high quantile forecast was driven by outlier behavior in the past, such as inorganic jumps in demand due to new product launches or specific machine configuration changes our customers requested. Our customers often knew when those changes were coming. Trying to forecast based on history alone, our fully-automated approach was ignoring this forward-looking information.

Forecast triangulation

We had to find a more efficient way to buffer for customer demand uncertainty that did not allow for unbounded customer forecast error based on past inorganic events or machine configuration changes. We therefore invested in processes and tools that would consider customer forecasts as proposed forecasts, compare with benchmark forecasts, and arrive at consensus high scenario forecasts that supply chain teams could plan materials against. This required investment from software engineering teams to capture forecasts from our customers in machine-readable form and convert those forecasts into units relevant to component capacity planning. It required investments from our data science team to re-think our statistical forecasting approach to make it easier to compare against customer forecasts. Instead of forecasting machine components as we had first tried, we forecasted closer to the true source of demand — customer compute and storage load at the data center level. We forecasted load at a finer granularity that allowed us to compare customer and statistical capacity forecasts directly, netting load growth versus existing capacity already deployed in the fleet. Our prediction intervals were also shared with our customers as a tool to rightsize the amount of deployed inventory they needed to hold in the fleet. With this change in focus, while leaving the exact mix of machine configs to those customers who required specific machine types, our statistical forecasts were more stable with narrower ranges and a much more credible benchmark.

The data science team also defined metrics to drive forecast accountability for both the operations teams and customer teams. Any significant shortage or excess in the consensus forecast could be traced back to its root cause based on the consensus forecast of record at lead time. The operations team facilitated a monthly process to drive a rolling 4-quarter alignment between customers and Finance on the consensus “base” and “high” forecasts so that all downstream teams could confidently execute according to the high scenario forecast.

This consensus building process helped us shape the future in a way that neither the customer-driven nor stats forecast alone could do. By combining data science with process rigor, we could expose key risks and better manage them, expose disagreements between stakeholders and negotiate to resolve them. This process reduced mean percent error (forecast bias) of the consensus forecast. We also learned that most of our customers most of the time were not actually asking us to cover nearly as large of an uncertainty range as estimated by our first attempt at an automated component forecast. The results are summarized in the table below.


Customer-driven forecast with added safety stock
Stats forecast with prediction interval
Forecast triangulation
Component safety stock inventory levels
1x
3x
0.6x
Excess component inventory due to forecast bias
~30% of forecast
0
(hypothetically)
0
On-time delivery and fleet deployed inventory
Poor on-time delivery
→ high fleet inventory
Good (hypothetically)
Good
Table 2: Forecast methods compared.

Conclusion

Compared with a purely algorithmic forecast, including humans in the loop certainly adds ambiguity, complexity and effort. This can be uncomfortable for data scientists, and can make us vulnerable to feeling insufficiently technical or scientific in our approach. To the extent possible, we all want to take technically rigorous approaches that are free from human bias. We dream of the one model to rule them all that has access to the perfect set of useful features with a long data history from which to learn. But in strategic forecasting, the available time series is short relative to the forecast horizon, and the time series is likely to be non-stationary. Perfection is not possible. Ambiguity already exists in the business problem and in the variety of information one can bring to bear to solve it. Models that ignore key business drivers or uncertainties due to lack of hard data bring their own type of bias. It is the data scientist’s job to grapple with the ambiguity, frame the analytical problem, and establish a process in which decision makers make good decisions based on all the relevant information at hand. We believe this applies as much to forecasting as any other kind of data science. With oversight from good data scientists, there is much value in having humans in the loop of strategic forecasts.


References

[1] C.Fry and M.Brundage, The M4 forecasting competition – A practitioner’s view. International Journal of Forecasting (2019), https://doi.org/10.1016/j.ijforecast.2019.02.013.

[2] Tim Januschowski & Stephan Kolassa, 2019. "A Classification of Business Forecasting Problems," Foresight: The International Journal of Applied Forecasting, International Institute of Forecasters, issue 52, pages 36-43, Winter.

[3] Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2. Accessed on 2019-08-01.

[4] Chambers, J., S. Mullick and D. Smith, “How to choose the right forecasting technique,” Harvard Business Review, July 1971, https://hbr.org/1971/07/how-to-choose-the-right-forecasting-technique.

[5] Graves, S.C. and A.G. de Kok. Supply Chain Management: Design, Coordination and Operation, 2003. See the chapters “Supply Chain Coordination with Contracts” from G. Cachon and “Information Sharing and Supply Chain Coordination” by F. Chen.







Comments