What a Pandemic Should Teach Us About Our Predictive Models
If your organization has a predictive model in place, you might be interested in its accuracy during an economic shock like the one we are currently experiencing with COVID-19. While it's not reasonable to expect your model to be perfect 100% of the time (remember, models are just generalizations of the data they were trained on), there are three major lessons to be learned from a risk management perspective to ensure your model isn't crushed by a jarring event.
Lesson #1: Normal distributions are rare
Most traditional economic models (in particular, forecasting and regression models that output a number -- as opposed to a classification model that might output a "Yes/No") make the assumption that the data for each variable are normally distributed. This is due to the fact that, because of their linear relationship, the variance from the mean in each independent ("predictor") variable should explain the variance from the mean in the dependent ("target") variable. However, economic data in the real world rarely follows a normal distribution. Let's look at an example:
If we plot a histogram of the monthly percentage changes in the Producer Price Index for grain in the U.S., we see data that look very close to a representing a normal distribution:
Most of the data represent a normal distribution centered around a 0% change. However, there is one observation that skews far from the rest, representing a month where there was a 70% increase. This creates a "long-tailed" distribution, which are dangerous when it comes to building models. The question now becomes: How do we handle this outlier?
On the one hand, we know that including this observation in the data we develop our model with will break the "normality" assumption. On the other hand, removing that outlier means that our model won't learn from what happened that month. For example, if you are training a model to predict default rates in a loan portfolio using this percentage change in grain PPI variable, removing this outlier means you also have to throw out the observed default rate during that month. You can hope and pray that the effect was linear to the rest of the data, but as we see in just about every COVID-19 chart, the effect of outliers in real life is typically exponential rather than linear.
Some possible methods for handling this situation include:
Remove the outlier from your model training data, but keep what happened when that outlier occurred handy and visible to the end consumer of your model
Move to another algorithm that is more robust to non-linearity (e.g., tree-based models) or consider Bayesian approaches
Simulate sensitivity in your parameter estimates (i.e., "Our model tells us that for every 1% decrease in the monthly percentage change in PPI for grain, there is a 0.05% increase in the monthly default rate. What if the effect was instead a 0.10% increase in the monthly default rate?") and present this in addendum to your model's output.
Lesson #2: Make your model output probabilistic (Bootstrap it!)
The term "risk" implies that there is more than one possible outcome, and each outcome has a particular likelihood of occurring. If we are trying to model risk, then by default our models should be returning a distribution of possible outcomes and associated likelihoods of occurrence.
A great first step into moving from a fully deterministic model to a probabilistic approach is to bootstrap your model. Instead of fitting a model once to all of the data, bootstrapping takes a pre-defined number of random samples (with replacement) of your training data, each sample equal in size to the original training dataset. This means that if we define that we want 1,000 bootstrap samples, we can create 1,000 different models, each one trained on slightly different variations of the original training dataset. Furthermore, we get 1,000 different coefficient estimates for each independent variable in our model.
In our earlier example, we used the monthly percentage change in the PPI for grain to predict the monthly default rate in our loan portfolio. If we took a bootstrap approach and plotted the 1,000 different parameter estimates for our independent variable, it might look something like this:
Since the bootstrapping approach generates a distribution of possible values for our parameter estimate, we can draw conclusions like:
"We are 95% confident that the weight of the percentage change in grain PPI on our default rate is between -0.043 and -0.035."
Said another way:
"For every 1% decrease in the monthly percentage change in grain PPI, we are 95% confident that the effect will be between a 0.043% and 0.035% increase in the monthly default rate."
While the difference between 0.043% and 0.035% may seem small, it equates to an $800,000 variance on a $10 billion loan portfolio. This effect compounds as you add additional independent variables, each bootstrapped with their own distribution of possible parameter estimates.
Lesson #3: Data is king
We have seen 3 economic disasters since 2000 between the dot-com bubble, the 2008 housing crisis, and now COVID-19. Maybe you weren't collecting data in 2000 or even 2008, but hopefully you are collecting it now. This may be your first opportunity to use data from an economic downturn in all of your models going forward.
Additionally, frequency is critical. If you are only collecting data quarterly, there is no way to make monthly predictions using that data. Conversely, you can always aggregate high-frequency data up to a lesser frequency (you could make quarterly predictions if you are collecting data monthly).
Lastly, collecting data and modeling at a higher frequency decreases the chance of a glaring singular outlier in your data. If you have a bad year and are collecting data annually (only one data point to represent that bad year), then you are stuck deciding how to handle that outlier. Whereas, if you are collecting data often throughout the year, that bad year may no longer represent an outlier and instead fit a much more model-able distribution.