"All models are wrong, but some are useful"

The above quote from George Box's 1976 paper still rings true *45 years later* (Happy New Year!) in many ways. The advances in machine learning over the past decade have brought a lot of promise (and a lot of hype) to helping solve the toughest challenge in business: knowing what is going to happen before it happens. We propose that even if all models are wrong, some are *less wrong* than others.

### Background

To provide some basic background on how machine learning works for classification problems, the visual representation below shows how a **dataset** made up of cases with known outcomes (i.e., each row in the dataset represents a unique loan, with attributes about that loan in each column, and a final column representing whether that loan defaulted or not) can be fed through an **algorithm** to *train* (i.e., create) a **model**:

Once the model has been *trained* on the known historical cases and outcomes (default vs. no default), it can be fed new cases where the outcome is not known; the model makes a prediction about what it expects the outcome to be for each of these new cases, based upon what it learned from the *training* dataset:

Now we have the basics of how machine learning works. It sounds great in theory, but let's explore why models like the one we trained above fall short in practice for many applications, including most credit loss modeling.

### An Important Distinction

In our data warehouse, we likely store data that represents transactions. For example, we probably have multiple rows of data for each loan: a new row representing the outstanding balance at each month if the loan is paid back monthly. This is in contrast to the structure of the dataset we need to train a model; each row in our training dataset should be a *unique* case of what we are going to later ask it to make predictions on. If we are building a model to predict whether a banana is ripe or rotten, each row in our training dataset must represent information about a __unique__ banana, and **not** daily observations of redundant bananas.

### The Shortfall

First, recall that our training dataset was comprised of cases which either defaulted or did not default. Thinking about a loan portfolio, we have some historical data in our data warehouse containing loans that either defaulted or repaid in full on schedule (let's ignore the idea of prepayment for now -- we can discuss that another day). However, a large portion of our data probably represents loans that are **still on the books** and have not defaulted but also have not reached the end of their loan term yet (their outcomes are *unknown*). This raises a couple of important questions:

If we are not including these "active" cases in our training dataset, aren't we leaving out a lot of important information that we should have told our model? After all, if they are still on the books, they haven't defaulted, yet...

Furthermore, knowing that each row in our training dataset has to represent a *unique* loan:

If one of the cases in our training dataset has a known outcome of "default", isn't it important to know whether that loan defaulted after 1 month or after 1 year?

Surely we would want our model to assess the customer who paid back their 30-year mortgage on time for 28 years and then defaulted differently than the customer who went belly-up after the first few months.

Consider a different problem, like whether or not a visitor will click on an ad on our webpage. In this example, there probably isn't much of a "lifespan" to consider. They went to our website and in the course of a few seconds, either clicked on the ad or moved on -- there isn't much more to it than that. We can quickly create a training dataset that can be fed into an algorithm and produce a model, as described previously in our **Background** section.

Conversely, a customer defaulting on a loan isn't a process that happens in a few seconds; it is almost certainly the result of a series of events that took place over the span of months or years leading up to the default.

We can boil down the two questions called out earlier into one general question that encompasses the heart of what we are trying to get at:

How do we incorporate more information about the "lifespan" of a case (a loan) into our model?

Let's look at three different methodologies we can use accomplish this:

Survival Analysis

IPCW (Inverse Probability Censoring Weighting)

Bayesian Inference

### Survival Analysis

Survival Analysis is a *regression* methodology that aims to predict the __time until an event will occur__. It also allows us to introduce cases where the "event" hasn't happened yet. When training a Survival model, each row in our training dataset would is a unique loan and the columns are specified as follows:

**'Time (months)'**represents how long we observed the loan for**'Event'**represents whether or not the event (default) occurred during the observed period; Default = 1, No Default = 0**'Attribute 1'**and**'Attribute 2'**represent some other information (i.e., independent variables) about the loan; these could be*categorical independent variables*like the region of the customer associated with the loan, or*continuous independent variables*such as the customer's average debt coverage ratio during the observed period; you can introduce as many*independent variables*as you think will help to explain the variance in your model

The output of a Survival model is different than that of a traditional machine learning model. Instead of outputting it's guess at which class (default or non-default) a new loan case belongs to, the Survival model outputs a probability of that new loan case "surviving" (not defaulting) past a particular number of months. More specifically, when fed a new case, the model returns a probability of surviving past each unique time value found in the '**Time (months)**' column in the training dataset.

In our mock dataset above -- ignoring **Attribute 1** & **2** -- our Survival model would tell us that a new loan has a 67% probability of surviving past 13 months (since 2/3 of the loans were observed past 13 months), and a 33% probability of surviving past 27 months (since 1 of the 3 loans "survived" past 27 months). Of course, the more complex model would take into account our additional independent variables (such as '**Attribute 1**' and '**Attribute 2**') and alter those probabilities.

Survival Analysis is an improvement over traditional machine learning because it allows us to incorporate into our model all of that information embedded within cases where we don't yet know the outcome. It also allows us to answer very powerful questions like, "Which loans in this segment of our portfolio have a greater than 50% probability of defaulting before 12 months? How about 24 months?". However, Survival models assume that the *event* (in our case, default) will **eventually** happen at some point in the future. This doesn't entirely line up with our business problem, since many loans will not experience a default (they will be paid in full). Perhaps there is a way to merge some of the methodology from Survival Analysis into traditional machine learning models...

### IPCW (Inverse Probability of Censoring Weighting)

In Survival Analysis, cases with unknown outcomes are referred to as being "censored". In the 2016 paper authored by Vock, et al, titled, "*Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting*", the authors discuss an approach to introducing Survival methods into any machine learning classification model.

The approach they propose calls for using only cases with known outcomes in the training dataset, but *weighting* each of those cases individually using a technique called **Inverse Probability of Censoring Weighting (IPCW)**. These weights are calculated based on the __entire__ dataset (cases with known outcomes *and* cases with unknown outcomes). Instead of deriving the probability that an event will occur by time *t*, IPCW weights are calculated by taking the inverse of that derivation, or the probability that the outcome will still be unknown past time *t*.

Using an IPCW-weighted training dataset to develop a model allows us to utilize all of the latest and greatest machine learning algorithms, while also encoding information from all of the data we have on loans with unknown outcomes. In particular, it helps our model to understand the difference between loans that defaulted shortly into their life and loans that defaulted but were nearly paid off.

### Bayesian Inference

If you have made it this far, you should see that we are slightly altering the business problem of estimating credit loss in different ways to line up with appropriate modeling approaches.

The last methodology we believe lines up well with modeling credit loss is Bayesian Inference. For those not familiar, these models ask for a *prior* distribution of possible outcomes and associated probabilities, then adds the data ('**evidence'** in the chart below) that has been collected, to create a final *posterior* distribution. It works a lot like our brain works -- we typically have some prior belief about the range of possible outcomes and their likelihoods, and we gather additional evidence over time which updates and refines our understanding.

Bayesian Inference provides us with two unique advantages in the context of modeling credit loss:

We often don't need as much data to train a statistically robust Bayesian model as we need for a frequentist model; this is particularly useful for organizations that don't have a lot of historical loss data

These models output a distribution of possible outcomes, instead of a single point estimate. This allows us to draw conclusions like, "We are 90% confident that credit losses will be less than $5 million, and we are 95% confident that credit losses will be less than $10 million". This is often more useful than the traditional frequentist models that are limited to only concluding, "We predict that credit losses next year will total $6 million"

Using Bayesian approaches to model credit losses is perhaps best outlined in Kwon's 2012 paper, *"Three Essays on Credit Risk Models and Their Bayesian Estimation"*.

### Wrapping Up

Although the promise around using machine learning to help improve credit loss prediction is justified, doing so requires ensuring that you are setting up your data and model appropriately. It can be tempting to plug your data directly into a machine learning algorithm too quickly, so consider using some methods from Survival Analysis -- including IPCW -- or Bayesian Inference to ensure you are building the best model with the information you have.

If you want to learn more about how to improve your credit loss estimation models, don't hesitate to get in touch with us at Ketchbrook Analytics.

## Comments