Updated: Feb 14, 2019
It seems like everywhere you turn these days, the software your company is thinking about upgrading to has an add-on feature that will do "Automated Machine Learning" or "Automated Artificial Intelligence". The up-sell attempt usually contains dazzling promises of "unlocking insights you never knew existed", and "putting the power of AI in every user's hands". As a data scientist, this scares me. And it's probably not for the reasons you might think...
First, these are NOT the 3 reasons why auto-ML and auto-AI tools scare me:
"They're going to put data scientists out of work"
"AI is going to end humanity"
"...something something, robots, something something..."
The real reasons that automated ML/AI tools scare me have to do with data science responsibility. When you just plug your raw data into an algorithm, you'll always get a prediction. It might come as a surprise to others, but as a data scientist, the algorithm is usually the easiest step in the predictive modeling process. The toughest (and arguably most important) part of the predictive modeling process is everything that comes before the algorithm. Three of these pre-modeling steps are exploratory data analysis, stratifying your sample, and meeting model assumptions.
Reason #1: You're not taking the time to truly learn about your data (Exploratory Data Analysis)
There's a step in the modeling process that data scientists refer to as "exploratory data analysis", where you visualize, filter, and transform the data in creative ways to learn as much as you can about the data that you're working with. While there are a few standard steps you should take each time you do EDA, there is really no way to automate all of the different ways you should analyze a dataset, because every dataset is different.
Let's use an example to demonstrate this. Last week I was looking at some housing data for my hometown of Ellington, CT, that I got from Connecticut's Open Data repository. If you're interested in following along, you can get my Rmarkdown notebook here. I dealt with the missing values in the data, and made sure my distributions looked good -- data preparation steps that a good auto-ML/AI tool will do, but before I got to the modeling stage, I looked at the data one more unique way. I looked at how many homes were sold in each season, each year, and saw this:
Every five years (2001, 2006, 2011, 2016) there were zero homes sold in the spring and summer... this can't be right. If we just plugged the data into a model (maybe a neural network), would it predict that the sale price of homes sold in the spring of 2021 would sell for $0? What if we realized this too late, after we had been making decisions off that automated model for years?
Reason #2: Are you giving the model a fair chance to predict accurately? (Stratifying your sample)
Let's say that our raw data that we connect to the auto-ML/AI engine represents data about customer fraud, and we want to build a model that predicts whether or not a customer will be fraudulent. In some of the rows (observations) in our historical data we have cases where a customer was fraudulent, while in other rows the customers did not commit fraud. More than likely, though, you have many more cases of "non-fraudulent" customers than you do "fraudulent" customers.
If your data has bias in your target variable like this (let's say that 98% of the observations are labeled "non-fraudulent", and 2% are "fraudulent"), then your model may likely predict that everything is "non-fraudulent", and it will still be right 98% of the time. There are methods to stratify, or balance, your data, but some of the less-robust auto-ML/AI tools may not have the capacity or expertise to do this before it applies the algorithm to the data.
Reason #3: Are you meeting the assumptions necessary for your model? (Model/Data diagnostics)
Machine learning and AI models use what happened historically to predict the future. If your auto-ML/AI tool isn't also identifying trends in your data, then its future predictions will fall short. For example, let's say we're trying to predict whether or not a customer will click on our ad based upon some characteristics about that person, including their age. If you're advertisement is for maternity products and you're using a lot of historical data, the predictive algorithm you're using might get confused when it sees that customers in their early-twenties were clicking on your ad most frequently in the older data, but in the more recent data its customers in their late-twenties who are most frequently clicking your ad. The algorithm might even tell you that age isn't a critical factor in who will click on your maternity ad if you don't analyze your data property. Take the following chart for example:
Just between 2006 and 2014 you can see that there is an upwards trend in the age that people are having children. If you're not explicitly building this trend into your dataset, then you might be missing out on valuable information to help power your predictions.
This concept of ensuring there are no trends in your data is called "checking for stationarity", and is just one of many assumptions that you might have to meet depending on what type of machine learning or AI algorithm you are using. Perhaps some of the more robust auto-ML/AI tools attempt to automate some of this diagnostic-checking, but I would argue that most tools on the market today haven't reached this complexity yet.
Moral of the Story
"With great power comes great responsibility" -- pretty much every movie ever
I don't think it's disputable that auto-ML and auto-AI make sense for a lot of organizations and is the way of the future. Long story short, just exercise some caution. Machine Learning and AI are extremely powerful tools when they're used right. But part of using them right means that we have a responsibility to ensure that the data they're consuming is not only accurate, but free from bias.
Happy data science-ing!