Steps Involved in Selecting a Model (Model Selection)
Model selection is a key ingredient in the long and essential series of steps involved in creating a machine learning (ML) model that would be deployed into production.
This article aims to act as a guide to machine learning engineers new to the process of model selection in machine learning (ML).
We’ll start by understanding what model selection is:
What is Model Selection
Model selection is the task (or process) of selecting a statistical model from a set of candidate models, given data. Wikipedia.
What this implies is that; model selection is the activity of undergoing a series of events (tasks/processes). This series of activities help us to determine if a statistical model (among others) is best suited to make predictions for a task.
In selecting a model we start by inspecting our dataset because everything we do afterward only matters when we know the kind of data we’re working with.
Is the dataset clean?
So to begin with, we start by looking into the dataset for issues like missing data, incorrectly formatted values, etc. This process is called data cleaning. It is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. tableau.
Trust me! Data Cleaning is a very lengthy and tiring process. It is a whole subject of its own which is necessary and thus, valuable materials to assist those new to it is available in the further reading section below.
What is the size of the dataset?
The next thing we look into will be the size of the data. How big is the data? Is the data big enough to be split into 3 sets (Train, Validation, and Test set) or is it so small we can’t even extract a good enough test set (example: the iris dataset).
Let’s start by identifying how we can address the small dataset.
How do we define a small dataset?
A dataset of 1,000 sets and lower can be considered small. A set higher than 1000 can still be considered small based on the problem you’re trying to solve.
if you try to process a small data set naively, it will still work. If you try to process a large data set naively, it will take orders of magnitude longer than acceptable (and possibly exhaust your computing resources as well). ~Carlos Barge
I consider the metrics by Carlos Barge to be more appropriate for distinguishing a small from a large dataset. What constitutes a large dataset isn’t just the size of the rows but also the size of the columns.
After defining a dataset as small, various steps should be taken to select a model for that dataset.
Note: When performing a model evaluation, consider the rule of thumb for training a model.
Your model should train on at least an order of magnitude more examples than trainable parameters developers.google.com
These steps include:
- Transform categorical columns to numeric (If any)
- Perform a k-fold cross-validation
- Elect candidate models
- Perform Model Evaluation
- Model selection
To explain this better, I would be making use of the iris dataset to examine the measures listed above. The complete notebook on the model selection process for the iris dataset set can be on my Kaggle page
Transform categorical columns to numeric
Machine learning models are unable to interpret non-numeric values, so before proceeding, all numeric columns need to be transformed to numeric values.
In most cases, columns that would need to be transformed to numeric values would be categorical columns like [low, medium, high] or [Yes, No] or [Male, Female].
Scikit-learn is a toolbox that was built to handle these conversions: they include the LabelEncoder, OrdinalEncoder, OneHotEncoder, etc. All this is available in sklearn.preprocessing.
Resources to articles that provide clarification on these tools can be found in the further reading section of this article.
Perform a k-fold cross-validation
The k-fold cross-validation is a procedure used to estimate the skill of the model on new data. machine learning mastery.
K-fold cross-validating works by splitting the dataset to a specified number of folds (say 5) and then shifting the position of the test set to a single fold at each iteration (as described above).
After performing the K-fold cross-validation, we then end up with the N number of the same dataset with N different training and testing sets (where N is the number of splits applied on the dataset).
There are two (2) ways to use k-fold cross-validation:
- Using k-fold cross-validation for evaluating a model’s performance
- Using k-fold cross-validation for hyper-parameter tuning
There’s a lovely article by Rukshan Pramoditha titled k-fold cross-validation explained in plain English which explains both. We would however use k-fold for evaluating model performance in this test case.
The purpose of performing a k-fold cross validation is to expand the dataset.
What do I mean by this? The iris dataset for instance has a total of 150 data which is so small that extracting a test and cross validation set will leave us with very little to train with.
By splitting the dataset into a training and test set across 5 different instances here, we try to maximize the use of the available data for training and then test the model.
Elect candidate models
Now that we’ve successfully split our dataset in 5 K-Fold we can proceed to elect the candidate models. This is the instance where we look at the kind of task we are solving and the models that can solve/address it.
The Iris dataset is a classification task. It has four (4) feature columns which are `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`. All are continuous feature columns.
By visualizing the dataset, we can tell that the `petal width (cm)` and `petal length (cm)` feature column is linearly separable from the other feature columns. Well, this and probably more relationships.
Question: What models best decide these relationships?
I’ll go straight to listing out models that can determine these relationships. For more on the reasons, we picked the models check out the further reading section.
We’ll be electing the LogisticRegression, SVC, KNN, and RandomForestClassifier.
Perform Model Evaluation
Now that we’ve decided on the machine learning (ML) models, we can proceed to evaluate the models with our dataset using cross-validation.
We would make use of the sklearn.model_selection.cross_val_score to cross-validate the dataset and get the scores on the model performance across each fold.
Model Selection
After cross-validating the dataset we can now conclude that the best performing models are the Logistic Regression and the K-Nearest Neighbor models which both have an accuracy of 97.33%.
This implies that either of them would be efficient for deployment. Now based on the needs of the problem, we can now decide on either of the models. If you have needs for a model-based learning algorithm, you can choose the KNN or the Logistic Regression for instance-based learning.
After cross-validating the dataset we can now conclude that the best performing models are the Logistic Regression and the K-Nearest Neighbor models which both had an accuracy of 97.33%.
Performing cross validation experiments like this on a large dataset would be very expensive computational-wise.
Now that we’ve figured out how to address the smaller datasets, how do we address larger ones?
How do we define a large dataset?
What do I mean by a large dataset? A dataset of about 10,000 rows upwards is large, while datasets within the range of say 2,000 to 10,000 are reasonably medium. Of course, this metric isn’t the best.
If you try processing a large dataset naively it will take longer processing time and exhaust computing power. This is a more precise metric.
After determining your dataset is large. what are the steps for selecting a model for the dataset then?
Well, unlike with smaller dataset, we can’t process this dataset naively. Thus, we have to split it. This is where reducing the dataset to three (3) set for training and evaluation comes to play.
Before we proceed though, let’s list the steps required to select a model for larger datasets:
- Transform Categorical Columns to Numeric (If any)
- Scale Continuous Columns (if necessary)
- Split the Dataset
- Elect Candidate Model
- Perform Model Evaluation
- Model Selection
You can proceed with these steps if you have a cleaned dataset. The House Prices — Advanced Regression Techniques dataset would be utilized for tutorial purposes as we analyze the steps involved in selecting models for larger datasets.
The House Prices dataset isn’t so large a dataset itself but should explain the concept behind our steps nicely.
The notebook compiling the codes for the dataset and the work we did can be found on my kaggle page.
I would jump right into splitting the dataset. Below is the code for cleaning the dataset and transforming the columns — in case you desire to follow with the House Prices dataset.
Split the Dataset
The reason we perform an evaluation on machine learning (ML) models is to ensure they don’t under-fit or over-fit.
We were able to evaluate the iris data-set (a small data-set) using cross-validation, but given our data-set isn’t as small, validating naively would be computationally expensive.
Therefore, we have to split the dataset into a train and test set. Given the entire dataset has a shape of (1460, 80), and (1460, 74) after cleaning and transformation, we can perform cross-evaluation on the train-set and evaluate our model performance on the test set.
Elect Candidate Model
Now that we’ve perfectly split the dataset into both train and test sets, we then proceed to elect models that can solve this task.
We have to understand the dataset. I talked about it in my notebook House Prices Prediction (Beginner) where I gave an overview of the dataset.
So, we’re dealing with a regression task consisting of lots of categorical features, having models with linear and decision-making abilities would be useful, like the Decision Tree Regressor or Random Forest Regressor. But let’s go for the Random Forest Regressor since it’s more of an ensemble of Decision Trees.
We should also pick models like Support Vector Regressor, Linear Regression, and K-Neighbors Regressor since we’re performing evaluations.
The XGBoost will prove to be a very vital tool in your ML journey and I suggest examining its usage in the notebook XGBoost by Kaggle grandmaster Dans Becker. More resources on XGBoost in the further reading section.
Perform Model Evaluation
Now that we’ve successfully split our dataset, and elected the models we want to use. It’s time to see how the individual models perform on the training dataset.
Beyond doubt, the Random Forest Regressor performed best, outperforming the Linear Regression model approximately 3x. Although since our focus is on model selection I avoided cross-validating and fine-tuning the models.
In most cases, I would fine-tune and cross-validate the model (using grid search) while searching out the best accuracy each model can produce before making a decision. But the model’s default parameters are also decent enough for this task. So let’s leave it simple.
Model Selection
After splitting the dataset, electing the candidate model, and performing model evaluation we can come to the conclusion that the Random Forest Regressor will be best suited for deployment having a mean absolute error (MAE) of 6732.92.
Although we didn’t quite fine-tune the model. We can get a much better MAE by fine-tuning the Random Forest Regressor, but the point has been established.
You could try out the XGBoost and compare it to see if it performs better. What if you fine-tune the XGBoost model as well!!!
Conclusion
We’ve proven that model selection is a key ingredient in the lengthy series of steps involved in creating a machine learning (ML) model that would be deployed into production.
We showed the metrics for proving if a dataset is either small or large and the reason for cross-validating smaller sets and splitting the larger ones.
We also talked about why we evaluate models and how we elect candidate models before model evaluation.
I hope this guide proves to be effective even as you deploy them into your machine learning tasks.
Further Reading
Data Cleaning
- The Ultimate Guide to Data Cleaning
- Data Cleaning with Python and Pandas
Encoding Categorical Columns
- Encoding Categorical data in Machine Learning
- Guide to Encoding Categorical Features Using Scikit-Learn For Machine Learning
Scikit-Learn Models
- Support Vector Machine
- Random Forest
- K-Nearest Neighbor
- Linear Regression
- Logistic Regression
Further Reading On Model Selection
- A “short” introduction to model selection
- A Gentle Introduction to Model Selection for Machine Learning
Associated Notebooks
- Steps Involved in Selecting a Model For a Small Data-set
- Steps Involved in Selecting a Model For a larger Data-set
Book
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow