Guide to Choosing a Machine Learning Algorithm


In machine learning, “regression” is the term for a range of algorithms which estimate the value of one variable, called the dependent variable, from the values of a range of other variables called independent variables.  For example, when trying to predict the dependent variable of house price, we might build a regression model that uses size, zip code, number of bedrooms, and whether or not there is a view as the independent variables. There are many different types of regression models available to choose from, so how do you know which one best models your data?

Linear Regression is a simple model whose goal is to fit a straight line (or more generally, a hyperplane) to data points in a feature space mapped onto Cartesian coordinates.  To fit a line to the data points in this scenario, we simply minimize the sum of the least squares error (https://en.wikipedia.org/wiki/Least_squares).

The figure below is a visualization of the residuals in a linear model. The model attempts to find a blue line whose distance between the line and data (i.e. the red lines), when squared and summed is minimized.

Source: https://en.wikipedia.org/wiki/Linear_regression

Linear regression is an effective method to model the relationships between a set of variables so the value of one may be predicted from the values of the others; but, when trying to predict a variable that measures the rate of an event, we need a different approach.  In this case we assume the distribution of the event conforms to the Poisson distribution, which models the probability of a given number of events occurring in a specified interval.

We can then perform a Poisson regression – for example, there tends to be four car accidents per day on the freeway on-ramp near your office.  You want to use machine learning to predict the probability that there will be just one accident on a given day.  Our first assumption is, unfortunately, a shaky one in the real world: that the car accidents are randomly distributed in time.  Traffic volumes vary throughout the day, but for the sake of this example we will assume the Poisson distribution is a close-enough fit.  We already know the average number of accidents is four, and a quick calculation with the Poisson formula gives us approximately 7% probability of just one accident on a given day.  Perhaps weather, sporting events at a nearby stadium, and holidays might affect traffic dense enough to affect the likelihood of a collision?  To test this, we can build a generalized linear regression model that predicts the number of car accidents on a particular day, while taking advantage of the fact that the natural log of the dependent variable for a given time interval is a linear combination of the independent variables with their coefficients, plus the natural log of the time interval.  We can then use this Poisson model to predict the probability that an accident occurred given the features, and whether each of the three features are significantly related to the outcome.

In contrast to regression, classification algorithms are used to categorize data into groups.  The model can then make predictions about which category a new instance belongs to based on high-level features.  Taking the classic example of Fisher’s Iris Flower Data (https://en.wikipedia.org/wiki/Iris_flower_data_set), if we measure the sepal width and length along with the petal width and length of a set of flowers, a classification model can be used to predict which species the flower belongs to.  The limitation of this model, of course, is that all test data must belong to one of the categories present in the training set.  So, for example, if only three species of Iris were predicted in the training set, the model will predict that any unknown flower belongs to one of these even if it is in fact a fourth species.  No analogous problem exists for the continuous predictions of regression models.

When deciding on a classification model, you must first ask how many categories, or classes, are present in the training data.  Many problems are binary, and the data may be divided into two classes.  Perhaps whether or not a customer will make a purchase, while other problems try to predict which of several categories a data item should be assigned to.

A number of binary classification algorithms exist including support vector machines (SVMs), logistic regression, and random forests.  The SVM algorithm defines a boundary separating positive and negative classes by attempting to maximize the distance between the boundary and the nearest positive/negative instance.  SVMs are particularly powerful when constructing models that contain hundreds of features.  Logistic regression, perhaps the most popular binary classifier, attempts to find the Sigmoid function – a probabilistic function to separate positive and negative classes.  Logistic regression is useful because it is a fast-training linear model, and the approach has been shown to deliver results competitive with more complex methods.  Random forests containing decision trees are highly accurate; each tree in the forest splits the data by feature to generate a binary decision.  Aggregating each of these decisions via bootstrapping or boosting is a key aspect to constructing a random forest that combats over-fitting.

When building a model to sort data into three or more classes, multi-class classification models are required.  In fact, most binary classification algorithms can be generalized to handle more classes by effectively reframing the question.  So, when assigning data to categories A, B and C, one can rephrase this into a series of binary decisions – A or not A, B or not B, and C or not C.  Binary classification algorithms, such as logistic regression and random forests, require only mild tweaking to become multi-class classification algorithms.  To classify more complex data structures such as images and audio clips, neural networks are often used.  Neural networks are created by aggregating many models together in a deep, connected network, with a softmax filter applied to the final layer to force all of the outputs to probabilities.

In conclusion, the first step in the selection of an appropriate algorithm for your data science problem is to develop a deep and nuanced understanding of the problem itself.  Once this is achieved, it should be simple to decide whether the problem is one of predicting a value, in which case regression approaches should be explored, or predicting a category, in which case regression is the way to go.  Each of these broad categories is filled with analytical approaches, and an appreciation of the problem domain and data you have should guide you further.  For example, a quick K-Nearest Neighbors analysis might clarify how many categories you have, or a need for a clear understanding of the effect of a specific feature on an outcome might make a logistic regression classifier more valuable than a convolutional neural network.  In the end, data science is much more than a set of techniques that can be blindly applied, and the value of a data scientist should always be to bring an understanding of both the problem context and the limitations of different types of analyses to ensure the resulting predictions are valid.