What is overfitting in machine learning?

At some point in machine learning, most beginners run into the same problem. And then a unanimous question arises. Why, with the training data set, does the model I am developing offer extreme forecasting reliability, while with a new data set it fails more than a fairground shotgun? At BETWEEN we know who the culprit is. And it's called overtraining or overfitting.

What is overfitting?

Overfitting in machine learning is a phenomenon that makes a predictive algorithm present a low percentage of success in its results, offering forecasts with a high variance. This happens if the sample used in training the model:

It is not very representative of the reality that the algorithm will have to face later.
It includes too many variables, and even irrelevant variables, which confuse the model and prevent it from identifying the underlying trend.
The optimal epoch threshold has been exceeded (number of times the model processes the same input data in the training).

In opposition to overfitting, underfitting is defined, a problem that also generates poor reliability in the model's predictions, in this case because they present a high bias. In underfitting or underfitting, the cause is that the input data are insufficient to establish generalizations, or that they offer little information about the question to be deduced. A common mistake, an example of the latter, is to insist on constructing a linear regression - in order to try to know what will happen in the future - with a sample drawn from too short a period.

How do you know if you are overtraining your machine learning model?

There is an unmistakable sign that a machine learning model is overfitting: with the training data set, its success rate is around 100%; but when it processes new records, the latter falls to half or less. Overtraining has led him to attribute with pinpoint precision the characteristics of what he already knows; but it has hampered him when it comes to generalizing the results in different areas.

Underfitting, on the other hand, is diagnosed when the machine learning model provides poor results with both the training sample and unknown input records.

machine-learning-ingeniera-informatica

How to prevent overfitting in machine learning?

To avoid or solve overfitting in machine learning, we can resort to various techniques that improve model training and correct inappropriate deviations in the results. Some of them are:

Continue training by providing a new data set. This reaction usually works when the lack of success is attributable to a low representativeness of the data set.
Divide the sample into two parts. We will use one to train the machine learning algorithm, and the other to carry out a test that verifies if it works correctly.
Subdivide the sample into several smaller data sets and train the model with them. Each one will serve as a validation set of the results of the previous one.
Simplify the registers, eliminating variables that do not provide meaningful information and that, on the other hand, do generate noise that makes it difficult for the algorithm to detect key patterns.
Adjust the number of training epochs well, interrupting it at the point where the risk of overfitting begins to lurk.
If we suspect that the cause of the failure is in the poor representativeness of the sample, carry out a cleaning of the data, removing the redundant ones, so that they better reflect the distinctive characteristics of the realities that the algorithm is going to process.

Entering the world of machine learning is a lot of trial and error, so don't despair if your models don't work the first time. IT professionals know well that their daily lives go through experimenting, correcting and learning from mistakes before achieving success. If this situation sounds familiar to you, why not come to BETWEEN to continue growing in your professional career thanks to our constantly updated range of job offers in IT? Place yourself in the best possible place to experience the news in the sector in the coming years, such as the expansion of big data or the generalization of the HTTP / 3 protocol for a faster Internet. With BETWEEN you will have no limits!

What is machine learning overfitting and how to avoid it?