Tackling Overfitting with L1 and L2 Norm Regularisation
Introduction
In the realm of Machine Learning, overfitting looms as a formidable adversary, posing a challenge to the creation of robust and accurate models. Overfitting occurs when a model becomes so engrossed in fitting the training data that it loses sight of its ability to generalize to new, unseen data. Essentially, it gets caught up in the nuances and noise of the training data, rather than grasping the broader patterns. This means that even minor fluctuations in features can lead to significant shifts in model outcomes. While overfit models might appear flawless when tested on training data, they often stumble when confronted with fresh, previously unencountered observations.
The prime culprit behind overfitting is model complexity. Fortunately, we possess a potent tool to counter this menace — regularisation. Regularisation steps in to control model complexity by penalizing the inclusion of higher-order terms in the model. When regularisation is applied, the model strives to strike a balance between minimizing loss and managing its own complexity.
In this article, we will delve into two prevalent regularisation techniques: L1 and L2 regularisation.
These techniques address the complexity conundrum by focusing on two critical aspects:
- Total Number of Features (L1 Regularisation): When dealing with datasets comprising numerous features, L1 regularisation proves invaluable. It specializes in handling sparse vectors, which predominantly consist of zeroes. Such vectors lead to a sprawling, high-dimensional feature space, making the model unwieldy. L1 regularisation comes to the rescue by compelling the weights of uninformative features to dwindle to zero over iterations. It accomplishes this by deducting a small amount from the weight at each step, eventually driving it to zero. L1 regularisation penalizes the absolute value of weights.
- Weight Magnitude (L2 Regularisation): L2 regularisation, also known as regularisation for simplicity, addresses complexity in a different way. Rather than forcing weights to become exactly zero, it exerts a gentle force to push them closer to zero without reaching it. This means that while weights under L2 regularisation remain nonzero, they become more manageable. The penalty in L2 regularisation is proportional to the square of the weight magnitude, making it ideal for taming complex models.
Selecting the right regularisation term is crucial. A hyperparameter called the regularisation rate (lambda) determines the strength of regularisation.
Setting lambda too high can lead to a model that is overly simplified and prone to underfitting. Conversely, setting lambda too low diminishes the impact of regularisation, potentially causing overfitting. A lambda value of zero eliminates regularisation entirely, heightening the risk of overfitting.
It’s worth noting that Ridge regression employs L2 regularisation, while Lasso regression relies on L1 regularisation. Elastic Net regression takes a combined approach, blending both L1 and L2 regularisation to strike a balance.
Conclusion
In the world of machine learning, overfitting stands as a formidable obstacle that must be navigated with care. Machine learning models are designed to make predictions about the unknown, and their success hinges on their ability to generalize from training data to new observations. Rather than becoming mired in the intricacies of training points, models need to extract the underlying patterns. Regularisation serves as a guiding light in achieving this objective, striking the balance between complexity and accuracy. In the quest for robust and reliable models, understanding and implementing regularisation techniques like L1 and L2 can make all the difference.