# Visualizing the effect of $L_1/L_2$ regularization
We use a toy example with two weights $(w_0, w_1)$ to illustrate the effect $L_1$ and $L_2$ regularization has on the solution to a loss minimization problem.
## Table of Contents
1.[Draw the data loss and the L1/L2L1/L2 regularization curves](#Draw-the-data-loss-and-the-%24L_1%2FL_2%24-regularization-curves)
2.[Plot the training progress](#Plot-the-training-progress)
3.[L1L1 -norm regularization leads to "near-sparsity"](#%24L_1%24-norm-regularization-leads-to-%22near-sparsity%22)
4.[References](#References)
%% Cell type:code id: tags:
``` python
importnumpyasnp
importmatplotlib.pyplotasplt
frommpl_toolkits.mplot3dimportaxes3d
importmatplotlib.animationasanimation
importmatplotlib.patchesasmpatches
fromtorch.autogradimportVariable
importtorch
importmath
```
%% Cell type:code id: tags:
``` python
fromIPython.displayimportset_matplotlib_formats
set_matplotlib_formats('pdf','png')
plt.rcParams['savefig.dpi']=75
plt.rcParams['figure.autolayout']=False
plt.rcParams['figure.figsize']=10,6
plt.rcParams['axes.labelsize']=18
plt.rcParams['axes.titlesize']=20
plt.rcParams['font.size']=16
plt.rcParams['lines.linewidth']=2.0
plt.rcParams['lines.markersize']=8
plt.rcParams['legend.fontsize']=14
# plt.rcParams['text.usetex'] = True
plt.rcParams['font.family']="Sans Serif"
plt.rcParams['font.serif']="cm"
```
%% Cell type:markdown id: tags:
## Draw the data loss and the $L_1/L_2$ regularization curves
We choose just a very simple convex loss function for illustration (in blue), which has its minimum at W=(3,2):
<!-- Equation labels as ordinary links -->
<divid="eq:loss"></div>
$$
\begin{equation}
loss(W) = 0.5(w_0-3)^2 + 2.5(w_1-2)^2
\label{eq:loss} \tag{1}
\end{equation}
$$
The L1-norm regularizer (aka lasso regression; lasso: least absolute shrinkage and selection operator):
%% Cell type:markdown id: tags:
<divid="eq:l1"></div>
$$
\begin{equation}
L_1(W) = \sum_{i=1}^{|W|} |w_i|
\label{eq:eq1} \tag{2}
\end{equation}
$$
%% Cell type:markdown id: tags:
The L2 regularizer (aka Ridge Regression and Tikhonov regularization), is not is the square of the L2-norm, and this little nuance is sometimes overlooked.
The diamond contour lines are the values of the L1 regularization. Since this is a contour diagram, all the points on a contour line have the same L1 value. <br>
The oval contour line are the values of the data loss function. The regularized solution tries to find weights that satisfy both the data loss and the regularization loss.
<br><br>
```alpha``` and ```beta``` control the strengh of the regularlization loss versus the data loss.
To see how the regularizers act "in the wild", set ```alpha``` and ```beta``` to a high value like 10. The regularizers will then dominate the loss, and you will see how each of the regularizers acts.
<br>
Experiment with the value of alpha to see how it works.
## $L_1$-norm regularization leads to "near-sparsity"
$L_1$-norm regularization is often touted as sparsity inducing, but it actually creates solutions that oscillate around 0, not exactly 0 as we'd like. <br>
To demonstrate this, we redefine our toy loss function so that the optimal solution for $w_0$ is close to 0 (0.3).
# Finally add the axes, so we can see how far we are from the sparse solution.
plt.axhline(0, color='orange')
plt.axvline(0, color='orange');
plt.xlabel("W[0]")
plt.ylabel("W[1]")
```
%% Cell type:markdown id: tags:
But if we look closer at what happens to $w_0$ in the last 100 steps of the training, we see that is oscillates around 0, but never quite lands there. Why?<br>
Well, $dL1/dw_0$ is a constant (```lr * alpha``` in our case), so the weight update step:<br>
```W.data = W.data - lr * W.grad.data``` <br>
can be expanded to <br>
```W.data = W.data - lr * (alpha + dloss_fn(W)/dW0)``` where ```dloss_fn(W)/dW0)``` <br>is the gradient of loss_fn(W) with respect to $w_0$. <br>
The oscillations are not constant (although they do have a rythm) because they are influenced by this latter loss.