Regularization
Big weights is bad.
Why? Recall sigmoid function, if huge weights, summatory function would be also big => sigmoid derivative near 0 => no learning at all
So we want to penalize our network from learning huge weights. How we can do it?
L2 penalization
We can ask network to minimize the squares of weights in addition to target function
So now the "direction" to learn becomes
Unknown macro: formula. Click on this message for details.
we asked our network to penalize big weights, here we are, the learning process now is:
Unknown macro: formula. Click on this message for details.
So regardless the direction of gradient, the weights always decrease.
So for succesfull learning the gradient should overtake the weights decreasing.
L1 Penalization
We can also ask our network to minimize the weights themself. In this case
Unknown macro: formula. Click on this message for details.
L1 Penalization | L2 Penalization | |
Learning | consatntly decrease weight by \lambda (regardless its value) | Lineary decrease weight by its value |
Choice of weights | the weights are choosen by their importance | All weights can almost freely improve the results (see pict.) |
late training | All weights adjustment are equally penalized: less important weights remain zero while more important weights are increased | Big weights are **2 penalized: network would rather try to play with less important weights than continue to increase the more important but already big weight |
Red L2 Regularization allows all weights to push off from zero by their gradient directions and thus improve the target function.
But then punishes hard the large weights
Green L1 Regularization remains constantly to the change of both small and large weights.
L1 & L2 Mix
The mix of two could train the important weights and allow less important weights to be slightly improved too.
You should think of the choice of two hyper parameters lambdaL1 and lambdaL2.
Now the free increasing of L2 weights will not be passed bu L1.