Regularization

Last modified by Nikita Kapchenko on 2019/10/16 00:51

Big weights is bad.

Why? Recall sigmoid function, if huge weights, summatory function would be also big => sigmoid derivative near 0 => no learning at all

So we want to penalize our network from learning huge weights. How we can do it?

L2 penalization

We can ask network to minimize the squares of weights in addition to target function

So now the "direction" to learn becomes

Unknown macro: formula. Click on this message for details.

we asked our network to penalize big weights, here we are, the learning process now is:

Unknown macro: formula. Click on this message for details.

So regardless the direction of gradient, the weights always decrease.

So for succesfull learning the gradient should overtake the weights decreasing.

We can also ask our network to minimize the weights themself. In this case

Unknown macro: formula. Click on this message for details.

	L1 Penalization	L2 Penalization
Learning	consatntly decrease weight by \lambda (regardless its value)	Lineary decrease weight by its value
Choice of weights	the weights are choosen by their importance	All weights can almost freely improve the results (see pict.)
late training	All weights adjustment are equally penalized: less important weights remain zero while more important weights are increased	Big weights are **2 penalized: network would rather try to play with less important weights than continue to increase the more important but already big weight

Red L2 Regularization allows all weights to push off from zero by their gradient directions and thus improve the target function.

But then punishes hard the large weights

Green L1 Regularization remains constantly to the change of both small and large weights.

The mix of two could train the important weights and allow less important weights to be slightly improved too.

You should think of the choice of two hyper parameters lambdaL1 and lambdaL2.

Now the free increasing of L2 weights will not be passed bu L1.