学習を上手く行うために　Regularization　Drop out　Data augmentation

学習を上手く行うために　Regularization

Regularization とはNeural networkの能力を制限して、過学習を防ぐ方法である。

L2 regularization is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight $w$ in the network, we add the term $\frac{1}{2} λ w^{2}$ to the objective, where $λ$ is the regularization strength. It is common to see the factor of $\frac{1}{2}$ in front because then the gradient of this term with respect to the parameter $w$ is simply $λ w$ instead of $2 λ w$ . The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. As we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot. Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero.

L1 regularization is another relatively common form of regularization, where for each weight $w$ we add the term $λ ∣ w ∣$ to the objective. It is possible to combine the L1 regularization with the L2 regularization: $λ_{1} ∣ w ∣ + λ_{2} w^{2}$ (this is called Elastic net regularization). The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.

実用上はL2 Regularizationを用いると良さそうである。

Max norm constraints. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector $\vec{w}$ of every neuron to satisfy $‖ \vec{w} ‖_{2} < c$ . Typical values of $c$ are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot “explode” even when the learning rates are set too high because the updates are always bounded.

Drop Outについて

Dropout is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in Dropout: A Simple Way to Prevent Neural Networks from Overfitting (pdf) that complements the other methods (L1, L2, maxnorm). While training, dropout is implemented by only keeping a neuron active with some probability $p$ (a hyperparameter), or setting it to zero otherwise.

Drop outと呼ばれる手法は、主にFullyconnected Layer (Convolution Layerでも使われる）で使用される手法で

ランダムにニューロンを不活性化し（値を0にして）

ニューロン同士の依存関係をなくす

あるいは一つのモデルでモデルアンサンブルを実現しているという解釈ができる

In practice: It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine this with dropout applied after all layers. The value of $p = 0.5$ is a reasonable default, but this can be tuned on validation data.

Data augmentation

Inputの画像を鏡で写したように逆にしたり、一部だけ取り出したり、光度をいじったりして学習させる

まとめ

Regularizationは

まずBatch Normalizationを適用し

L2 regularizationとDrop Outなどを加えていくというのがいいらしい

cs231n.github.io

けの〜のブログ

ガッキーでディープラーニングして教育界に革命を起こす

学習を上手く行うために　Regularization　Drop out　Data augmentation