けの〜のブログ

ガッキーでディープラーニングして教育界に革命を起こす

Deep Learning Frame work TensorFlowとPytorch

TensorFlowとPytorchについて書き留めたいと思う。

 

Deep learningでは大量の行列の計算を行う。

そのためGPUでの計算が実用上早いとされるため、Deep LearningにはGPUの設備が必要になってくる。

 

GPU上でプログラムを実行するにはGPU専用の言語があるため、NumpyなどをそのままGPUで稼働させることはできない。

 

そこでTensorFlowやPytorchと呼ばれるSoftWareを使用することが多い(たくさんある)

これらを使用することによってCPU,GPUどちらでも稼働可能で、傾きの計算も自動で行える。

 

TensorFlowは汎用性が高い

Pytorchは研究用に、Caffe2はapplicationへの応用が効きやすいようだ。

学習を上手く行うために Transfer Learning

学習を上手く行うために Transfer Learning

Transfer Learning

In practice, very few people train an entire Convolutional Network from scratch (with random initialization), because it is relatively rare to have a dataset of sufficient size. Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest. The three major Transfer Learning scenarios look as follows:

Transfer Learningとは、ImageNetなどのような豊富な画像データでConvNetで学習を行い、それで目的となるtaskを行うものである。

  • ConvNet as fixed feature extractor. Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. In an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. We call these features CNN codes. It is important for performance that these codes are ReLUd (i.e. thresholded at zero) if they were also thresholded during the training of the ConvNet on ImageNet (as is usually the case). Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.

Imagenetなどの画像で学習させて、

最後のFully connectedLayerの部分だけ学習させ直すという手法である。

 

  • Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset. In case of ImageNet for example, which contains many dog breeds, a significant portion of the representational power of the ConvNet may be devoted to features that are specific to differentiating between dog breeds.

後半のConv層も学習させ直す方法。

 

  • Pretrained models. Since modern ConvNets take 2-3 weeks to train across multiple GPUs on ImageNet, it is common to see people release their final ConvNet checkpoints for the benefit of others who can use the networks for fine-tuning. For example, the Caffe library has a Model Zoo where people share their network weights.

様々な人が試して保存した重みを使用する方法。

 

When and how to fine-tune? How do you decide what type of transfer learning you should perform on a new dataset? This is a function of several factors, but the two most important ones are the size of the new dataset (small or big), and its similarity to the original dataset (e.g. ImageNet-like in terms of the content of images and the classes, or very different, such as microscope images). Keeping in mind that ConvNet features are more generic in early layers and more original-dataset-specific in later layers, here are some common rules of thumb for navigating the 4 major scenarios:

  1. New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN codes.
  2. New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.
  3. New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier form the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier from activations somewhere earlier in the network.
  4. New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model. In this case, we would have enough data and confidence to fine-tune through the entire network.

どのようにTransfer Learningを行うかはデータとなる画像の性質や画像の多さによっても大きく異なってくる。

 

Practical advice. There are a few additional things to keep in mind when performing Transfer Learning:

  • Constraints from pretrained models. Note that if you wish to use a pretrained network, you may be slightly constrained in terms of the architecture you can use for your new dataset. For example, you can’t arbitrarily take out Conv layers from the pretrained network. However, some changes are straight-forward: Due to parameter sharing, you can easily run a pretrained network on images of different spatial size. This is clearly evident in the case of Conv/Pool layers because their forward function is independent of the input volume spatial size (as long as the strides “fit”). In case of FC layers, this still holds true because FC layers can be converted to a Convolutional Layer: For example, in an AlexNet, the final pooling volume before the first FC layer is of size [6x6x512]. Therefore, the FC layer looking at this volume is equivalent to having a Convolutional Layer that has receptive field size 6x6, and is applied with padding of 0.
  • Learning rates. It’s common to use a smaller learning rate for ConvNet weights that are being fine-tuned, in comparison to the (randomly-initialized) weights for the new linear classifier that computes the class scores of your new dataset. This is because we expect that the ConvNet weights are relatively good, so we don’t wish to distort them too quickly and too much (especially while the new Linear Classifier above them is being trained from random initialization).



cs231n.github.io

 

 

学習を上手く行うために Regularization Drop out Data augmentation

学習を上手く行うために Regularization

Regularization とはNeural networkの能力を制限して、過学習を防ぐ方法である。

 

L2 regularization is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight w in the network, we add the term 12λw2 to the objective, where λ is the regularization strength. It is common to see the factor of 12in front because then the gradient of this term with respect to the parameter w is simply λw instead of 2λw. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. As we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot. Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero.

 

L1 regularization is another relatively common form of regularization, where for each weight w we add the term λw to the objective. It is possible to combine the L1 regularization with the L2 regularization: λ1w+λ2w2 (this is called Elastic net regularization). The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.

実用上はL2 Regularizationを用いると良さそうである。

 

Max norm constraints. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector w of every neuron to satisfy w2<c. Typical values of c are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot “explode” even when the learning rates are set too high because the updates are always bounded.

Drop Outについて

Dropout is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in Dropout: A Simple Way to Prevent Neural Networks from Overfitting (pdf) that complements the other methods (L1, L2, maxnorm). While training, dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise.

Drop outと呼ばれる手法は、主にFullyconnected Layer (Convolution Layerでも使われる)で使用される手法で

ランダムにニューロンを不活性化し(値を0にして)

ニューロン同士の依存関係をなくす

あるいは一つのモデルでモデルアンサンブルを実現しているという解釈ができる

 

f:id:keno-lasalle-kagoshima:20171127105018p:plain

 

In practice: It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine this with dropout applied after all layers. The value of p=0.5 is a reasonable default, but this can be tuned on validation data.

 

 

 

Data augmentation

Inputの画像を鏡で写したように逆にしたり、一部だけ取り出したり、光度をいじったりして学習させる

 

 

 まとめ

Regularizationは

まずBatch Normalizationを適用し


L2 regularizationとDrop Outなどを加えていくというのがいいらしい

 

cs231n.github.io

 

 

学習を上手く行うために Optimization

学習を行う際、重みwの傾きから損失関数を減らす方向に重みを調整していくが
重みが何次元もあると

local minima(極小値)や
saddle point(鞍点)と呼ばれる、重みの傾きが局所的に0になってしまうところが生じてしまい上手く学習が進まない場合がある

 

この場合どのように最適な重みを探していくか

まず従来の方法を見返すと、

f:id:keno-lasalle-kagoshima:20171117233501p:plain

このように傾きに対し、損失関数の値を下げる方向に重みを更新して行く

 

より良い重みの更新はないかと様々な手法が研究されている。

 

Momentumという 最適かする際にvelocityと呼ばれる速度、勢いをつけて探していくアプローチがある

f:id:keno-lasalle-kagoshima:20171117233727p:plain

vの初期設定は0で、muは0.9と設定されている

 

イメージとしてボールを転がすという感じである

このボールの位置が最適な重みを表すということになる

 

ボールが勢いを持っていればlocal minimaやsaddle pointに入ったとしてもそこにボールが止まることを防げるようになる

 

Nestrov momentum  という手法もある

f:id:keno-lasalle-kagoshima:20171117234113p:plain

 

この手法は普通のmomentumの更新の仕方が異なる。

違いは次のようになる

f:id:keno-lasalle-kagoshima:20171117234144p:plain

違いはボールを転がす前それとも後に、傾きへ向かって値を更新するかの違いである。

 

他にもRMSpropなどと呼ばれる手法もある

これらは従来が全パラメーターに対し一定のLearning Rateで重みの更新をしていたのに対し、パラメーターごとにLearning Rateを変えて学習を行うというものである。

RMSProp,Adagrad,Adamがそれらの手法である。

Adagrad

f:id:keno-lasalle-kagoshima:20171117235119p:plain

RMSprop

f:id:keno-lasalle-kagoshima:20171117235225p:plain


Adam

f:id:keno-lasalle-kagoshima:20171117235435p:plain

AdamはRmspropとmomentumの考え方の融合のような形である

 

 

まとめ

実用上は

Recommended values in the paper are eps = 1e-8, beta1 = 0.9, beta2 = 0.999.

のように設定したAdamを用いるのが良いらしい

また

SGD+Nesterov Momentum も試してみる価値があるようだ。

cs231n.github.io

 

 

学習を上手く行うために どのように学習を見守るか Babysitting the learning process , Hyper parameter optimization

まず学習を行う手順を確認しよう

 

1.Input data が Zero-Centeredになるように前処理 preprocessingを行う

2.CNNの構造を決定する 畳み込み層は何層にするかなど

3.Softmaxなどの損失関数で得られた値が妥当性のある値かチェックする この時reguralizationは無効にする

4.損失関数で得られた値が妥当性のある値であると確認できたら、reguralizationを設定していく 適用した際、手順3で得られた値より大きな値が得られていることを確認する

1~4までの手順は学習データの中から一部を取り出して行う ここで過学習を行い正確性が100になるのを確認する ここまで上手く行ったらようやく学習の準備が整ったということになる

 

全学習データを用いて学習を行う

1.reguralizationの値を小さめに設定して、損失関数の値が小さくなるようなLearning Rateの値を模索する

値があまり変わらないようであればLearning Rateが小さすぎるということなので大きくする。

値がNANあるいはInfとなるようであればLearning Rateが大きすぎるということなので小さく設定する

f:id:keno-lasalle-kagoshima:20171117172159p:plain

 

これらのHyperparameterの値はcross Validationを用いることで効果的に得られる。

初めはepoch数(学習回数)を小さめに設定し大体の予測を立てる

予測を立てたらepoch数を増やし、より詳細にHyperparameterの値を見ていく
設定したHyperparameterの値で学習を行い、最初の損失関数の値より3倍大きい状態が変わらないようであればそのHyperparameterの値の適用をやめる。

 

f:id:keno-lasalle-kagoshima:20171117172558p:plain

 

 

大体の予測を立てた値の範囲の最大、最小に近い値で学習が上手く行われている場合、最初に立てた予測の範囲が適切でない恐れがあるため予測の範囲を拡大することも必要である。

ex

1<x<10の値で調べた結果 x= 9.6でいい結果が得られた

その場合はxの範囲を 8<x<20のように設定して学習し直してみるなど

 

cs231n.github.io

 

 

学習を上手く行うために Batch Normalization

Batch Normalizationについて

 

Batch NormalizationはConv層やFully Connected層の後に用いられ、W*Xで出力された値を正規化し、無理やり正規分布になるよう値を調整する、それを次の層へ出力する

 

これによりネットワークの傾きの流れが良くなり

Learning rateを高く設定することができ

Initializationの高度な設計をせずにすみ

正則化の役割をある意味果たし、dropoutの必要性も少なくすることができる

 

Batch Normalization. A recently developed technique by Ioffe and Szegedy called Batch Normalizationalleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training. The core observation is that this is possible because normalization is a simple differentiable operation. In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we’ll soon see), and before non-linearities. We do not expand on this technique here because it is well described in the linked paper, but note that it has become a very common practice to use Batch Normalization in neural networks. In practice networks that use Batch Normalization are significantly more robust to bad initialization. Additionally, batch normalization can be interpreted as doing preprocessing at every layer of the network, but integrated into the network itself in a differentiable manner. Neat!

Batch Normalizationを行うことで、初期化の悪影響にも頑強になり、各層に置いて前処理を行なっていると解釈できる。 

cs231n.github.io

 

学習を上手く行うために 重みの初期設定 Weight Initializationについて

重みの初期設定 Weight Initializationについて

 

最初の重みが統一されていると、つまり重みフィルターw0~wnまで同じ値だと重みフィルターをいくつも用意する意味がない(ただ同じ値が計算されていくだけだから)

なので重みフィルターの値は異ならないといけない

 

しかし重みの値が小さすぎると層をかさねていくと0にどんどん近づいていく

大きすぎると大きい値になっていく(tanhなどの活性化関数では傾きが殺されてしまう)

 

これらによって重みの更新が効率よく行えなくなってしまう。

 

tanhなどの活性化関数ではXavierの関数を用いて初期設定すると値が保存されていいらしい。

しかし

 

Reluなどの活性化関数を使用する際にも負の値は0で返されてしまうため学習を上手く行えない

 

よって重みの初期設定は学習においてとても重要である

 

In practice, the current recommendation is to use ReLU units and use the w = np.random.randn(n) * sqrt(2.0/n), as discussed in He et al..

実用上は活性化関数としてReluを用いる場合このように重みを初期化することが望ましいらしい。

 

 

Initializing the biases. It is possible and common to initialize the biases to be zero, since the asymmetry breaking is provided by the small random numbers in the weights. For ReLU non-linearities, some people like to use small constant value such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient. However, it is not clear if this provides a consistent improvement (in fact some results seem to indicate that this performs worse) and it is more common to simply use 0 bias initialization.

バイアスの初期化は、0を用いるのが通例らしい

cs231n.github.io