CNNについて　⑤　CNNの構造全体についてのまとめ　ConvNet Architectures

今までの総括としてCNNの全体としての構成を見ていきたいと思う。

Layer Patterns

The most common form of a ConvNet architecture stacks a few CONV-RELU layers, follows them with POOL layers, and repeats this pattern until the image has been merged spatially to a small size. At some point, it is common to transition to fully-connected layers. The last fully-connected layer holds the output, such as the class scores. In other words, the most common ConvNet architecture follows the pattern:

Inputの画像に対して、

{(畳み込み層での処理→Reluなどの活性化関数の処理)×N(N<=3が普通)

→プーリング層（任意)}×M

→(全結合層(Fully Connectedlayer)→Reluなどの活性化関数の処理)×K

→全結合層(Fully Connectedlayer)＝クラス分類のスコア

という構造になるらしい

Prefer a stack of small filter CONV to one large receptive field CONV layer.

小さいフィルター数のConvolution Layerを重ねることで複雑な特徴を取り出せるとともに計算するパラメーターも少なく済むようだ

Recent departures. It should be noted that the conventional paradigm of a linear list of layers has recently been challenged, in Google’s Inception architectures and also in current (state of the art) Residual Networks from Microsoft Research Asia. Both of these (see details below in case studies section) feature more intricate and different connectivity structures.

In practice: use whatever works best on ImageNet. If you’re feeling a bit of a fatigue in thinking about the architectural decisions, you’ll be pleased to know that in 90% or more of applications you should not have to worry about these. I like to summarize this point as “don’t be a hero”: Instead of rolling your own architecture for a problem, you should look at whatever architecture currently works best on ImageNet, download a pretrained model and finetune it on your data. You should rarely ever have to train a ConvNet from scratch or design one from scratch. I also made this point at the Deep Learning school.

どのように層を重ねるかはImageNetでいいスコアが出ているモデルの構造をダウンロードしてちょっといじって見るというぐらいでいいらしい。

どんな構造が良いのかという考察はGoogleなどの大企業が研究している。

次にそれぞれで設定するハイパーパラメーターについて書きまとめようと思う。

Input層について

The input layer (that contains the image) should be divisible by 2 many times. Common numbers include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512.

入力画像のピクセル数は２の倍数の値にすべきらしい

Conv層について

The conv layers should be using small filters (e.g. 3x3 or at most 5x5), using a stride of $S = 1$ , and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when $F = 3$ , then using $P = 1$ will retain the original size of the input. When $F = 5$ , $P = 2$ . For a general $F$ , it can be seen that $P = (F - 1) / 2$ preserves the input size. If you must use bigger filter sizes (such as 7x7 or so), it is only common to see this on the very first conv layer that is looking at the input image.

フィルターサイズは3×3,5×5のような小さい値にすること

Input画像の空間サイズと同じになるようにStride(フィルターをどれくらいずらすか)を1に設定し、zeropadの大きさも調整すること。

F=3×3ならzeropaddingは1

F=5×5ならzeropaddingは2

F=7×7のような大きいサイズのフィルターを適用するのは（あまり適用されないが)、最初のInput画像のみに対して行われるようだ

Pooling Layer

The pool layers are in charge of downsampling the spatial dimensions of the input. The most common setting is to use max-pooling with 2x2 receptive fields (i.e. $F = 2$ ), and with a stride of 2 (i.e. $S = 2$ ). Note that this discards exactly 75% of the activations in an input volume (due to downsampling by 2 in both width and height). Another slightly less common setting is to use 3x3 receptive fields with a stride of 2, but this makes. It is very uncommon to see receptive field sizes for max pooling that are larger than 3 because the pooling is then too lossy and aggressive. This usually leads to worse performance.