Quick Review of CNN and relvent terms

CNN

Reduce the sizes of filters (keep only useful features)
Parameter sharing (How?)
- 2D Filter slide through the weights (Convolution)
  - Convolution extract features and represent it with smaller size
    - The convolution results show the similarity at various positions (correlation amount data)
Interence function: $Y = XW^T$
Training: $W = YX^T$
$Y-WX=0$ $Y - W X = 0$
- $YX^T = WXX^T$
- $W = YX^T(XX^T)^{-1}$ Hence $W$ can be found

Feature map width = (Width - FilterSize + 2 * Padding ) / (Stride) + 1

Padding

Pooling (Subsampling)

Often used after the convolutional layer to reduce spatial size (only width and height, not the depth)
Max pooling: return maximum value from the portion of the image covered by the kernel
Average pooling: returns the average value from the portion of the image covered by the kernel.

Filters

Image is in (Width, Height, Depth) where depth = 3
We can slide a WxHx3 filter over the image and obtain a activation map which serves as the input to the next layer
We use multiple filters where each filter produce one frame

Standard CNN arch

LeNet (LeCun, 1998)

Conv-Pool-Conv-Pool-Flatten-FC-FC-FC-Out
Used Tanh and Sigmoid => Suffer from gradient vanishing problem
Why LeNet was not popular at that time?
- Limitation of computational power
- Weak theoretical background: researchers appreciated methods with mathematical proof such as SVM

AlexNet (2012)

First to use ReLU
Data Augmentation
- Image pre-processing and cropping
- Horizontal reflection
- Color jittering
  - step1: Computer PCA on all RGB points valuese in the training image
  - step2: Sample somecolor offset along the principal components at each forward pass annd add random variable drawn from a gaussian with
  - step3: add the offset to all pixels in a training image
Dropout
- Randomly drop out connections between input and output in each weight update cycle.
  - Hence, every input goes through a different network architecture.
    - Increase difficulty for the network the memorize the training data very well and increase the robustness of the network
Large filter size used (filters in 1st layer are 11x11)
- Filter sizes affect number of trainable parameters and large filter size would be too aggressive for training

VGGNet (2014)

Deeper layers
Smaller filter (Standardized all filters to 3x3)
- Note the receptive field of two 3x3 conv layers is equal to a 5x5.
  - However using 5x5 filter has 5*5+1 =26 params,
  - while using two 3x3 filters have 2(3*3+1) = 20 params

GoogleNet (2014)

Inception module
- Parallel combination of 1x1, 3x3, 5x5 conv filters and make a concatenation (bottleneck layer) to obtain the same size output

Deep Learning for Image Segmentation

How Local segmentation is done?

1: Extract a patch from an image, then use a classification model to classify
2: Re-extract patch from the image in the next pixel, then use a classification model to classify again
3: Repeat 1 and 2 until all pixels are searched
4: Able to locate the thing

Drawbacks of Local segmentation

Very slow because network is run separately for each patch
Patch size determine the localization accuracy (trade-off in the use of context)
- Larger patch => lower localization accuracy, but better use of context
- Smaller patch => higher localization accuracy, but worse use of context

Pixelwise prediction: trained end-to-end, pixels-to-pixels
- Both learning and inference are performed whole-image-at-a-time
Supervised (Need training dataset with labels)
Outperform local segmentation in terms of efficiency and quality
Can be applied on images of any resolution
- In FCN, input is $H \times W$ , output is also $H \times W$

How FCN works?

Only uses Convolutional layers to extract image from $H \times W$ $H \times W$ to $H/32 \times W/32$ $H /32 \times W /32$
- Convolutional layers contains Conv, Pool and nonlinearity
- If we upsample the output, we can calculate the pixelwise output (label map).
  - How to upsample the output?

Upsample an image using Convolution with Fractional Strides

Convolution with Fractional Strides is also known as Deconvolution, Up Convolution, Transposed Convolution
Insert columns and rows of zero between neighboring columns and rows
- Fractional stride indicates the amount of zero columns and rows to be inserted
- Strides of 1/S: enlarge the image resoltuion by S times
  - We can use Strides 1/32 to turn the label map with size $H/32\times W/32$ $H /32 \times W /32$ into image with dim $H \times W$ $H \times W$
    - Is this upsampling approach too aggressive?
  - We can incorporate information from previous (lower-level) feature layers

Drawbacks of FCN

Built upon FCN with two main differences
Many feature channels in the upsampling part
- Allow the network to propagate context information to higher resolution layers. Therefore, the expansive path is more or less symmetric to the contracting path, and yields a u-shaped architecture
- Yield more precise segmentations
Excessive data augmentation by applying elastic deformation to the training images
- Allow the network to learn invariance to such deformations without the need to see these transformations in the annotated image corpus

How U-Net works?

Encoder-Decoder body
Skipping layers and directly transmit the information to the target layer
- Cropping is required before skipping layers due to the loss of border pixels in every convolution
To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image.
Data augmentation applied to both input and ground-truth images together
- Eg. Random crop, flip, translate, rotate, scale, skew etc
- UNet proposed Elastic Deeformations