Quick Review of CNN and relvent terms

CNN

  • Reduce the sizes of filters (keep only useful features)
  • Parameter sharing (How?)
    • 2D Filter slide through the weights (Convolution)
      • Convolution extract features and represent it with smaller size
        • The convolution results show the similarity at various positions (correlation amount data)
  • Interence function: Y=XWTY = XW^T
  • Training: W=YXTW = YX^T
  • YWX=0Y-WX=0
    • YXT=WXXTYX^T = WXX^T
    • W=YXT(XXT)1W = YX^T(XX^T)^{-1} Hence WW can be found

Feature map width = (Width - FilterSize + 2 * Padding ) / (Stride) + 1

Padding

  • Zero padding: add zero padding
  • Valid convolution means no padding
  • same convolution means pad the corners so that output size = input size

Pooling (Subsampling)

  • Often used after the convolutional layer to reduce spatial size (only width and height, not the depth)
  • Max pooling: return maximum value from the portion of the image covered by the kernel
  • Average pooling: returns the average value from the portion of the image covered by the kernel.

Filters

  • Image is in (Width, Height, Depth) where depth = 3
  • We can slide a WxHx3 filter over the image and obtain a activation map which serves as the input to the next layer
  • We use multiple filters where each filter produce one frame

Standard CNN arch

  • Feature Extraction Stage (CNNs) => Flatten => Fully Connected Layers => Out

LeNet (LeCun, 1998)

  • Conv-Pool-Conv-Pool-Flatten-FC-FC-FC-Out
  • Used Tanh and Sigmoid => Suffer from gradient vanishing problem
  • Why LeNet was not popular at that time?
    • Limitation of computational power
    • Weak theoretical background: researchers appreciated methods with mathematical proof such as SVM

AlexNet (2012)

  • First to use ReLU

  • Data Augmentation

    • Image pre-processing and cropping
    • Horizontal reflection
    • Color jittering
      • step1: Computer PCA on all RGB points valuese in the training image
      • step2: Sample somecolor offset along the principal components at each forward pass annd add random variable drawn from a gaussian with
      • step3: add the offset to all pixels in a training image
  • Dropout

    • Randomly drop out connections between input and output in each weight update cycle.
      • Hence, every input goes through a different network architecture.
        • Increase difficulty for the network the memorize the training data very well and increase the robustness of the network
  • Large filter size used (filters in 1st layer are 11x11)

    • Filter sizes affect number of trainable parameters and large filter size would be too aggressive for training

VGGNet (2014)

  • Deeper layers
  • Smaller filter (Standardized all filters to 3x3)
    • Note the receptive field of two 3x3 conv layers is equal to a 5x5.
      • However using 5x5 filter has 5*5+1 =26 params,
      • while using two 3x3 filters have 2(3*3+1) = 20 params

GoogleNet (2014)

  • Inception module
    • Parallel combination of 1x1, 3x3, 5x5 conv filters and make a concatenation (bottleneck layer) to obtain the same size output

Deep Learning for Image Segmentation

  • Local segmentation
  • FCN
  • U-Net

Local segmentation

How Local segmentation is done?

  • 1: Extract a patch from an image, then use a classification model to classify
  • 2: Re-extract patch from the image in the next pixel, then use a classification model to classify again
  • 3: Repeat 1 and 2 until all pixels are searched
  • 4: Able to locate the thing

Drawbacks of Local segmentation

  • Very slow because network is run separately for each patch
  • Patch size determine the localization accuracy (trade-off in the use of context)
    • Larger patch => lower localization accuracy, but better use of context
    • Smaller patch => higher localization accuracy, but worse use of context

FCN (Fully Convolutional Network)

  • Pixelwise prediction: trained end-to-end, pixels-to-pixels
    • Both learning and inference are performed whole-image-at-a-time
  • Supervised (Need training dataset with labels)
  • Outperform local segmentation in terms of efficiency and quality
  • Can be applied on images of any resolution
    • In FCN, input is H×WH \times W, output is also H×WH \times W

How FCN works?

  • Only uses Convolutional layers to extract image from H×WH \times W to H/32×W/32H/32 \times W/32
    • Convolutional layers contains Conv, Pool and nonlinearity
    • If we upsample the output, we can calculate the pixelwise output (label map).
      • How to upsample the output?

Upsample an image using Convolution with Fractional Strides

  • Convolution with Fractional Strides is also known as Deconvolution, Up Convolution, Transposed Convolution

  • Insert columns and rows of zero between neighboring columns and rows

    • Fractional stride indicates the amount of zero columns and rows to be inserted
    • Strides of 1/S: enlarge the image resoltuion by S times
      • We can use Strides 1/32 to turn the label map with size H/32×W/32H/32\times W/32 into image with dim H×WH \times W
        • Is this upsampling approach too aggressive?
      • We can incorporate information from previous (lower-level) feature layers

Drawbacks of FCN

  • Needs a lot of training pairs (~10k labeled images)
  • Result quality is not satisfactory due to blur boundaries

U-Net

  • Built upon FCN with two main differences

  • Many feature channels in the upsampling part

    • Allow the network to propagate context information to higher resolution layers. Therefore, the expansive path is more or less symmetric to the contracting path, and yields a u-shaped architecture
    • Yield more precise segmentations
  • Excessive data augmentation by applying elastic deformation to the training images

    • Allow the network to learn invariance to such deformations without the need to see these transformations in the annotated image corpus

How U-Net works?

  • Encoder-Decoder body

  • Skipping layers and directly transmit the information to the target layer

    • Cropping is required before skipping layers due to the loss of border pixels in every convolution
  • To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image.

  • Data augmentation applied to both input and ground-truth images together

    • Eg. Random crop, flip, translate, rotate, scale, skew etc
    • UNet proposed Elastic Deeformations