Summary of iconic Image Segmentation methods
Quick Review of CNN and relvent terms
- Reduce the sizes of filters (keep only useful features)
- Parameter sharing (How?)
- 2D Filter slide through the weights (Convolution)
- Convolution extract features and represent it with smaller size
- The convolution results show the similarity at various positions (correlation amount data)
- Convolution extract features and represent it with smaller size
- 2D Filter slide through the weights (Convolution)
- Interence function:
- Training:
- Hence can be found
Feature map width = (Width - FilterSize + 2 * Padding ) / (Stride) + 1
- Zero padding: add zero padding
- Valid convolution means no padding
- same convolution means pad the corners so that output size = input size
Pooling (Subsampling)
- Often used after the convolutional layer to reduce spatial size (only width and height, not the depth)
- Max pooling: return maximum value from the portion of the image covered by the kernel
- Average pooling: returns the average value from the portion of the image covered by the kernel.
- Image is in (Width, Height, Depth) where depth = 3
- We can slide a WxHx3 filter over the image and obtain a activation map which serves as the input to the next layer
- We use multiple filters where each filter produce one frame
Standard CNN arch
- Feature Extraction Stage (CNNs) => Flatten => Fully Connected Layers => Out
LeNet (LeCun, 1998)
- Conv-Pool-Conv-Pool-Flatten-FC-FC-FC-Out
- Used Tanh and Sigmoid => Suffer from gradient vanishing problem
- Why LeNet was not popular at that time?
- Limitation of computational power
- Weak theoretical background: researchers appreciated methods with mathematical proof such as SVM
AlexNet (2012)
First to use ReLU
Data Augmentation
- Image pre-processing and cropping
- Horizontal reflection
- Color jittering
- step1: Computer PCA on all RGB points valuese in the training image
- step2: Sample somecolor offset along the principal components at each forward pass annd add random variable drawn from a gaussian with
- step3: add the offset to all pixels in a training image
- Randomly drop out connections between input and output in each weight update cycle.
- Hence, every input goes through a different network architecture.
- Increase difficulty for the network the memorize the training data very well and increase the robustness of the network
- Hence, every input goes through a different network architecture.
- Randomly drop out connections between input and output in each weight update cycle.
Large filter size used (filters in 1st layer are 11x11)
- Filter sizes affect number of trainable parameters and large filter size would be too aggressive for training
VGGNet (2014)
- Deeper layers
- Smaller filter (Standardized all filters to 3x3)
- Note the receptive field of two 3x3 conv layers is equal to a 5x5.
- However using 5x5 filter has 5*5+1 =26 params,
- while using two 3x3 filters have 2(3*3+1) = 20 params
- Note the receptive field of two 3x3 conv layers is equal to a 5x5.
GoogleNet (2014)
- Inception module
- Parallel combination of 1x1, 3x3, 5x5 conv filters and make a concatenation (bottleneck layer) to obtain the same size output
Deep Learning for Image Segmentation
- Local segmentation
- U-Net
Local segmentation
How Local segmentation is done?
- 1: Extract a patch from an image, then use a classification model to classify
- 2: Re-extract patch from the image in the next pixel, then use a classification model to classify again
- 3: Repeat 1 and 2 until all pixels are searched
- 4: Able to locate the thing
Drawbacks of Local segmentation
- Very slow because network is run separately for each patch
- Patch size determine the localization accuracy (trade-off in the use of context)
- Larger patch => lower localization accuracy, but better use of context
- Smaller patch => higher localization accuracy, but worse use of context
FCN (Fully Convolutional Network)
- Pixelwise prediction: trained end-to-end, pixels-to-pixels
- Both learning and inference are performed whole-image-at-a-time
- Supervised (Need training dataset with labels)
- Outperform local segmentation in terms of efficiency and quality
- Can be applied on images of any resolution
- In FCN, input is , output is also
How FCN works?
- Only uses Convolutional layers to extract image from to
- Convolutional layers contains Conv, Pool and nonlinearity
- If we upsample the output, we can calculate the pixelwise output (label map).
- How to upsample the output?
Upsample an image using Convolution with Fractional Strides
Convolution with Fractional Strides is also known as Deconvolution, Up Convolution, Transposed Convolution
Insert columns and rows of zero between neighboring columns and rows
- Fractional stride indicates the amount of zero columns and rows to be inserted
- Strides of 1/S: enlarge the image resoltuion by S times
- We can use Strides 1/32 to turn the label map with size into image with dim
- Is this upsampling approach too aggressive?
- We can incorporate information from previous (lower-level) feature layers
- We can use Strides 1/32 to turn the label map with size into image with dim
Drawbacks of FCN
- Needs a lot of training pairs (~10k labeled images)
- Result quality is not satisfactory due to blur boundaries
Built upon FCN with two main differences
Many feature channels in the upsampling part
- Allow the network to propagate context information to higher resolution layers. Therefore, the expansive path is more or less symmetric to the contracting path, and yields a u-shaped architecture
- Yield more precise segmentations
Excessive data augmentation by applying elastic deformation to the training images
- Allow the network to learn invariance to such deformations without the need to see these transformations in the annotated image corpus
How U-Net works?
Encoder-Decoder body
Skipping layers and directly transmit the information to the target layer
- Cropping is required before skipping layers due to the loss of border pixels in every convolution
To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image.
Data augmentation applied to both input and ground-truth images together
- Eg. Random crop, flip, translate, rotate, scale, skew etc
- UNet proposed Elastic Deeformations