Difference between CNN and DNN

DNN (Deep Neural Network)

  • In DNN approach, when we deal with image matrix, we flatten the image using flatten layer.
    • we lost the spatial information of every pixels neighborhood of pixels.
    • Spatial information is very important.

A Code Example of MNIST Model using DNN (97.5% Accuracy)

1
2
3
4
5
6
7
8
9
10
11
12
input_size = 784
output_size = 10
hidden_layer_size = 100

model = tf.keras.Sequential([
tf.keras.layers.Dense(hidden_layer_size, input_shape=(input_size,), activation='relu'),
tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
tf.keras.layers.Dense(output_size, activation='softmax')
])

model.summary()

Extra Info From CNN Explainer

CNNs utilize a special type of layer, aptly named a convolutional layer, that makes them well-positioned to learn from image and image-like data. Regarding image data, CNNs can be used for many different computer vision tasks, such as [image processing, classification, segmentation, and object detection](http://ijcsit.com/docs/Volume 7/vol7issue5/ijcsit20160705014.pdf)

CNN (Convolutional Neural Network)

  • In CNN approach, Convolutional Neural Network solve the problem without flattening the image.
    • Apply kernel filters to every possible position of the image (Convolution Layer)
    • Divide the Convolution Layer into small squares without overlapping, then take and keep the largest number from the square as it is the strongest detail (Pooling Layer)

CNN approach can greatly reduce the dimensionality of the problem.

  • If we convolute and pool for long enough we can reduce the dimensions to a vector containing

    one hot encoded categories, like dog, cat, horse, and so on.

A Code Example of MNIST Model using CNN (99.2% Accuracy)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(64, 3, activation='relu', input_shape=(28,28,1),padding="same"),
tf.keras.layers.MaxPool2D(strides=2),


tf.keras.layers.Conv2D(128, 3, activation='relu',padding="same"),
tf.keras.layers.MaxPool2D(strides=2),

tf.keras.layers.Dropout(0.2),

tf.keras.layers.Conv2D(256, 3, activation='relu',padding="same"),
tf.keras.layers.MaxPool2D(strides=2),

tf.keras.layers.Conv2D(256, 3, activation='relu',padding="same"),
tf.keras.layers.MaxPool2D(strides=2),

tf.keras.layers.Flatten(),
tf.keras.layers.Dense(100,kernel_regularizer=tf.keras.regularizers.l2(0.01), activation='relu'),

tf.keras.layers.Dense(50,kernel_regularizer=tf.keras.regularizers.l2(0.01), activation='relu'),

tf.keras.layers.Dense(10, kernel_regularizer=tf.keras.regularizers.l2(0.01) ,activation='softmax')
])

model.summary()

Usage of CNN

CNN is mainly used in Image Recongition and deal with visual data.

  • Spatial proximities are preserved
  • Certain details such as a human eye is looked for everywhere in the photo

These two advantages make CNN’s predictive power much higher than NNs especially when it comes to image related problems.

Real Life Examples :

  • Robot Vision
  • Self-driving Cars
  • Facebook Face-Tagging
  • Face Recognitions for unlocking phones
  • Detection: Breast Cancer Screening
  • Semantic Sementation: Biomedical Image Analysis

Which Companies will use CNN often?

Basically all the Huge Companies and many tech startups.

  • Apple
  • Google
  • Microsoft
  • Amazon
  • Tesla

The applications of CNN so far seem more or less out of reach for most people.

Overview of CNN

Here I will try to explain CNN (Convolutional Neural Network) as far as I could.

Unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width(x), height(y), depth(channels) .

An Image usually has 3 Channels. (R,G,B)

A simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet architecture.

  • Input -> CONV(+ Activation Function and Padding) -> POOL -> FC(Output)

In Practice, we might use those layers multiple times.

Convolutional Layer (CONV)

The convolutional layers are the foundation of CNN, as they contain the learned kernels (weights), which extract features that distinguish different images from one another. The Convolutional Layer is produced by Convoluting different Image channel with unique Kernels.

Convolution

First We must understand how convolution is computed:

  • We need a Image and a Kernel Filter to perform Image Convolution.
  • A kernel slide across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position.

Note: In practice,

Our Image has many color channels (Depth).

We should also decide the spatial extent(size) of kernel

We might also have more than 1 kernel filters.

Lastly, We must specify the stride when we convolute the filter.

Kernel Filter

Also called Weight.

The convolutional neuron performs an elementwise dot product with a unique kernel and the output of the previous layer’s corresponding neuron. This will yield as many intermediate results as there are unique kernels. The convolutional neuron is the result of all of the intermediate results summed together with the learned bias.

Kernel size, often also referred to as filter size, refers to the dimensions of the sliding window over the input. Choosing this hyperparameter has a massive impact on the image classification task. For example, Small kernel sizes are able to extract a much larger amount of information containing highly local features from the input. As you can see on the visualization above, a smaller kernel size also leads to a smaller reduction in layer dimensions, which allows for a deeper architecture.

Stride

We must specify the stride when we slide the filter.

When the stride is 1 then we slide the filters one pixel at a time.

When the stride is n then the filters jump n pixels at a time as we slide them around.

  • A higher value of stride will produce a smaller output volumes spatially.

Zero Padding

Zero Padding is used to control the spatial size of the output volumes.

When we add zero padding to our input, the image is surrounded by a bunch of zeros.

This will alter the size of the output.

We don’t have to decide the thickness of zero padding, as the tensorflow function will plug the suitable thickness for us.

Padding ‘valid’ is the first figure. The filter window stays inside the image.

Padding ‘same’ is the third figure. The output is the same size.

What is the difference between ‘SAME’ and ‘VALID’ padding in tf.nn.max_pool of tensorflow?

  • valid = no padding
  • same = padding to make the output size same as input size

Activation Functions

I Have Already Introduced Acitivation Functions Here: Neural Networks and DNN Explained

Non-linearity is necessary to produce non-linear decision boundaries, so that the output cannot be written as a linear combination of the inputs. If a non-linear activation function was not present, deep CNN architectures would devolve into a single, equivalent convolutional layer, which would not perform nearly as well.

The Computation of CONV Layer

The Convolutional Layer Accepts a volume of size Win×Hin×DinW_{in} \times H_{in} \times D_{in}. (Width x Height x Depth)

4 Hyperparameters are required:

  • Number of Kernel Filters (K)(K)
  • Spatial Extent of Kernel Filters (F)(F)
  • The Stride (S)(S)
  • The amount of Zero Padding (P)(P)

With above Hyperparameters, The Convolutional Layer will produce a volume of size Wout×Hout×DoutW_{out} \times H_{out} \times D_{out}

  • Wout=(WinF+2P)S+1W_{out} = \frac{(W_{in} - F + 2P)}{S} + 1
  • Hout=(HinF+2P)S+1H_{out} = \frac{(H_{in}-F+2P)}{S}+1
  • Dout=KD_{out} = K

Pooling Layer (POOL)

POOL layer will perform a downsampling operation along the spatial dimensions (width, height).

  • The purpose of POOL layer is to gradually decrease the spatial extent of the network.

Stride

We must specify the stride when we slide the filter.

When the stride is 1 then we slide the filters one pixel at a time.

When the stride is n then the filters jump n pixels at a time as we slide them around.

  • A higher value of stride will produce a smaller output volumes spatially.

Pooling Operation

In Pooling, A filter slide across the width and height of the input volume.

It returns the Max/Min/Average values in the filter at any position.

In Above Example is a Max Pooling, It returns the maximum number

  • Note here Stride = 2, the filter jump 2 pixels at a time as we slide them around.

Extra Info From CS231n:

In addition to max pooling, the pooling units can also perform other functions, such as average pooling or even L2-norm pooling. Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice.

Getting rid of pooling. Many people dislike the pooling operation and think that we can get away without it. For example, Striving for Simplicity: The All Convolutional Net proposes to discard the pooling layer in favor of architecture that only consists of repeated CONV layers. To reduce the size of the representation they suggest using larger stride in CONV layer once in a while. Discarding pooling layers has also been found to be important in training good generative models, such as variational autoencoders (VAEs) or generative adversarial networks (GANs). It seems likely that future architectures will feature very few to no pooling layers.

The Computation of POOL Layer

The Pooling Layer Accepts a volume of size Win×Hin×DinW_{in} \times H_{in} \times D_{in}. (Width x Height x Depth).

2 Hyperparameters are required:

  • Spatial Extent of Kernel Filters (F)(F)
  • The Stride (S)(S)

For Pooling layers, it is not common to pad the input using zero-padding.

With above Hyperparameters, The Pooling Layer will produce a volume of size Wout×Hout×DoutW_{out} \times H_{out} \times D_{out}

  • Wout=(WinF)S+1W_{out} = \frac{(W_{in}-F)}{S} + 1
  • Hout=(HinF)S+1H_{out} = \frac{(H_{in}-F)}{S} + 1
  • Dout=DinD_{out} = D_{in}

Flatten Layer

This layer converts a three-dimensional layer in the network into a one-dimensional vector to fit the input of a fully-connected layer for classification.

  • A Tensor will be converted into a vector.

Fully-Connected Layer (FC)

Also knowns as Dense layer in regular neural networks.

The last fully-connected layer holds the output, such as the class scores.

  • We use the softmax activation function to classify these features, which requires a 1-dimensional input.
  • This is why the flatten layer is necessary.

Practical Usage of CNN

Reference

DeepLizard - Machine Learning & Deep Learning Fundamentals

Deep Learning for Computer Vision (Andrej Karpathy, OpenAI)

CNN - CS231n Materials

CNN Explainer

CS230 VIP CheatSheet