ResNet and DenseNet
The Vanishing Gradient and ResNet
Problem of Vanishing Gradient
- Ideally, the deeper the NN is, the better performance we can obtain.
- However, in reality, with the network depth increasing, the accuracy gets saturated and then deegrades rapidly.
- What happened? => Gradient Vanishing / Explosion
- How? Chain rule multiplication
- What happened? => Gradient Vanishing / Explosion
ResNet (2015)
- A great break through for using more layers, isolate the residual for improvement with 152 layers, better than human recognition.
- Residual Learning: Skipping connection or Jump connection
- Avoided vanishing Gradient problem
- Residual block contains 2 Conv and 2 Activations.
- The residual output is obtained after the 2 Conv and 1 Activation because we want to make use the activation layer to learn a sparse residual model for better generalization annd also feasibility in real application.
- ResNet applied Residual Learning into all layers
- Extremely simple arrangment
- Not only useful for ResNet, but useful for nearly all deep learning structures
- If we add all the previous residual output, we have the final residual output as: which has the additive error. For direct mapping, we will have the multiplicative error.
- The multiplicative error is the reason of vanishing / exploding gradient problem
Residual Network (ResNet)
Making a network deeper does not necessarily bring better performance because of the vanishing gradient problem.
- The main idea of ResNet is to use an identity shortcut connection that skips one or more layers.
- This trick alleviates the gradient vanishing problem, leading to networks with 100+ layers
- If the desired mapping is H(x), it is easier to train a feed-forward network (enclosed by the red-dashed rectangle) to fit a residual mapping
It is because we have one extra term to reduce the chance of having small gradient, as the error gradient can be directly passed to lower layers.
PyTorch Example
1 | import torch |
Densely Connected CNN (DenseNet)
- DenseNet further exploits the effect of shortcut connections
- The input of each layer consists of the feature maps of all earlier layers, and its output is passed to each subsequent layer.
- DenseNet not only alleviates the gradient vanishing problem, but also encourages feature reuse, i.e., the network can perform well with less parameters.
- Feature reuse is achieved by concatenating feature maps instead of adding, as in ResNet.
PyTorch Example
1 | import torch |
All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Comment