Paper Review - Modality Gap and Alignment in Multi-modal Contrastive Learning
Modality Gap in Multi-modal Contrastive Representation Learning
Code: https://github.com/Weixin-Liang/Modality-Gap
Paper: https://openreview.net/forum?id=S7Evzt9uit3 (NIPS 2022)
This paper try to explain the modality gap, an phenomenon of the representation space of multi-modal models.
- The separation of embeddings from different data modalities (e.g., images and text) in the shared representation space of multi-modal models
1 | def visualize_gap(all_img_features, all_text_features): |
or
1 |
|
- Happens across:
- various multimodal models
- any data modalities
- Encoder architectures
- Even happened with random model weights initialization
Conclusion of this paper:
- Modality gap is born at random initialization
- Contrastive learning encourage the gap
Note that The Cone Effect Induces A Modality Gap, The Cone Effect =/= Modality Gap
Cone Effect
The Cone Effect Induces A Modality Gap
- a general inductive bias of deep neural networks
How Cone Effect is measured
- Calculating cosine similarity between pairs of embeddings from the same modality allows you to measure how tightly clustered the embeddings are within that modality’s representation space.
1 | # Compute cosine similarity within one modal |
Evidence / Causes for Cone Effect
While it might seem reasonable to attribute the gap to differences in data distributions or to the different encoder architectures
These factors are not the fundamental cause of the modality gap phenomenon.
- Can be viewed from computing the consine similarity between pairs within single modality
- The average cosine similarity is substantially larger than 0, indicating that the embedding space is a narrow cone
- Observed in models with random weights and even with random noise inputs
- suggesting it is not due to data distribution
- Different random initializations create different embedding cones. Since a multi-modal model consists of two encoders, which create different cones at random initialization, this explains how the modality gap is present at initialization.
- The authors proved that different random initializations of model weights result in different cones. More specifically, the variance of an intermediate output mostly come from the model’s random initialization
- Non-linear activation (e.g. ReLU) brings up the cone effect
- Average cosine similarity increases rapidly as the number of layers (with non-linearity) increases.
- The authors proved that each network layer increases cosine similarity, as each neural network layer shrinks the angle between any pair of embedding vectors with high probability
What does not attribute to cone effect:
- Normalization layers
How Contrastive Learning causes domain gap
- Contrastive Learning preserves the gap during optimization
- How??
Potential Downsides of Modality Gap
- restrict the expressiveness of learned representations
- potentially decrease the downstream task performance?
- The paper actually found that increasing the modality gap can improve the downstream performance on zero-shot learning and fairness tasks
- The authors cannot figure out whather if it is desirable to have no modality gap
How the gap is measured
Mean of one modality data minus another modality
Embedding Shift Experiment
The authors tried manually shift CLIP’s image embeddings and text embeddings towards closing the gap.
- Found out shifting toward closing the gap increases the contrastive loss
- contrastive loss actually encourages the separation of modalities
- The gap distance decreases monotonically with increasing temperature ()
- increasing temperature can decrease the gap
Temperature was often used as uncertainty parameter in contrastive learning.
Mismatched Data Experiment
The authors tried set up a simple embedding with mismatched labels and fixed the mismatched labels.
- Found out without mismatched labels, the model has no gap formation
- The presence of mismatched data (e.g., hard-to-differentiate examples or annotation errors) contributes to the formation of the modality gap under low temperatures.
Implications of the Modality Gap
- Table 1: Modifying the modality gap distance can impact downstream zero-shot classification performance (some good some bad).
- Table 2: Adjusting the gap distance can mitigate denigration biases in models like CLIP, leading to more fair outcomes
Alignment and Uniformity in Contrastive Learning
Alignment and Uniformity
In the context of contrastive learning, the concepts of alignment and uniformity are crucial. Alignment refers to the similarity between the representations of positive pairs, while uniformity refers to the uniformity of the distribution of representations in the embedding space.
- Alignment refers to the phenomenon that similar samples have similar features.
- Uniformity refers to the phenomenon that features are evenly distributed on the hypersphere in order to preserve information.
The authors propose quantifiable metrics for alignment and uniformity:
The alignment loss is defined as the expected distance between positive pairs:
where are positive pairs drawn from the positive pair distribution and is the encoder function.
In practice its L2.
The uniformity loss is defined as the logarithm of the average pairwise Gaussian potential:
where and are samples drawn from the data distribution , is the encoder function, and is the Gaussian potential kernel.
The authors choose the Gaussian potential kernel because the uniform distribution is the unique distribution that minimizes the expected pairwise potential.
Directly optimizing for the proposed metrics leads to comparable or better downstream task performance than contrastive learning.