Modality Gap in Multi-modal Contrastive Representation Learning

Code: https://github.com/Weixin-Liang/Modality-Gap

Paper: https://openreview.net/forum?id=S7Evzt9uit3 (NIPS 2022)

This paper try to explain the modality gap, an phenomenon of the representation space of multi-modal models.

The separation of embeddings from different data modalities (e.g., images and text) in the shared representation space of multi-modal models

def visualize_gap(all_img_features, all_text_features):
    def svd(X, n_components=2):
        # using SVD to compute eigenvectors and eigenvalues
        # M = np.mean(X, axis=0)
        # X = X - M
        X = X.astype(np.float32)
        U, S, Vt = np.linalg.svd(X)
        # print(S)
        return U[:, :n_components] * S[:n_components]

    features_2d = svd(np.concatenate([all_img_features, all_text_features], 0))
    plt.figure(figsize=(5, 5))
    plt.scatter(features_2d[:-len(all_img_features), 0], features_2d[:-len(all_img_features), 1], c='red')
    plt.scatter(features_2d[-len(all_img_features):, 0], features_2d[-len(all_img_features):, 1], c='blue')
    # connect the dots
    for i in range(len(all_img_features)):
        plt.plot([features_2d[i, 0], features_2d[len(all_img_features)+i, 0]], [features_2d[i, 1], features_2d[len(all_img_features)+i, 1]], c='black', alpha=0.1)


def visualize_umap(protein_repr, text_repr, title=""):
    # Combine the normalized representations for UMAP
    combined_normalized_data = np.concatenate((protein_repr, text_repr), axis=0)

    # Compute UMAP
    umap = UMAP(n_components=2, random_state=42)
    umap_rep = umap.fit_transform(combined_normalized_data)

    # Split the UMAP representation back into protein and text
    umap_protein_repr = umap_rep[:protein_repr.shape[0]]
    umap_text_repr = umap_rep[protein_repr.shape[0]:]

    # Scatter plot of the UMAP embeddings with cone effect
    plt.figure(figsize=(8, 6))
    sns.scatterplot(x=umap_protein_repr[:, 0], y=umap_protein_repr[:, 1], s=30, alpha=0.5, color='blue', label='Protein')
    sns.scatterplot(x=umap_text_repr[:, 0], y=umap_text_repr[:, 1], s=30, alpha=0.5, color='orange', label='Text')

    # Draw lines between corresponding points (optional, adjust as needed)
    for i in range(len(umap_protein_repr)):
        plt.plot([umap_protein_repr[i, 0], umap_text_repr[i, 0]], 
                [umap_protein_repr[i, 1], umap_text_repr[i, 1]], 
                color='gray', alpha=0.5)

    plt.title(title)
    plt.xlabel('UMAP Component 1')
    plt.ylabel('UMAP Component 2')
    plt.legend()
    plt.show()

Happens across:
- various multimodal models
- any data modalities
- Encoder architectures
- Even happened with random model weights initialization

Conclusion of this paper:

Modality gap is born at random initialization
Contrastive learning encourage the gap

Note that The Cone Effect Induces A Modality Gap, The Cone Effect =/= Modality Gap

Cone Effect

The Cone Effect Induces A Modality Gap

a general inductive bias of deep neural networks

How Cone Effect is measured

Calculating cosine similarity between pairs of embeddings from the same modality allows you to measure how tightly clustered the embeddings are within that modality’s representation space.

# Compute cosine similarity within one modal
def pair_wise_similarity(modality_str, data, title=""):
    assert modality_str in ['protein', 'text']
    normed_feature = data
    similarity = normed_feature @ normed_feature.T
    similarity = similarity[~np.eye(similarity.shape[0], dtype=bool)].tolist()
    print(modality_str, 'min similarity:', min(similarity))

    x_name = '{} cosine pair similarity'.format(modality_str)
    tmp_df = pd.DataFrame({x_name: similarity})
    print('mean', np.mean(similarity))
    print(tmp_df[x_name].describe())

    plt.figure(figsize=(6, 6))
    plt.title(title)
    sns.histplot(data=tmp_df, x=x_name)
    plt.xlabel('')
    plt.ylabel('')
    plt.show()
    return

Evidence / Causes for Cone Effect

While it might seem reasonable to attribute the gap to differences in data distributions or to the different encoder architectures
These factors are not the fundamental cause of the modality gap phenomenon.

Can be viewed from computing the consine similarity between pairs within single modality
- The average cosine similarity is substantially larger than 0, indicating that the embedding space is a narrow cone
Observed in models with random weights and even with random noise inputs
- suggesting it is not due to data distribution
- Different random initializations create different embedding cones. Since a multi-modal model consists of two encoders, which create different cones at random initialization, this explains how the modality gap is present at initialization.
- The authors proved that different random initializations of model weights result in different cones. More specifically, the variance of an intermediate output mostly come from the model’s random initialization
Non-linear activation (e.g. ReLU) brings up the cone effect
- Average cosine similarity increases rapidly as the number of layers (with non-linearity) increases.
- The authors proved that each network layer increases cosine similarity, as each neural network layer shrinks the angle between any pair of embedding vectors with high probability

What does not attribute to cone effect:

Normalization layers

How Contrastive Learning causes domain gap

Contrastive Learning preserves the gap during optimization
- How??

Potential Downsides of Modality Gap

restrict the expressiveness of learned representations
potentially decrease the downstream task performance?
- The paper actually found that increasing the modality gap can improve the downstream performance on zero-shot learning and fairness tasks
The authors cannot figure out whather if it is desirable to have no modality gap

How the gap is measured

Mean of one modality data minus another modality

$\vec{\Delta}_{\text {gap }}=\frac{1}{n} \sum_{i=1}^n \mathbf{x}_i-\frac{1}{n} \sum_{i=1}^n \mathbf{y}_i$

Embedding Shift Experiment

The authors tried manually shift CLIP’s image embeddings and text embeddings towards closing the gap.

$\mathbf{x}_i^{\text {shift }}=\operatorname{Normalize}\left(\mathbf{x}_i-\lambda \vec{\Delta}_{\text {gap }}\right), \quad \mathbf{y}_i^{\text {shift }}=\operatorname{Normalize}\left(\mathbf{y}_i+\lambda \vec{\Delta}_{\text {gap }}\right)$

Found out shifting toward closing the gap increases the contrastive loss
- contrastive loss actually encourages the separation of modalities
The gap distance decreases monotonically with increasing temperature ( $\tau$ $τ$ )
- increasing temperature can decrease the gap

Temperature was often used as uncertainty parameter in contrastive learning.

Mismatched Data Experiment

The authors tried set up a simple embedding with mismatched labels and fixed the mismatched labels.

Found out without mismatched labels, the model has no gap formation
The presence of mismatched data (e.g., hard-to-differentiate examples or annotation errors) contributes to the formation of the modality gap under low temperatures.

Implications of the Modality Gap

Table 1: Modifying the modality gap distance can impact downstream zero-shot classification performance (some good some bad).
Table 2: Adjusting the gap distance can mitigate denigration biases in models like CLIP, leading to more fair outcomes

Alignment and Uniformity in Contrastive Learning

Paper: Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

Alignment and Uniformity

In the context of contrastive learning, the concepts of alignment and uniformity are crucial. Alignment refers to the similarity between the representations of positive pairs, while uniformity refers to the uniformity of the distribution of representations in the embedding space.

Alignment refers to the phenomenon that similar samples have similar features.
Uniformity refers to the phenomenon that features are evenly distributed on the hypersphere in order to preserve information.

The authors propose quantifiable metrics for alignment and uniformity:

The alignment loss is defined as the expected distance between positive pairs:

$L_{align}(f;\alpha) \equiv E_{(x,y) \sim p_{pos}}[|f(x) - f(y)|^{\alpha}_2], \alpha > 0$

where $(x, y)$ are positive pairs drawn from the positive pair distribution $p_{pos}$ and $f$ is the encoder function.

In practice its L2.

The uniformity loss is defined as the logarithm of the average pairwise Gaussian potential:

$L_{uniform}(f;t) \equiv \log E_{x,y \stackrel{i.i.d.}{\sim} p_{data}}[G_t(u,v)] = \log E_{x,y \stackrel{i.i.d.}{\sim} p_{data}}[e^{-t|f(x)-f(y)|^2_2}], t > 0$

where $x$ and $y$ are samples drawn from the data distribution $p_{data}$ , $f$ is the encoder function, and $G_t(u, v) \equiv e^{-t|u-v|^2_2} = e^{2t \cdot u^Tv - 2t}, t > 0$ is the Gaussian potential kernel.

The authors choose the Gaussian potential kernel because the uniform distribution is the unique distribution that minimizes the expected pairwise potential.

Directly optimizing for the proposed metrics leads to comparable or better downstream task performance than contrastive learning.