ESM-1 (Evolutionary Scale Modeling 1)

Paper: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

  • Inspired from Character-level Language Models.

This paper explores self-supervised language modeling approaches that have demonstrated state-of-the-art performance on a range of natural language processing tasks, applying them to protein data in the form of unlabeled amino acid sequences.

Since protein sequences use a small vocabulary of twenty canonical elements, the modeling problem is more similar to character-level language models than word-level models. Like natural language, protein sequences also contain long-range dependencies, motivating use of architectures that detect and model distant context.

Method

Trained Transformer on datasets with up to 250 million sequences of the Uniparc database which has 86 billion amino acids.

  • A BERT-like Deep Transformer
  • Taking amino acid character sequences as input
  • Trained with MLM objective
  • The network is trained to predict the missing tokens from the corrupted sequence
Model Params Training ECE
N-Gram Models
n-gram 4-gram UR50/S 17.18
LSTM
Small 28.4M UR50/S 14.42
Large 113.4M UR50/S 13.54
Transformer
6-layer 42.6M UR50/S 11.79
12-layer 85.1M UR50/S 10.45
Deep Transformer
34-layer 669.2M UR100 10.32
34-layer 669.2M UR50/S 8.54
34-layer 669.2M UR50/D 8.46
Deep Transformer but less data
10% data 669.2M UR50/S 10.99
1% data 669.2M UR50/S 15.01
0.1% data 669.2M UR50/S 17.50
  • UR100 : the low-diversity dataset (UR100) uses the UniRef100 representative sequences
  • UR50/S : the high diversity sparse dataset (UR50/S) uses the UniRef50 representative sequences
  • UR50/D : the high-diversity dense dataset (UR50/D) samples the UniRef100 sequences evenly across the UniRef50 clusters

UniRef is used to create the three pre-training datasets with differing levels of diversity.

ESM-1:

  • Uses SinusoidalPositionalEmbedding

ESM-1b:

  • Uses LearnedPositionalEmbedding
  • Layer Norm before the outputs

MLM (Masked Language Modeling) Objective

  • ESM models are trained with a masked language modeling (MLM) objective.

LMLM=ExXEMiMlogp(xix/M)\mathcal{L}_{\mathrm{MLM}}=\mathbb{E}_{x \sim X} \mathbb{E}_M \sum_{i \in M}-\log p\left(x_i \mid x_{/ M}\right)

For each sequence xx we sample a set of indices MM to mask, replacing the true token at each index ii with the mask token.

For each masked token, we independently minimize the negative log likelihood of the true amino acid xix_i given the masked sequence x/Mx_{/M} as context.

Intuitively, to make a prediction for a masked position, the model must identify dependencies between the masked site and the unmasked parts of the sequence.

ECE (Exponentiated Cross Entropy) Metric

  • The exponential of the model’s loss averaged per token.
    • In the case of the Transformer this is 2MLM2^{MLM}.
  • ECE metric describes the mean uncertainty of the model among its set of options for every prediction
    • ranging from 1 for an ideal model to 25 (the number of unique amino acid tokens in the data) for a completely random prediction.

Downstream Prediction

On top of the sequence and pairwise features from the Transformer:

  • Use a depth-32 residual network (ResNet) model to predict binary contacts of Secondary and Tertiary structures

ESM-2 (Evolutionary Scale Modeling 2) / ESMFold

Paper: Language models of protein sequences at the scale of evolution enable accurate structure prediction

Fast End-to-End atomic resolution

  • No MSA
  • No Template Search Process for related protein (in AlphaFold and RosettaFold)

Method

ESMFold is a fully end-to-end single sequence predictor.

  • A folding head is trained for ESM-2

img

About ESM-2

  • like ESM-1, trained with MLM (Masked Language Modeling) Objective
    • Predict the identity of randomly selected amino acids in a protein sequence by observing their context in the rest of the sequence
  • RoPE (Rotary Position Embedding) is used instead of absolute positional encoding
    • Found improvement on model quality for only small models, become insignificant when model size and training time scales up

What Pair representation?

Folding Trunk

  • A simplified Evoformer block (from AlphaFold2)
  • The major change is to remove the dependence on MSAs.

img

Structure Module

  • Takes the output from the folding module and outputs 3d atomic coordinates
  • Equivariant Transformer with invariant point attention (IPA) (proposed in AlphaFold2)

img

Extra Info

HuggingFace Intro: https://huggingface.co/docs/transformers/main/en/model_doc/esm

OpenFold: https://github.com/aqlaboratory/openfold/tree/main