ESM-1 (Evolutionary Scale Modeling 1)

Paper: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Inspired from Character-level Language Models.

This paper explores self-supervised language modeling approaches that have demonstrated state-of-the-art performance on a range of natural language processing tasks, applying them to protein data in the form of unlabeled amino acid sequences.

Since protein sequences use a small vocabulary of twenty canonical elements, the modeling problem is more similar to character-level language models than word-level models. Like natural language, protein sequences also contain long-range dependencies, motivating use of architectures that detect and model distant context.

Method

Trained Transformer on datasets with up to 250 million sequences of the Uniparc database which has 86 billion amino acids.

A BERT-like Deep Transformer
Taking amino acid character sequences as input
Trained with MLM objective
The network is trained to predict the missing tokens from the corrupted sequence

Model	Params	Training	ECE
N-Gram Models
n-gram	4-gram	UR50/S	17.18
LSTM
Small	28.4M	UR50/S	14.42
Large	113.4M	UR50/S	13.54
Transformer
6-layer	42.6M	UR50/S	11.79
12-layer	85.1M	UR50/S	10.45
Deep Transformer
34-layer	669.2M	UR100	10.32
34-layer	669.2M	UR50/S	8.54
34-layer	669.2M	UR50/D	8.46
Deep Transformer but less data
10% data	669.2M	UR50/S	10.99
1% data	669.2M	UR50/S	15.01
0.1% data	669.2M	UR50/S	17.50

UR100 : the low-diversity dataset (UR100) uses the UniRef100 representative sequences
UR50/S : the high diversity sparse dataset (UR50/S) uses the UniRef50 representative sequences
UR50/D : the high-diversity dense dataset (UR50/D) samples the UniRef100 sequences evenly across the UniRef50 clusters

UniRef is used to create the three pre-training datasets with differing levels of diversity.

ESM-1:

Uses SinusoidalPositionalEmbedding

ESM-1b:

Uses LearnedPositionalEmbedding
Layer Norm before the outputs

MLM (Masked Language Modeling) Objective

ESM models are trained with a masked language modeling (MLM) objective.

$\mathcal{L}_{\mathrm{MLM}}=\mathbb{E}_{x \sim X} \mathbb{E}_M \sum_{i \in M}-\log p\left(x_i \mid x_{/ M}\right)$

For each sequence $x$ we sample a set of indices $M$ to mask, replacing the true token at each index $i$ with the mask token.

For each masked token, we independently minimize the negative log likelihood of the true amino acid $x_i$ given the masked sequence $x_{/M}$ as context.

Intuitively, to make a prediction for a masked position, the model must identify dependencies between the masked site and the unmasked parts of the sequence.

ECE (Exponentiated Cross Entropy) Metric

The exponential of the model’s loss averaged per token.
- In the case of the Transformer this is $2^{MLM}$ .
ECE metric describes the mean uncertainty of the model among its set of options for every prediction
- ranging from 1 for an ideal model to 25 (the number of unique amino acid tokens in the data) for a completely random prediction.

Downstream Prediction

On top of the sequence and pairwise features from the Transformer:

Use a depth-32 residual network (ResNet) model to predict binary contacts of Secondary and Tertiary structures

ESM-2 (Evolutionary Scale Modeling 2) / ESMFold

Paper: Language models of protein sequences at the scale of evolution enable accurate structure prediction

Fast End-to-End atomic resolution

No MSA
No Template Search Process for related protein (in AlphaFold and RosettaFold)

Method

ESMFold is a fully end-to-end single sequence predictor.

A folding head is trained for ESM-2

About ESM-2

like ESM-1, trained with MLM (Masked Language Modeling) Objective
- Predict the identity of randomly selected amino acids in a protein sequence by observing their context in the rest of the sequence
RoPE (Rotary Position Embedding) is used instead of absolute positional encoding
- Found improvement on model quality for only small models, become insignificant when model size and training time scales up

What Pair representation?

Folding Trunk

A simplified Evoformer block (from AlphaFold2)
The major change is to remove the dependence on MSAs.

Structure Module

Takes the output from the folding module and outputs 3d atomic coordinates
Equivariant Transformer with invariant point attention (IPA) (proposed in AlphaFold2)