Paper Review - Evolutionary Scale Modeling
ESM-1 (Evolutionary Scale Modeling 1)
- Inspired from Character-level Language Models.
This paper explores self-supervised language modeling approaches that have demonstrated state-of-the-art performance on a range of natural language processing tasks, applying them to protein data in the form of unlabeled amino acid sequences.
Since protein sequences use a small vocabulary of twenty canonical elements, the modeling problem is more similar to character-level language models than word-level models. Like natural language, protein sequences also contain long-range dependencies, motivating use of architectures that detect and model distant context.
Method
Trained Transformer on datasets with up to 250 million sequences of the Uniparc database which has 86 billion amino acids.
- A BERT-like Deep Transformer
- Taking amino acid character sequences as input
- Trained with MLM objective
- The network is trained to predict the missing tokens from the corrupted sequence
Model | Params | Training | ECE |
---|---|---|---|
N-Gram Models | |||
n-gram | 4-gram | UR50/S | 17.18 |
LSTM | |||
Small | 28.4M | UR50/S | 14.42 |
Large | 113.4M | UR50/S | 13.54 |
Transformer | |||
6-layer | 42.6M | UR50/S | 11.79 |
12-layer | 85.1M | UR50/S | 10.45 |
Deep Transformer | |||
34-layer | 669.2M | UR100 | 10.32 |
34-layer | 669.2M | UR50/S | 8.54 |
34-layer | 669.2M | UR50/D | 8.46 |
Deep Transformer but less data | |||
10% data | 669.2M | UR50/S | 10.99 |
1% data | 669.2M | UR50/S | 15.01 |
0.1% data | 669.2M | UR50/S | 17.50 |
- UR100 : the low-diversity dataset (UR100) uses the UniRef100 representative sequences
- UR50/S : the high diversity sparse dataset (UR50/S) uses the UniRef50 representative sequences
- UR50/D : the high-diversity dense dataset (UR50/D) samples the UniRef100 sequences evenly across the UniRef50 clusters
UniRef is used to create the three pre-training datasets with differing levels of diversity.
ESM-1:
- Uses SinusoidalPositionalEmbedding
ESM-1b:
- Uses LearnedPositionalEmbedding
- Layer Norm before the outputs
MLM (Masked Language Modeling) Objective
- ESM models are trained with a masked language modeling (MLM) objective.
For each sequence we sample a set of indices to mask, replacing the true token at each index with the mask token.
For each masked token, we independently minimize the negative log likelihood of the true amino acid given the masked sequence as context.
Intuitively, to make a prediction for a masked position, the model must identify dependencies between the masked site and the unmasked parts of the sequence.
ECE (Exponentiated Cross Entropy) Metric
- The exponential of the model’s loss averaged per token.
- In the case of the Transformer this is .
- ECE metric describes the mean uncertainty of the model among its set of options for every prediction
- ranging from 1 for an ideal model to 25 (the number of unique amino acid tokens in the data) for a completely random prediction.
Downstream Prediction
On top of the sequence and pairwise features from the Transformer:
- Use a depth-32 residual network (ResNet) model to predict binary contacts of Secondary and Tertiary structures
ESM-2 (Evolutionary Scale Modeling 2) / ESMFold
Paper: Language models of protein sequences at the scale of evolution enable accurate structure prediction
Fast End-to-End atomic resolution
- No MSA
- No Template Search Process for related protein (in AlphaFold and RosettaFold)
Method
ESMFold is a fully end-to-end single sequence predictor.
- A folding head is trained for ESM-2
About ESM-2
- like ESM-1, trained with MLM (Masked Language Modeling) Objective
- Predict the identity of randomly selected amino acids in a protein sequence by observing their context in the rest of the sequence
- RoPE (Rotary Position Embedding) is used instead of absolute positional encoding
- Found improvement on model quality for only small models, become insignificant when model size and training time scales up
What Pair representation?
Folding Trunk
- A simplified Evoformer block (from AlphaFold2)
- The major change is to remove the dependence on MSAs.
Structure Module
- Takes the output from the folding module and outputs 3d atomic coordinates
- Equivariant Transformer with invariant point attention (IPA) (proposed in AlphaFold2)
Extra Info
HuggingFace Intro: https://huggingface.co/docs/transformers/main/en/model_doc/esm
OpenFold: https://github.com/aqlaboratory/openfold/tree/main