Paper Review - ProteinDT
Domain Knowledge Terminologies
-
Protein
- a combination of 20 types of amino acids
-
Peptide
- a short sequence of amino acids
-
Fold
- refers to the specific three-dimensional arrangement (shape and surface characteristics) of a polypeptide chain. Crucial for the protein’s function.
-
Ligand
- Any molecule that attaches to a specific spot on a protein. When the ligand attaches, it can change how the protein works.
-
Affinity
- The strength of the interaction between a ligand and its protein
-
Primary Structure
- The linear sequence of amino acids in a polypeptide chain
-
Secondary Structure
- Localized conformations of the polypeptide chain, primarily alpha-helices and beta-sheets, stabilized by hydrogen bonds
-
Tertiary Structure
- The overall three-dimensional shape of a single polypeptide chain, formed by the folding and interactions of the secondary structures. This involves hydrogen bonds, disulfide bridges, hydrophobic interactions, and ionic interactions.
-
Quaternary Structure
- The arrangement and interaction of multiple polypeptide chains (subunits) in a multi-subunit protein
-
Residue
- In the context of proteins, a residue refers to a single unit within a polymer chain of amino acids.
ProteinDT
Modalities: Protein Sequence (FASTA) and Textual Prompt
Key components of ProteinDT:
- ProteinCLAP Encoder
- Trained on SwissProtCLAP, a text-protein pair dataset (441K) from UniProt
- ProtBERT for Protein seq and SciBERT for Text seq
- ProteinFacilitator Gaussian model: Text vector -> Protein sequence representation
- Trained as an augmented alignment module in addition to ProteinCLAP
- Conditional Generative Decoder
- AR Transformer(T5) or Diffusion(ProteinDiff)
Part (d) shows the complete pipeline of ProteinDT.
Training of ProteinDT
Preparing Dataset
- SwissProtCLAP (441K) -> Total 541158
- A lot of very similar entries
I uploaded the dataset on https://huggingface.co/datasets/vinesmsuic/SwissProtCLAP
To get ProteinCLAP Pre-training
Part (a) in the figure
- ProtBert-BFD as Protein Encoder, SciBert as Text Encoder
- 10 epochs, batch size =9 to finetune both encoders + 2 projection model
- EBM-NCE (ICLR 2022) Contrastive learning is used as loss function
- Protein2Latent and Text2Latent models are simply Linear projection
Then we can extract Empty Sequence embedding from both projection models
- To Form pairwise representation
Learn the facilitator distribution (GaussianPrior)
Part (b) in the figure
Given Text representation, Facilitator predicts Protein representation
- Facilitator architecture: Linear-SiLU-Linear (same dim)
- 32 epochs, batch size =5 to train the GaussianPrior
- MSE as loss function
Learn the ProtT5-XL-BFD Decoder distribution (or ProteinDiff Decoder)
Part © in the figure
Decode protein representation
- CrossEntropyLoss as loss function
- ProteinDiff Decoder is just a few layers MLP
- Using RNN / BERT score network
- 32 epochs, batch size =8
Downstream Task: Protein Editing
Part (e) in the figure
Input:
- Initial Protein seq + Target prompt
Types of Editing:
- Structure Editing: Secondary Structure
- Stability Editing
- Peptide binding editing: Affinity
Method 1: Latent Interpolation
- intermediate latent = slerp(latent_text, latent_protein, theta)
Method 2: Latent Optimization
- learn a token-level latent code that is close to both the text and protein seq-level representation
In fact this Protein Editing is simply finding the closest distence sample in the model space given the new text condition.