Domain Knowledge Terminologies

  • Protein

    • a combination of 20 types of amino acids
  • Peptide

    • a short sequence of amino acids
  • Fold

    • refers to the specific three-dimensional arrangement (shape and surface characteristics) of a polypeptide chain. Crucial for the protein’s function.
  • Ligand

    • Any molecule that attaches to a specific spot on a protein. When the ligand attaches, it can change how the protein works.
  • Affinity

    • The strength of the interaction between a ligand and its protein
  • Primary Structure

    • The linear sequence of amino acids in a polypeptide chain
  • Secondary Structure

    • Localized conformations of the polypeptide chain, primarily alpha-helices and beta-sheets, stabilized by hydrogen bonds
  • Tertiary Structure

    • The overall three-dimensional shape of a single polypeptide chain, formed by the folding and interactions of the secondary structures. This involves hydrogen bonds, disulfide bridges, hydrophobic interactions, and ionic interactions.
  • Quaternary Structure

    • The arrangement and interaction of multiple polypeptide chains (subunits) in a multi-subunit protein
  • Residue

    • In the context of proteins, a residue refers to a single unit within a polymer chain of amino acids.

ProteinDT

Modalities: Protein Sequence (FASTA) and Textual Prompt

Key components of ProteinDT:

  • ProteinCLAP Encoder
    • Trained on SwissProtCLAP, a text-protein pair dataset (441K) from UniProt
  • ProtBERT for Protein seq and SciBERT for Text seq
  • ProteinFacilitator Gaussian model: Text vector -> Protein sequence representation
    • Trained as an augmented alignment module in addition to ProteinCLAP
  • Conditional Generative Decoder
    • AR Transformer(T5) or Diffusion(ProteinDiff)

Part (d) shows the complete pipeline of ProteinDT.

Training of ProteinDT

Preparing Dataset

  • SwissProtCLAP (441K) -> Total 541158
    • A lot of very similar entries

I uploaded the dataset on https://huggingface.co/datasets/vinesmsuic/SwissProtCLAP

To get ProteinCLAP Pre-training

Part (a) in the figure

  • ProtBert-BFD as Protein Encoder, SciBert as Text Encoder
  • 10 epochs, batch size =9 to finetune both encoders + 2 projection model
  • EBM-NCE (ICLR 2022) Contrastive learning is used as loss function
  • Protein2Latent and Text2Latent models are simply Linear projection

Then we can extract Empty Sequence embedding from both projection models

  • To Form pairwise representation

Learn the facilitator distribution (GaussianPrior)

Part (b) in the figure
Given Text representation, Facilitator predicts Protein representation

  • Facilitator architecture: Linear-SiLU-Linear (same dim)
  • 32 epochs, batch size =5 to train the GaussianPrior
  • MSE as loss function

Learn the ProtT5-XL-BFD Decoder distribution (or ProteinDiff Decoder)

Part © in the figure
Decode protein representation

  • CrossEntropyLoss as loss function
  • ProteinDiff Decoder is just a few layers MLP
  • Using RNN / BERT score network
  • 32 epochs, batch size =8

Downstream Task: Protein Editing

Part (e) in the figure

Input:

  • Initial Protein seq + Target prompt

Types of Editing:

  • Structure Editing: Secondary Structure
  • Stability Editing
  • Peptide binding editing: Affinity

Method 1: Latent Interpolation

  • intermediate latent = slerp(latent_text, latent_protein, theta)

Method 2: Latent Optimization

  • learn a token-level latent code that is close to both the text and protein seq-level representation

In fact this Protein Editing is simply finding the closest distence sample in the model space given the new text condition.