Domain Knowledge Terminologies

Protein
- a combination of 20 types of amino acids
Peptide
- a short sequence of amino acids
Fold
- refers to the specific three-dimensional arrangement (shape and surface characteristics) of a polypeptide chain. Crucial for the protein’s function.
Ligand
- Any molecule that attaches to a specific spot on a protein. When the ligand attaches, it can change how the protein works.
Affinity
- The strength of the interaction between a ligand and its protein
Primary Structure
- The linear sequence of amino acids in a polypeptide chain
Secondary Structure
- Localized conformations of the polypeptide chain, primarily alpha-helices and beta-sheets, stabilized by hydrogen bonds
Tertiary Structure
- The overall three-dimensional shape of a single polypeptide chain, formed by the folding and interactions of the secondary structures. This involves hydrogen bonds, disulfide bridges, hydrophobic interactions, and ionic interactions.
Quaternary Structure
- The arrangement and interaction of multiple polypeptide chains (subunits) in a multi-subunit protein
Residue
- In the context of proteins, a residue refers to a single unit within a polymer chain of amino acids.

ProteinDT

Modalities: Protein Sequence (FASTA) and Textual Prompt

Key components of ProteinDT:

ProteinCLAP Encoder
- Trained on SwissProtCLAP, a text-protein pair dataset (441K) from UniProt
ProtBERT for Protein seq and SciBERT for Text seq
ProteinFacilitator Gaussian model: Text vector -> Protein sequence representation
- Trained as an augmented alignment module in addition to ProteinCLAP
Conditional Generative Decoder
- AR Transformer(T5) or Diffusion(ProteinDiff)