Paper Review - Pinal for De novo Protein Design
De novo Pinal (2025)
Paper: Toward De Novo Protein Design from Natural Language
Contribution
- Show that direct end-to-end text-to-sequence mapping is challenging due to the vast complexity of the protein sequence space
- T2struct model for text to protein structure translation, and SaProt-T for structure and text co-guided sequence generation
- Trained a 16B model on a huge dataset containing 1.7B text-protein pairs and 160B word tokens.
- …
Method
Objective:
, Bayes theorem
A 2 Stage approach:
- Text -> Structure
- T2struct to predict
- Text, Structure -> Protein Sequence
- SaProt-T to predict
Architecture
- Text Encoder: PubMedBERT (2021)
- 3Di structural token decoder: GPT-2 architecture with 3Di token embeedings, enchanged by a text-informed layer in each block
- Text-informed layer: layer norm + cross attention + residual connection
Dataset Summary and Preparision
The dataset is composed of 1.7 billion protein-text pairs and 160 billion word tokens, sourced from:
-
Swiss-Prot: A manually curated protein sequence database.
-
UniRef50: A database containing more diverse protein sequences but with less annotation.
-
ProTrek: Used for retrieving protein descriptions.
- Not publicly available.
-
InterProScan: For obtaining functional labels.
-
AlphaFold Database: For proteins without annotations.
Preparision Steps
Pinal trains on 1.7 billion protein–text pairs, including both curated annotations and synthetically generated descriptions. The goal is to enable natural language-guided protein design, so Pinal needs protein sequences with meaningful textual descriptions of structure/function.
🧬 Overview: Data Sources
Source | Description | Purpose |
---|---|---|
Swiss-Prot | Manually curated protein database (~560K sequences) | High-quality base data |
UniRef50 | Large protein sequence database clustered at 50% identity | Broader protein diversity |
InterProScan | Tool for assigning functional annotations (keywords) | Functional labels |
AlphaFold DB | Predicted 3D protein structures | Structural diversity |
ProTrek | Used for retrieving protein descriptions. | More text-protein pairs |
Downloading Swiss-Prot
Description: A manually curated protein sequence database with high-quality annotations.
Website: UniProt
Download:
- Go to: https://www.uniprot.org/uniprotkb?query=reviewed:true
- Click Download → Choose FASTA (canonical) or XML/JSON if you need annotations.
- Or, use the command-line:
1 | wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz |
- uniprot_sprot.fasta only includes
- Protein sequence
- Minimal description (1-line header)
- uniprot_sprot.dat includes
- More detailed info
Downloading UniRef50
Description: Clustered database of proteins at 50% sequence identity. More diverse than Swiss-Prot.
Website: https://www.uniprot.org/uniref/
Download:
-
UniRef50 protein sequences:
1
wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz
Downloading InterProScan
Description: Tool/database for functional annotations: domains, families, GO terms, etc.
Website: https://www.ebi.ac.uk/interpro/download/InterProScan/
Download:
1 | Download & install (requires Java, Perl, Python) |
Downloading AlphaFold Protein Structure Database
Description: Predicted 3D protein structures for >200M proteins.
Website: https://alphafold.ebi.ac.uk/download
Download:
-
Download by organism (e.g., Human, E. coli, etc.)
-
For full dataset:
1
wget https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000005640_9606_HUMAN.tar
- Structures are in
.pdb
or.pdb.gz
format.
Downloading ProTrek + SwissProt-Enh / UniRef50-Enh
Description: Used for retrieving protein descriptions.
Website: https://github.com/westlake-repl/ProTrek
🛠️ Processing Steps
Component | How |
---|---|
SwissProt-Aug (4M) | Extract annotations from Swiss-Prot and rewrite using LLMs |
SwissProt-Enh (9M) | Use handcrafted prompts + LLMs to generate more text |
UniRef50-Enh (530M) | Use same LLM approach for UniRef50 proteins |
InterProScan keywords (791M) | Run InterProScan locally to extract functional labels |
AlphaFold structures | Download AlphaFold predictions for sequences |
UniRef50-ProTrek (400M) | You can approximate using retrieval |
🧾 SwissProt-Annot
- Extract structured annotations from Swiss-Prot entries (e.g., function, localization).
- Use sentence templates to convert structured data into natural language, and then paraphrase it with LLM.
1 | from Bio import SwissProt |
1 | Total number of records: 572970 |
✍️ SwissProt-Aug
- Rewrite SwissProt-Annot with LLMs to make biologist-friendly descriptions.
- Result: 4M protein-text pairs
Just use a LLM to paraphrase it.
🔍 UniRef50-ProTrek
- Use ProTrek to retrieve 10 relevant natural language captions per UniRef50 protein.
- Result: 400M protein-text pairs
1 | ``` |
📦 Final Dataset Composition
Dataset | Size | Notes |
---|---|---|
SwissProt-Aug | 4M | Comes from SwissProt-Annot via LLM rewriting |
UniRef50-ProTrek | 400M | Retrieved captions via ProTrek [Not Available] |
SwissProt/UniRef50-Enh | 539M | LLM-generated text |
InterPro/AlphaFold | 791M | Keyword-based annotations |
Total | ~1.7B | 160B text tokens, 60B amino acids |
🧹 Splitting and Clustering
- Cluster sequences by 50% sequence identity.
- Split into:
- Training set
- Validation set (1,142 clusters, 5,165 proteins)
- Test set (572 clusters, 3,304 proteins)
🧮 Structure Representation
- Use Foldseek to convert 3D structures into 3Di discrete tokens.
🧰 Final Training Data Format
-
Text → Structure (T2Struct):
- Input: Natural language
- Output: Sequence of Foldseek tokens
-
Structure + Text → Sequence (SaProt-T):
- Input: Structure tokens + pooled text embedding
- Output: Amino acid sequence
Summary
Pinal’s data pipeline is a multi-source, multi-stage process:
- Use Swiss-Prot and UniRef50 for sequences.
- Annotate with LLMs and ProTrek.
- Enhance text with prompt engineering.
- Represent structure using Foldseek tokens.