De novo Pinal (2025)

Paper: Toward De Novo Protein Design from Natural Language

Contribution

  • Show that direct end-to-end text-to-sequence mapping is challenging due to the vast complexity of the protein sequence space
  • T2struct model for text to protein structure translation, and SaProt-T for structure and text co-guided sequence generation
  • Trained a 16B model on a huge dataset containing 1.7B text-protein pairs and 160B word tokens.

Method

img

Objective:

p(st)=p(s,ct)p(s|t) = p(s,c | t)

p(s,ct)=p(ct)p(sc,t)p(s,c|t) = p(c|t)p(s|c,t) , Bayes theorem

A 2 Stage approach:

  • Text -> Structure
    • T2struct to predict p(ct)p(c|t)
  • Text, Structure -> Protein Sequence
    • SaProt-T to predict p(sc,t)p(s | c,t)

Architecture

  • Text Encoder: PubMedBERT (2021)
  • 3Di structural token decoder: GPT-2 architecture with 3Di token embeedings, enchanged by a text-informed layer in each block
    • Text-informed layer: layer norm + cross attention + residual connection

Dataset Summary and Preparision

The dataset is composed of 1.7 billion protein-text pairs and 160 billion word tokens, sourced from:

  • Swiss-Prot: A manually curated protein sequence database.

  • UniRef50: A database containing more diverse protein sequences but with less annotation.

  • ProTrek: Used for retrieving protein descriptions.

    • Not publicly available.
  • InterProScan: For obtaining functional labels.

  • AlphaFold Database: For proteins without annotations.

Preparision Steps

Pinal trains on 1.7 billion protein–text pairs, including both curated annotations and synthetically generated descriptions. The goal is to enable natural language-guided protein design, so Pinal needs protein sequences with meaningful textual descriptions of structure/function.

🧬 Overview: Data Sources

Source Description Purpose
Swiss-Prot Manually curated protein database (~560K sequences) High-quality base data
UniRef50 Large protein sequence database clustered at 50% identity Broader protein diversity
InterProScan Tool for assigning functional annotations (keywords) Functional labels
AlphaFold DB Predicted 3D protein structures Structural diversity
ProTrek Used for retrieving protein descriptions. More text-protein pairs

Downloading Swiss-Prot

Description: A manually curated protein sequence database with high-quality annotations.

Website: UniProt

Download:

1
2
3
4
5
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
gunzip uniprot_sprot.dat.gz
# Or you want Fasta verison
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz
  • uniprot_sprot.fasta only includes
    • Protein sequence
    • Minimal description (1-line header)
  • uniprot_sprot.dat includes
    • More detailed info

Downloading UniRef50

Description: Clustered database of proteins at 50% sequence identity. More diverse than Swiss-Prot.

Website: https://www.uniprot.org/uniref/

Download:

  • UniRef50 protein sequences:

    1
    wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz

Downloading InterProScan

Description: Tool/database for functional annotations: domains, families, GO terms, etc.

Website: https://www.ebi.ac.uk/interpro/download/InterProScan/

Download:

1
2
3
4
5
# Download & install (requires Java, Perl, Python)
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.66-98.0/interproscan-5.66-98.0-64-bit.tar.gz
tar -xzf interproscan-5.66-98.0-64-bit.tar.gz
cd interproscan-*
./interproscan.sh -i example.fasta -f tsv

Downloading AlphaFold Protein Structure Database

Description: Predicted 3D protein structures for >200M proteins.

Website: https://alphafold.ebi.ac.uk/download

Download:

  • Download by organism (e.g., Human, E. coli, etc.)

  • For full dataset:

    1
    wget https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000005640_9606_HUMAN.tar
  • Structures are in .pdb or .pdb.gz format.

Downloading ProTrek + SwissProt-Enh / UniRef50-Enh

Description: Used for retrieving protein descriptions.

Website: https://github.com/westlake-repl/ProTrek

🛠️ Processing Steps

Component How
SwissProt-Aug (4M) Extract annotations from Swiss-Prot and rewrite using LLMs
SwissProt-Enh (9M) Use handcrafted prompts + LLMs to generate more text
UniRef50-Enh (530M) Use same LLM approach for UniRef50 proteins
InterProScan keywords (791M) Run InterProScan locally to extract functional labels
AlphaFold structures Download AlphaFold predictions for sequences
UniRef50-ProTrek (400M) You can approximate using retrieval

🧾 SwissProt-Annot

  • Extract structured annotations from Swiss-Prot entries (e.g., function, localization).
  • Use sentence templates to convert structured data into natural language, and then paraphrase it with LLM.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from Bio import SwissProt

records = SwissProt.parse(open("uniprot_sprot.dat"))
count = sum(1 for _ in records)
print(f"Total number of records: {count}")

idx = 0
for record in records:
print(record.entry_name)
print(record.description)
print(record.sequence)
print(record.organism)
print(record.comments)
print(record.gene_name)
idx = idx+1
if idx >= 5:
break
1
2
3
4
5
6
7
8
Total number of records: 572970
001R_FRG3G
RecName: Full=Putative transcription factor 001R;
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL
Frog virus 3 (isolate Goorha) (FV-3).
['FUNCTION: Transcription activation. {ECO:0000305}.']
[{'ORFNames': ['FV3-001R']}]
...

✍️ SwissProt-Aug

  • Rewrite SwissProt-Annot with LLMs to make biologist-friendly descriptions.
  • Result: 4M protein-text pairs

Just use a LLM to paraphrase it.

🔍 UniRef50-ProTrek

  • Use ProTrek to retrieve 10 relevant natural language captions per UniRef50 protein.
  • Result: 400M protein-text pairs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
```





### 🧠 SwissProt-Enh & UniRef50-Enh
- Use **custom prompts + LLMs** to generate new captions.
- SwissProt-Enh: 9M pairs
- UniRef50-Enh: 530M pairs

### 🧪 InterProScan & AlphaFold
- Use **InterProScan** to assign functional keywords to proteins.
- Use **AlphaFold** to get structure for proteins without annotations.
- Result: **791M keyword-protein pairs**

```shell
interproscan.sh -i input.fasta -f tsv -dp

📦 Final Dataset Composition

Dataset Size Notes
SwissProt-Aug 4M Comes from SwissProt-Annot via LLM rewriting
UniRef50-ProTrek 400M Retrieved captions via ProTrek [Not Available]
SwissProt/UniRef50-Enh 539M LLM-generated text
InterPro/AlphaFold 791M Keyword-based annotations
Total ~1.7B 160B text tokens, 60B amino acids

🧹 Splitting and Clustering

  • Cluster sequences by 50% sequence identity.
  • Split into:
    • Training set
    • Validation set (1,142 clusters, 5,165 proteins)
    • Test set (572 clusters, 3,304 proteins)

🧮 Structure Representation

  • Use Foldseek to convert 3D structures into 3Di discrete tokens.

🧰 Final Training Data Format

  • Text → Structure (T2Struct):

    • Input: Natural language
    • Output: Sequence of Foldseek tokens
  • Structure + Text → Sequence (SaProt-T):

    • Input: Structure tokens + pooled text embedding
    • Output: Amino acid sequence

Summary

Pinal’s data pipeline is a multi-source, multi-stage process:

  1. Use Swiss-Prot and UniRef50 for sequences.
  2. Annotate with LLMs and ProTrek.
  3. Enhance text with prompt engineering.
  4. Represent structure using Foldseek tokens.