De novo Pinal (2025)

Paper: Toward De Novo Protein Design from Natural Language

Contribution

Show that direct end-to-end text-to-sequence mapping is challenging due to the vast complexity of the protein sequence space
T2struct model for text to protein structure translation, and SaProt-T for structure and text co-guided sequence generation
Trained a 16B model on a huge dataset containing 1.7B text-protein pairs and 160B word tokens.
…

Method

Objective:

$p(s|t) = p(s,c | t)$

$p(s,c|t) = p(c|t)p(s|c,t)$ , Bayes theorem

A 2 Stage approach:

Text -> Structure
- T2struct to predict $p(c|t)$
Text, Structure -> Protein Sequence
- SaProt-T to predict $p(s | c,t)$

Architecture

Text Encoder: PubMedBERT (2021)
3Di structural token decoder: GPT-2 architecture with 3Di token embeedings, enchanged by a text-informed layer in each block
- Text-informed layer: layer norm + cross attention + residual connection

Dataset Summary and Preparision

The dataset is composed of 1.7 billion protein-text pairs and 160 billion word tokens, sourced from:

Swiss-Prot: A manually curated protein sequence database.
UniRef50: A database containing more diverse protein sequences but with less annotation.
ProTrek: Used for retrieving protein descriptions.
- Not publicly available.
InterProScan: For obtaining functional labels.
AlphaFold Database: For proteins without annotations.

Preparision Steps

Pinal trains on 1.7 billion protein–text pairs, including both curated annotations and synthetically generated descriptions. The goal is to enable natural language-guided protein design, so Pinal needs protein sequences with meaningful textual descriptions of structure/function.

🧬 Overview: Data Sources

Source	Description	Purpose
Swiss-Prot	Manually curated protein database (~560K sequences)	High-quality base data
UniRef50	Large protein sequence database clustered at 50% identity	Broader protein diversity
InterProScan	Tool for assigning functional annotations (keywords)	Functional labels
AlphaFold DB	Predicted 3D protein structures	Structural diversity
ProTrek	Used for retrieving protein descriptions.	More text-protein pairs

Downloading Swiss-Prot

Description: A manually curated protein sequence database with high-quality annotations.

Website: UniProt

Download:

Go to: https://www.uniprot.org/uniprotkb?query=reviewed:true
Click Download → Choose FASTA (canonical) or XML/JSON if you need annotations.
Or, use the command-line:

wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
gunzip uniprot_sprot.dat.gz
# Or you want Fasta verison
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz

uniprot_sprot.fasta only includes

Protein sequence

Minimal description (1-line header)

uniprot_sprot.dat includes

More detailed info

Downloading UniRef50

Description: Clustered database of proteins at 50% sequence identity. More diverse than Swiss-Prot.

Website: https://www.uniprot.org/uniref/

Download:

UniRef50 protein sequences:

1	wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz

Downloading InterProScan

Description: Tool/database for functional annotations: domains, families, GO terms, etc.

Website: https://www.ebi.ac.uk/interpro/download/InterProScan/

Download:

# Download & install (requires Java, Perl, Python)
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.66-98.0/interproscan-5.66-98.0-64-bit.tar.gz
tar -xzf interproscan-5.66-98.0-64-bit.tar.gz
cd interproscan-*
./interproscan.sh -i example.fasta -f tsv

Downloading AlphaFold Protein Structure Database

Description: Predicted 3D protein structures for >200M proteins.

Website: https://alphafold.ebi.ac.uk/download

Download:

Download by organism (e.g., Human, E. coli, etc.)

For full dataset:

1	wget https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000005640_9606_HUMAN.tar

Structures are in .pdb or .pdb.gz format.

Downloading ProTrek + SwissProt-Enh / UniRef50-Enh

Description: Used for retrieving protein descriptions.

Website: https://github.com/westlake-repl/ProTrek

🛠️ Processing Steps

Component	How
SwissProt-Aug (4M)	Extract annotations from Swiss-Prot and rewrite using LLMs
SwissProt-Enh (9M)	Use handcrafted prompts + LLMs to generate more text
UniRef50-Enh (530M)	Use same LLM approach for UniRef50 proteins
InterProScan keywords (791M)	Run InterProScan locally to extract functional labels
AlphaFold structures	Download AlphaFold predictions for sequences
UniRef50-ProTrek (400M)	You can approximate using retrieval

🧾 SwissProt-Annot

Extract structured annotations from Swiss-Prot entries (e.g., function, localization).
Use sentence templates to convert structured data into natural language, and then paraphrase it with LLM.

from Bio import SwissProt

records = SwissProt.parse(open("uniprot_sprot.dat"))
count = sum(1 for _ in records)
print(f"Total number of records: {count}")

idx = 0
for record in records:
    print(record.entry_name)
    print(record.description)
    print(record.sequence)
    print(record.organism)
    print(record.comments)
    print(record.gene_name)
    idx = idx+1
    if idx >= 5:
        break

Total number of records: 572970
001R_FRG3G
RecName: Full=Putative transcription factor 001R;
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL
Frog virus 3 (isolate Goorha) (FV-3).
['FUNCTION: Transcription activation. {ECO:0000305}.']
[{'ORFNames': ['FV3-001R']}]
...

✍️ SwissProt-Aug

Rewrite SwissProt-Annot with LLMs to make biologist-friendly descriptions.
Result: 4M protein-text pairs

Just use a LLM to paraphrase it.

🔍 UniRef50-ProTrek

Use ProTrek to retrieve 10 relevant natural language captions per UniRef50 protein.
Result: 400M protein-text pairs

```





### 🧠 SwissProt-Enh & UniRef50-Enh
- Use **custom prompts + LLMs** to generate new captions.
- SwissProt-Enh: 9M pairs  
- UniRef50-Enh: 530M pairs

### 🧪 InterProScan & AlphaFold
- Use **InterProScan** to assign functional keywords to proteins.
- Use **AlphaFold** to get structure for proteins without annotations.
- Result: **791M keyword-protein pairs**

```shell
interproscan.sh -i input.fasta -f tsv -dp

📦 Final Dataset Composition

Dataset	Size	Notes
SwissProt-Aug	4M	Comes from SwissProt-Annot via LLM rewriting
UniRef50-ProTrek	400M	Retrieved captions via ProTrek [Not Available]
SwissProt/UniRef50-Enh	539M	LLM-generated text
InterPro/AlphaFold	791M	Keyword-based annotations
Total	~1.7B	160B text tokens, 60B amino acids

🧹 Splitting and Clustering

Cluster sequences by 50% sequence identity.
Split into:
- Training set
- Validation set (1,142 clusters, 5,165 proteins)
- Test set (572 clusters, 3,304 proteins)

🧮 Structure Representation

Use Foldseek to convert 3D structures into 3Di discrete tokens.

🧰 Final Training Data Format

Text → Structure (T2Struct):
- Input: Natural language
- Output: Sequence of Foldseek tokens
Structure + Text → Sequence (SaProt-T):
- Input: Structure tokens + pooled text embedding
- Output: Amino acid sequence