Paper : https://arxiv.org/abs/1906.08230
Code : https://github.com/songlab-cal/tape

The TAPE benchmark includes the following five tasks:

  • Secondary Structure (SS) Prediction

    • A sequence-to-sequence task mapping each amino acid to a label indicating its secondary structure (helix, strand, or other).
  • Contact Prediction

    • A pairwise amino acid task that predicts whether two amino acids in a protein sequence are in contact (less than 8 angstroms apart).
  • Remote Homology Detection

    • A sequence classification task that assigns a protein sequence to a protein fold.
  • Fluorescence Landscape Prediction

    • A regression task that predicts the log-fluorescence intensity of a protein.
  • Stability Landscape Prediction

    • A regression task that predicts a protein’s stability.

Secondary Structure (SS) Prediction

  • SS is an important feature for understanding the function of a protein, especially if the protein of interest is not evolutionarily related to proteins with known structure.
  • SS prediction tools are very commonly used to create richer input features for higher-level models.

Metric: accuracy on a per-amino acid basis on the CB513 dataset. It seems the code provided basis on the TS115 and CASP12, but they are not used.

SS-3: three-class classification that categorizes each amino acid as either Helix, Strand, or Other. This is a more general classification of secondary structure3.

SS-8: eight-class classification that breaks down the three general classes into more specific categories. For example, the Helix class is further divided into 3-turn, 4-turn, or 5-turn helix3.

Contact Prediction

  • Accurate contact maps provide powerful global information; e.g., they facilitate robust
    modeling of full 3D protein structure.
  • Of particular interest are medium- and long-range contacts, which may be as few as twelve sequence positions apart, or as many as hundreds apart.

Metric: precision of the L/5 most likely contacts for medium-range and long-range contact on the ProteinNet CASP12 test set, which is a standard metric reported in CASP

Remote Homology Detection

  • Detection of remote homologs is of great interest in microbiology and medicine; e.g., for
    detection of emerging antibiotic resistant genes and discovery of new CAS enzymes.

Metric: overall classification accuracy on the fold-level heldout set from

Fluorescence Landscape Prediction

  • For a protein of length L, the number of possible sequences m mutations away is O(Lm), a prohibitively large space for exhaustive search via experiment, even if m is modest. Moreover, due to epistasis (second- and higher-order interactions between mutations at different positions), greedy optimization approaches are unlikely to succeed. Accurate computational predictions could allow significantly more efficient exploration of the landscape, resulting in better optima. Machine learning methods have already seen some success in related protein engineering tasks

Metric: Spearman’s correlation on the test set.

Stability Landscape Prediction

  • Designing stable proteins is important to ensure, for example, that drugs are delivered
    before they are degraded. More generally, given a broad sample of protein measurements, finding better refinements of top candidates is useful for maximizing yield from expensive protein
    engineering experiments.

Metric: Spearman’s correlation on the test set.