Predicting a Protein’s Stability under a Million Mutations: Results

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Jeffrey Ouyang-Zhang, UT Austin

(2) Daniel J. Diaz, UT Austin

(3) Adam R. Klivans, UT Austin

(4) Philipp Krähenbühl, UT Austin

Table of Links

5 Results

5.1 Implementation Details

Our primary feature extractor is AlphaFold, implemented in OpenFold [3, 31]. The multiple sequence alignment (MSA) is computed using Colabfold [43]. Sequence features from the Evoformer and Structure Module are aggregated as input to the decoder. An adapter maps the backbone hidden dimension to D = 128. Our amino acid projections and single mutation decoders are implemented with a linear layer. The higher-order mutation decoder transforms the previous embedding with a 2-layer MLP. These representations are aggregated and fed into a 3-layer MLP to predict ∆∆G.

We train on the cDNA display proteolysis dataset [72], which leverages a high throughput proteolysis screen enabled by next generation sequencing (NGS) to extract noisy ∆∆G values for 100 miniproteins totaling over 100,000 single and double mutations. To evaluate generalization on unseen proteins, we remove proteins from our training set that have high sequence similarity using MMSeqs2 [70] to any protein in our evaluation benchmarks. The short protein lengths (< 100 amino acids) keep the memory requirement small for the higher-order decoder. See Section A.1 for more information about cDNA and other datasets.

We fine-tune a pre-trained backbone on single mutations for 20 epochs. Then, we finetune the model on both single and double mutations for 100 epochs using a cosine learning rate schedule with 10 warmup epochs. We use a batch size of 3 proteins due to the high memory requirements of AlphaFold. We use a learning rate of 3×10−4 and weight decay of 0.5. Training takes 6 hours on 3 A100 GPUs.

5.2 Finding Stabilizing Double Mutations

We evaluate our model’s ability to find the stabilizing mutations on cDNA2. cDNA2 is our validation split of the cDNA double mutations dataset, consisting of 18 mini-proteins totaling 22,000 double mutations. The proteins in the validation set have at most 36% homology with those in the cDNA training set. Of these mutations, only 198 or 0.8% are stabilizing with ∆∆G < −0.5 kcal/mol [8]. Like protein engineering, this scenario closely resembles the challenge of identifying a small number of stabilizing mutations amidst a much larger pool of destabilizing or neutral mutations.

We evaluate different methods on stabilizing mutations in Table 1. Stabilizing rs is the Spearman coefficient on the experimentally stabilizing subset introduced in [65]. Normalized discounted cumulative gain (nDCG) [30] measures the quality of the ranked mutation list by taking into account its experimentally validated ∆∆G and position in the list [59]. Detection Precision (DetPr) is the proportion of experimentally validated stabilizing mutations among the top K = 30 predictions. We adapt previous state-of-the-art single mutation predictors by naively adding both single mutation predictions. A detailed description of the metrics is found in Section A.2.

Our model demonstrates exceptional performance in prioritizing stabilizing double mutations over destabilizing ones, achieving a significantly higher normalized discounted cumulative gain of 0.43 compared to 0.25, as well as a superior detection precision of 0.16 compared to 0.10. Our model additionally improves classification metrics Matthews Correlation Coefficient (MCC) and Area under Precision-Recall Curve (AUC) by 0.02 and 0.03, respectively.

Table 2: Multiple Mutation Results on PTMul. Our additive baseline naively adds predicted single mutation ∆∆Gs. Our model presents a strong mutation assessor. The proteins in PTMul have at most 35% sequence similarity to those in the training set. Parenthesis is standard error across 7 runs.

To the best of our knowledge, Mutate Everything is the first work that models all double mutations in a computationally tractable manner. Figure 4 shows the runtime of several methods on a protein of 317 amino acids on an A100 GPU. The dashed line indicates the transition from evaluating single mutants to double mutants. Mutate Everything predicts ∆∆G for all single and double mutations in one pass of the model. It runs in 0.6 seconds using an ESM2 backbone, and 12.1 seconds on an AlphaFold backbone. PROSTATA [74] also uses the ESM2 backbone but takes a protein sequence and its mutated sequence as input and outputs the change in stability. PROSTATA takes 306 hours to evaluate all double mutations with a batch size of 768. On an 8 GPU node, this will take 1.5 days.

Figure 4: Runtime analysis. Our model’s runtime is constant.

5.3 Higher-order Mutation Results

We evaluate our model for predicting changes in thermodynamic stability for higher-order mutations in ProTherm. ProTherm is a database of thermodynamic (∆∆G) and thermostability (∆Tm) experimental characterization of protein mutations curated from the literature. We consider a subset of this database that contains ∆∆G values for higher-order mutations, named ProTherm Multiple (PTMul). PTMul contains 858 mutations [44]. We keep 846 experimentally validated ∆∆G for higher-order mutants, including 536 doubles and 183 triples, after removing 4 duplicate and 8 ambiguous mutations. PTMul proteins have at most 35% homology to the proteins used in training.

Table 2 shows our results on PTMul. A description of baselines is provided in Section A.3. Our additive baselines naively add the sum of single mutation ∆∆Gs. This baseline rigorously evaluates our model’s ability to learn epistasis, or the non-additive interactions when combining two mutations. Mean always predicts the global test dataset mean statistic. ESM and MSA perform poorly when comparing mutations across proteins. On ProTherm, Mutate Everything achieves the state-of-the-art performance of 0.53 Spearman correlation rs, compared to 0.50 of our additive baseline. A breakdown of PTMul performance on double mutations and more (> 2) mutations is found in Table 6.

While other methods also handle multiple mutations, they usually adopt a mutation- or protein-level train and test split. This results in data leakage since the same protein or a homolog will have mutations in both the training and testing sets, leading to inflated reported metrics [16, 18, 35, 64]. In our study, we use sequence similarity-based splitting where our training proteins have at most 35% homology to those in PTMul. To fairly evaluate generalization to new proteins, we exclude these inflated comparisons from our study.

Table 3: Comparisons on Single Mutation Thermodynamic Stability prediction in S669. We report regression and classification metrics on the forward and reverse datasets for structure-based approaches (top block) and sequence-based approaches (bottom block). r refers to Spearman correlation coefficient. Our training proteins have at most 30% sequence similarity with those in S669. Mutate Everything is state-of-the-art on S669 forward.

5.4 Single Mutation Results

We additionally validate our model’s ability to predict changes in thermodynamic stability under single mutations. We compare Mutate Everything to prior works on the newly introduced S669 dataset, which was established to address the data leakage issue and enable fair comparisons between existing methods [54]. We also evaluate the commonly studied reverse dataset setting, in which the model takes the mutant sequence as input and predicts ∆∆G for mutating back to the wild-type sequence. The experimental value is obtained by negating the original ∆∆G value going from the wild-type sequence to the mutant sequence. S669 is well curated and extensive, containing 94 proteins totaling 669 mutations [54]. S669 proteins have at most 30% sequence similarity with those in our training set to ensure separation between the training and validation sets.

Table 3 shows our comparisons with prior works. Mutate Everything is state-of-the-art on S669, achieving a Spearman correlation of 0.56, where the prior art obtained 0.53. Mutate Everything outperforms the existing sequence model by a large margin, with a 6-point gain on Spearman correlation and a 3-point improvement on AUC. Our method is even as powerful as the latest structure-based methods which have access to atom coordinates.

Mutate Everything is competitive on the reverse dataset evaluation, in which the original and mutated sequences are flipped and ∆∆G negated. Our performance drops on reverse mutations because reverse mutations are out of distribution for our model. We did not perform data augmentation to train on reversed mutations as commonly done in the literature [18, 36, 74]. We found that it was beneficial to bias our reverse predictions using ∆∆G to the amino acid found in the original sequence. We found the performance to be similar across 5 runs (< 0.01 standard error).

Table 4: Comparisons on mutation fitness prediction in ProteinGym. We report ranking and classification metrics on both a subset of ProteinGym with stability-like phenotypes and the entire ProteinGym dataset. Our stability model can generalize to all phenotypes, outperforming existing methods on stability-like phenotypes and performing similarly to existing methods otherwise. The maximum homology between our training set and ProteinGym proteins is 43%.

5.5 Generalization to other Phenotypes

We show that stability predictors generalize to other phenotypes as well. ProteinGym substitutions is a dataset of 87 proteins with 1.6 million mutation sets labeled with a generic fitness score, such as activity, growth, or expression. The mutation sets have up to 20 substitutions and 14 proteins contain higher-order mutations. The ProteinGym-Stability split contains 7 proteins and 26,000 mutations with fitness labels that are known to correlate with thermodynamic stability (thermostability, abundance, expression). ProteinGym proteins have at most 45% sequence similarity with those in our training set. Due to memory constraints, we truncate proteins to at most 2000 amino acids.

Table 4 compares our work with previous works. Our MSA baseline uses the empirical amino acid distribution in the MSA at a given position to compute a likelihood ratio. The MSA retrieval prior weights the scores using the per-position first-order statistics in a multiple sequence alignment. On ProteinGym-Stability, Mutate Everything obtains a Spearman correlation of 0.52, compared to 0.50 of the next best method, TranceptEVE. On the one protein with labeled stability, Mutate Everything obtains 0.51 vs 0.33 of Tranception with MSA retrieval [51]. On the full ProteinGym benchmark, we outperform the strongest methods by ensembling our predictions with Tranception [49]. Ensembling our method with evolutionary models by averaging scores gains 0.11 points on ProteinGym and 0.01 points on ProteinGym-Stability, suggesting that our model learns complementary information to the evolutionary models. For comparison, TransEVE ensembles an evolutionary model (Tranception) and a family-based model (EVE) for a 0.03 point gain [50]. We found the performance to be similar across 5 runs (< 0.01 standard error). An in-depth study on how protein stability related to function can be found in [26].

5.6 Ablation Studies

We ablate several architectural design choices in Table 5. Core architectural designs are evaluated on S669. Architectural designs for only multiple mutations are evaluated on cDNA2.

Backbone. We analyze the effect of different feature extractors on single mutation stability assessment in Table 5a. ESM2 [38] takes only sequence as input whereas MSA-Transformer [60] and AlphaFold [31] take the sequence and its MSA as input. Both ESM2 and MSA-Transformer are trained on masking residue modeling while AlphaFold is also trained on structure prediction. MSA input improves the stability prediction performance and AlphaFold’s architecture and training further improve stability prediction.

(d) Higher-order Modeling on cDNA2. We model the ∆∆G of the double mutation as a residual combined with the ∆∆G of its single mutation.

Table 5: Mutate Everything ablation experiments. Default Settings are marked in grey.

Backbone Optimization We try fine-tuning and freezing the AlphaFold backbone during stability finetuning in Table 5c. We find that fine-tuning the backbone improves stability prediction performance.

Aggregation The technique to aggregate mutation representations for higher-order mutations is ablated in Table 5b. To control for feature dimension size, we reduce the head dimension when aggregating with outer product and flattening. Aggregating with product and summation performs similarly on double mutations. We add the features for a natural extension to more mutations.

Higher Order Modeling In the direct prediction model, the multi-mutant head directly predicts ∆∆G. In the multiply and add models, the multi-mutant head learns how interactions among single mutations affect the higher-order mutation’s ∆∆G. In these models, the multi-mutant head output learns an additive bias or multiplicative scaling to the sum of single mutation ∆∆Gs. In Table 5d, we show that learning a bias performs the strongest among these three options.