Toward Compilation of Balanced Protein Stability Data Sets: Flattening the Delta Delta G Curve through Systematic Enrichment


JOURNAL OF CHEMICAL INFORMATION AND MODELING, vol.62, no.5, pp.1345-1355, 2022 (SCI-Expanded) identifier identifier identifier

  • Publication Type: Article / Article
  • Volume: 62 Issue: 5
  • Publication Date: 2022
  • Doi Number: 10.1021/acs.jcim.2c00054
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, EMBASE, MEDLINE
  • Page Numbers: pp.1345-1355
  • Acibadem Mehmet Ali Aydinlar University Affiliated: Yes


Often studies analyzing stability data sets and/or predictors ignore neutral mutations and use a binary classification scheme labeling only destabilizing and stabilizing mutations. Recognizing that highly concentrated neutral mutations interfere with data set quality, we have explored three protein stability data sets: S2648, PON-tstab, and the symmetric S-sym that differ in size and quality. A characteristic leptokurtic shape in the Delta Delta G distributions of all three data sets including the curated and symmetric ones was reported due to concentrated neutral mutations. To further investigate the impact of neutral mutations on Delta Delta G predictions, we have comprehensively assessed the performance of 11 predictors on the PON-tstab data set. Correlation and error analyses showed that all of the predictors performed the best on the neutral mutations, while their performance became gradually worse as the Delta Delta G of the mutations departed further from the neutral zone regardless of the direction, implying a bias toward dense mutations. To this end, after unraveling the role of concentrated neutral mutations in biases of stability data sets, we described a systematic enrichment approach to balance the Delta Delta G distributions. Before enrichment, mutations were clustered based on their biochemical and/or structural features, and then three mutations were selected from every 2 kcal/mol of each cluster. Upon implementation of this approach by distinct clustering schemes, we generated five subsets varying in size and Delta Delta G distributions. All subsets showed improved Delta Delta G and frequency distributions. We ultimately reported that the errors toward enriched subsets were higher than those toward the parent data sets, confirming the enrichment of difficult-to-predict mutations in the subsets. In summary, we elaborated the prediction bias toward a concentrated neutral zone and also implemented a rational strategy to tackle this and other forms of biases. Ultimately, this study equipping us with an extended view of shortcomings of stability data sets is a step taken toward development of an unbiased predictor.