Toward Compilation of Balanced Protein Stability Data Sets: Flattening the Delta Delta G Curve through Systematic Enrichment


Kebabci N., TİMUÇİN A. C., TİMUÇİN E.

JOURNAL OF CHEMICAL INFORMATION AND MODELING, cilt.62, sa.5, ss.1345-1355, 2022 (SCI-Expanded) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 62 Sayı: 5
  • Basım Tarihi: 2022
  • Doi Numarası: 10.1021/acs.jcim.2c00054
  • Dergi Adı: JOURNAL OF CHEMICAL INFORMATION AND MODELING
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, EMBASE, MEDLINE
  • Sayfa Sayıları: ss.1345-1355
  • Acıbadem Mehmet Ali Aydınlar Üniversitesi Adresli: Evet

Özet

Often studies analyzing stability data sets and/or predictors ignore neutral mutations and use a binary classification scheme labeling only destabilizing and stabilizing mutations. Recognizing that highly concentrated neutral mutations interfere with data set quality, we have explored three protein stability data sets: S2648, PON-tstab, and the symmetric S-sym that differ in size and quality. A characteristic leptokurtic shape in the Delta Delta G distributions of all three data sets including the curated and symmetric ones was reported due to concentrated neutral mutations. To further investigate the impact of neutral mutations on Delta Delta G predictions, we have comprehensively assessed the performance of 11 predictors on the PON-tstab data set. Correlation and error analyses showed that all of the predictors performed the best on the neutral mutations, while their performance became gradually worse as the Delta Delta G of the mutations departed further from the neutral zone regardless of the direction, implying a bias toward dense mutations. To this end, after unraveling the role of concentrated neutral mutations in biases of stability data sets, we described a systematic enrichment approach to balance the Delta Delta G distributions. Before enrichment, mutations were clustered based on their biochemical and/or structural features, and then three mutations were selected from every 2 kcal/mol of each cluster. Upon implementation of this approach by distinct clustering schemes, we generated five subsets varying in size and Delta Delta G distributions. All subsets showed improved Delta Delta G and frequency distributions. We ultimately reported that the errors toward enriched subsets were higher than those toward the parent data sets, confirming the enrichment of difficult-to-predict mutations in the subsets. In summary, we elaborated the prediction bias toward a concentrated neutral zone and also implemented a rational strategy to tackle this and other forms of biases. Ultimately, this study equipping us with an extended view of shortcomings of stability data sets is a step taken toward development of an unbiased predictor.