INFORMATION-THEORETICAL ENTROPY AS A MEASURE OF SEQUENCE VARIABILITY


SHENKIN P., ERMAN B., MASTRANDREA L.

PROTEINS-STRUCTURE FUNCTION AND GENETICS, cilt.11, sa.4, ss.297-313, 1991 (SCI-Expanded) identifier identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 11 Sayı: 4
  • Basım Tarihi: 1991
  • Doi Numarası: 10.1002/prot.340110408
  • Dergi Adı: PROTEINS-STRUCTURE FUNCTION AND GENETICS
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
  • Sayfa Sayıları: ss.297-313
  • Acıbadem Mehmet Ali Aydınlar Üniversitesi Adresli: Hayır

Özet

We propose the use of the information-theoretical entropy, S = -SIGMA-p(i) log2 p(i), as a measure of variability at a given position in a set of aligned sequences. p(i) stands for the fraction of times the i-th type appears at a position. For protein sequences, the sum has up to 20 terms, for nucleotide sequences, up to 4 terms, and for codon sequences, up to 61 terms. We compare S and V(S), a related measure, in detail with V(K), the traditional measure of immunoglobulin sequence variability, both in the and as applied to the immunoglobulins. We conclude that S has desirable mathematical properties that V(K) lacks and has intuitive and statistical meanings that accord well with the notion of variability. We find that V(K) and the S-based measures are highly correlated for the immunoglobulins. We show by analysis of sequence data and by means of a mathematical model that this correlation is due to a strong tendency for the frequency of occurrence of amino acid types at a given position to be log-linear. It is not known whether the immunoglobulins are typical or atypical of protein families in this regard, nor is the origin of the observed rank-frequency distribution obvious, although we discuss several possible etiologies.