Submitted on May 24, 2002
Revised on October 16, 2002
Accepted on December 9, 2002
Abundance and distributions of eukaryote protein simple sequences
Kim Lan Sim and Trevor P. Creamer
Molecular and Cellular Biochemistry, University of Kentucky, Lexington, Kentucky 40502-0298
Corresponding Author: tpcrea0{at}uky.edu
Protein simple sequences are a subclass of low-complexity regions of sequence that are highly enriched in one or a few residue types. Such sequences are common in transcription regulatory proteins, structural proteins, proteins involved in nucleic acid interactions, and in mediating protein-protein interactions. Simple sequences of ten or more residues, containing 50% of a single residue type are surveyed in this work. Both eukaryote and prokaryote proteomes are investigated, with emphasis on the eukaryotes. Very large numbers of such sequences are found in all organisms surveyed. It is found that eukaryotes possess far more simple sequences per protein than prokaryotes. Prokaryotes display a linear relationship between number of proteins containing simple sequences and proteome size, whereas it is not clear that such a relationship holds for eukaryotes. Strikingly, it is found that each eukaryote possesses its own unique distribution of simple sequences. Within those distributions it is found that simple sequences enriched in certain residue types are clearly favored, whereas others are just as clearly discriminated against. The preferences observed are not correlated with residue occurrence. An analysis of classes of proteins of known function suggests that simple sequence occurrence and distribution may be related to protein function. Based upon this analysis, the large number of simple sequences found above that which would be expected from a simple statistical model, plus the known functional importance of numerous such sequences, it is postulated that eukaryotes have evolved to not only tolerate large numbers of simple sequences, but also to require them.