Frequent Triplets – Theory and simulation.
Expected values of Frequent Triplets (FTs) in random proteins as function of sequence length. Length range is up to 35,000 amino-acids, approximately the length of the longest proteins found among the proteomes of the 94 species studied (TITIN in human, and beta-helical in Chlorobium). A) Blue curve is the theoretical expected value given by the Bernoulli probability, for n = 5. Dark circles are the corresponding results of a numerical search of triplets showing perfect match to the theoretical estimation. Red circles are the numerical results for restrictive FTs defined by n = 5 and M = 2000. Inset: same data is shown up to L = 8000 for clarity. Additional black curves represent the theoretical estimation for n = 4–6. B) P-value for FT misidentification as function of length on log-scale. C) Length distribution of human proteins showing log-normal characteristics. Length of CO proteins is right-shifted (see also Text S1 -section 3, figure S6d). Further analysis based on a human “unigram” reference model is provided in Text S1 - sections 1 and 2, where the few very long proteins are analyzed in detail.