## Frequent Triplets – Theory and simulation.

Expected values of Frequent Triplets (FTs) in random proteins as function of sequence length. Length range is up to 35,000 amino-acids, approximately the length of the longest proteins found among the proteomes of the 94 species studied (TITIN in human, and beta-helical in *Chlorobium*). A) Blue curve is the theoretical expected value given by the Bernoulli probability, for *n = 5*. Dark circles are the corresponding results of a numerical search of triplets showing perfect match to the theoretical estimation. Red circles are the numerical results for restrictive FTs defined by *n = 5* and *M = 2000*. Inset: same data is shown up to *L = 8000* for clarity. Additional black curves represent the theoretical estimation for *n = 4–6*. B) *P*-value for FT misidentification as function of length on log-scale. C) Length distribution of human proteins showing log-normal characteristics. Length of CO proteins is right-shifted (see also Text S1 -section 3, figure S6d). Further analysis based on a human “unigram” reference model is provided in Text S1 - sections 1 and 2, where the few very long proteins are analyzed in detail.