Skip to main content
Figure 1 | BMC Bioinformatics

Figure 1

From: Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures

Figure 1

Schematic drawing of a contrast alignment and the corresponding probability model. Aligned sequences are assigned to either a ‘foreground’ or a ‘background’ partition (orange and gray horizontal bars, respectively). Partitioning is based on the conservation of foreground residues (blue vertical bars) that diverge from (or contrast with) the background residues at those positions (white vertical bars). Red vertical bar heights quantify the selective pressure imposed on divergent residue positions. Below this is given the logarithm of the corresponding probability distribution for the possible sequence partitions and corresponding discriminating patterns which together serve as the random variables over which sampling occurs. X is an n × k matrix representing a multiple alignment of n sequences and k columns; x i j is a 20-dimensional vector of all 0’s except for a lone ‘1’ indicating the observed residue type; R is a vector indicating which rows (i.e., sequences) belong to the foreground (R i =1) or background (R i = 0) partitions; C is a vector indicating which columns do (C j =1) or do not (C j =0) differentiate the foreground from the background; Θ is an array of vectors representing the amino acid compositions at each column position for each partition; ⋅ , ⋅ denotes the inner product of two vectors; and θ j α ≡ 1 − α θ j + α δ A j models the foreground composition at pattern positions where θ j ≡ θ j , 1 , … , θ j , 20 T is the background amino acid frequency vector for column j, the parameter α specifies the expected background ‘contamination’ at pattern positions in the foreground, and δ Aj is a vector that specifies the pattern residues at position j. At non-pattern positions, the vector θ j corresponds to the overall (foreground and background) composition. The third through sixth terms in the equation correspond to the logarithm of the product of the prior probabilities with p(α) and p(Θ) defined by the beta and product Dirichlet distributions, respectively, and with p(R) and p(C) defined by independent Bernoulli distributions; prior definitions are as shown (in parentheses). The log-likelihood ratio (LLR) is computed by subtracting from the log-probability for the observed contrast alignment the log-probability for a ‘null’ contrast alignment, in which all of the sequences are assigned to the background partition.

Back to article page