Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention

BMC Bioinformatics

Table 1 The feature extraction methods used in this paper

Feature	Abbreviation	Description
One-hot vector	Seq	It composed by 20 types of different amino acids and a 20D one-hot vector is used to encode it
Position-specific scoring matrix	PSSM	It represents the probabilities of 20 amino acids occurring at each position, and the PSI-BLAST algorithm is used to generate it, i.e., we search against the NCBI’s non-redundant sequence database with three iterations and an E-value threshold 0.001
Entropy density	Den	It represents the composition information of the protein sequence and obtained by calculating the information entropy of 20 amino acid residues
Physicochemical properties	PhyChem	It represents the physical and chemical attributes of different amino acid residues and obtained by multivariate statistical analysis of 188 natural amino acid properties
Hydrophilicity and hydrophobicity index	HyIn	A larger hydropathic index means that the residue is more hydrophilic. Conversely, the residues will have higher hydrophobic properties. The hydrophobicity index is the opposite
Pseudo amino acid based on K-nearest neighbors	K-PseAA	It is a new feature combining K-nearest neighbors with the PseAA proposed in this paper. A subsequence is formed by combining the targeted amino acid residue with the residues that are not more than K before and after it. The length of the subsequence is 2K + 1. Then we calculate the PseAA of this subsequence as the K-PseAA feature of the targeted amino acid residue

ISSN: 1471-2105