Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention

Cong, Hanhan; Liu, Hong; Cao, Yi; Liang, Cheng; Chen, Yuehui

doi:10.1186/s12859-023-05592-7

Research
Open access
Published: 05 December 2023

Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention

Hanhan Cong^1,2,
Hong Liu^1,2,
Yi Cao^3,4,
Cheng Liang¹ &
…
Yuehui Chen^3,4

BMC Bioinformatics volume 24, Article number: 456 (2023) Cite this article

1019 Accesses
Metrics details

Abstract

Background

Protein–protein interactions (PPIs) are crucial in various biological functions and cellular processes. Thus, many computational approaches have been proposed to predict PPI sites. Although significant progress has been made, these methods still have limitations in encoding the characteristics of each amino acid in sequences. Many feature extraction methods rely on the sliding window technique, which simply merges all the features of residues into a vector. The importance of some key residues may be weakened in the feature vector, leading to poor performance.

Results

We propose a novel sequence-based method for PPI sites prediction. The new network model, PPINet, contains multiple feature processing paths. For a residue, the PPINet extracts the features of the targeted residue and its context separately. These two types of features are processed by two paths in the network and combined to form a protein representation, where the two types of features are of relatively equal importance. The model ensembling technique is applied to make use of more features. The base models are trained with different features and then ensembled via stacking. In addition, a data balancing strategy is presented, by which our model can get significant improvement on highly unbalanced data.

Conclusion

The proposed method is evaluated on a fused dataset constructed from Dset186, Dset_72, and PDBset_164, as well as the public Dset_448 dataset. Compared with current state-of-the-art methods, the performance of our method is better than the others. In the most important metrics, such as AUPRC and recall, it surpasses the second-best programmer on the latter dataset by 6.9% and 4.7%, respectively. We also demonstrated that the improvement is essentially due to using the ensemble model, especially, the hybrid feature. We share our code for reproducibility and future research at https://github.com/CandiceCong/StackingPPINet.

Peer Review reports

Background

Protein–protein interactions (PPIs) play a crucial role in various biological functions and cellular processes [1], such as signal transduction, immunological recognition, metabolism [2] etc. During PPIs, some interfaces are formed at particular protein residues, called protein–protein interaction sites [3]. Therefore, identifying those sites are essential to reveal the key mechanisms of PPIs and beneficial to modern drug design [4, 5]. However, via experiments, PPI sites identification requires high-end devices and accurate manipulations, being time-consuming and expensive. As an economic and efficient alternative, computational methods [6] have been widely applied. In particular, data-driven methods can provide competitive results by leveraging machine learning and modern deep learning techniques [7,8,9,10]. Existing computational approaches can be roughly divided into partner-independent prediction [11] and partner-specific prediction [12]. In addition, according to the feature information, partner-independent prediction can be further divided into structure-based methods [13] and sequence-based methods [14]. Structure-based methods usually need structural details [15], while the structural information for many proteins is currently unavailable in the dataset. With the rapid development of high-throughput sequencing techniques, a growing number of protein sequences can be obtained, which attracts more attention for sequence-based methods [16].

Since the functions of the residues are determined by its physiochemical properties and context [17,18,19], residues are usually represented by these properties, e.g., accessible surface area [20], protein sequence composition, hydrophilic and hydrophobic index [21]. In addition, evolutionary information [22] and secondary structure information [23] are often incorporated as supplements. To model the local context, sliding window-based methods [24] are widely applied. However, the features of the residues in the window are typically treated equally, which is obviously inaccurate and harms the precise PPI site prediction [25]. Hitherto, many machine learning methods have been proposed to deal with this prediction task, including neural networks (NNs) [26], support vector machines (SVMs) [27], random forests (RF) [28], etc. ISIS [29] is a neural network predictor, which is trained on sequences profiles and structural features predicted from the sequences. SPPIDER [30] employs an SVM, neural network and linear discriminant analysis based on 19 selected features from the sequences. SPRINGS [31] uses mean cumulative hydrophobicity, relative solvent accessibility, and structural features to represent the targeted residue site, and the algorithm uses neural networks for classifier construction. DeepPPISP [32] is an end-to-end deep learning framework that combines local contextual and global sequence features to fulfill the prediction task. Although considerable progress has been achieved, the predictive performance of these methods still needs to be improved [33].

As a matter of fact, most residues in proteins are not PPI sites and thus making the data highly imbalanced [34]. The cascade random forests algorithm (CRF) [35] is first proposed to deal with the problem. It connects multiple random forests in a cascade-like manner, each of which is trained with a balanced training subset that includes all minority samples and a subset of majority samples. However, sampling of training data based on residues level might destroy the completeness of a sequence. SSWRF [36] combines an ensemble of SVMs and sample-weighted random forests to solve the imbalance issue and achieves competing performance. SLSTM utilizes a simplified long short-term memory [37] network to improve the precision of the imbalanced PPI sites. It builds a set of protein sequences, instead of single residues, to retain the entire sequential completeness of each protein. The balancing methods either increase the samples of the minority class or reduces the samples of the majority class, which partly change the data distribution.

In this paper, we proposed a novel sequence-based method for PPI sites prediction. The new network model, namely, PPINet, contains multiple feature processing paths. For a residue, the PPINet extracts the features of the targeted residue and its context separately. These two types of features are processed by two paths and combined to form a protein representation, where the two types of features are of relatively equal importance. The individual PPINets are further ensembled via stacking, by which multiple types of features can be merged. To get high quality hybrid features, the dimensions of the 2 types of the features are adjusted to be equal. Therefore, the bias caused by feature dimensionality can be eliminated during feature fusion. Moreover, a novel data balancing strategy is presented. The majority class of samples (non-interaction sites) are divided into multiple sub-datasets. Each sub-dataset is merged with the entire minority class (interaction sites) to form a balanced dataset, which is used for training. In this way, the consistency of data distributions can be maintained.

Based on the above novelty, the contributions of this paper are as follows:

(1)
A hybrid feature representation method is proposed to avoid the drawbacks of the traditional sliding window-based methods. The single targeted residue feature and the context feature based sliding window are extracted. They are processed by 2 paths in the PPINet and combined to form a hybrid feature of a protein. This idea is also extended via stacking, where multiple types of features are merged to form a full representation of a protein.
(2)
A new feature fusion method is proposed, where the feature importance is balanced. In each PPINet, 2 feature vectors are concatenated to form a hybrid feature of a protein. Before concatenation, the dimensions of them are adjusted to be equal so that they can be exploited equally by the model. Therefore, the bias caused by feature dimensionality can be eliminated.

Methods

This section describes our proposed ensemble framework (StackingPPINet) for PPI prediction. The architecture of the proposed StackingPPINet is shown in Fig. 1, which fundamentally consists of a group of base classifiers, named PPINets, and a stacking module for ensembling. A PPINet is an independent classifier which predicts whether the targeted amino acid in an input sequence segment is a PPI site. It further contains a feature forming module (FFMod), a feature aggregation network (FANet) and a predictor (PPIPred). The FFMod extracts various low-level features from the input sequence by traditional feature extraction methods. The extracted low-level features are then aggregated into a highly abstracted feature vector with fixed dimension by a FANet. Based on the aggregated feature, decisions are made by the predictor, which is a deep neural network with a binary output. In StackingPPINet architecture, multiple PPINets are first trained independently and then ensembled to enhance the performance and robustness.

The base classifier for protein–protein interaction site prediction

The base classifiers in the StackingPPINet are PPINets, whose architecture is illustrated in Fig. 2. The FFMod extracts various low-level features from the input protein sequence by traditional feature extraction methods. Specifically, ${f}_{\text{tr}}$ is the targeted residue feature for the target residue, while ${f}_{\text{ctx}}$ is the context feature, which is a series of feature vectors extracted by sliding window-based method. FANet is responsible for aggregating the context feature ${f}_{\text{ctx}}$ into a vector ${f}_{\text{agg}}$ and generating a full protein segment representation, namely ${f}_{\text{prot}}$, by concatenating ${f}_{\text{tr}}$ and ${f}_{\text{agg}}$. Finally, predictions are made by PPINet, which is a deep neural network. Details of those modules are demonstrated in the following subsections.

The feature forming module

In many existing works, the entire input sequence is converted into a fixed-dimensional feature vector, or is converted into a vector sequence, where the features are extracted separately from each residue. The features of the predicting residue are treated equally to contextual residues. When the context is extended for including more information, the importance of the targeted residue will be weakened, unintendedly harming the model performance.

To address this issue, FFMod extracts the targeted residue feature ${f}_{\text{tr}}$ and the context feature ${f}_{\text{ctx}}$ separately for a residue. For the targeted residue features, FFMod firstly extracts single features and combine them. In this paper, we use six single features, they are one-hot vector [38], position-specific scoring matrix (PSSM) [39], entropy density (Den) [40], physicochemical properties (PhyChem) [41], hydrophilicity and hydrophobicity index (HyIn) [42], and the pseudo amino acid [43] based on K-nearest neighbors (K-PseAA). And the single features are then concatenated in pairs to form combination features for the targeted residue. Finally, there are three combination features for a targeted residue. Table 1 shows the details of these features. For the context feature, it is a connection of multiple residue features in the sliding window. Since it bases sliding window, zero padding is applied for the residues at the ends of the sequence. Although sliding window-based methods can model regional features to some extent, their feature aggregations are restricted by the window size and their simple aggregation patterns. Thus, FFMod only produces low-level features of the input protein sequence, which are not sufficient for PPI sites prediction. The obtained context feature ${f}_{\text{ctx}}$ will be further processed by FANet. Since those 2 types of features are provided separately, FANet can handle them respectively and balance their relative importance, which will be introduced in the next subsection.

Table 1 The feature extraction methods used in this paper

Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention

Abstract

Background

Results

Conclusion

Background

Methods

The base classifier for protein–protein interaction site prediction

The feature forming module

The feature aggregation network

The predictor of protein–protein interaction sites

The stacking of multiple base classifiers

Benchmark datasets

Data balancing strategy

Implementation details

Results

Evaluation metrics

Performance comparison of StackingPPINet and other PPI predictors

Discussion

The improvement of using multiple balanced datasets

The improvement by stacking

The effects of different integrated rules in stacking

The effectiveness of hybrid feature

The effect of sliding window length

The performance on sequences in different lengths

The effect of multi-head attention

Discussion

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us