Skip to main content
  • Research article
  • Open access
  • Published:

Improving peptide-MHC class I binding prediction for unbalanced datasets

Abstract

Background

Establishment of peptide binding to Major Histocompatibility Complex class I (MHCI) is a crucial step in the development of subunit vaccines and prediction of such binding could greatly reduce costs and accelerate the experimental process of identifying immunogenic peptides. Many methods have been applied to the prediction of peptide-MHCI binding, with some achieving outstanding performance. Because of the experimental methods used to measure binding or affinity between peptides and MHCI molecules, however, available datasets are enriched for nonbinders, and thus highly unbalanced. Although there is no consensus on the ideal class distribution for training sets, extremely unbalanced datasets can be detrimental to the performance of prediction algorithms.

Results

We have developed a decision-theoretic framework to construct cost-sensitive trees to predict peptide-MHCI binding and have used them to 1) Assess the impact of the training data's class distribution on classifier accuracy, and 2) Compare resampling and cost-sensitive methods as approaches to compensate for training data imbalance. Our results confirm that highly unbalanced training sets can reduce the accuracy of classifier predictions and show that, in the peptide-MHCI binding context, resampling methods do not improve the classifier performance. In contrast, cost-sensitive methods significantly improve accuracy of decision trees. Finally, we propose the use of a training scheme that, when the training set is enriched for nonbinders, consistently improves the overall classifier accuracy compared to cost-insensitive classifiers and, in particular, increases the sensitivity of the classifiers. This method minimizes the expected classification cost for large datasets.

Conclusion

Our method consistently improves the performance of decision trees in predicting peptide-MHC class I binding by using cost-balancing techniques to compensate for the imbalance in the training dataset.

Background

Determination of binding between peptide and Major Histocompatibility Complex class I (MHCI) is a crucial step in the development of subunit vaccines. The peptide-MHCI complexes are required for T cell activation and thus for the initiation of the adaptative immune response. Although MHCI binding does not alone determine the immunogenicity of peptides, it plays an important part, being a major bottleneck that separates immunogenic peptides from non-immunogenic ones. Hence, the ability to predict the binding between peptides and MHCI molecules would greatly reduce costs and accelerate the experimental process of identifying immunogenic peptides, which can then be used in the development of vaccines and therapies against neoplastic, infectious, and autoimmune diseases.

Our primary goal is to guide experimental research in identifying potential vaccine epitopes. In a given microbial genome, there are tens of thousands of peptides and the experimental assessment of the affinity between each peptide and an MHCI molecule represents a significant cost in terms of time and resources. The investigator has to consider the benefits of identifying binders versus the cost associated with experimentally testing nonbinders in order to decide which and how many peptides will be tested in the laboratory. This type of concern can be best addressed by the use of decision-theoretic approaches. Here we formalize such an approach to training decision trees to differentiate binders from nonbinders and show how costs that reflect this experimental tradeoff can be incorporated into the training of classifiers to increase their utility.

Myriad approaches have been applied to the prediction of peptide-MHCI binding. These methods can be divided into two broad categories: 1) MHCI structure-based methods, which use crystallized structures of MHCI molecules to develop computational models of the interaction between MHCI and peptides [1]; and 2) peptide sequence-based methods, which infer the physico-chemical preferences of a particular MHCI allele by analyzing the amino acid sequence of peptides with known affinity to it, where peptides with IC50 lower than a certain threshold, typically 500 nM [2], are classified as binders, and otherwise as nonbinders. Earlier prediction methods used the amino acid frequencies in each position of MHCI-eluted peptides to derive binding motifs and position specific scoring matrices (PSSMs). Methods of this type include SYFPEITHI [3] and BIMAS [4], which have been publicly available and used extensively by the experimental community. As the number of peptides in the MHCI databases increased, so did the number of different machine learning methods that were applied to this problem, which include artificial neural networks [5], support vector machines [6], hidden Markov models [7], Gibbs sampling [8], and classification trees [9–11].

While some of these methods achieve outstanding performance in predicting binding between peptides and certain MHCI alleles, all of them suffer from the fact that the available data for training is heavily biased towards one class of peptides (either binders or nonbinders). There is a vast literature on the impact of class distribution of training sets on the performance of the prediction algorithms (for further readings see Chawla et al., 2004 [12]), and although there is not a straightforward answer to the question of what the ideal class distribution of training datasets is, it has been suggested that a balanced distribution or the estimated distribution of the target population should be used. Moreover, it is a well known phenomenon that highly unbalanced datasets are detrimental to classifier performance. The imbalance in the peptide-MHCI binding data depends on the experimental methods used to produce them: either elution assays, in which case the dataset consists purely of binders; or binding assays in which peptides are tested for binding or affinity to a particular MHCI allele, leading to datasets consisting mostly of nonbinders. The reason for this imbalance towards nonbinders is that binders are extremely rare in nature: It has been estimated that the proportion of peptides in a protein that will bind to a given MHC allele varies between 0.001 and 0.05 [13]. Datasets generated in different laboratories using different assays and conditions are often inconsistent with each other and thus combination of datasets can be very difficult.

Here we investigate how best to use unbalanced datasets to train algorithms for the prediction of peptide-MHCI binding. Although there is no universally agreed upon method for dealing with unbalanced data, several techniques have been proposed to deal with this issue and have been demonstrated to improve prediction accuracy depending on the context in which they are used [12]. Elkan [14] showed how to make a standard learning algorithm yield cost-sensitive results when trained with an unbalanced dataset. Another successful strategy is referred to as cost-sensitive methods, in which weights are used to compensate for the imbalance in the ratio of the two classes. Other methods pre-process the data to achieve a balanced class distribution. In particular two resampling methods stand out: 1) Undersampling, where random cases of the majority class are deleted until both classes have the same number of cases; and 2) Oversampling, where random cases of the minority class are duplicated until both classes have the same number of cases. Our primary goal is to determine whether or not the accuracy of peptide-MHCI binding prediction can be improved by the use of methods that compensate for the training data imbalance, such as resampling and cost-sensitive methods. The results presented herein suggest that resampling procedures, such as undersampling and oversampling, do not consistently improve the utility of classifiers used in the context of peptide-MHCI binding. The cost-sensitive method, however, significantly improves prediction accuracy when the training data is biased towards nonbinders. These results are derived from analysis using decision trees. The underlying mathematical treatment is, however, quite general, and can be applied to any classifier capable of cost-sensitive learning, including most of the classifiers used in peptide-MHCI binding prediction.

Methods

Approach

The development of subunit vaccines is a multi-step process; at each stage, the investigators must decide whether a particular peptide warrants further investment or should be omitted from further experimentation. These decisions must be informed, either explicitly or implicitly, by consideration of the costs incurred in continuing the experiments and of the potential reward for a positive discovery. One must also estimate the probability that a given decision will be erroneous, either as a false positive (continuing to invest in a peptide that will prove to be unsuitable) or a false negative (discontinuing tests on a peptide that would have worked). Let the cost of misclassifying a binder be denoted κ2 (for type 2 error) and that for misclassifying a non-binder, κ1. We refer to κ2 as the "real-world" cost, as it can be interpreted as the number of nonbinders an investigator is willing to test in the laboratory in order to find one binder. Finally, suppose that we can parameterize a family of classifiers with the continuous vector θ. Then the cost, K, incurred in making a decision on a peptide ϕ using the classifier T(θ) is

K(ϕ|θ) = τ+(ϕ)κ2c-(ϕ|θ) + τ-(ϕ)κ1c+(ϕ|θ), (1)

where τ+ and τ- are indicators of true class and c+(·|θ) and c-(·|θ) are indicators of the classification induced by T(θ). The expected decision cost over all peptides is

E K ( θ ) = π k 2 ∈ 2 ( θ ) + ( 1 − π ) k 1 ∈ 1 ( θ ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyrauKaem4saS0aaeWaaeaacqaH4oqCaiaawIcacaGLPaaacqGH9aqpcqaHapaCcqWGRbWAdaWgaaWcbaGaeGOmaidabeaaiiGakiab=HGiopaaBaaaleaacqaIYaGmaeqaaOWaaeWaaeaacqaH4oqCaiaawIcacaGLPaaacqGHRaWkdaqadaqaaiabigdaXiabgkHiTiabec8aWbGaayjkaiaawMcaaiabdUgaRnaaBaaaleaacqaIXaqmaeqaaOGae8hcI48aaSbaaSqaaiabigdaXaqabaGcdaqadaqaaiabeI7aXbGaayjkaiaawMcaaaaa@4B3A@
(2)

where π is the proportion of binders in the population and ϵ i is the expected rate of type i errors. We would like to find the classifier T* that minimizes this expected cost. In the training context, we use the "training cost function", K T (ϕ|θ), which has the same form as the decision cost described above, but differs from it in the fact that both the false positive λ2 and false negative λ1 costs are now tunable parameters:

K T (ϕ|θ) = τ+(ϕ)c-(ϕ)λ2 + τ-(ϕ)c+(ϕ)λ1. (3)

Because we only ever have access to finite datasets to train classifiers, the training cost from using such classifiers can be decomposed into two parts: the decision-making cost described in Eq. 1 and the residual, due to the deviation of the training set D from the whole population:

K T ( D | θ ) = n { p λ 2 ϵ 2 ( θ ) + ( 1 − p ) λ 1 ϵ 1 ( θ ) } + ∑ ϕ ∈ D { τ + ( ϕ ) λ 2 δ c − ( ϕ | θ ) + τ − ( ϕ ) δ c + ( ϕ | θ ) } MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabiWaaaqaaiabdUealnaaBaaaleaacqWGubavaeqaaOGaeiikaGIaemiraqKaeiiFaWNaeqiUdeNaeiykaKcabaGaeyypa0dabaGaemOBa4Maei4EaSNaemiCaaNaeq4UdW2aaSbaaSqaaiabikdaYaqabaWeuvgwd1utHrhAjrxySL2yaeHbJ1wBPfdmaGabcOGae8x9di7aaSbaaSqaaiabikdaYaqabaGccqGGOaakcqaH4oqCcqGGPaqkcqGHRaWkcqGGOaakcqaIXaqmcqGHsislcqWGWbaCcqGGPaqkcqaH7oaBdaWgaaWcbaGaeGymaedabeaakiab=v=aYoaaBaaaleaacqaIXaqmaeqaaOGaeiikaGIaeqiUdeNaeiykaKIaeiyFa0habaaabaGaey4kaScabaWaaabuaeaacqGG7bWEcqaHepaDdaWgaaWcbaGaey4kaScabeaakiabcIcaOiabew9aMjabcMcaPiabeU7aSnaaBaaaleaacqaIYaGmaeqaaOGaeqiTdqMaem4yam2aaSbaaSqaaiabgkHiTaqabaGccqGGOaakcqaHvpGzcqGG8baFcqaH4oqCcqGGPaqkcqGHRaWkcqaHepaDdaWgaaWcbaGaeyOeI0cabeaakiabcIcaOiabew9aMjabcMcaPiabes7aKjabdogaJnaaBaaaleaacqGHRaWkaeqaaOGaeiikaGIaeqy1dyMaeiiFaWNaeqiUdeNaeiykaKIaeiyFa0haleaacqaHvpGzcqGHiiIZcqWGebaraeqaniabggHiLdaaaaaa@8CB0@
(4)

where n is the size of the training dataset, p is its proportion of positives and the classification error is defined on positive (negative) peptides as

δc-(+)(ϕ|θ) ≡ c-(+)(ϕ|θ) - ϵ2(1)(θ) (5)

We may further abbreviate this expression to

K T (D|θ) = n {pλ2ϵ2(θ) + (1 - p)λ1ϵ1(θ)} + R(D; θ, λ). (6)

The expected decision cost described in Eq. 2 is minimized at θ ˆ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiUdeNbaKaaaaa@2D9B@ where

0 = ∂ ∂ θ E K ( θ ˆ ) = π κ 2 ∂ ϵ 2 ∂ θ ( θ ˆ ) + ( 1 − π ) κ 1 ∂ ϵ 1 ∂ θ ( θ ˆ ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeGimaaJaeyypa0tcfa4aaSaaaeaacqGHciITaeaacqGHciITcqaH4oqCaaGccqWGfbqrcqWGlbWscqGGOaakcuaH4oqCgaqcaiabcMcaPiabg2da9iabec8aWjabeQ7aRnaaBaaaleaacqaIYaGmaeqaaKqbaoaalaaabaGaeyOaIy7euvgwd1utHrhAjrxySL2yaeHbJ1wBPfdmaGabciab=v=aYoaaBaaabaGaeGOmaidabeaaaeaacqGHciITcqaH4oqCaaGccqGGOaakcuaH4oqCgaqcaiabcMcaPiabgUcaRiabcIcaOiabigdaXiabgkHiTiabec8aWjabcMcaPiabeQ7aRnaaBaaaleaacqaIXaqmaeqaaKqbaoaalaaabaGaeyOaIyRae8x9di7aaSbaaeaacqaIXaqmaeqaaaqaaiabgkGi2kabeI7aXbaakiabcIcaOiqbeI7aXzaajaGaeiykaKIaeiOla4caaa@67E4@
(7)

Similarly, the training cost function is minimized at θ * where

0 = ∂ ∂ θ L ( D | θ ∗ ) = p λ 2 ∂ ϵ 2 ∂ θ ( θ ∗ ) + ( 1 − p ) λ 1 ∂ ϵ 1 ∂ θ ( θ ∗ ) + ∂ R ∂ θ ( D ; θ ∗ , λ ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeGimaaJaeyypa0tcfa4aaSaaaeaacqGHciITaeaacqGHciITcqaH4oqCaaGaemitaWKccqGGOaakcqWGebarcqGG8baFcqaH4oqCdaahaaWcbeqaaiabgEHiQaaakiabcMcaPiabg2da9iabdchaWjabeU7aSnaaBaaaleaacqaIYaGmaeqaaKqbaoaalaaabaGaeyOaIy7euvgwd1utHrhAjrxySL2yaeHbJ1wBPfdmaGabciab=v=aYoaaBaaabaGaeGOmaidabeaaaeaacqGHciITcqaH4oqCaaGccqGGOaakcqaH4oqCdaahaaWcbeqaaiabgEHiQaaakiabcMcaPiabgUcaRiabcIcaOiabigdaXiabgkHiTiabdchaWjabcMcaPiabeU7aSnaaBaaaleaacqaIXaqmaeqaaKqbaoaalaaabaGaeyOaIyRae8x9di7aaSbaaeaacqaIXaqmaeqaaaqaaiabgkGi2kabeI7aXbaakiabcIcaOiabeI7aXnaaCaaaleqabaGaey4fIOcaaOGaeiykaKIaey4kaSscfa4aaSaaaeaacqGHciITcqWGsbGuaeaacqGHciITcqaH4oqCaaGccqGGOaakcqWGebarcqGG7aWocqaH4oqCdaahaaWcbeqaaiabgEHiQaaakiabcYcaSiabeU7aSjabcMcaPiabc6caUaaa@7C6C@
(8)

Denote the value of θ that minimizes the expected decision cost by θ ˆ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiUdeNbaKaaaaa@2D9B@ , and that that minimizes the training cost function by θ * = θ ˆ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiUdeNbaKaaaaa@2D9B@ + 1/n δθ. Now differentiation and Taylor expansion yield the sufficient condition for the minimum of the training cost function to approach θ ˆ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiUdeNbaKaaaaa@2D9B@ as R(θ)/n → 0:

λ 2 B = π 1 − π 1 − p p κ 2 κ 1 λ 1 . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaakiabg2da9KqbaoaalaaabaGaeqiWdahabaGaeGymaeJaeyOeI0IaeqiWdahaamaalaaabaGaeGymaeJaeyOeI0IaemiCaahabaGaemiCaahaamaalaaabaGaeqOUdS2aaSbaaeaacqaIYaGmaeqaaaqaaiabeQ7aRnaaBaaabaGaeGymaedabeaaaaGccqaH7oaBdaWgaaWcbaGaeGymaedabeaakiabc6caUaaa@4527@
(9)

This expression defines what we refer to as the "balancing cost", λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ . The basic intuition behind the balancing cost is that its use results in both classes having equal importance in the training of the classifier. It is helpful to note that: 1) As the real-world false negative cost κ2 increases, so does the balancing cost; and 2) As the proportion of positives p in the training set increases in relation to the population positive frequency π, the balancing cost decreases (see figure 1). Finally, we have

Figure 1
figure 1

Theoretical relation between λ 2 and EK ( θ ). Theoretical relationship between the training false negative cost (λ2) that minimizes the expected cost of a classifier (EK(θ)) for a given type 2 error cost (κ2). The dotted lines represent one standard deviation from the mean. Here κ1 = 1, λ1 = 1 and π = 0.5.

δ θ = − ∂ R ∂ θ [ p λ 2 ∂ 2 ϵ 2 ∂ θ 2 + ( 1 − p ) λ 1 ∂ 2 ϵ 1 ∂ θ 2 ] − 1 , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqiTdqMaeqiUdeNaeyypa0JaeyOeI0scfa4aaSaaaeaacqGHciITcqWGsbGuaeaacqGHciITcqaH4oqCaaGcdaWadaqaaiabdchaWjabeU7aSnaaBaaaleaacqaIYaGmaeqaaKqbaoaalaaabaGaeyOaIy7aaWbaaeqabaGaeGOmaidaambvLH1qn1uy0Hws0fgBPngaryWyT1wAXadaiqGacqWF1pGSdaWgaaqaaiabikdaYaqabaaabaGaeyOaIyRaeqiUde3aaWbaaeqabaGaeGOmaidaaaaakiabgUcaRiabcIcaOiabigdaXiabgkHiTiabdchaWjabcMcaPiabeU7aSnaaBaaaleaacqaIXaqmaeqaaKqbaoaalaaabaGaeyOaIy7aaWbaaeqabaGaeGOmaidaaiab=v=aYoaaBaaabaGaeGymaedabeaaaeaacqGHciITcqaH4oqCdaahaaqabeaacqaIYaGmaaaaaaGccaGLBbGaayzxaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaGccqGGSaalaaa@6679@
(10)

with the right-hand side evaluated at θ ˆ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiUdeNbaKaaaaa@2D9B@ , which provides a first-order correction for finite datasets. Figure 1 displays the relation between population and training sample positive proportions and costs as described in Eq. 9 and can serve as a guideline of what weights to assign to peptides of different classes given the class distribution in the training set and the relative importance of positives versus negatives in the real-world application.

Datasets

The peptide binding data used to train and test the decision trees were obtained from a publicly available database published by Peters et al. [2], where the peptide affinity to a particular MHCI molecule is measured by one of two assays, and classified as binder when its IC50 is less or equal to 500 nM, and nonbinder otherwise. Decision trees were constructed for each one of 35 alleles in the dataset. The cost-sensitive and resampling experiments (described below) were performed for five alleles: A0203, A1101, A3101, B0702 and B1501. The numbers of peptides in the datasets for these five alleles are shown in table 1.

Table 1 Number of binders and nonbinders in Peters et al. [2] datasets for 5 alleles.

Cost Adjustments

Seven training sets for each allele studied were generated, such that all training sets for a given allele had the same number of observations but varying proportions p of positives, namely 5%, 10%, 25%, 50%, 75%, 90% and 95%. These training sets were created as follows. First, 25% of the binders and 25% of the nonbinders were randomly selected and set aside as a testing set. The remaining 75% of the binders and of the nonbinders formed the "training superset", from which the peptides for the various training sets were sampled. The total number of peptides in each of the seven training sets was fixed and equal to the number of peptides of the minority class in the training superset. The minority class was the positive for all 5 alleles that we tested. Finally, the training sets were formed by randomly sampling without replacement positive and negative peptides from the training superset such that the described class distribution was reached. The numbers of binders and nonbinders in the resulting training sets are shown in table 2.

Table 2 Training sets used in the cost-sensitive experiment

The goal of this set of experiments was two-fold: 1) To investigate the relationship between class distribution and classifier performance, and 2) To learn how can misclassification costs be used to improve prediction accuracy for a given class distribution of the training set. We emphasize that our goal is not to improve upon existing computational methods, but rather to show that the performance of a single classifier can be improved with the use of cost-sensitive techniques. Misclassification costs were used as weights with the purpose of artificially changing the class distribution of the training dataset. The false negative cost (λ2) can be interpreted as the weight given to the peptides in the positive class, and similarly false positive cost (λ1) is the weight given to the negative class. The overall scale of the training cost function (Eq. 6) is arbitrary, so we have fixed λ1 = 1 and varied λ2 between 1/20 and 20 in order to investigate the relationship between costs and class distribution.

Previous works (e.g., [15]) have suggested that among the best class distributions for learning is the balanced distribution, one in which all classes are equally represented. We assume that given an unbalanced training set, a balancing misclassification cost can be used to achieve an artificially nearly-balanced class distribution. The balancing cost, λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ , defined in Eq. 9 can be interpreted to be the λ2 that weights the positive peptides to be the same number as the negatives and therefore compensates for the imbalance ratio of the two classes. Consider the simplest scenario, where λ1 = 1, κ1 = 1, κ2 = 1 and π = 0.5, then the balancing cost reduces to

λ 2 B = ( 1 − p ) p MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaakiabg2da9KqbaoaalaaabaGaeiikaGIaeGymaeJaeyOeI0IaemiCaaNaeiykaKcabaGaemiCaahaaaaa@3812@

We are particularly interested in how classifiers trained with this simplified balancing cost perform compared to the best classifiers for a given allele, as well as compared with classifiers trained with unit costs (λ1 = 1 and λ2 = 1).

Resampling

Undersampling

The undersampling method consists of randomly eliminating peptides of the majority class from the training set until both classes have the same number of examples. The training sets were constructed in a similar manner to the cost-modifying experiment. First, we set aside 25% of binders and nonbinders into the testing set. The remaining binders were put into the training set together with the same number of nonbinders, which were randomly sampled without replacement from the nonbinders training superset. One of the issues concerning undersampling is the loss of information that results from the process, which can be aggravated when particularly important elements are removed from the training set. To get around this problem, we used 10-fold crossvalidation and the results presented here are the average of the 10 experiments.

Oversampling

The oversampling method consists of randomly replicating peptides of the minority class into the training set until both classes have the same number of examples. The training sets were constructed as follows. First, we set aside 25% of binders and nonbinders into the testing set. All remaining peptides were put into the training set together with d peptides from the minority class which were sampled with replacement, where d is the difference between the number of peptides in the the training set belonging to the majority and minority classes. Hence, each peptide of the minority class is represented at least once and possibly multiple times in the training set. Similarly to the undersampling procedure, we used 10-fold crossvalidation and the results presented here are the average of the 10 experiments.

Decision trees

The present study applies tree-based models to the peptide-MHCI binding prediction problem. We have chosen to use decision trees for the simplicity in their interpretation and also because they have not been thoroughly explored in the context of peptide-MHCI binding. Moreover, decision and classification trees have become the canonical method for comparison of techniques used to deal with unbalanced datasets in the machine learning community. Finally, there seems to be a natural correspondence between the importance of the different residue positions in a peptide and the hierarchical way in which decision trees are constructed.

Tree generation

Breiman et al. [16] provides an excellent and detailed description of classification and regression trees. Briefly, given a dataset in which each object, φ, is represented by a (τ (φ), x(φ)) pair, where x(φ) is a vector containing attributes of the object and τ is an indicator function of the class of the object, a tree-based classifier recursively partitions the data's attribute space into sub-regions, called nodes, in which the response variable is increasingly more homogeneous. These trees are created in two steps: (1) induction of a large tree; and (2) pruning of the large tree into gradually smaller subtrees (here we use the cost-complexity pruning [16]). Finally, one subtree must be chosen from the sequence of subtrees generated by the pruning process. In the present study, we chose the tree that minimizes the training cost function (Eq. 6) when applied to the test set.

The construction of a tree requires (1) a set of splits, which are binary questions with mutually exclusive and exhaustive outcomes used to partition the data, where the questions are coined in terms of the attributes of the objects in the dataset; and (2) a split function used to quantify the goodness of a split, by measuring the change in the homogeneity of the response variable in the tree due to splitting a node into two subsets based on the given split.

Splits and split function

In the problem at hand, the training dataset consists of peptides Ï•, where Ï„(Ï•) is the class of the peptide (either binder or nonbinder) and x(Ï•) is the linear sequence of amino acids of the peptide, with x j being the jthamino acid from the amino terminal end of the peptide. The binary questions about the sequence of peptides can be phrased in several distinct ways, and each one of them generates a different class of splits, called motifs, that can be used in the construction of trees. We used motifs based on the anchor positions, which are represented by a single amino acid with a fixed position in the peptide, such that every amino acid is represented in every position of the peptide. The amino acids, in turn, can be represented in one of two ways: 1) by the traditional amino acid single-letter code. For example, alanine is represented by "A", arginine by "R" and so forth; and 2) by their physico-chemical properties, namely molecular weight, hydropathicity, volume, isoelectric point, polarity, ability to form hydrogen bonds and chain type (aliphatic, aromatic) as previously shown [17].

The split function used was the training cost described in Eq. 3.

Results

Cost Adjustments

The first goal of this set of experiments was to investigate the relationship between class distribution and classifier performance. Our results suggests that for a fixed training set size, decision trees perform best when trained with datasets of nearly balanced class distribution. Figure 2 shows the performance of classifiers trained with datasets of the same size but different class distributions and training costs for alleles A1101 and B0702 (see Additional file 1 for the results for the other three alleles). Note that as the proportion of positives in the training set increases, the false negative rate decreases and the false positive rate increases as can be seen by the subtle shift in the curves from left to right.

Figure 2
figure 2

Classifier performance vs. class distribution. Comparison of the performance of classifiers built with training sets of same size but different proportions of positives for alleles A1101 (left panel) and B0702 (right panel). Each point in a curve represents a classifier constructed with a different false negative training cost. The classifier constructed with the unit cost (λ2 = 1) in each curve is marked with a solid circle and that constructed with the balancing cost is marked with a star. The curve for the perfect classifier would lie on the dotted line. The y-axis shows the total error rate of a classifier, which is the same as the classifier cost (K) when the type 1 and type 2 misclassification costs are identical (κ1 = κ2 = 1). FNR: false negative rate. FPR: false positive rate.

Our second goal was to determine whether or not prediction accuracy of a given classifier can be improved by the use of cost-sensitive techniques and, if so, to establish the relationship between classifier performance and training costs. Our results demonstrate that misclassification costs can be used to improve prediction accuracy. In fact, for each one of the alleles we tested there was a cost λ2 that performed significantly better than the unit cost, as can be seen by the increase in AUC shown in table 3. Although our goal is not to improve upon the performance of existing methods, we also show in table 3 the AUC for 4 other methods as described in [2] for purposes of comparison.

Table 3 Comparison of classifiers performance as measured by AUC

Note in figure 2 that for the training sets with majority of nonbinders, λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ consistently reduced the total error rate as compared to the unit cost (λ2 = 1). The impact of λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ on the performance of classifiers trained with binders-enriched datasets was not consistent, being better than unit cost for some classifiers and worse for others. In addition to representing an improvement over the unit cost, in a few cases λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ coincided with the minimizing cost, that is, the most accurate classifier for a given allele and training set was the one trained with λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ . However, in most cases, the balancing cost over-compensated for the imbalance in the class distribution, such that it was larger than the minimizing cost (see figure 3)

Figure 3
figure 3

Balancing cost vs. minimizing cost. Comparison of balancing cost (solid black line) and the minimizing costs (symbols) for each one of the five alleles.

We then compared the performance of trees trained with the complete dataset using either the unit cost or λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ (the red and green ROC curves in figure 4, respectively). The use of λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ resulted in AUC at least as large as those for unit cost, such that λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ improved the ROC curves as compared to the unit cost in the majority of the cases. One interesting feature of the use of λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ is that it consistently shifts the ROC curve toward increasing sensitivity at the price of decreasing specificity, which is a desirable tradeoff when binders are rare. Thus, even in the cases when the increase in AUC is not substantial, the use of λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ can still represent an improvement over unit cost due to the shift it causes to the ROC curve.

Figure 4
figure 4

Comparison of unit cost, balancing cost, undersampling and oversampling. ROC curves for alleles A1101 (left panel) and B0702 (right panel) comparing the results of trees constructed with the oversampled training set (black curve), the undersampled training set (red curve), and the full training set without training costs, that is, λ1 = λ2 = 1 (green curve) and with the balancing training cost, that is, λ1 = 1 and λ2 = (1 - p)/p (blue curve). The ROC curves were constructed by varying the threshold used to label a node from 0 to 1 and evaluating its sensitivity and specificity at each threshold.

Resampling

The results obtained using the balanced undersampled and oversampled training sets did not represent an improvement over those using the complete unbalanced training sets (see figure 4 and Additional file 2). For alleles A0203, A1101 and B0702, the ROC curves for the trees trained with the entire dataset and those trained with the re-sampled dataset were indistinguishable from one another, whereas for alleles A3101 and B1501, the use of undersampling severely damaged the accuracy of the trees.

Real-world costs versus training costs

We built decision trees with the training data described in table 1 using different values of false negative cost (λ2), and evaluated them on a test set using the "real-world" cost κ2. We call λ ˆ 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaadaWgaaWcbaGaeGOmaidabeaaaaa@2EB7@ the training cost that minimizes the total cost of a classifier on the test set for a given κ2. Figure 5 shows the relationship between λ ˆ 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaadaWgaaWcbaGaeGOmaidabeaaaaa@2EB7@ and κ2. Note that although the results are relatively noisy, in general the same trend shown in theory can be observed from this empirical data (see figure 1). The λ ˆ 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaadaWgaaWcbaGaeGOmaidabeaaaaa@2EB7@ increases with κ2 and as the proportion of positives in the training set increases, the line shifts to the right, indicating that for a particular value of κ2, the suggested λ ˆ 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaadaWgaaWcbaGaeGOmaidabeaaaaa@2EB7@ decreases.

Figure 5
figure 5

Empirical relation between λ ˆ 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaadaWgaaWcbaGaeGOmaidabeaaaaa@2EB7@ and EK ( θ ). Optimal false negative training cost ( λ ˆ 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaadaWgaaWcbaGaeGOmaidabeaaaaa@2EB7@ ) as a function of type 2 error cost (κ2). Classifiers were trained at multiple values of λ ˆ 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaadaWgaaWcbaGaeGOmaidabeaaaaa@2EB7@ and tested at κ2 (compare with figure 1). This was done for each of the five alleles and the λ2 shown in the curves are the average of the minimizing λ2 for each allele.

Discussion

Prediction of peptide-MHCI binding has great potential to accelerate and reduce the cost of subunit vaccine development. One of the issues concerning the prediction of MHC-peptide binding is that binders are much less abundant than nonbinders, and thus much harder to find experimentally. This circumstance typically leads to highly unbalanced training sets, which can hinder the performance of algorithms trained with them. In fact, such training sets lead to a significant increase of type 2 errors and thus make it more difficult still to find binders.

Our results show that highly unbalanced training sets do indeed reduce the accuracy of predictions made with decision trees and that these predictions improve as the training sets become more balanced. We have examined three approaches that aim at improving classifier accuracy by compensating for the imbalance in the class distribution of the training sets: undersampling, oversampling and a cost-sensitive method. Overall, resampling did not improve the performance of the decision trees. In fact, in several cases classifiers trained with undersampled training sets performed much worse than those trained with the full dataset. This could have been caused by the loss of information relevant to the training process. For this reason, undersampling methods may only be appropriately used with datasets in which the majority class contains a lot of redundancy, in which circumstance undersampling has been shown to outperform other random resampling methods in four distinct datasets [18]. Another potential drawback of undersampling, and in broader terms of random resampling methods, is that they may yield noisy results due to the variability introduced in the process by the randomness of the sampling procedure.

In contrast to undersampling, using misclassification costs as a means to artificially counterbalance data bias led to significant improvements in the performance of the decision trees in the majority of the cases. Although cost-sensitive procedures do not add any extra information to the training set, they seem to be more advantageous than random resampling techniques because they do not cause loss of information as does undersampling and do not have the extra variability introduced by the random sampling process. Several other studies have shown cost-modifying methods to be advantageous. For example, Japkowicz and Stephen [19] performed a systematic comparison of these methods in both artificially-generated and real-world domains, showing that cost-modifying methods yield better results than resampling techniques. Fundamentally, the cost-sensitive method described here can be straightforwardly applied to any classifier that is trained using datasets that include both classes of peptides, binders and nonbinders. For instance, the individual weights of a weight matrix can be derived by minimizing the cost function (Eq. 3) over these weights. The indicator function c-(+)(Ï•) can be defined by the score function's being above or below a given threshold, where the scoring function is typically the sum of the scores of each amino acid in each position of a peptide. Similarly, this cost function can be incorporated into a neural network by differentially weighting the output depending on the class of the training example and allowing it to be used in the learning process by the backpropagation procedure [20, 21]. Likewise, for support vector machines, the cost function can be implemented through the definition of the "soft margin" [22], allowing the SVM to misclassify more examples of one class than examples of the other class.

In addition to showing that peptide-MHCI binding predictions can be improved by the use of cost-sensitive decision trees, we have investigated the use of the balancing cost, λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ , as a rule-of-thumb to train classifiers. We have shown that although λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ is not always the λ2 that minimizes the total cost of the classifier, it consistently outperforms the unit cost (λ2 = λ1) when the training set is enriched for nonbinders.

Moreover, we have showed that the use of λ 2 B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabikdaYaqaaiabdkeacbaaaaa@2FB5@ shifts the ROC curves towards areas of higher sensitivity in relation to ROC curves generated with unit cost, which can be highly desirable in situations such as epitope discovery projects.

Thus, although the relationship between training costs and class imbalance is relatively noisy, and further studies should be conducted before a complete guideline of what training costs should be used for a particular peptide-MHCI binding dataset, our results allow us to suggest that a balancing cost should be used for datasets enriched for nonbinders, and the unit cost should be used for binders-enriched training sets.

Conclusion

The vaccine development process is costly and time-consuming, requiring decisions to be made at each step and it lends itself nicely to a decision-theoretic approach, which we have described here. In particular, at the epitope discovery stage, there are real costs associated with the risk of missing a positive and with the experimental verification of nonbinders. Here we have described a decision-theoretic framework for the prediction of peptide-MHCI binding and have provided a guideline on how to incorporate real-world costs together with misclassification costs at the training level in order to maximize prediction accuracy and push it in the desired direction.

References

  1. Zhang C, Anderson A, DeLisi C: Structural principles that govern the peptide-binding motifs of class I MHC molecules. J Mol Biol 1998, 281(5):929–47. 10.1006/jmbi.1998.1982

    Article  CAS  PubMed  Google Scholar 

  2. Peters B, Bui HH, Frankild S, Nielson M, Lundegaard C, Kostem E, Basch D, Lamberth K, Harndahl M, Fleri W, Wilson SS, Sidney J, Lund O, Buus S, Sette A: A community resource benchmarking predictions of peptide binding to MHC-I molecules. PLoS Comput Biol 2006, 2(6):e65. 10.1371/journal.pcbi.0020065

    Article  PubMed Central  PubMed  Google Scholar 

  3. Rammensee H, Bachmann J, Emmerich NP, Bachor OA, Stevanovic S: SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics 1999, 50(3–4):213–9. 10.1007/s002510050595

    Article  CAS  PubMed  Google Scholar 

  4. Parker KC, Bednarek MA, Coligan JE: Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide side-chains. J Immunol 1994, 152: 163–75.

    CAS  PubMed  Google Scholar 

  5. Gulukota K, Sidney J, Sette A, DeLisi C: Two complementary methods for predicting peptides binding major histocompatibility complex molecules. J Mol Biol 1997, 267(5):1258–67. 10.1006/jmbi.1997.0937

    Article  CAS  PubMed  Google Scholar 

  6. Donnes P, Elofsson A: Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics 2002, 3: 25. 10.1186/1471-2105-3-25

    Article  PubMed Central  PubMed  Google Scholar 

  7. Yu K, Petrovsky N, Schonbach C, Koh JY, Brusic V: Methods for prediction of peptide binding to MHC molecules: a comparative study. Mol Med 2002, 8(3):137–48.

    PubMed Central  CAS  PubMed  Google Scholar 

  8. Nielsen M, Lundegaard C, Worning P, Hvid CS, Lamberth K, Buus S, Brunak S, Lund O: Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics 2004, 20(9):1388–97. 10.1093/bioinformatics/bth100

    Article  CAS  PubMed  Google Scholar 

  9. Segal MR, Cummings MP, Hubbard AE: Relating amino acid sequence to phenotype: analysis of peptide-binding data. Biometrics 2001, 57(2):632–42. 10.1111/j.0006-341X.2001.00632.x

    Article  CAS  PubMed  Google Scholar 

  10. Zhu S, Udaka K, Sidney J, Sette A, Aoki-Kinoshita KF, Mamitsuka H: Improving MHC binding peptide prediction by incorporating binding data of auxiliary MHC molecules. Bioinformatics 2006, 22(13):1648–55. 10.1093/bioinformatics/btl141

    Article  CAS  PubMed  Google Scholar 

  11. Peters B, Tong W, Sidney J, Sette A, Weng Z: Examining the independent binding assumption for binding of peptide epitopes to MHC-I molecules. Bioinformatics 2003, 19(14):1765–72. 10.1093/bioinformatics/btg247

    Article  CAS  PubMed  Google Scholar 

  12. Chawla NV, Japkowicz N, Kotcz A: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 2004, 6: 1–6. 10.1145/1007730.1007733

    Article  Google Scholar 

  13. Brusic V, Zeleznikow J: Computational binding assays of antigenic peptides. Letters in Peptide Science 1999, 6: 313–324.

    CAS  Google Scholar 

  14. Elkan C: The Foundations of Cost-Sensitive Learning. IJCAI 2001, 973–978.

    Google Scholar 

  15. Weiss GM, Provost FJ: Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. J Artif Intell Res (JAIR) 2003, 19: 315–354.

    Google Scholar 

  16. Breiman L: Classification and regression trees. Wadsworth statistics/probability series. New York, N.Y.: Chapman and Hall; 1993.

    Google Scholar 

  17. Ray S, Kepler T: Amino acid biophysical properties in the statistical prediction of peptide-MHC class I binding. Immunome Research 2007, 3: 9. [http://www.immunome-research.com/content/3/1/9] 10.1186/1745-7580-3-9

    Article  PubMed Central  PubMed  Google Scholar 

  18. Drummond C, Holte R: C4.5, Class Imbalance, and Cost-Sensitivity: Why Under-Sampling beats Over-Sampling. Proceedings of the International Conference on Machine Learning (ICML 2003) Workshop on Learning from Imbalanced Data Sets II 2003.

    Google Scholar 

  19. Japkowicz N, Stephen S: The class imbalance problem: A systematic study. Intelligent Data Analysis 2002, 6: 429–449.

    Google Scholar 

  20. Kukar M, Kononenko I: Cost-Sensitive Learning with Neural Networks. European Conference on Artificial Intelligence 1998, 445–449. [http://citeseer.ist.psu.edu/kukar98costsensitive.html]

    Google Scholar 

  21. Zhou ZH, Liu XY: Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE Transactions on Knowledge and Data Engineering 2006, 18: 63–77. 10.1109/TKDE.2006.17

    Article  Google Scholar 

  22. Brefeld U, Geibel P, Wysotzki F: Support Vector Machines with Examples Dependent Costs. Lecture Notes in Computer Science 2003, 23–34.

    Google Scholar 

Download references

Acknowledgements

We thank Cliburn Chan for insightful discussions, Yongting Cai for sharing experimental data used in our pilot studies and Kent Weinhold for his leadership and guidance. This work was supported by the Large Scale Antibody & T cell Epitope Discovery Program (Kent Weinhold, PI), under the NIH grant N01-A1-400822.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas B Kepler.

Additional information

Authors' contributions

APS performed the computational experiments, analyzed the data and wrote the manuscript. TBK supervised the study and wrote the manuscript. GDT supervised the pilot experimental studies and helped revise the manuscript.

Electronic supplementary material

12859_2008_2370_MOESM1_ESM.pdf

Additional file 1: Classifier performance vs class distribution for alleles A0203, A3101 and B1501. Comparison of the performance of classifiers built with training sets of same size but different proportions of positives for alleles A0203, A3101 and B1501 (compare to figure 2). Each point in a curve represents a classifier constructed with a different false negative training cost. The classifier constructed with the unit cost (λ1 = 1) in each curve is marked with a solid circle and that constructed with the balancing cost is marked with a star. The curve for the perfect classifier would lie on the dotted line. The y-axis shows the total error rate of a classifier, which is the same as the classifier cost (K) when the type 1 and type 2 misclassification costs are identical (κ1 = κ2 = 1). FNR: false negative rate. FPR: false positive rate. (PDF 7 KB)

12859_2008_2370_MOESM2_ESM.pdf

Additional file 2: Comparison of unit cost, balancing cost, undersampling and oversampling for alleles A0203, A3101 and B1501. ROC curves for alleles A1101 (left panel) and B0702 (right panel) comparing the results of trees constructed with the oversampled training set (black curve), the undersampled training set (red curve), and the full training set without training costs, that is, λ1 = λ2 = 1 (green curve) and with the balancing training cost, that is, λ1 = 1 and λ2 = (1 - p)/p (blue curve). Compare to figure 4. The ROC curves were constructed by varying the threshold used to label a node from 0 to 1 and evaluating its sensitivity and specificity at each threshold. (PDF 10 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Sales, A.P., Tomaras, G.D. & Kepler, T.B. Improving peptide-MHC class I binding prediction for unbalanced datasets. BMC Bioinformatics 9, 385 (2008). https://0-doi-org.brum.beds.ac.uk/10.1186/1471-2105-9-385

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/1471-2105-9-385

Keywords