NCMHap: a novel method for haplotype reconstruction based on Neutrosophic c-means clustering

Zamani, Fatemeh; Olyaee, Mohammad Hossein; Khanteymoori, Alireza

doi:10.1186/s12859-020-03775-0

Methodology article
Open access
Published: 22 October 2020

NCMHap: a novel method for haplotype reconstruction based on Neutrosophic c-means clustering

Fatemeh Zamani¹,
Mohammad Hossein Olyaee² &
Alireza Khanteymoori¹

BMC Bioinformatics volume 21, Article number: 475 (2020) Cite this article

1369 Accesses
2 Citations
1 Altmetric
Metrics details

Abstract

Background

Single individual haplotype problem refers to reconstructing haplotypes of an individual based on several input fragments sequenced from a specified chromosome. Solving this problem is an important task in computational biology and has many applications in the pharmaceutical industry, clinical decision-making, and genetic diseases. It is known that solving the problem is NP-hard. Although several methods have been proposed to solve the problem, it is found that most of them have low performances in dealing with noisy input fragments. Therefore, proposing a method which is accurate and scalable, is a challenging task.

Results

In this paper, we introduced a method, named NCMHap, which utilizes the Neutrosophic c-means (NCM) clustering algorithm. The NCM algorithm can effectively detect the noise and outliers in the input data. In addition, it can reduce their effects in the clustering process. The proposed method has been evaluated by several benchmark datasets. Comparing with existing methods indicates when NCM is tuned by suitable parameters, the results are encouraging. In particular, when the amount of noise increases, it outperforms the comparing methods.

Conclusion

The proposed method is validated using simulated and real datasets. The achieved results recommend the application of NCMHap on the datasets which involve the fragments with a huge amount of gaps and noise.

Background

It has been revealed that the human genome shows some degrees of inter-individual and inter-population variations which make it an appropriate target to rigorous functional genomic analysis [1, 2]. Recent cost-effective next-generation sequencing (NGS) technologies have provided a huge amount of genome sequences of individual human [3]. It has been discovered that more than 99% of human genomes are completely identical. Therefore, it turns out that the vast differences among people can be emerged from less than 1% variations [4, 5]. Single nucleotide polymorphisms (SNPs) refer to the genetic variations which are more frequent. A sequence of SNPs that co-occur in a specific chromosome is named as haplotype. In diploid species like humans, there are two copies of each chromosome. Since each haplotype is derived from a copy of a specific chromosome, as a result, there are two copies of haplotypes.

Haplotypes provide more attainable information than individual SNPs which can be remarkable for investigating the relationship between genetic variations and complex diseases [6], studying human history [7], providing personalized medicine [8] and studying biological mechanisms [9].

Although obtaining the haplotypes is an important task, direct experimental analysis of haplotypes is labor-intensive, expensive, and restricted to obtaining local haplotypes. In practice, human haplotypes are provided as sequencing reads (fragments). Assuming the importance of detecting genetic variations accompanied by limitations over molecular approaches, obtaining haplotype information from these numerous fragments may have profound effects on different aspects of medicine and molecular biology [10,11,12,13]. Availability of the fragments makes it possible to assemble haplotypes in a process referred to as single individual haplotyping (SIH) [14] which is performed by in silico (computer-aided) analysis using statistical and computational approaches.

For this purpose, the requested region of the specified chromosome is sequenced several times and a number of fragments are provided. Due to the limitations of sequencing methods, the fragments involve errors and gaps. It should be noted that the former derived from the wrong determination of allele’s measure; while, the latter is related to the low-confidence measures of allele positions. SIH attempts to assign each fragment to the right chromosome copy. Then, it detects and corrects the errors to reconstruct the desired haplotypes. In order to solve this problem, several models have been proposed which minimum SNP removal (MSR) [14], minimum fragment removal (MFR) [14], and minimum error correction (MEC) [15] are the chief models. Among the existing models, MEC is more efficient and has been applied in several approaches [16,17,18,19]. The aim of this model is to find and correct the errors by applying the minimum letter changes in the input fragments. It has been proved that all of the models are NP-Hard [14]. Most of the current methods construct a weighted graph such that each fragment corresponds with a vertex and the weight of each edge represents the amount of similarity between the connecting fragments. Based on the used model, each method transforms the built graph into a bipartite graph. For example in the MEC model, this is performed by deleting the least number of conflicting edges. AROHap [19] and FCMHap [20] are two recent methods which have been addressed the problem according to the MEC model. The first, through the use of asexual reproduction optimization (ARO) algorithm, attempts to improve the fitness function which is designed based on the MEC model. The second, by exploiting the Fuzzy c-means (FCM) clustering algorithm tries to improve the initial haplotypes iteratively. It is worthwhile noting that the method divides the input fragments into two groups and the haplotypes are obtained as the center of the clusters. However, some popular methods such as MCMC [21] and HapCUT [16] build the graph in a different way. These methods start with a set of arbitrary sequences as initial haplotypes and improve it step by step regarding the input fragments. They make a similar weighted graph in their distinctive model; but instead of fragments, SNPs are the vertices. Each pair of SNPs is connected if they are covered by at least one input fragments. The weight of each edge describes the amount of consistency with their corresponding positions in the current haplotypes. Albeit, this model efficiently describes the consistency of the current haplotype with the input fragments; but the existence of gaps and noise may lead to achieving inaccurate weights [22].

In this paper, we propose a fast and accurate method to solve haplotype reconstruction named NCMHap which involves two steps. First, a weighted fuzzy conflict graph is made such that each node corresponds with an input fragment and the weight of each edge represents the measurement of incompatibility between the corresponding input fragments. By removing the least of conflicting edges based on the MEC model and bi-partitioning the input fragments, an initial fragment clustering is obtained. Next, to decrease the effect of noise and outliers on the obtained clusters, the Neutrosophic c-means (NCM) clustering method is applied. NCM by assigning a coefficient to each input fragment can reduce the noise effects on the clustering process. The performance of the proposed method is validated with both simulated and real datasets. According to the obtained results, by selecting appropriate measures for the parameters of NCM, our method can provide high throughput reconstructed haplotypes close to the optimal.

Results

In this section, the performance of NCMHap is evaluated by using two simulated and publicly available real datasets.

Setting the parameters

The proposed method was implemented in MATLAB and all experiments were completed on a Core i5 Intel with 2.7 GHz and 8G RAM. Among the parameters, m and $\varepsilon$ are common with fuzzy c-means clustering which usually are set by 2 and $10^{ - 5}$, respectively. The other parameters i.e. $\delta$, $w_{1}$,$w_{2}$, and $w_{3}$ are set as $25$, 0.7, 0.2, and 0.1, respectively, which were tuned by trial and error. For this aim, similar to the study of Guo and Sengur [23], a grid search of the trade-off constant $\delta$ on {5, 10, 15,…, 30} and $w_{1}$,$w_{2}$, and $w_{3}$ on {0.1, 0.2, 0.3,…, 0.9} was performed to find the optimal results. Similar to the previous works [16, 19, 22, 24,25,26,27], Reconstruction rate (RR) measure is used to evaluate the quality of the obtained haplotypes.

Competitor methods

In this experiment, NCMhap is compared with a set of state-of-the-art and well-known methods. Some important notes about these competitors are described as follows:

1
H-PoP [26] clusters the DNA reads into k groups such that the elements of each cluster have minimum distance with each other while are far from the reads of the other clusters. Moreover, it exploits the genotype information to improve the reconstructed haplotypes.
2
SCGD [28] is a heuristic-based method that models SIH as the low-rank matrix factorization problem and represents a modified of the gradient descent algorithm to solve the problem.
3
FastHap [25] is an iterative based method which models the similarities between the input fragments with a weighed fuzzy conflict graph.
4
FCMHap [20] uses the Fuzzy C-means clustering method to divide the input fragments into two segments with minimum MEC measure.
5
HGHap [22] exploits the hypergraph model to describe the similarities between the input fragments more precisely.
6
AROHap [19] is a nature-inspired method that utilizes the Asexual Reproduction optimization method to cluster the input fragments with the best MEC score.
7
ALTHap [27] is an iterative algorithm that formulates the haplotype assembly problem as a sparse tensor decomposition.
8
HRCH [29] utilizes a chaotic viewpoint to reconstruct haplotypes. For this aim, the obtained haplotypes are mapped to some coordinate series by applying chaos game representation. Then, the positions with low confidences are improved by using a local projection.

Simulated data

In order to evaluate the performance of the proposed method, first, the experiments have been carried out on a widely used dataset named as Geraci’s dataset [30]. It was provided by the international Hapmap project which is based on 22 chromosomes of 269 different individuals.

The individuals have been nominated from Japan (JPT), China (HCB), Nigeria (YR), and Utah (CEU). Haplotype length (l), coverage (c), and error rate (e) are the main parameters which $l = \left\{ {100, 350, 700} \right\}$, $c = \left\{ {3, 5, 8, 10} \right\}$ and $e = \left\{ {0.1, 0.2, 0.3} \right\}$. It should be noted that for each combination of these parameters there are 100 instances.

Since the proposed method involves two steps, it can be desired to evaluate the influence of each step independently. For this purpose, the initial clustering, NCM algorithm, and NCMHap are separately executed on the Geraci’s dataset. The obtained results for haplotypes with length 100, 350, and 700 are listed in Tables 1, 2 and 3 respectively. It should be noted that the first two columns in these tables are the error rate e and the coverage c, respectively. In each table, The NCM column represents the results when it starts with a random initial guess for each cluster center.

Table 1 The average reconstruction rate over 100 instances with length 100

Full size table

Table 2 The average reconstruction rate over 100 instances with length 350

Full size table

Table 3 The average reconstruction rate over 100 instances with length 700

Full size table

It can be seen in the last column of Tables 1, 2 and 3, the synergistic of these steps achieved promising results which completely outperform the other cases.

Figures 1, 2 and 3 demonstrate the comparison of RRs obtained from the run of the NCMHap as well as the benchmarking algorithms on Geraci’s dataset for haplotypes with length 100, 350, and 700 respectively. Each figure represents a heatmap. The color of each row ranges from green i.e. the minimum RR to red i.e. the maximum RR. It should be noted that each heatmap cell is obtained based on computing the average over 100 data samples.

By investigating the heatmap of Fig. 1, it reveals that the proposed method can provide high-quality results and completely comparable against the other approaches. Comparing the results demonstrates that the proposed method completely outperforms SCGD, FastHap, FCMHap, and AROHap algorithms in all parameters.

As can be seen in Fig. 2, by increasing the length of fragments, the quality of the obtained haplotypes is efficiently improved. Particularly, when the amount of noise is increased, it can preserve the quality of reconstructed haplotypes against the other approaches and in most cases outperforms the benchmarking methods.

Finally, as demonstrated in Fig. 3, for input fragments with length 700, except for one situation, NCMHap has achieved better reconstruction rates than any other algorithms. It should be noted that the obtained RR measures are listed in Additional file 1: Tables S1–S3.

Investigating the obtained results demonstrates that the proposed method can provide high performance in dealing with long input fragments. In fact, increasing the length of input fragments as well as the rate of coverage enable the proposed method to compute the similarity between the fragments more precisely. Moreover, increasing the length of input fragments can aid to identify and decrease the effect of outliers more accurately.

Since the Neutrosophic c-means clustering is a developed form of Fuzzy c-means method and moreover NCMHap like Fast method uses weighted fuzzy conflict graph to model the similarity between the input fragments, its performance is compared against FCMHap and FastHap approaches when it deals with long block haplotypes and a huge amount of noise. Figure 4 demonstrates the quality of obtained results for haplotypes with length $700$ and error rate $e \ge 0.2$.

It is apparent the results of the proposed method are valuable against comparing methods in dealing with input fragments with a high error rate.

Experimental data

For more investigation, we tested the performance of our method on a real dataset which involves data provided by the 1000 genome project [31]. This data belongs to an individual NA12878 [32] which is frequently used to investigate the performance of the existing SIH methods. Moreover, the trio-phased variant calls from the GATK resource bundle [33] was used as the true haplotypes. The represented heatmap in Fig. 5, illustrates the reconstruction rate of the proposed method as well as H-PoP [26], SCGD [28], FastHap [25], HGHap [22], AROHap [19], ALTHap [27], and HRCH [29]. The obtained results demonstrate that our method achieves the highest and second-highest RRs for most of the chromosomes.

Evaluating the obtained results on both simulated and experimental datasets demonstrates that the proposed method can provide promising reconstructed haplotypes in dealing with low-quality sequencing data. Moreover, in the worst case, NCMHap can solve the problem in less than 3 min which this runtime is suitable against the existing approaches. It should be noted that the running times of the competitor methods are represented in Additional file 1: Tables S5–S8.

Discussion

Haplotypes could have profound impacts on personalized medicine. Moreover, it can be used for the study of human evolutionary history. Haplotype assembly includes assembling a pair of haplotypes from a huge amount of individual's aligned DNA sequence fragments. Nevertheless, the quality of the reconstructed haplotypes is poor due to the sparsity as well as the amount of noise in the sequenced fragments. NCMHap reconstructs the haplotypes based on the Neutrosophic c-means (NCM) clustering algorithm.

By evaluating the results of NCMHap on both simulated and real datasets, we found that the proposed approach could effectively overcome the challenge of the occurrence of noise in the input fragments, and could provide promising results compared with current methods.

In order to increase the convergence speed of NCM as well as improving the accuracy of the results, as a pre-processing step, a weighted fuzzy conflict graph is constructed, where the nodes correspond with the fragments and each edge represents the similarity of the corresponding fragments. By partitioning the graph, and clustering the input fragments, an initial haplotype is obtained which feds to the next step.

According to the obtained results, it can be concluded that NCMHap provides comparable performance while offering reasonable execution speed. Moreover, when the length of input fragments is increased, it can outperform other methods in terms of the reconstruction rate. By utilizing NCM, the proposed method can more accurately identify long noisy input fragments as outliers and decreases their effects on the reconstructing of haplotypes.

It should be noted that the performance of the proposed method relied on initializing the parameters of NCM. Consequently, these parameters should be tuned appropriately.

Moreover, although NCMHap performance is already good enough compared with other existing methods, it can only be applied for diploid organisms. Therefore, further research should be conducted to reconstruct haplotypes for the polyploid organisms.

Conclusion

In this paper, we presented a method based on the Neutrosophic c-means (NCM) clustering algorithm for haplotype assembly problem. Time complexity and handling high error rate datasets are the main challenges of the existing methods. Due to improving the NCM’s convergence speed, the proposed method consists of two phases. First, the input fragments are divided into two partitions based on their similarities. Second, information of bi-partitioning is employed as initial centers by the NCM clustering method. Applying the information in NCM can improve the speed of convergence and decrease the number of iterations. Using simulated and real datasets, the proposed method provides promising performance, in terms of reconstruction rate and running time, to the current methods. Moreover, the obtained results demonstrate that the proposed method provides high efficiency to reconstruct haplotypes with a high-error-rate.

As demonstrated in a series of recent publications (see, e.g., [22, 34,35,36,37]) in developing new prediction methods, user friendly and publicly accessible web-servers will significantly enhance their impacts [26], we shall make efforts in our future work to provide a web-server for the prediction method presented in this paper. Also, the source code of NCMHap is freely available at https://github.com/Fatemeh-Zamani/NCMHap.

Methods

Problem formulation

As can be seen in Fig. 6, $X_{m \times n}$ is a SNP matrix where each row corresponds with an input fragment with length n. Since in most cases, there are two alleles at each SNP site, for simplicity, the major and minor alleles are represented by 0 and 1 respectively. It should be noted that if a SNP value cannot be determined with enough confidence, it is indicated by ‘−’.

Let $f_{i}$ and $f_{j}$ are two arbitrary input fragments. The Hamming distance (HD) can describe their similarity as below:

$$HD\left( {f_{i}, f_{j} } \right) = \mathop \sum \limits_{k = 1}^{n} D\left( {f_{ik}, f_{jk} } \right)$$

(1)

$$D\left( {a,b} \right) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {\quad if\;a, b \ne^{\prime} -^{\prime}\;and\;a \ne b} \hfill \\ 0 \hfill & {\quad else} \hfill \\ \end{array} } \right.$$

(2)

where $f_{i}$ and $f_{j}$ are compatible if $HD = 0$, else they are in conflict. In other words, when $HD\left( {f_{i}, f_{j} } \right)$ equals zero, it can be concluded that these fragments are originated from the same chromosome copy, otherwise, the fragments belong to different chromosome copy, or some of their positions are destroyed by noise. To solve the problem, the fragments of the SNP matrix must be divided into two clusters such that the elements of each cluster will be compatible by the minimum number of letter flips i.e. MEC measure is minimized. Then, the center of each cluster equals with its corresponding haplotype. Figure 6, demonstrates the haplotype reconstruction in the diploid genome, X is SNP matrix which divided into two parts and $H = \{ h_{1}, h_{2} \}$ involves the reconstructed haplotypes of each cluster.

In order to evaluate the quality of the obtained haplotypes, reconstruction rate (RR) [38] and MEC score are two useful measurements. Let $\hat{H}$ and $H$ contain the reconstructed haplotypes and the original haplotypes respectively. The RR describes the similarity between $\hat{H}$ and $H$ that it is computed as below.

$$RR_{{\left( {\hat{H}.H} \right)}} = 1 - \frac{{min\left( {HD\left( {\hat{h}_{1} ,h_{1} } \right) + HD\left( {\hat{h}_{2} ,h_{2} } \right),HD\left( {\hat{h}_{1} ,h_{2} } \right) + HD\left( {\hat{h}_{2} ,h_{1} } \right)} \right)}}{2n}$$

(3)

Neutrosophic c-means (NCM) algorithm

As stated previously, fragment clustering is an important phase of the haplotype assembly. Also, a huge amount of noise and gaps in the input fragments have made this phase as a challenging task. In order to perform this phase efficiently, we consider the Neutrosophic c-means (NCM) clustering algorithm. The algorithm computes the degrees belonging to the determinant and indeterminate clusters at the same time for each of the data points [23, 40]. Outlier and noise data are considered as Indeterminate clusters. Therefore, the NCM algorithm can detect outliers and noisy data. Also, by using some relevant functions, it can decrease the undesirable effects of noise and outliers on the clustering process. For this purpose, the NCM algorithm minimizes the objective function given in Eq. (4) through an iterative process, whereby the centers of the clusters are determined with the least error and the clustering accuracy is improved.

$$J\left( {T,I,F,C} \right) = \mathop \sum \limits_{i = 1}^{N} \mathop \sum \limits_{j = 1}^{C} \left( {w_{1} T_{ij} } \right)^{m} \left\| {x_{i} - c_{j} } \right\|^{2} + \mathop \sum \limits_{i = 1}^{N} \left( {w_{2} I_{i} } \right)^{m} \left\| {x_{i} - \overline{c}_{{i{ }max}} } \right\|^{2} + \mathop \sum \limits_{i = 1}^{N} \delta^{2} \left( {w_{3} F_{i} } \right)^{m}$$

(4)

$$\overline{c}_{{i{ }max}} = \frac{{c_{{p_{i} }} + c_{{q_{i} }} }}{2}$$

(5)

$$p_{i} = \mathop {\text{arg max}}\limits_{j = 1,2, \ldots ,C} \left( {T_{ij} } \right)$$

(6)

$$q_{i} = \mathop {\text{arg max}}\limits_{{j \ne p_{i} \cap j = 1,2, \ldots ,C}} \left( {T_{ij} } \right)$$

(7)

In the above relations, $T_{{{\text{ij}}}}$ is defined as the degree to determinant clusters, $I_{{\text{i}}}$ is the degree to the boundary clusters, $F_{{\text{i}}}$ is the degree belonging to the outlier data set, N number of data points, C number of clusters, w weighting factor, m is a fuzzification constant, x_i is a data point, and δ is the number of objects that are considered as outliers. $\overline{c}_{i max}$ is a constant that is calculated for each data point according to Eq. (5). This parameter is used to precisely determine the value of function $I_{{\text{i}}}$, because the degree of indeterminacy of each data point depends on the two largest definite clusters close to it, namely Eqs. (6) and (7). The cluster centers $c_{j}$ and membership $T_{{{\text{ij}}}}$, $I_{{\text{i}}}$, and $F_{{\text{i}}}$ are updated by Eqs. (8–11) respectively, where k is the iteration step.

$$c_{j} = \frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {w_{1} T_{ij} } \right)^{m} x_{i} }}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {w_{1} T_{ij} } \right)^{m} }}$$

(8)

$$T_{ij} = \frac{K}{{w_{1} }}\left( {x_{i} - c_{j} } \right)^{{ - \left( {{\raise0.7ex\hbox{$2$} \!\mathord{\left/ {\vphantom {2 {m - 1}}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${m - 1}$}}} \right)}}$$

(9)

$$I_{i} = \frac{K}{{w_{2} }}\left( {x_{i} - \overline{c}_{{i{ }max}} } \right)^{{ - \left( {{\raise0.7ex\hbox{$2$} \!\mathord{\left/ {\vphantom {2 {m - 1}}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${m - 1}$}}} \right)}}$$

(10)

$$F_{i} = \frac{K}{{w_{3} }}\delta^{{ - \left( {{\raise0.7ex\hbox{$2$} \!\mathord{\left/ {\vphantom {2 {m - 1}}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${m - 1}$}}} \right)}}$$

(11)

NCMHap method

As can be seen in Fig. 7, the proposed method involves two main steps. First, in order to provide an initial clustering of the input fragments, a weighted graph, called fuzzy conflict graph, is constructed based on the SNP matrix. In this graph, fragments are considered as vertices, and the weight of each edge is the normalized Hamming distance (NHD) between corresponding fragments. This measure is given as follows:

$$NHD\left( {f_{i} ,f_{j} } \right) = \frac{1}{{S_{ij} }}\mathop \sum \limits_{k = 1}^{n} D\left( {f_{ik} ,f_{jk} } \right)$$

(12)

In the above relations, f_i and f_j are two fragments of X, S_ij denotes the number of columns (SNPs) that are covered by either f_ik or f_jk in X. In fact, S_ij is a normalization factor that allows us to normalize the distance between the two fragments such that the resulting distance ranges from 0 to 1, and n represents the number of SNPs.

After constructing the graph, the edges with weight of 0.5 are removed because they do not provide sufficient information about the clustering of the connected fragments.

Next, an edge with the highest weight is found from the obtained graph and its connecting nodes (fragments) are assigned to different clusters (i.e. C₁ and C₂). Then, in an iterative manner, for each cluster (C_i, i = 1,2), a node with highest distance from the cluster is found. Then, it is assigned to the opposite cluster. This step is repeated until all nodes will be assigned to the clusters.

In the second phase, the initial clustering is given to the NCM algorithm. The centers of each cluster are considered as the primary centers in the NCM algorithm. Initial clustering can improve the convergence speed of the NCM algorithm. This algorithm determines the impact of fragments on clustering based on the three membership functions introduced and is able to reduce the impact of noise or outliers on the clustering process and consequently, the accuracy of clustering will be increased. Therefore, clustering is achieved by repeating the optimal objective function and the membership degree of the determinant and indeterminate clusters and the centers of the clusters in each iteration will be updated by Eqs. (8–11). The iteration is repeated until the difference between cluster centers at two successive iterations is greater than $\varepsilon$. Finally, the center of obtained clusters construct the set of reconstructed haplotypes.

Availability of data and materials

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request. Moreover, the source code is available in: https://github.com/Fatemeh-Zamani/NCMHap.

Abbreviations

SIH:: Single individual haplotype
NCM:: Neutrosophic c-means
NGS:: Next generation sequencing
SNP:: Single nucleotide polymorphism
MSR:: Minimum SNP removal
MFR:: Minimum fragment removal
MEC:: Minimum error correction
ARO:: Asexual reproduction optimization
FCM:: Fuzzy c-means
RR:: Reconstruction rate
NHD:: Normalized Hamming distance

References

Jorde LB, Wooding SP. Genetic variation, classification and “race.” Nat Genet. 2004;36(11s):S28.
Article CAS Google Scholar
Schneider JA, Pungliya MS, Choi JY, Jiang R, Sun XJ, Salisbury BA, Stephens JC. DNA variability of human genes. Mech Ageing Dev. 2003;124(1):17–25.
Article CAS Google Scholar
Snyder MW, Adey A, Kitzman JO, Shendure J. Haplotype-resolved genome sequencing: experimental methods and applications. Nat Rev Genet. 2015;16(6):344–58.
Article CAS Google Scholar
Hoehe MR, Köpke K, Wendel B, Rohde K, Flachmeier C, Kidd KK, Berrettini WH, Church GM. Sequence variability and candidate gene analysis in complex disease: association of µ opioid receptor gene variation with substance dependence. Hum Mol Genet. 2000;9(19):2895–908.
Article CAS Google Scholar
Terwilliger JD, Weiss KM. Linkage disequilibrium mapping of complex disease: fantasy or reality? Curr Opin Biotechnol. 1998;9(6):578–94.
Article CAS Google Scholar
Tewhey R, Bansal V, Torkamani A, Topol EJ, Schork NJ. The importance of phase information for human genomics. Nat Rev Genet. 2011;12(3):215.
Article CAS Google Scholar
Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, Patterson N, Li H, Zhai W, Fritz MH-Y. A draft sequence of the Neandertal genome. Science. 2010;328(5979):710–22.
Article CAS Google Scholar
Shastry BS. SNPs and haplotypes: genetic markers for disease and drug response. Int J Mol Med. 2003;11(3):379–82.
CAS PubMed Google Scholar
Adey A, Burton JN, Kitzman JO, Hiatt JB, Lewis AP, Martin BK, Qiu R, Lee C, Shendure J. The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line. Nature. 2013;500(7461):207.
Article CAS Google Scholar
Douglas JA, Boehnke M, Gillanders E, Trent JM, Gruber SB. Experimentally-derived haplotypes substantially increase the efficiency of linkage disequilibrium studies. Nat Genet. 2001;28(4):361.
Article CAS Google Scholar
Liu N, Zhang K, Zhao H. Haplotype-association analysis. Adv Genet. 2008;60:335–405.
Article Google Scholar
Ruano G, Kidd KK. Direct haplotyping of chromosomal segments from multiple heterozygotes via allele-specific PCR amplification. Nucleic Acids Res. 1989;17(20):8392.
Article CAS Google Scholar
Ruano G, Kidd KK, Stephens JC. Haplotype of multiple polymorphisms resolved by enzymatic amplification of single DNA molecules. Proc Natl Acad Sci. 1990;87(16):6296–300.
Article CAS Google Scholar
Lancia G, Bafna V, Istrail S, Lippert R, Schwartz R. SNPs problems, complexity, and algorithms. In: European symposium on algorithms. Springer; 2001. p. 182–193.
Lippert R, Schwartz R, Lancia G, Istrail S. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Brief Bioinform. 2002;3(1):23–31.
Article CAS Google Scholar
Bansal V, Bafna V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics. 2008;24(16):i153–9.
Article Google Scholar
Qian W, Yang Y, Yang N, Li C. Particle swarm optimization for SNP haplotype reconstruction problem. Appl Math Comput. 2008;196(1):266–72.
Google Scholar
Wang T-C, Taheri J, Zomaya AY. Using genetic algorithm in reconstructing single individual haplotype with minimum error correction. J Biomed Inform. 2012;45(5):922–30.
Article Google Scholar
Olyaee M-H, Khanteymoori A. AROHap: an effective algorithm for single individual haplotype reconstruction based on asexual reproduction optimization. Comput Biol Chem. 2018;72:1–10.
Article CAS Google Scholar
Olyaee MH, Khanteymoori A. Fuzzy c-means clustering for SNP haplotype reconstruction problem.
Bansal V, Halpern AL, Axelrod N, Bafna V. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 2008;18(8):1336–46.
Article CAS Google Scholar
Chen X, Peng Q, Han L, Zhong T, Xu T. An effective haplotype assembly algorithm based on hypergraph partitioning. J Theor Biol. 2014;358:85–92.
Article Google Scholar
Guo Y, Sengur A. NCM: Neutrosophic c-means clustering algorithm. Pattern Recognit. 2015;48(8):2710–24.
Article Google Scholar
Berger E, Yorukoglu D, Peng J, Berger B. Haptree: a novel Bayesian framework for single individual polyplotyping using NGS data. PLoS Comput Biol. 2014;10(3):e1003502.
Article Google Scholar
Mazrouee S, Wang W. FastHap: fast and accurate single individual haplotype reconstruction using fuzzy conflict graphs. Bioinformatics. 2014;30(17):i371–8.
Article CAS Google Scholar
Xie M, Wu Q, Wang J, Jiang T. H-PoP and H-PoPG: Heuristic partitioning algorithms for single individual haplotyping of polyploids. Bioinformatics. 2016;32(24):3735–44.
Article CAS Google Scholar
Hashemi A, Zhu B, Vikalo H. Sparse tensor decomposition for haplotype assembly of diploids and Polyploids. BMC Genom. 2018;19(4):191.
Article Google Scholar
Cai C, Sanghavi S, Vikalo H. Structured low-rank matrix factorization for haplotype assembly. IEEE J Sel Top Signal Process. 2016;10(4):647–57.
Article Google Scholar
Olyaee MH, Khanteymoori AR, Khalifeh K. A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model. bioRxiv 10.1101/2020.09.29.318907.
Geraci F. A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem. Bioinformatics. 2010;26(18):2217–25.
Article CAS Google Scholar
Consortium GP. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061.
Article Google Scholar
Gibbs R, Belmont J, Hardenbol P, Willis T, Yu F, Yang H, Ch’ang L, Huang W, Liu B, Shen Y. The international HapMap project. Nature. 2003;426(6968):789–96.
Article CAS Google Scholar
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, Del Angel G, Rivas MA, Hanna M. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491.
Article CAS Google Scholar
Liu Z, Xiao X, Qiu W-R, Chou K-C. iDNA-Methyl: Identifying DNA methylation sites via pseudo trinucleotide composition. Anal Biochem. 2015;474:69–77.
Article CAS Google Scholar
Jia J, Liu Z, Xiao X, Liu B, Chou K-C. iPPI-Esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J Theor Biol. 2015;377:47–56.
Article CAS Google Scholar
Ding H, Deng E-Z, Yuan L-F, Liu L, Lin H, Chen W, Chou K-C. iCTX-type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels. BioMed Res Int 2014;2014:286419. https://0-doi-org.brum.beds.ac.uk/10.1155/2014/286419.
Chen W, Feng P-M, Deng E-Z, Lin H, Chou K-C. iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Anal Biochem. 2014;462:76–83.
Article CAS Google Scholar
Wang R-S, Wu L-Y, Li Z-P, Zhang X-S. Haplotype reconstruction from SNP fragments by minimum error correction. Bioinformatics. 2005;21(10):2456–62.
Article CAS Google Scholar
Rhee J-K, Li H, Joung J-G, Hwang K-B, Zhang B-T, Shin S-Y. Survey of computational haplotype determination methods for single individual. Genes Genom. 2016;38(1):1–12.
Article CAS Google Scholar
Akbulut Y, Şengür A, Guo Y, Polat K. KNCM: Kernel neutrosophic c-means clustering. Appl Soft Comput. 2017;52:714–24.
Article Google Scholar

Download references

Acknowledgements

We gratefully acknowledge Dr. Khosrow Khalifeh for his valuable suggestions.

Funding

No funding.

Author information

Authors and Affiliations

Department of Computer Engineering, University of Zanjan, Zanjan, Iran
Fatemeh Zamani & Alireza Khanteymoori
Department of Computer Engineering, Faculty of Engineering, University of Gonabad, Gonabad, Iran
Mohammad Hossein Olyaee

Authors

Fatemeh Zamani
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Hossein Olyaee
View author publications
You can also search for this author in PubMed Google Scholar
Alireza Khanteymoori
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.R.K., M.H.O. and F.Z. designed the research, F.Z. and M.H.O. collected data, F.Z. and M.H.O. wrote and performed computer programs, A.R.K., M.H.O. and F.Z. analyzed and interpreted the results, F.Z. and M.H.O. wrote the first version of the manuscript, A.R.K. and M.H.O. revised and edited the manuscript. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Alireza Khanteymoori.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests in relation to this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1:

Table S1. Performance comparison of NCMHap and other methods on the Geraci's dataset with haplotype block length l = 100. Each element in this table is the average value of each 100 data samples. Table S2. Performance comparison of NCMHap and other methods on the Geraci's dataset with haplotype block length l = 350. Each element in this table is the average value of each 100 data samples. Table S3. Performance comparison of NCMHap and other methods on the Geraci's dataset with haplotype block length l = 700. Each element in this table is the average value of each 100 data samples. Table S4. The reconstruction rate for the proposed method, H-pop, SCGD, FastHap, HGHap, AROHap, FCMHap, ALTHap, and HRCH applied to the experimental dataset NA12878 dataset provided by 1000 genome project. Table S5. The average of running time of NCMHap and other methods on the Geraci's dataset with haplotype block length l = 100 (In seconds). Table S6. The average of running time of NCMHap and other methods on the Geraci's dataset with haplotype block length l = 350 (In seconds). Table S7. The average of running time of NCMHap and other methods on the Geraci's dataset with haplotype block length l = 700 (In seconds). Table S8. The average of running time for the proposed method, H-pop, SCGD, FastHap, HGHap, AROHap, FCMHap, ALTHap, and HRCH applied to the experimental dataset NA12878 dataset provided by 1000 genome project (In seconds).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Zamani, F., Olyaee, M.H. & Khanteymoori, A. NCMHap: a novel method for haplotype reconstruction based on Neutrosophic c-means clustering. BMC Bioinformatics 21, 475 (2020). https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-020-03775-0

Download citation

Received: 27 April 2020
Accepted: 22 September 2020
Published: 22 October 2020
DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-020-03775-0

NCMHap: a novel method for haplotype reconstruction based on Neutrosophic c-means clustering