 Software
 Open Access
 Published:
BLAMM: BLASbased algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs
BMC Bioinformatics volume 21, Article number: 81 (2020)
Abstract
Background
The identification of all matches of a large set of position weight matrices (PWMs) in long DNA sequences requires significant computational resources for which a number of efficient yet complex algorithms have been proposed.
Results
We propose BLAMM, a simple and efficient tool inspired by high performance computing techniques. The workload is expressed in terms of matrixmatrix products that are evaluated with high efficiency using optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs and has a runtime that is independent of the selected pvalue. In terms of singlecore performance, it is competitive with stateoftheart software for PWM matching while being much more efficient when using multithreading. Additionally, BLAMM requires negligible memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 13 min with a pvalue of 10^{−4} using a 36core machine. On a dual GPU system, the same task can be performed in under 5 min.
Conclusions
BLAMM is an efficient tool for identifying PWM matches in large DNA sequences. Its C++ source code is available under the GNU General Public License Version 3 at https://github.com/biointec/blamm.
Background
Position weight matrices (PWM), also referred to as a positionspecific scoring matrices (PSSM), are used to model short, biologically relevant sequence patterns such as transcription factor binding sites [1]. PWMs offer more flexibility than consensus patterns as they can allow variation at each position in the pattern. Databases such as JASPAR [2], UniPROBE [3] and TRANSFAC [4] host large collections of curated PWMs.
Position weight matrices are generated from an alignment of functionally related sequences. In this work we focus on DNA sequences. Assuming independence between positions, these alignments can be summarized in a 4×m position frequency matrix (PFM), with m the length of the alignment, where each matrix element PFM (i,j) represents the frequency of character i (0=‘A’; 1=‘C’; 2=‘G’; 3=‘T’) at position j in the alignment. By computing, at each position, the relative occurrence of each nucleotide, the position probability matrix (PPM) is derived:
where α≥0 is a pseudocount that acts as a smoothing parameter to avoid zero probabilities [5]. Finally, when computing the logarithm (base2) of the ratio of the PPM elements and the corresponding background nucleotide probability b_{i}, the PWM is obtained:
Each element PWM(i,j) thus represents the loglikelihood ratio of observing character i at position j in a functional site compared with a random sequence. Figure 1 shows an example of a PFM, PPM and PWM and its corresponding sequence logo visualization [6].
Given a sequence of length m, the PWM score is obtained by summing over the PWM loglikelihood ratios at m positions, each time selecting the appropriate PWM element, i.e., the one that corresponds to the nucleotide in the sequence. Higher PWM scores express a better correspondence to the PWM model and thus a higher likelihood that the sequence represents a functional site.
In this work, we focus on the PWM matching problem: given an input sequence of length n (with typically n≫m), identify all matches or occurrences of a PWM, i.e. all subsequences for which the PWM score exceeds a userdefined threshold. These matches are then putative functional sites. The problem can be generalized to the case where the matches of multiple PWMs need to be identified, referred to as the multiple PWM matching problem. For long input sequences and/or a large number of PWMs, PWM matching is a computeintensive problem that may require a large runtime. We briefly review the most important algorithmic approaches and refer to [7] for a more detailed overview.
A simple bruteforce algorithm evaluates the PWM score at each possible starting position of the input sequence(s) and has a time complexity of O(nm). A simple improvement, called lookahead scoring, is to stop the computation of the PWM score as soon as it has been determined that the score threshold can no longer possibly be reached [8]. In permuted lookahead scoring, the PWM score is evaluated in such order that potential early termination is established as soon as possible. More advanced techniques for PWM matching usually involve generalizations of algorithms developed for exact pattern matching. They can be categorized as either online algorithms, that rely on the preprocessing of the PWM search matrices, or algorithms based on index structures, that rely on the preprocessing of the input sequence(s). In [9], PWM matches are identified using a depthfirst traversal up to depth m of a suffix tree representation of the input sequence(s). This means that PWM score computations of repeated subsequences only need to be performed once and that computations can be partly reused between subsequences that share the same prefix. The use of lookahead scoring allows for the identification of subtrees that contain no PWM matches and that can be discarded from the search procedure. Similarly, the more spaceefficient enhanced suffix array (ESA) [10] index structure is used in PoSSuMsearch [11, 12]. As index structures require O(n) memory, which can be costly in practice, much research has been devoted to online algorithms as well. In [13], index tables are constructed that contain precomputed (partial) PWM scores for all possible words or a short, fixed length. Additionally, potential similarity between multiple PWMs is exploited to accelerate the multiple PWM matching problem. In [14], the shiftadd algorithm has been adopted for PWM matching. Similarly, in [15], the MorrisPratt and KnuthMorrisPratt algorithms [16] are generalized to PWM matching. Finally, in [17], the AhoCorasick, filtration and superalphabet techniques developed for exact string matching are generalized to PWM matching, and further extended to the multiple PWM matching problem [18] and to higherorder PWM matching [19]. These algorithms are collectively implemented in the MOODS software package [20]. The use of graphics processing unit (GPU) architectures is investigated in [21]. The bruteforce algorithm is parallelized over the different starting positions of the input sequence, as well as over the different PWMs, thus realizing important performance gains. Additionally, similarly to [13], index tables with precomputed partial PWM scores are used for further accelerate the search process.
These more complex PWM matching methods accelerate the (multiple) PWM matching problem by using various algorithmic techniques to avoid redundant computations and/or eliminate parts of the search space that are guaranteed not to contain PWM matches. The degree to which they can be successful largely depends on the PWM score threshold that is selected. When a high threshold is selected, the PWM matching problem resembles exact pattern matching for which efficient algorithms with O(n+m) time complexity exist. In contrast, for relaxed PWM score thresholds, the fraction of the search space that can be eliminated is small, and the time complexity eventually approaches the O(nm) complexity of the bruteforce algorithm. As m takes a value between 5 and 15 for most practical applications, complex PWM matching algorithms may attain a speedup of approximately one order of magnitude over implementations of the bruteforce algorithm.
In this work, we propose an alternative methodology to accelerate the PWM matching problem that is inspired by High Performance Computing techniques [22]. Rather than reducing the search space, we adopt the bruteforce algorithm and reduce its runtime by expressing the PWM matching problem in terms of matrixmatrix products (MMP). MMPs can be evaluated with very high efficiency on modern CPUs for two reasons. First, they leverage SIMD (Single Instruction, Multiple Data) instructions that allow the same operation (multiplication, addition) to be performed on multiple data elements simultaneously. Second, MMPs can be implemented such that they maximally exploit spatial and temporal locality of reference, thus ensuring that most data accesses are satisfied from the CPU’s cache memory. As such, MMPs are among a select class of algorithms that can be evaluated with a performance that approaches the theoretical peak performance of the CPU. The latter is typically two orders of magnitude higher than what is effectively obtained by scalar (i.e., nonSIMD), compilergenerated code. High performance MMPs are provided through Basic Linear Algebra Subroutines (BLAS) [23] library implementations, for which most CPU vendors offer highly optimized implementations. Alternatively, opensource implementations such as ATLAS [24] or GotoBLAS [25] can be considered. The proposed method, named BLAMM (BLASAccelerated Motif Matching) inherits the advantages of the bruteforce algorithm: it requires very little RAM, has a runtime that is independent of the selected PWM threshold(s) and is easy to implement and parallelize. We evaluated BLAMM on three recent generations of Intel CPU architectures and compared its performance with a naive implementation of the bruteforce algorithm, MOODS (stateoftheart online algorithm) and PoSSuMsearch (stateoftheart index structure based algorithm) for different data sets and PWM threshold settings. Finally, we demonstrate BLAMM can easily leverage the computing power of graphics processing unit (GPU) architectures. To this end, we use the cuBLAS library [26] to efficiently perform the MMPs on the GPU rather than on the GPU. The performance of the GPU version of BLAMM is compared with TFMCUDA, a stateoftheart GPU implementation.
This paper is an extended version of the proceedings paper [27]. When comparing current benchmark results with the ones initially presented in [27], it should be understood that the performance of BLAMM in between both publications has improved significantly through various algorithmic and implementation improvements.
Implementation
We consider the multiple PWM matching problem over a DNA alphabet. Starting from the bruteforce algorithm, we express the evaluation of the PWM score at each possible starting position of the input sequence(s) in terms of matrixmatrix products (MMP). The procedure involves three matrices:
A pattern matrix P that contains all of the PWMs.
A sequence matrix S that encodes (part of) the DNA input sequence(s) using only elements 0 and 1.
A result matrix R that is computed as R=sub(S)∗P and that contains the PWM scores of all PWMs at some positions in the input sequence(s). The routine sub(.) denotes that a submatrix of S is used.
We now describe each matrix in detail. Figure 2 provides an overview of the algorithm.
Pattern matrix P
The pattern matrix P is built once and remains constant during the execution of the algorithm. It has dimensions 4m×c where c denotes the total number of PWMs and m=max_{j}(m^{j}) refers to the maximum PWM length with m^{j} the length of PWM ^{j}. Every column of P corresponds to a single PWM. The values in a column of P are obtained by unrolling the values of the corresponding PWM. For PWMs shorter than m characters, trailing zeros are appended to the corresponding column in P. Formally:
for all 0≤j<c.
In case PWM occurrences on both strands of the input sequence(s) need to be identified, c additional columns are added to matrix P that represent the reversecomplement of each PWM.
Sequence matrix S
The sequence matrix S has dimensions h×4(w+m−1) where h and w can be arbitrarily chosen ≥1 and where m again represents the maximum PWM length. The matrix S is used to encode (part of) the input sequence(s) S^{DNA} of exactly hw+m−1 nucleotides. First, the string S^{DNA} is converted into an array S^{enc} of 4(hw+m−1) zeros and ones by simply replacing character A by 1000; C by 0100; G by 0010; and T by 0001. Formally:
for all 0≤i<4(hw+m−1). The matrix S is constructed from this temporary array as follows:
for all 0≤i<h and 0≤j<4(w+m−1).
Every row in S contains a contiguous subarray of S^{enc} and thus encodes a substring of S^{DNA}. The rightmost 4(m−1) elements of row i are identical to the leftmost 4(m−1) elements of row i+1. In other words, subsequent rows of S encode overlapping substrings of S^{DNA} with an overlap of m−1 characters.
Result matrix R
The result matrix R has dimensions h×c and is computed as the matrixmatrix product of a submatrix of S and matrix P. Given an offset o with 0≤o<w, R^{o} is computed as follows:
where the notation S(:, [4o,4(o+m)[) refers to the h×4m submatrix of S where the first column of the submatrix corresponds to the column with index 4o in S. Every element in R^{o} is thus computed as the dot product of (part of) a row in S and a column in P. The elements of S (zeros and ones) are multiplied with the elements of the PWM and thus generate the terms that, when added, correspond to the PWM score. As such, element R^{o}(i,j) contains the score for PWM ^{j} at position iw+o in S^{DNA}.
Algorithm 1 provides pseudocode for the entire workflow. In the outer forloop, a portion of the input sequence(s) of length hw+m−1 is read into S^{DNA}. In the inner forloop, the PWM scores are exhaustively computed for all c PWMs at the hw first positions of S^{DNA}. Therefore, the S^{DNA} strings at consecutive outer forloop iterations overlap by m−1 nucleotides.
The time complexity of Algorithm 1 is given by:
where n_{occ} denotes the total number of motif occurrences identified. This further simplifies to O(nmc) which is the same time complexity as the bruteforce algorithm. Note that the time complexity is indeed independent of the choice of parameters h and w.
The space complexity is given by \(O(mc + h(w+m) + hc + n^{\text {buf}}_{occ})\), i.e. the space to store the three matrices and where \(n^{\text {buf}}_{occ}\) denotes the maximum number of occurrences that are buffered in memory before they are spilled to disk (userdefined).
Note that in case the input data consists of multiple DNA sequences, these sequences can be concatenated when generating S^{DNA}. With minimal extra bookkeeping, one can prevent the reporting of occurrences that span adjacent DNA sequences.
Implementation details and performance considerations
The algorithm was implemented in C++ with support for multithreading using C++11 threads. The workload is easily split in independent subtasks by parallelizing the outer forloop in Algorithm 1. Under the EREWPRAM (Exclusive Read, Exclusive Write, Parallel RandomAccess Machine) model, the time complexity using p parallel processes is given by:
where the O(n) term refers to data input and the O(n_{occ}) term refers to data output. Concurrent disk access is indeed prohibited through the use of a mutualexclusion (mutex) synchronization primitive. The ≤ sign in (10) reflects the fact that disk access and computations can overlap. As all datastructures involved are duplicated per thread, the memory complexity is given by \(pO(mc + h(w+m) + hc + n^{\text {buf}}_{occ})\). This seems reasonable as the matrices require little memory in practice. The threadlocal duplication of the readonly matrix P is done to avoid performance issues on nonuniform memory access (NUMA) architectures.
The BLAS sgemm routine [23] is used to evaluate the MMPs using singleprecision computations. The runtime of BLAMM is largely governed by the time required to evaluate these MMPs and hence the quality of the BLAS library implementation.
Even though parameters h and w that govern the dimensions of matrix S can be arbitrarily chosen, for performance reasons, they should not be too small (as evaluating MMPs with small matrices is not very efficient) nor too large (to avoid excessive memory requirements). In our implementation, we set h=1000 and w=250 such that matrix S corresponds to a 1000×[1000+4(m−1)] matrix.
Note that BLAS routines have full support to specify submatrix ranges without any need to explicitly copy these submatrices onto separate data structures.
Finally, note that in [27], we described the algorithm in using the transpose matrices R, P and S. Even though both approaches are mathematically equivalent, we found that current representation has a higher performance in practice. This is likely due to the fact that, when using a columnmajor matrix layout (as is natural in BLAS), matrix sub ^{o}(S) is represented as a linear array and stridedmemory access is avoided.
GPU version
Graphics processing units (GPUs) contain massively parallel processors able to perform certain tasks with higher efficiency than generalpurpose CPUs. Using the cuBLAS [26] library, one can evaluate the MMPs on a GPU thus boosting the performance of BLAMM. To this end, the matrices P and S must be copied to the GPU memory. The constant pattern matrix P is copied only once, while a new sequence matrix S is copied during each outer forloop iteration in Algorithm 1. To avoid having to copy the entire result matrix R^{o} from GPU memory to regular RAM after each MMP in the inner forloop iteration, a kernel was developed in the CUDA programming language to identify the matrix indices (i,j) for which R^{o}(i,j) exceeds the threshold score for PWM ^{j}. This task, known as stream compaction, is also executed on the GPU itself. Only the set of indices (i,j) that correspond to actual PWM matches is copied from GPU to system RAM, thus minimizing the volume of the data that has to be transferred. The CPU itself is only responsible for preparing the S matrices and converting the indices (i,j) to formatted output that is written to disk. Again, by parallelizing over the outer forloop in Algorithm 1, BLAMM supports the use of multiple GPU devices simultaneously.
The programming efforts required to enable GPU support are minimal as most tasks are handled by CUDA library calls (copying data between CPU and GPU, calling cublasSgemm,...). The only exception is the stream compaction kernel itself which consists of 7 lines of CUDA code.
Partitioning of the Pmatrix
The total number of floating point operations in BLAMM depends on the size of the input sequence(s) n, the total number of PWMs c and the length of the longest PWM m. Indeed, recall that PWMs with a length shorter than m are represented as a column in P by adding trailing zeros. In case many PWMs have a length that is substantially shorter than m, matrix P may contain a large fraction of zero elements. In turn, this may create significant overhead during the evaluation of the MMPs.
This overhead can easily be reduced by sorting the columns in P according to length of the PWM they represent and subsequently partitioning P into a number of tiles. The evaluation of R^{o}=sub^{o}(S)∗P can then be computed as a number of smaller MMPs as follows:.
where the interval [c_{i},c_{i+1}[ corresponds to a subset of the columns in R and P and where m_{i} denotes the maximum PWM length in that range. When m_{i}<m overhead is reduced.
The idea is illustrated in Fig. 3 for the JASPAR dataset (see description below). The pattern matrix P represents 1404 PWMs with lengths between 5 and 30 and initially contains 62% zero elements (Fig. 3a). This means that more than half of the floating point operations executed during the MMP are redundant. Matrix P is divided into two tiles along some column index c_{i} (Fig. 3b). The leftmost tile then contains, at the bottom, a submatrix with only zero elements that can be discarded when evaluating the MMP. The column index c_{i} is selected such that the area of this zero submatrix is maximal. The process of matrix subdividing is applied recursively to the two tiles individually. The procedure is stopped when the area of the zero submatrix that can be discarded is smaller than a userdefined parameter. After two rounds of subdividing, four tiles are obtained (Fig. 3c) and 90% of the zero elements are discarded from P. Figure 3d shows the relative performance gains obtained by subdividing matrix P. Different CPU architectures (discussed further) benefit to varying degrees (40% to 18% reduction in runtime). The GPU architecture did not benefit from partitioning P (6% increase in runtime). The reason is that even though the total number of floating point operations is reduced, the process of tiling also yields smaller matrices for which the evaluation of the MMPs is less efficient. Current tiling algorithm appears to provide a good tradeoff between reduction of zero fill while maintaining fairly large tiles. Note that this tiling algorithm is more efficient than the one we originally described in [27] where P was partitioned in slices with uniform width.
Benchmark results
The performance of BLAMM was benchmarked against i) a naive implementation of the bruteforce algorithm, ii) MOODS [20] as a stateoftheart online algorithm, iii) PoSSuMsearch [11] as a stateoftheart index structure based implementation and iv) TFMCUDA [21], a stateoftheart GPUaccelerated implementation. The naive implementation simply scans for PWM occurrences using three nested forloops: one forloop over the input sequence(s), a second forloop over the different PWMs and a third forloop to compute the PWM score. It has the same code quality standards as BLAMM to ensure a fair comparison.
A total of 1404 position frequency matrices with length between 5 and 30 were downloaded from the JASPAR CORE database [28]. The human genome reference genome (HG38) was retrieved from the GATK Resource Bundle [29]. Part of the benchmarks were run only on chromosome 1 (230 Mbp). For all tools, both strands of the DNA sequences were scanned for PWM occurrences. In the case of BLAMM, this is achieved by including the reverse complements of the PWM matrices as columns in matrix P. Thus effectively, 2808 PWM matrices were used in total.
Each tool computes the PWM score thresholds from a userdefined pvalue. Smaller pvalues give rise to higher thresholds and vice versa. Like MOODS, PoSSuMsearch and TFMCUDA, BLAMM implements a dynamic programming algorithm to convert pvalues to PWM score thresholds [30, 31]. Note that all tools use exact algorithms and should, in principle, produce identical results. Minor differences in output are due to i) slight differences in the process of converting pvalues to PWM thresholds, ii) slight differences in the way pseudocounts are implemented, iii) differences in the way nonACGT characters in the input sequences are handled. BLAMM’s CPU and GPU versions produce results identical to the naive bruteforce implementation, regardless of the number of threads used. The exact command line arguments for each tool are listed in Additional file 1: Section S1.
The benchmarks were run on three generations of Intel CPU architectures, referred to as node A, B and C, respectively (see Table 1 for details). With each generation, the number of CPU cores increases (16, 24, 36, resp.) while the clock frequency shows a slight decrease (2.6, 2.5 and 2.3 GHz, resp.). Particularly relevant for BLAMM, however, are the SIMD capabilities of the different CPUs. Node A contains CPUs that support Advanced Vector Extensions (AVX) that operate on 256bit AVX registers and that can thus contain 8 single precision (SP) floating point numbers. Each core is able to generate 16 single precision (SP) floating point operations (FLOPs) per CPU cycle: one 8wide AVX addition and one 8wide AVX multiplication. The CPUs in node B support AVX2 instructions which include fused multiplyadd (FMA) operations that effectively perform a multiplication and addition in a single instruction. Hence, these CPU cores are able to generate 32 SP FLOPs per CPU cycle: two 8wide FMA instructions. Finally, the CPU in node C has support for AVX512 instructions which operate on 512bit AVX registers that can hold 32 SP numbers again doubling the theoretical peak performance to 64 SP FLOPs per cycle. Table 1 also contains the specifications of a node with two nVidia 1080 Ti GPUs. In this case, as most of the workload is performed by the GPUs, the other system specifications (no. of CPU cores, SIMD capabilities) have only a marginal influence on the runtime.
The C++ source code of BLAMM was compiled against the Intel Math Kernel Library (MKL) version 2018.1.163 which implements optimized BLAS routines for Intel CPUs. Depending on the CPU’s capabilities (AVX, AVX2, AVX512), it automatically selects the most appropriate implementation. In all cases, multithreading within the MKL was disabled. In other words, individual calls to sgemm were always executed in a singlethreaded manner but multiple calls to sgemm are issued by different threads concurrently. The CUDA code was compiled with the nvcc compiler and linked against cuBLAS from the CUDA SDK version 8.0.
When performing the benchmarks with fewer threads than CPU cores, the remaining CPU cores were idle. Runtime (wall clock time) and peak resident memory use were measured using the Linux /usr/bin/time v tool.
Table 2 shows the benchmark results of the bruteforce, MOODS, PoSSuMsearch, TFMCUDA and BLAMM algorithm on node C (36 CPU cores with AVX512 support) when searching the occurrences of 1404 JASPAR PWMs on both strands of human chromosome 1 for two different pvalues: 10^{−5} and 10^{−4}. The bruteforce algorithm, PoSSuMsearch (PWM matching module) and BLAMM support multithreading and were run using 1, 4, 16 and 36 CPU cores respectively. For those cases, the parallel speedup (acceleration factor w.r.t. the singlethreaded run) and parallel efficiency (ratio of the parallel speedup and the no. of CPU cores used) are provided. PoSSuM requires an enhanced suffix array (ESA) to be constructed prior to the actual PWM matching; this runtime is reported separately and needs to be performed only once, independent of the pvalue. The PoSSuM ESA construction step as well as MOODS do not have multithreading support. A more extensive version of this table is provided in Additional file 1: Table S3. Similar tables for node A and B are provided in Additional file 1: Tables S1 and S2, respectively.
Despite being the slowest, the naive implementation of the bruteforce algorithm has only negligible memory use and shows nearly perfect scaling behavior. Additionally, its runtime is nearly independent of the pvalue that is used. The minor differences can be attributed to the larger volume of output that has to be written to disk when more relaxed pvalues are used. MOODS shows excellent runtime performance, taking into account that the software uses only a single CPUcore. However, its memory use is high, exceeding 50 GByte for a pvalue of 10^{−4}. Additionally, both the runtime and memory use depend on the selection of the pvalue: more relaxed (i.e., higher) pvalue settings result in additional resource requirements. The same is true for PoSSuMsearch: more relaxed pvalue settings render it more difficult to eliminate parts of the search space that are guaranteed not to contain PWM occurrences. PoSSuMsearch has a runtime that is comparable to MOODS, albeit with a lower memory use. The software does not benefit much from multithreading with a parallel speedup of at most 2 despite the availability of 36 CPU cores. BLAMM shares the benefits of the bruteforce algorithm: a low memory footprint, a runtime that is largely independent of the selected pvalue, and very good multithreading scaling behavior with a parallel speedup exceeding 27 using 36 CPU cores. When comparing the singlecore performance of BLAMM with MOODS and PoSSuMsearch, BLAMM is somewhat slower: it requires 2 to 4 CPU cores to achieve the same runtime. However, when allowed to make use of all available CPU cores, it is 6.2× to 13.7× faster than MOODS, while using only a fraction of the memory. When run on the dual GPU system, runtime is further reduced and PWM occurrences are identified in only 25s. Compared with TFMCUDA, the GPU version of BLAMM is more than four times faster (singleGPU results). This again illustrates the high performance one can obtain when evaluating MMPs.
Table 3 shows the benchmark results when applying the algorithms to the entire human genome. Again, Additional file 1: Tables S4 and S5 contain the benchmark results on nodes A and B, respectively. The runtime of MOODS ranges from 36 min to over 3 h, depending on the pvalue while the memory use ranges from 17.5 GB to 79 GB. Because of these high memory requirements, MOODS cannot be run on nodes A and B for this configuration. Due to the much larger size of the input sequence, memory requirements of PoSSuM also increase significantly. It takes over three hours to construct the ESA index structure and an additional 11 min to over three hours to identify the PWM matches. In contrast, BLAMM has a runtime that is nearly constant (13 min) and requires very little memory. On the GPU system, the runtimes of BLAMM are further reduced to under 5 min. Note that we were unable to run TFMCUDA on this dataset, likely due to GPU memory restrictions.
Figure 4 shows the relative performance gains of the individual tools from newer CPU architectures. For each tool individually, we normalized its runtime (using chr. 1 and a pvalue = 10^{−5}) to the runtime obtained on the oldest CPU architecture (node A). Despite the decrease in CPU clock frequency, all tools benefit from newer architectures, albeit to varying degrees. Due to the introduction of AVX2 and AVX512 in nodes B and C, respectively, BLAMM is able to benefit the most in terms of singlecore performance. The availability of more powerful SIMD instructions in newer architectures automatically translates to faster MMPs and hence, lower runtimes. Also when considering the multithreaded case, BLAMM benefits the most. In that case, we observe that performance benefits not only from newer SIMD instructions, but also from an increasing number of CPU cores.
Discussion
The PWM matching problem is a form of imprecise string matching. Under stringent settings (low pvalue and hence, a high PWM score threshold), the problem approaches that of exact string matching for which efficient algorithms have been developed, both online and based on index structures. They reduce the complexity from O(nm) for a bruteforce search to O(n+m). These ideas have been extended to PWM matching and implemented in tools such as MOODS and PoSSuMsearch. Nevertheless, for more relaxed pvalues, these algorithms lose their ability to apriori eliminate large parts of the search space and thus require more computational resources. In that respect, the proposed BLAMM algorithm has certain advantages:
The runtime does not depend on the selected pvalue. A small increase in runtime can be observed when relaxing the pvalue, however, this is due to a higher number of PWM occurrences that have to be written to disk.
The memory use of BLAMM is negligible and again independent of the selected pvalue. Per CPU core, memory is required for the (relatively small) S, P and R matrices and to buffer occurrences before spilling them to disk. All data structures involved are threadlocal and hence, BLAMM makes efficient use of multithreading up to a high number of CPU cores, even on nonuniform memory architectures (NUMA).
The runtime of BLAMM is governed largely by the time required to evaluate the MMPs and hence, the quality of the BLAS implementation. Optimized BLAS libraries are provided by almost all CPU vendors. This enables BLAMM to make full use of current and future SIMD capabilities of CPUs, without needing to modify the source code of BLAMM itself. We have demonstrated this for AVX, AVX2 and AVX512. As many applications in High Performance Computing and Artificial Intelligence benefit from improved BLAS performance, CPU vendors will surely continue along this path of developments in the future. For example, halfprecision floating point computations are increasingly supported and might further improve BLAMM’s performance.
The algorithm and its implementation are very simple.
The algorithm is easily portable to GPUs and/or other coprocessors or hardware accelerators. For GPUs, this achieved using the cuBLAS library to enable MMPs to be evaluated on massively parallel processors such as GPUs, thus offloading the CPU and boosting performance.
Conclusions
We proposed and described BLAMM as a simple yet effective algorithm to identify position weight matrix occurrences in DNA sequences. The BLAMM algorithm is based on the bruteforce algorithm in the sense that it exhaustively computes all PWM scores for all PWMs at each starting position of the input sequence(s). However, it does so with much higher efficiency by expressing all computations through matrixmatrix products (MMPs). It is wellknown that MMPs are among a select class of algorithms that can be evaluated on modern, cachebased CPUs with a performance that approaches the theoretical peak performance of the CPU. Highly optimized MMP implementations are provided by CPU vendors through the BLAS library. BLAMM outperforms a naive implementation of the bruteforce algorithm by a factor of 13 to 17. Additionally, it inherits the advantages of the bruteforce algorithm, namely its simplicity, its negligible memory use, its ability to make very efficient use of multithreading and the fact that runtime and memory use are (largely) independent of the selected pvalue or PWM score thresholds. Compared with stateoftheart software package such as MOODS and PoSSuMsearch which implement more sophisticated search algorithms, BLAMM is a factor 2 to 4 slower when considering singlecore performance. When using multithreading, BLAMM can be an order of magnitude faster than MOODS and PoSSuMsearch, while requiring only a fraction of their memory use. Finally, we presented an implementation of BLAMM on a GPU system that outperforms all CPUbased algorithms, as well as an existing GPU implementation by a large margin.
Availability and requirements
Project name: BLAMM – BLASAccelerated Motif Matching Project home page:https://github.com/biointec/blammOperating system: Platform independent Programming language: C++11 Other requirements: BLAS library, CUDA optional License: GNU GPL v3 Any restrictions to use by nonacademics: none
Abbreviations
 ATLAS:

Automatically tuned linear algebra software
 AVX:

Advanced vector extensions
 BLAMM:

BLASaccelerated motif matching
 BLAS:

Basic linear algebra subroutines
 CUDA:

Compute unified device architecture
 EREWPRAM:

Exclusive read exclusive write parallel randomaccess machine
 FLOPs:

Floating point operations
 FMA:

Fused multiplyadd
 GPU:

Graphics processing unit
 HPC:

High performance computing
 MKL:

Math kernel library
 MMP:

Matrixmatrix product
 MOODS:

Motif occurrence detection suite
 PFM:

Position frequency matrix
 PPM:

Position probability matrix
 PSSM:

Position specific scoring matrix
 PWM:

Position weight matrix
 SIMD:

Single instruction multiple data
References
 1
Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000; 16(1):16–23. https://0doiorg.brum.beds.ac.uk/10.1093/bioinformatics/16.1.16.
 2
Khan A, Fornes O, Stigliani A, Gheorghe M, CastroMondragon JA, van der Lee R, Bessy A, Chèneby J, Kulkarni SR, Tan G, Baranasic D, Arenillas DJ, Sandelin A, Vandepoele K, Lenhard B, Ballester B, Wasserman WW, Parcy F, Mathelier A. JASPAR 2018: update of the openaccess database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 2018; 46(D1):260–6. https://doi.org/10.1093/nar/gkx1126.
 3
Newburger DE, Bulyk ML. UniPROBE: an online database of protein binding microarray data on proteinDNA interactions. Nucleic Acids Res. 2009; 37(Database issue):77–82. https://0doiorg.brum.beds.ac.uk/10.1093/nar/gkn660.
 4
Wingender E, Dietze P, Karas H, Knüppel R. TRANSFAC: A database on transcription factors and their DNA binding sites. Nucleic Acids Res. 1996; 24(1):238–41. https://0doiorg.brum.beds.ac.uk/10.1093/nar/24.1.238.
 5
Nishida K, Frith MC, Nakai K. Pseudocounts for transcription factor binding sites. Nucleic Acids Res. 2009; 37(3):939–44. https://0doiorg.brum.beds.ac.uk/10.1093/nar/gkn1019.
 6
Schneider TD, Stephens RM. Sequence logos: A new way to display consensus sequences. Nucleic Acids Res. 1990; 18:6097–6100.
 7
Pizzi C, Ukkonen E. Fast profile matching algorithms – A survey. Theor Comput Sci. 2008; 395(2–3):137–57. https://0doiorg.brum.beds.ac.uk/10.1016/j.tcs.2008.01.015.
 8
Wu TD, NevillManning CG, Brutlag DL. Fast probabilistic analysis of sequence function using scoring matrices. Bioinformatics. 2000; 16(3):233–44. https://0doiorg.brum.beds.ac.uk/10.1093/bioinformatics/16.3.233.
 9
Dorohonceanu B, NevillManning CG. Accelerating protein classification using suffix trees. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, August 1923, 2000, La Jolla / San Diego, CA, USA: 2000. p. 128–33. http://www.aaai.org/Library/ISMB/2000/ismb00013.php. Accessed July 2018.
 10
Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. J Discret Algoritm. 2004; 2(1):53–86. https://0doiorg.brum.beds.ac.uk/10.1016/S15708667(03)000650. The 9th International Symposium on String Processing and Information Retrieval.
 11
Beckstette M, Strothmann D, Homann R, Giegerich R, Kurtz S. PoSSuMsearch: Fast and sensitive matching of position specific scoring matrices using enhanced suffix arrays. In: Proc. of the German Conference on Bioinformatics, Volume P53, GI Lecture Notes in Informatics. 2004. p. 53–64.
 12
Beckstette M, Homann R, Giegerich R, Kurtz S. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics. 2006; 7(1):389. https://0doiorg.brum.beds.ac.uk/10.1186/147121057389.
 13
Liefooghe A, Touzet H, Varré JS. Large scale matching for position weight matrices In: Lewenstein M, Valiente G, editors. Combinatorial Pattern Matching. Berlin: Springer: 2006. p. 401–412.
 14
Salmela L, Tarhio J. Algorithms for weighted matching In: Ziviani N, BaezaYates R, editors. String Processing and Information Retrieval. Berlin: Springer: 2007. p. 276–286.
 15
Liefooghe A, Touzet H, Varré JS. Selfoverlapping occurrences and KnuthMorrisPratt algorithm for weighted matching. In: 3rd International Conference on Language and Automata Theory and Applications, vol. 5457. Tarragona: 2009. p. 481–92. https://hal.inria.fr/inria00365411.
 16
Knuth D, Morris JJr, Pratt V. Fast pattern matching in strings. SIAM J Comput. 1977; 6(2):323–50. https://0doiorg.brum.beds.ac.uk/10.1137/0206024. http://arxiv.org/abs/https://0doiorg.brum.beds.ac.uk/10.1137/0206024.
 17
Pizzi C, Rastas P, Ukkonen E. Fast search algorithms for position specific scoring matrices. In: Proceedings of the 1st International Conference on Bioinformatics Research and Development, BIRD’07. Berlin: Springer: 2007. p. 239–50. http://0dl.acm.org.brum.beds.ac.uk/citation.cfm?id=1762370.1762395.
 18
Pizzi C, Rastas P, Ukkonen E. Finding significant matches of position weight matrices in linear time. IEEE/ACM Trans Comput Biol Bioinforma. 2011; 8(1):69–79. https://0doiorg.brum.beds.ac.uk/10.1109/TCBB.2009.35.
 19
Korhonen JH, Palin K, Taipale J, Ukkonen E. Fast motif matching revisited: highorder PWMs, SNPs and indels. Bioinforma (Oxf Engl). 2016. https://0doiorg.brum.beds.ac.uk/10.1093/bioinformatics/btw683.
 20
Korhonen J, Martinmäki P, Pizzi C, Rastas P, Ukkonen E. MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics. 2009; 25(23):3181–2. https://0doiorg.brum.beds.ac.uk/10.1093/bioinformatics/btp554.
 21
Giraud M, Varré JS. Parallel position weight matrices algorithms. Parallel Comput. 2011; 37(8):466–78. https://0doiorg.brum.beds.ac.uk/10.1016/j.parco.2010.10.001.
 22
Schmidt B, Hildebrandt A. Nextgeneration sequencing: Big Data meets High Performance Computing. Drug Disc Today. 2017; 22(4):712–7. https://0doiorg.brum.beds.ac.uk/10.1016/j.drudis.2017.01.014.
 23
Dongarra JJ, Du Croz J, Hammarling S, Duff IS. A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw. 1990; 16(1):1–17. https://0doiorg.brum.beds.ac.uk/10.1145/77626.79170.
 24
Whaley RC, Dongarra JJ. Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, SC ’98. Washington, DC: IEEE Comput Soc: 1998. p. 1–27. http://0dl.acm.org.brum.beds.ac.uk/citation.cfm?id=509058.509096.
 25
Goto K, Geijn RAvd. Anatomy of highperformance matrix multiplication. ACM Trans Math Softw. 2008; 34(3):12–11225. https://0doiorg.brum.beds.ac.uk/10.1145/1356052.1356053.
 26
Cook S. CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs, 1st edn.San Francisco: Morgan Kaufmann Publishers Inc.; 2013.
 27
Fostier J. A BLASbased algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs. In: Bioinformatics and Biomedical Engineering, Lecture Notes in Bioinformatics. Cham: Springer: 2018. p. 439–49.
 28
Mathelier A, Fornes O, Arenillas DJ, Chen CYY, Denay G, Lee J, Shi W, Shyr C, Tan G, WorsleyHunt R, Zhang AW, Parcy F, Lenhard B, Sandelin A, Wasserman WW. JASPAR 2016: a major expansion and update of the openaccess database of transcription factor binding profiles. Nucleic Acids Res. 2016; 44(D1). https://0doiorg.brum.beds.ac.uk/10.1093/nar/gkv1176.
 29
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing nextgeneration DNA sequencing data. Genome Res. 2010; 20(9):1297–303.
 30
Touzet H, Varre JS. Efficient and accurate Pvalue computation for Position Weight Matrices. Algoritm Mol Biol. 2007; 2(1). https://0doiorg.brum.beds.ac.uk/10.1186/17487188215.
 31
Zhang J, Jiang B, Li M, Tromp J, Zhang X, Zhang MQ. Computing exact Pvalues for DNA motifs. Bioinformatics. 2007; 23(5):531–537. https://0doiorg.brum.beds.ac.uk/10.1093/bioinformatics/btl662.
 32
Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004; 14:1188–90. https://0doiorg.brum.beds.ac.uk/10.1101/gr.849004.
Acknowledgements
The computational resources (Stevin Supercomputer Infrastructure) and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by Ghent University, FWO and the Flemish Government – department EWI. JF would like to acknowledge Klaas Vandepoele and Subhada Kulkarni for testing the software and providing useful suggestions.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 21 Supplement 2, 2020: Selected articles from the 6th International WorkConference on Bioinformatics and Biomedical Engineering. The full contents of the supplement are available online at URL.
Funding
JF was supported by Ghent University  imec. Publication costs are funded by Ghent University.
Author information
Affiliations
Contributions
JF conceived the method, implemented the software, performed benchmarks and wrote the manuscript. The author read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The author declares that he has no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
BLAMM: BLASbased Algorithm for Finding Position Weight Matrix Occurrences in DNA sequences on CPUs and GPUs.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Fostier, J. BLAMM: BLASbased algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs. BMC Bioinformatics 21, 81 (2020). https://0doiorg.brum.beds.ac.uk/10.1186/s1285902033486
Published:
DOI: https://0doiorg.brum.beds.ac.uk/10.1186/s1285902033486
Keywords
 Position weight matrix (PWM)
 High performance computing (HPC)
 Basic linear algebra subprograms (BLAS)
 Graphics processing units (GPUs)