Skip to main content

Table 2 Procedure of EMBF algorithm.

From: Short read DNA fragment anchoring algorithm

Input: Block length L, n bps reference sequence S, m bps query sequence T, miss match error threshold M, gap error threshold G.

Let B1 = ⌊n/L⌋, B2 = ⌊m/L⌋, E = MAX(M, G);

1. For offset = 1 to L do

   1.1 Divide S [offset, n] into L bps blocks, as Soffset = {soffset,1,..., soffset, B1};

   1.2 Convert Soffsetto frequency vectors, as ESoffset = {esoffset,1,..., esoffset, B1};

2. For offset = 1 to L do

   2.1 Sequentially choose B2 blocks from ESoffset, and set the start position as p; Using all possible combinations to get B2-E blocks. And combine them as ADDR variable. Set the remaining E blocks as r;

   2.2 Mapping pair (ADDR,(r, p)) into a hash map M, and chaining possible conflicts;

   2.3 Iteratively scan ESoffset for next B2 blocks in ESoffset;

3 First level filtering process

   3.1 Divide T into L bps blocks, as T = {t1, t2,..., tB2} and convert them to frequency vector as ET = {et1, et2,..., etB2};

   3.2 Choose B2-E blocks from ET and combining them as ADDR variable, set the remaining E blocks as t;

   3.3 Query ADDR in M and pass all returned results as R = {(r1, p1), (r2, p2),...... } to step 4;

4 For i = 1 to | R| do

   if ECD(t, ri) < E then record pi;

Output: All recorded p i from step 4.

  1. The step 1~2 in EMBF are pre-processing steps where a two-level index structure was constructed. Index entry addresses are generated according to different combination of blocks, and require L*n/L*C(m/L,2) = nm2/L2 operations in total. The computing overheads to generate ADDR could be set as constant c, so the total pre-processing costs to build the index is cnm2/L2 ≈ O(n). Step 3 is the first level filtering phase with constant computing cost. The output result count is related to the length of index seed and the size of reference sequence. The second level filtering work is processed in step 4 with time complexity of O(rm/L), where r is the average output count of step 3, see results section for accurate evaluation of the value r. So by excluding the pre-processing steps, the timing complexity of EMBF is O(rm/L) << O(n). The space complexity could be interpreted as memory space used to implement the two-level index structure (see results section for detailed analysis). In order to fit first index into fast storage device to achieve best performance, we could adjust the size of reference sequence and the length of index seed to fine tune the index size and access overheads.