Short read DNA fragment anchoring algorithm

BMC Bioinformatics

Table 2 Procedure of EMBF algorithm.

Input: Block length L, n bps reference sequence S, m bps query sequence T, miss match error threshold M, gap error threshold G.
Let B1 = ⌊n/L⌋, B2 = ⌊m/L⌋, E = MAX(M, G);
1. For offset = 1 to L do
1.1 Divide S [offset, n] into L bps blocks, as S_offset = {s_offset,1,..., s_{offset, B1}};
1.2 Convert S_offsetto frequency vectors, as ES_offset = {es_offset,1,..., es_{offset, B1}};
2. For offset = 1 to L do
2.1 Sequentially choose B2 blocks from ES_offset, and set the start position as p; Using all possible combinations to get B2-E blocks. And combine them as ADDR variable. Set the remaining E blocks as r;
2.2 Mapping pair (ADDR,(r, p)) into a hash map M, and chaining possible conflicts;
2.3 Iteratively scan ES_offset for next B2 blocks in ES_offset;
3 First level filtering process
3.1 Divide T into L bps blocks, as T = {t₁, t₂,..., t_B2} and convert them to frequency vector as ET = {et₁, et₂,..., et_B2};
3.2 Choose B2-E blocks from ET and combining them as ADDR variable, set the remaining E blocks as t;
3.3 Query ADDR in M and pass all returned results as R = {(r₁, p₁), (r₂, p₂),...... } to step 4;
4 For i = 1 to \| R\| do
if ECD(t, r_i) < E then record p_i;
Output: All recorded p _i from step 4.

The step 1~2 in EMBF are pre-processing steps where a two-level index structure was constructed. Index entry addresses are generated according to different combination of blocks, and require L*n/L*C(m/L,2) = nm²/L² operations in total. The computing overheads to generate ADDR could be set as constant c, so the total pre-processing costs to build the index is cnm²/L² ≈ O(n). Step 3 is the first level filtering phase with constant computing cost. The output result count is related to the length of index seed and the size of reference sequence. The second level filtering work is processed in step 4 with time complexity of O(rm/L), where r is the average output count of step 3, see results section for accurate evaluation of the value r. So by excluding the pre-processing steps, the timing complexity of EMBF is O(rm/L) << O(n). The space complexity could be interpreted as memory space used to implement the two-level index structure (see results section for detailed analysis). In order to fit first index into fast storage device to achieve best performance, we could adjust the size of reference sequence and the length of index seed to fine tune the index size and access overheads.

ISSN: 1471-2105