Boyer-Moore Explained
=====================
Look at the situation below, the pattern is matched in the first three characters but
a mismatch occurs in character c
... t[k] t[k+5] t[k+8] ... <- text
c X Y Z
Y Z d X Y Z X Y Z <- pattern
Depending on the value of c, a different shift would be used.
Case 1: c = 'R' and d is not 'R'
Since c='R' doesn't appear in pattern, shift pattern beyond c. (Bad Character Shift)
... t[k] t[k+5] t[k+8] ... <- text
c X Y Z
Y Z W X Y Z X Y Z <- pattern
Case 2: c = 'R' and d = 'R'
Since 'R' appears in the pattern to the left of the mismatch, shift 3 to align c and d. (Bad
Character Shift)
... t[k] t[k+5] t[k+8] ... <- text
c X Y Z
Y Z d X Y Z X Y Z <- pattern
What is the Good Suffix Shift?
If you have matched a suffix in the pattern and a mismatch occurs at the next character p in the
pattern, the good suffix shift is the amount to shift the pattern so that the same suffix matches
as before but this occurrence of the suffix is preceded by a different character q not equal to p.
In the example below, the suffix that matches is XYZ it is preceded by Z which mismatches with c.
We shift the pattern so that XYZ still matches but the character that precedes the suffix d is not
equal to 'Z'.
Case 3: c = 'W' and d is not Z (looking for recurring suffix in pattern
preceded by character different than 'Z').
Shift the pattern 3 positions so d lines up with c in the pattern. (Good Suffix Shift)
... t[k] t[k+5] t[k+8] ... <- text
c X Y Z
Y Z d X Y Z X Y Z <- pattern
What is the Good Suffix Shift continued?
If it is impossible to rematch the suffix where the character that preceds the suffix is
not equal to p, then find the longest prefix of the pattern that matches a suffix of the pattern.
In the next example, when attempting to rematch the suffix XYZ, we find that any rematch of the
suffix is preceded by the same character p = 'Z' where the mismatch occurred. So, look for the
longest prefix of the pattern that matches a suffix 'XY'.
Case 4: d = 'Z' (if no recurring pattern exists preceded by a different
character than 'Z' find the longest prefix of pattern that matches a suffix of pattern)
Shift pattern by 4 to line up the longest prefix with a corresponding suffix). (Good Suffix Shift)
... t[k] t[k+5] t[k+8] ... <- text
c X Y Z
Y Z d X Y Z X Y Z <- pattern
----------------------------------------------------------------------------------------
Let the pattern P of length m be represented as P = p(m) ... p(2) p(1)
Example: P = "abaa", m = 4
4 3 2 1
P = a b a a
Bad Character Shift Table
=========================
Let c be any character in the alphabet, and B represent the Bad Character Shift Table.
Define the minimum of the empty set to be m, that is, min{ } = m by convention.
Define B[c] = min{ k>=1 : p(k+1) = c }
Let's compute the Bad Character Shift Table for P = "abaa".
B['a'] = min{1, 3} = 1, [why 1 and 3? since p(1+1) = p(2) = 'a' and p(3+1) = p(4) = 'a']
B['b'] = min{2} = 2 [why 2? since p(2+1) = p(3) = 'b']
B['x'] = min{ } = m = 4 [why empty? no k exists where p(k+1) = 'x']
B[any character not 'a' or 'b'] = 4
Good Suffix Table
=================
Let G represent the Good Suffix Table where i = 1, 2, ..., m are valid indicies.
G[i] = min{k >= 1 : [p(min{m,i+k-1}..p(k+1)] matches exactly with a suffix of [p(i-1).. p(1)]
when i>=2 and
p(i+k) != p(i) when i+k <= m}
Example: P = "abaa", m = 4
4 3 2 1
P = a b a a
Case 1: i = 1
=============
G[1] = min{ k >= 1 : p(1+k) != p(1) when 1+k <= 4}
= min{ 2 } = 2 [why 2? since p(2+1) = 'b' != 'a' = p(1)]
G[2] = min{ k >= 1 : [p(k+1)..p(k+1)] matches a suffix of [p(1)..p(1)] = 'a' and
p(2+k) != p(2) = 'a' when 2+k <= 4}
= min{1, ...} = 1 [why 1? since [p(2)..p(2)] = 'a' matches suffix of "a" and
p(2+1) = 'b' != 'a' = p(2)]
G[3] = min{ k >= 1 : [p(k+2)..p(k+1)] matches a suffix of [p(2)..p(1)] = "aa" and
p(3+k) != p(3) = 'b' when 3+k <= 4}
= min{3, ...} = 3 [why 3? First show why not 1 or 2. If k=1,
[p(k+2)..p(k+1)] = [p(3)..p(2)] = "ba" not a suffix of "aa".
Similarly, if k = 2,
[p(k+2)..p(k+1)] = [p(4)..p(3)] = "ab" not a suffix of "aa".
If k=3,
[p(m=4)..p(k+1)] = [p(4)..p(4)] = "a" (run off pattern)
But "a" is a suffix of "aa" and
p(3+k) != p(3) vacuously true since run off pattern.]
G[4] = min{ k >= 1 : [p(k+3)..p(k+1)] matches a suffix of [p(3)..p(1)] = "baa" and
p(4+k) != p(4) = 'a' when 4+k <= 4}
= min{3, ...} = 3 [why 3? First show why not 1 or 2. If k=1,
[p(k+3)..p(k+1)] = [p(4)..p(4)] ="aba" not a suffix of "baa".
Similarly, if k = 2,
[p(m=4)..p(k+1)] = [p(4)..p(2)]= "ab" (run off pattern)
not a suffix of "aa".
If k=3,
[p(m=4)..p(k+1)] = [p(4)..p(4)] = "a" (run off pattern)
But "a" is a suffix of "baa" and
p(4+k) != p(4) vacuously true since run off pattern.]
Boyer-Moore Shift
=================
If a mismatch occurs in position i of pattern P and the mismatch character in the text
is c, shift by the max{G[i] and B[c] - i + 1}.
Example:
P = abaa
Text = abababaxaaaaaxaabbaaxbaabaa
012345678901234567890123456
abababaxaaaaaxaabbaaxbaabaa
abaa
4321
^
mismatch at position i = 1 of pattern at 'b' in text
shift = max{ G[1], B['b'] - 1 + 1} = max{2, 2} = 2
012345678901234567890123456
abababaxaaaaaxaabbaaxbaabaa
abaa
4321
^
mismatch at position i = 1 of pattern at 'b' in text
shift = max{ G[1], B['b'] - 1 + 1} = max{2, 2} = 2
012345678901234567890123456
abababaxaaaaaxaabbaaxbaabaa
abaa
4321
^
mismatch at position i = 1 of pattern at 'x' in text
shift = max{ G[1], B['x'] - 1 + 1} = max{2, 4} = 4
012345678901234567890123456
abababaxaaaaaxaabbaaxbaabaa
abaa
4321
^
mismatch at position i = 3 of pattern at 'a' in text
shift = max{ G[3], B['a'] - 3 + 1} = max{3, -1} = 3
012345678901234567890123456
abababaxaaaaaxaabbaaxbaabaa
abaa
4321
^
mismatch at position i = 2 of pattern at 'x' in text
shift = max{ G[2], B['x'] - 2 + 1} = max{1, 3} = 3
012345678901234567890123456
abababaxaaaaaxaabbaaxbaabaa
abaa
4321
^
mismatch at position i = 1 of pattern at 'b' in text
shift = max{ G[1], B['b'] - 1 + 1} = max{2, 2} = 2
012345678901234567890123456
abababaxaaaaaxaabbaaxbaabaa
abaa
4321
^
mismatch at position i = 4 of pattern at 'b' in text
shift = max{ G[4], B['b'] - 4 + 1} = max{3, -1} = 3
012345678901234567890123456
abababaxaaaaaxaabbaaxbaabaa
abaa
4321
^
mismatch at position i = 2 of pattern at 'b' in text
shift = max{ G[2], B['b'] - 2 + 1} = max{1, 1} = 1
012345678901234567890123456
abababaxaaaaaxaabbaaxbaabaa
abaa
4321
^
mismatch at position i = 4 of pattern at 'x' in text
shift = max{ G[4], B['x'] - 4 + 1} = max{3, 1} = 3
012345678901234567890123456
abababaxaaaaaxaabbaaxbaabaa
abaa
4321 pattern found
---------------------------------------------------------------------------------------
Exercise: Compute the B[] and G[] tables for the pattern "GCAGAGAG". To see the answer,
run Boyer.class for this pattern. Note: read the table generated by the program in
reverse order. That is, index 7 down to index 0 is the same as looking at G[1] up to G[8]
described above.