Your Ad Here
Boyer-Moore Explained
=====================
Look at the situation below, the pattern is matched in the first three characters but
a mismatch occurs in character c

  ... t[k]               t[k+5]      t[k+8] ...       <- text
                           c   X   Y   Z
       Y   Z   d   X   Y   Z   X   Y   Z              <- pattern


Depending on the value of c, a different shift would be used.

Case 1: c = 'R' and d is not 'R'
Since c='R' doesn't appear in pattern, shift pattern beyond c. (Bad Character Shift)

  ... t[k]               t[k+5]      t[k+8] ...                               <- text
                           c   X   Y   Z
                               Y   Z   W   X   Y   Z   X   Y   Z              <- pattern


Case 2: c = 'R' and d = 'R'
Since 'R' appears in the pattern to the left of the mismatch, shift 3 to align c and d. (Bad
Character Shift)

  ... t[k]               t[k+5]      t[k+8] ...                   <- text
                           c   X   Y   Z
                   Y   Z   d   X   Y   Z   X   Y   Z              <- pattern


What is the Good Suffix Shift?
If you have matched a suffix in the pattern and a mismatch occurs at the next character p in the 
pattern, the good suffix shift is the amount to shift the pattern so that the same suffix matches
as before but this occurrence of the suffix is preceded by a different character q not equal to p.
In the example below, the suffix that matches is XYZ it is preceded by Z which mismatches with c.
We shift the pattern so that XYZ still matches but the character that precedes the suffix d is not
equal to 'Z'.


Case 3: c = 'W' and d is not Z (looking for recurring suffix in pattern 
preceded by character different than 'Z').

Shift the pattern 3 positions so d lines up with c in the pattern. (Good Suffix Shift)

  ... t[k]               t[k+5]      t[k+8] ...                   <- text
                           c   X   Y   Z
                   Y   Z   d   X   Y   Z   X   Y   Z              <- pattern


What is the Good Suffix Shift continued?
If it is impossible to rematch the suffix where the character that preceds the suffix is
not equal to p, then find the longest prefix of the pattern that matches a suffix of the pattern.
In the next example, when attempting to rematch the suffix XYZ, we find that any rematch of the
suffix is preceded by the same character p = 'Z' where the mismatch occurred. So, look for the
longest prefix of the pattern that matches a suffix 'XY'.

Case 4:  d = 'Z' (if no recurring pattern exists preceded by a different 
character than 'Z' find the longest prefix of pattern that matches a suffix of pattern)
Shift pattern by 4 to line up the longest prefix with a corresponding suffix). (Good Suffix Shift)

  ... t[k]               t[k+5]      t[k+8] ...                                   <- text
                           c   X   Y   Z
                                   Y   Z   d   X   Y   Z   X   Y   Z              <- pattern

  ----------------------------------------------------------------------------------------
Let the pattern P of length m be represented as P = p(m) ... p(2) p(1)

Example: P = "abaa", m = 4
               
              4 3 2 1
         P =  a b a a
         
Bad Character Shift Table
=========================
Let c be any character in the alphabet, and B represent the Bad Character Shift Table.

Define the minimum of the empty set to be m, that is,  min{ } = m by convention.
Define B[c] = min{ k>=1 : p(k+1) = c }

Let's compute the Bad Character Shift Table for P = "abaa".

B['a'] = min{1, 3} = 1,     [why 1 and 3? since p(1+1) = p(2) = 'a' and p(3+1) = p(4) = 'a']

B['b'] = min{2} = 2         [why 2? since p(2+1) = p(3) = 'b']

B['x'] = min{ } = m = 4     [why empty? no k exists where p(k+1) = 'x']

B[any character not 'a' or 'b'] = 4


Good Suffix Table
=================

Let G represent the Good Suffix Table where i = 1, 2, ..., m are valid indicies.

G[i] = min{k >= 1 : [p(min{m,i+k-1}..p(k+1)] matches exactly with a suffix of [p(i-1).. p(1)] 
                     when i>=2 and
                    
                     p(i+k) != p(i) when i+k <= m}
                     
                     
Example:  P = "abaa", m = 4
               
              4 3 2 1
         P =  a b a a
         
  Case 1: i = 1
  =============
  G[1] = min{ k >= 1 : p(1+k) != p(1) when 1+k <= 4}
  
       = min{ 2 } = 2           [why 2? since p(2+1) = 'b' != 'a' = p(1)]
       
       
       
  G[2] = min{ k >= 1 : [p(k+1)..p(k+1)] matches a suffix of [p(1)..p(1)] = 'a' and 
                        p(2+k) != p(2) = 'a'   when 2+k <= 4}
                       
       = min{1, ...}  = 1      [why 1? since [p(2)..p(2)] = 'a' matches suffix of "a" and
                                        p(2+1) = 'b' != 'a' = p(2)]
                                        
                                        
   G[3] = min{ k >= 1 : [p(k+2)..p(k+1)] matches a suffix of [p(2)..p(1)] = "aa" and 
                         p(3+k) != p(3) = 'b'   when 3+k <= 4} 
                         
                         
        = min{3, ...} = 3       [why 3? First show why not 1 or 2. If k=1,      
                                   [p(k+2)..p(k+1)] = [p(3)..p(2)] = "ba" not a suffix of "aa".
                                   Similarly, if k = 2, 
                                   [p(k+2)..p(k+1)] = [p(4)..p(3)] = "ab" not a suffix of "aa".
                                        
                                   If k=3,
                                   [p(m=4)..p(k+1)] = [p(4)..p(4)] = "a" (run off pattern)
                                   But "a" is a suffix of "aa" and
                                   p(3+k) != p(3) vacuously true since run off pattern.]
                                        
    
    G[4] = min{ k >= 1 : [p(k+3)..p(k+1)] matches a suffix of [p(3)..p(1)] = "baa" and 
                         p(4+k) != p(4) = 'a'   when 4+k <= 4}                                      

         = min{3, ...} = 3      [why 3? First show why not 1 or 2. If k=1,      
                                   [p(k+3)..p(k+1)] = [p(4)..p(4)] ="aba" not a suffix of "baa".
                                   Similarly, if k = 2, 
                                   [p(m=4)..p(k+1)] = [p(4)..p(2)]= "ab" (run off pattern) 
                                   not a suffix of "aa".
                                        
                                   If k=3,
                                   [p(m=4)..p(k+1)] = [p(4)..p(4)] = "a" (run off pattern)
                                   But "a" is a suffix of "baa" and
                                   p(4+k) != p(4) vacuously true since run off pattern.]
                    
      
    Boyer-Moore Shift
    =================
    If a mismatch occurs in position i of pattern P and the mismatch character in the text
    is c, shift by the max{G[i] and B[c] - i + 1}.
    
    
    Example:
    
    P = abaa    
    Text = abababaxaaaaaxaabbaaxbaabaa
    
    012345678901234567890123456
    abababaxaaaaaxaabbaaxbaabaa
    abaa
    4321 
       ^
       mismatch at position i = 1 of pattern at 'b' in text
       shift = max{ G[1], B['b'] - 1 + 1} = max{2, 2} = 2
       
       
    012345678901234567890123456
    abababaxaaaaaxaabbaaxbaabaa
      abaa
      4321 
         ^
         mismatch at position i = 1 of pattern at 'b' in text
         shift = max{ G[1], B['b'] - 1 + 1} = max{2, 2} = 2
    
    012345678901234567890123456
    abababaxaaaaaxaabbaaxbaabaa
        abaa
        4321 
           ^
           mismatch at position i = 1 of pattern at 'x' in text
           shift = max{ G[1], B['x'] - 1 + 1} = max{2, 4} = 4
    
    012345678901234567890123456
    abababaxaaaaaxaabbaaxbaabaa
            abaa
            4321 
             ^
             mismatch at position i = 3 of pattern at 'a' in text
             shift = max{ G[3], B['a'] - 3 + 1} = max{3, -1} = 3
             
    012345678901234567890123456
    abababaxaaaaaxaabbaaxbaabaa
               abaa
               4321 
                 ^
                 mismatch at position i = 2 of pattern at 'x' in text
                 shift = max{ G[2], B['x'] - 2 + 1} = max{1, 3} = 3
             
             
    012345678901234567890123456
    abababaxaaaaaxaabbaaxbaabaa
                  abaa
                  4321 
                     ^
                     mismatch at position i = 1 of pattern at 'b' in text
                     shift = max{ G[1], B['b'] - 1 + 1} = max{2, 2} = 2
                     
    012345678901234567890123456
    abababaxaaaaaxaabbaaxbaabaa
                    abaa
                    4321 
                    ^
                     mismatch at position i = 4 of pattern at 'b' in text
                     shift = max{ G[4], B['b'] - 4 + 1} = max{3, -1} = 3
                     
    012345678901234567890123456
    abababaxaaaaaxaabbaaxbaabaa
                       abaa
                       4321 
                         ^
                         mismatch at position i = 2 of pattern at 'b' in text
                        shift = max{ G[2], B['b'] - 2 + 1} = max{1, 1} = 1                      
                       
    012345678901234567890123456
    abababaxaaaaaxaabbaaxbaabaa
                        abaa
                        4321                       
                        ^
                        mismatch at position i = 4 of pattern at 'x' in text
                        shift = max{ G[4], B['x'] - 4 + 1} = max{3, 1} = 3
                        
                        
    012345678901234567890123456
    abababaxaaaaaxaabbaaxbaabaa
                           abaa
                           4321   pattern found                       
    
    ---------------------------------------------------------------------------------------
    
    Exercise: Compute the B[] and G[] tables for the pattern "GCAGAGAG". To see the answer,
    run Boyer.class for this pattern. Note: read the table generated by the program in
    reverse order. That is, index 7 down to index 0 is the same as looking at G[1] up to G[8]
    described above.