Department of Computer Science, University of Chile, Casilla 2777, Santiago, Chile

String searching is an important component of many problems, including text editing, data retrieval, and symbol manipulation. Despite the use of indices for searching large amounts of text, string searching may help in an information retrieval system. For example, it may be used for filtering of potential matches or for searching retrieval terms that will be highlighted in the output.

The string searching or string matching problem consists of finding all occurrences (or the first occurrence) of a pattern in a text, where the pattern and the text are strings over some alphabet. We are interested in reporting all the occurrences. It is well known that to search for a pattern of length *m *in a text of length *n* (where *n* > *m*) the search time is *0*(*n*) in the worst case (for fixed *m*). Moreover, in the worst case, at least *n* - *m* + 1 characters must be inspected. This result is due to Rivest (1977). However, for different algorithms the constant in the linear term can be very different. For example, in the worst case, the constant multiple in the naive algorithm is *m*, whereas for the Knuth-Morris-Pratt (1977) algorithm it is two.

We use the following notation:

*m*: the length of the pattern (string)

The expected number of matches of a random pattern of length *m *in a random text of length *n* is

The naive, or brute force, algorithm is the simplest string matching method. The idea consists of trying to match any substring of length* m *in the text with the pattern (see Figure 10.1).

naivesearch( text, n, pat, m ) /* Search pat[1..m] in text[1..n] */

char text[], pat[];

int n, m;

{

int i, j, k, lim;

lim = n-m+1;

for( i = 1; i<= lim; i++ ) /* Search */

{

k = i;

for( j=1; j<=m && text[k] == pat[j]; j++ ) k++;

if( j > m ) Report_match_at_position( i-j+1 );

}

}

This is drastically different from the worst case *mn*.

The classic Knuth, Morris, and Pratt (1977) algorithm, discovered in 1970, is the first algorithm for which the constant factor in the linear term, in the worst case, does not depend on the length of the pattern. It is based on preprocessing the pattern in time *O(m*). In fact, the expected number of comparisons performed by this algorithm (search time only) is bounded by

*next*[*j*] = max{*i*|(*pattern*[*k*] *= pattern* [*j - i *+ *k*] for *k* = 1, . . . , *i* - 1)

* Example 1* The next table for the pattern

a b r a c a d a b r a

next[j] 0 1 1 0 2 0 2 0 1 1 0 5

kmpsearch( text, n, pat, m ) /* Search pat[1..m] in text[1..n] */

char text[], pat[];

int n, m;

{

int j, k, resume;

int next[MAX_PATTERN_SIZE];

pat[m+1] = CHARACTER_NOT_IN_THE_TEXT;

initnext( pat, m+1, next ); /* Preprocess pattern */

resume = next[m+1];

j = k = 1;

do { /* Search */

if( j==0 || text[k]==pat[j] )

{

k++; j++;

}

else j = next[j];

if( j > m )

{

Report_match_at_position( k-m );

J = resume;

}

} while( k <= n );

pat[m+1] = END_OF_STRING;

}

initnext( pat, m, next ) /* Preprocess pattern of length m */

char pat[];

int m, next[];

{

int i, j;

i = 1; j = next[1] = 0;

do

{

if( j == 0 || pat[i] == pat[j] )

{

i++; j++;

if( pat[i] != pat[j] ) next[i] = j;

else next[i] = next[j];

}

else j = next[j];

}

while( i <= m ) ;

}

Also in 1977, the other classic algorithm was published by Boyer and Moore (1977). Their main idea is to search from right to left in the pattern. With this scheme, searching is faster than average.

**1. **match *all *the characters previously matched, and

**2.** bring a *different* character to the position in the text that caused the mismatch.

* Example 2* The table for the pattern

a b r a c a d a b r a

ddhat [j] 17 16 15 14 13 12 11 13 12 4 1

d[x] = min{s|s=mor (0s<mandpattern[m-s] =x)}

bmsearch( text, n, pat, m ) /* Search pat[1..m] in text[1..n] */

char text[], pat[];

int n, m;

{

int k, j, skip;

int dd[MAX_PATTERN_SIZE], d[MAX_ALPHABET_SIZE];

initd( pat, m, d ); /* Preprocess the pattern */

initdd( pat, m, dd );

k = m; skip = dd[1] + 1;

while( k <= n ) /* Search */

{

j = m;

while( j>0 && text[k] == pat[j] )

{

j--; k--;

}

if( j == 0 )

{

Report_match_at_position( k+1 );

k += skip;

}

else k += max( d[text[k]], dd[j] );

}

}

* Example 3* The

d['a'] = 0 d['b'] = 2 d['c'] = 6 d['d'] = 4 d['r'] = 1

and the value for any other character is 11.

initd( pat, m, d ) /* Preprocess pattern of length m : d table */

char pat[];

int m, d[];

{

int k;

for( k=0; k = MAX_ALPHABET_SIZE; k++ ) d[k] = m;

for( k=1; k<=m; k++ ) d[pat[k]] = m-k;

}

initdd( pat, m, dd ) /* Preprocess pattern of length m : dd hat table */

char pat[];

int m, dd[];

{

int j, k, t, t1, q, q1;

int f[MAX_PATTERN_SIZE+1];

for( k=1; k<=m; k++ ) dd[k] = 2*m-k;

for( j=m, t=m+1; j > 0; j--, t-- ) /* setup the dd hat table */

{

f[j] = t;

while( t <= m && pat[j] != pat[t] )

{

dd[t] = min( dd[t], m-j );

t = f[t];

}

}

q = t; t = m + 1 - q; q1 = 1; /* Rytter's correction */

for( j=1, t1=0; j <= t; t1++, j++ )

{

f[j] = t1;

while( t1 >= 1 && pat[j] != pat[t1] ) t1 = f[t1];

}

while( q < m )

{

for( k=q1; k<=q; k++ ) dd[k] = min( dd[k], m+q-k );

q1 = q + 1; q = q + t - f[t]; t = f[t];

}

}

A simplified version of the Boyer-Moore algorithm (simplified-Boyer-Moore, or SBM, algorithm) is obtained by using only the occurrence heuristic. The main reason behind this simplification is that, in practice, patterns are not periodic. Also, the extra space needed decreases from *O(m* + *c*) to* O(c*). That is, the space depends only on the size of the alphabet (almost always fixed) and not on the length of the pattern (variable). For the same reason, it does not make sense to write a simplified version that uses Galil's improvement because we need *O*(*m*) space to compute the length of the overlapping characters. Of course, the worst case is now *O(mn*), but it will be faster on the average.

Horspool (1980) presented a simplification of the Boyer-Moore algorithm, and based on empirical results showed that this simpler version is as good as the original Boyer-Moore algorithm. Moreover, the same results show that for almost all pattern lengths this algorithm is better than algorithms that use a hardware instruction to find the occurrence of a designated character.

d[x] = min{s|s=mor (1 s <mandpattern[m-s] =x)}

* Example 4* The

d['a'] = 3 d['b'] = 2 d['c'] = 6 d['d'] = 4 d['r'] = 1

and the value for any other character is 11.

bmhsearch( text, n, pat, m ) /* Search pat[1 . .m] in text[1..n] */

char text[], pat[];

int n, m;

{

int d[MAX_ALPHABET_SIZE], i, j, k, lim;

for( k=0; k<MAX_ALPHABET_SIZE; k++ ) d[k] = m+1; /* Preprocessing */

for( k=1; k<=m; k++ ) d[pat[k]] = m+1-k;

pat[m+1] = CHARACTER_NOT_IN_THE_TEXT; /* To avoid having code */

/* for special case n-k+1=m */

lim = n-m+1;

for( k=1; k <= lim; k += d[text[k+m]] ) /* Searching */

{

i=k;

for( j=1; text[i] == pat[j]; j++ ) i++;

if( j == m+1 ) Report_match_at_position( k );

}

/* restore pat[m+1] if necessary */

}

The main idea is to represent the state of the search as a number, and is due to Baeza-Yates and Gonnet (1989). Each search step costs a small number of arithmetic and logical operations, provided that the numbers are large enough to represent all possible states of the search. Hence, for small patterns we have an *O*(*n*) time algorithm using *O*( ) extra space and* O(m* + ) preprocessing time, where denotes the alphabet.

The main properties of the shift-or algorithm are:

Real time: the time delay to process one text character is bounded by a constant.

No buffering: the text does not need to be stored.

To update the state after reading a new character in the text, we must:

state= (state<< 1)or T[curr char]

where << denotes the shift left operation.

The definition of the table *T* is

T[a] = 11010 T[b] = 10101 T[c] = 01111 T[d] = 11111

text : a b d a b a b a b c

T[x] : 11010 10101 11111 11010 10101 11010 10101 11010 10101 01111

state: 11110 11101 11111 11110 11101 11010 10101 11010 10101 01111

MAXSYM: size of the alphabet. For example, 128 for ASCII code.

WORD: word size in bits (32 in our case).

B: number of bits per individual state; in this case, one.

sosearch( text, n, pat, m ) /* Search pat[1..m] in text[1..n] */

register char *text;

char pat[];

int n, m;

{

register char *end;

register unsigned int state, lim;

unsigned int T[MAXSYM], i, j;

char *start;

if( m > WORD )

Abort( "Use pat size <= word size" );

for( i=0; i<MAXSYM; i++ ) T[i] =^{~}0; /* Preprocessing */

for( lim=0, j=1, i=1; i<=m; lim |= j, j <<= B, i++ )

T[pat[i]] &=^{~}j;

lim =^{~}(lim >>B);

text++; end = text+n+1; /* Search */

state =^{~}0; /* Initial state */

for( start=text; text end; text++ )

{

state = (state <<B) | T[*text]; /* Next state */

if( state < lim ) Report_match_at_position( text-start-m+2 );

}

}

initial =^{~}0; first = pat[1]; start = text; /* Search */

do {

state = initial;

do {

state = (state << B) | T[*text]; /* Next state */

if( state < lim ) Report_match_at_position( text-start-m+2 );

text++;

} while( state != initial );

while( text < end && *text != first ) /* Scan */

text++;

} while( text < end );

a set of characters (for example, match a vowel),

a "don't care" symbol (match any character), or

the complement of a set of characters.

A different approach to string searching is to use hashing techniques, as suggested by Harrison (1971). All that is necessary is to compute the signature function of each possible *m*-character substring in the text and check if it is equal to the signature function of the pattern.

rksearch( text, n, pat, m ) /* Search pat[1..m] in text[1..n] */

char text[], pat[]; /* (0 m = n) */

int n, m;

{

int h1, h2, dM, i, j;

dM = 1;

for( i=1; i<m; i++ ) dM = (dM << D) % Q; /* Compute the signature */

h1 = h2 = O; /* of the pattern and of */

for( i=1; i<=m; i++ ) /* the beginning of the */

{ /* text */

h1 = ((h1 << D) + pat[i] ) % Q;

h2 = ((h2 << D) + text[i] ) % Q;

}

for( i = 1; i <= n-m+1; i++ ) /* Search */

{

if( h1 == h2 ) /* Potential match */

{

for(j=1; j<=m && text[i-1+j] == pat[j]; j++ ); /* check */

if( j > m ) /* true match */

Report_match_at_position( i );

}

h2 = (h2 + (Q << D) - text[i]*dM ) % Q; /* update the signature */

h2 = ((h2 << D) + text[i+m] ) % Q; /* of the text */

}

}

With these changes, the evaluation of the signature at every step (see Figure 10.10) is

h2 = h2*D - text[j-m]*dM + text[i+m]; /* update the signature value */

BAEZA-YATES, R. 1989b. "Improved String Searching." *Software-Practice and Experience*, 19(3), 257-71.

BOYER, R., and S. MOORE. 1977. "A Fast String Searching Algorithm." *CACM,* 20, 762-72.

FALOUTSOS, C. 1985. "Access Methods for Text." *ACM C.* *Surveys*, 17, 49-74.

GALIL, Z. 1981. "String Matching in Real Time." *JACM*, 28, 134-49.

GALIL, Z., and J. SEIFERAS. 1983. "Time-Space-Optimal String Matching." *JCSS*, 26, 280-94.

HARRISON, M. 1971. "Implementation of the Substring Test by Hashing." *CACM*, 14, 777-79.

HOLLAAR, L. 1979. "Text Retrieval Computers." *IEEE Computer*, 12, 40-50.

IYENGAR, S., and V. ALIA. 1980. "A String Search Algorithm." *Appl. Math. Comput*., 6, 123-31.

MCCREIGHT, E. 1976. "A Space-Economical Suffix Tree Construction Algorithm." *JACM*, 23, 262-72.

MEYER, B. 1985. "Incremental String Matching." *Inf. Proc. Letters*, 21, 219-27.

SEDGEWICK, R. 1983. *Algorithms*. Reading, Mass.: Addison-Wesley.

SUNDAY, D. 1990. "A Very Fast Substring Search Algorithm." *Communications of the ACM*,* *33(8), 132-42.

TAKAOKA, T. 1986. "An On-Line Pattern Matching Algorithm." *Inf. Proc. Letters*, 22, 329-30.

WEINER, P. 1973. "Linear Pattern Matching Algorithm," in *FOCS*, vol. 14, pp. 1-11.