Dept. of Computer Science, ETH, Zurich, Switzerland

Depto. de Ciencias de la Computación, Universidad de Chile, Casilla 2777, Santiago, Chile

Centre for the New OED and Text Research, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1

We survey new indices for text, with emphasis on PAT arrays (also called suffix arrays). A PAT array is an index based on a new model of text that does not use the concept of word and does not need to know the structure of the text.

Text searching methods may be classified as lexicographical indices (indices that are sorted), clustering techniques, and indices based on hashing. In this chapter we discuss two new lexicographical indices for text, called PAT trees and PAT arrays. Our aim is to build an index for the text of size similar to or smaller than the text.

Briefly, the traditional model of text used in information retrieval is that of a *set of documents*. Each document is assigned a list of *keywords* (attributes), with optional relevance *weights* associated to each keyword. This model is oriented to library applications, which it serves quite well. For more general applications, it has some problems, namely:

A basic structure is assumed (documents and words). This may be reasonable for many applications, but not for others.

Keywords must be extracted from the text (this is called "indexing"). This task is not trivial and error prone, whether it is done by a person, or automatically by a computer.

Queries are restricted to keywords.

No structure of the text is needed, although if there is one, it can be used.

This paper describes PAT trees and PAT arrays. PAT arrays are an efficient implementation of PAT trees, and support a query language more powerful than do traditional structures based on keywords and Boolean operations. PAT arrays were independently discovered by Gonnet (1987) and Manber and Myers (1990). Gonnet used them for the implementation of a fast text searching system, PAT^{TM} (Gonnet 1987; Fawcett 1989), used with the *Oxford English Dictionary *(*OED*). Manber and Myers' motivation was searching in large genetic databases. We will explain how to build and how to search PAT arrays.

The PAT tree is a data structure that allows very efficient searching with preprocessing. This section describes the PAT data structure, how to do some text searches and algorithms to build two of its possible implementations. This structure was originally described by Gonnet in the paper "Unstructured Data Bases" by Gonnet (1983). In 1985 it was implemented and later used in conjunction with the computerization of the *Oxford English Dictionary*. The name of the implementation, the PAT^{TM} system, has become well known in its own right, as a software package for very fast string searching.

In what follows, we will use a very simple model of text. Our text, or text data-base, will consist of a single (possibly very long) array of characters, numbered sequentially from one onward. Whether the text is already presented as such, or whether it can be viewed as such is not relevant. To apply our algorithms it is sufficient to be able to view the entire text as an array of characters.

Text Once upon a time, in a far away land . . .

sistring 1 Once upon a time . . .

sistring 2 nce upon a time . . .

sistring 8 on a time, in a . . .

sistring 11 a time, in a far . . .

sistring 22 a far away land . . .

For example, the above sistrings will compare as follows:

A PAT tree is a Patricia tree (Morrison 1968; Knuth 1973; Flajolet and Sedgewick 1986; and Gonnet 1988) constructed over all the possible sistrings of a text. A Patricia tree is a digital tree where the individual bits of the keys are used to decide on the branching. A zero bit will cause a branch to the left subtree, a one bit will cause a branch to the right subtree. Hence Patricia trees are binary digital trees. In addition, Patricia trees have in each internal node an indication of which bit of the query is to be used for branching. This may be given by an absolute bit position, or by a count of the number of bits to *skip*. This allows internal nodes with single descendants to be eliminated, and thus all internal nodes of the tree produce a useful branching, that is, both subtrees are non-null. Patricia trees are very similar to compact suffix trees or compact position trees (Aho et al. 1974).

So far we have assumed that every position in the text is indexed, that is, the Patricia tree has *n* external nodes, one for each position in the text. For some types of search this is desirable, but in some other cases not all points are necessary or even desirable to index. For example, if we are interested in word and phrase searches, then only those sistrings that are at the beginning of words (about 20% of the total for common English text) are necessary. The decision of how many sistrings to include in the tree is application dependent, and will be a trade-off between size of the index and search requirements.

In this section we will describe some of the algorithms for text searching when we have a PAT tree of our text.

Notice that every subtree of the PAT tree has all the sistrings with a given prefix, by construction. Then prefix searching in a PAT tree consists of searching the prefix in the tree up to the point where we exhaust the prefix or up to the point where we reach an external node. At this point we need to verify whether we could have skipped bits. This is done with a single comparison of any of the sistrings in the subtree (considering an external node as a subtree of size one). If this comparison is successful, then all the sistrings in the subtree (which share the common prefix) are the answer; otherwise there are no sistrings in the answer.

We define a proximity search as finding all places where a string *S*_{1} is at most a fixed (given by the user) number of characters away from another string *S*_{2}. The simplest algorithm for this type of query is to search for *s*_{l} and *s*_{2}. Then, sort by position the smaller of the two answers. Finally, we traverse the unsorted answer set, searching every position in the sorted set, and checking if the distance between positions (and order, if we always want *s*_{1} before *s*_{2}) satisfies the proximity condition. If *m*_{1} and *m*_{2} (*m*_{1} < *m*_{2}) are the respective answer set sizes, this algorithm requires (*m*_{1} + *m*_{2}) log *m*_{1} time. This is not very appealing if *m*_{1} or *m*_{2} are *O(n*). For the latter case (when one of the strings *S*_{1} or *S*_{2} is small), better solutions based on PAT arrays have been devised by Manber and Baeza-Yates (1991).

Searching for all the strings within a certain range of values (lexicographical range) can be done equally efficiently. More precisely, range searching is defined as searching for all strings that lexicographically compare between two given strings. For example, the range "abc" . . "acc" will contain strings like "abracadabra," "acacia," "aboriginal," but not "abacus" or "acrimonious."

The longest repetition of a text is defined as the match between two different positions of a text where this match is the longest (the most number of characters) in the entire text. For a given text, the longest repetition will be given by the tallest internal node in the PAT tree, that is, the tallest internal node gives a pair of sistrings that match for the greatest number of characters. In this case, tallest means not only the shape of the tree but has to consider the skipped bits also. For a given text, the longest repetition can be found while building the tree and it is a constant, that is, will not change unless we change the tree (that is, the text).

This type of search has great practical interest, but is slightly difficult to describe. By "most significant" or "most frequent" we mean the most frequently occurring strings within the text database. For example, finding the "most frequent" trigram is finding a sequence of three letters that appears most often within our text.

The main steps of the algorithm due by Baeza-Yates and Gonnet (1989) are:

**a.** Convert the regular expression passed as a query into a minimized deterministic finite automation (DFA), which may take exponential space/time with respect to the query size but is independent of the size of the text (Hopcroft and Ullman 1979).

**c.** Convert character DFAs into binary DFAs using any suitable binary encoding of the input alphabet; each state will then have at most two outgoing transitions, one labeled 0 and one labeled 1.

**d.** Simulate the binary DFA on the binary digital trie from all sistrings of text using the same binary encoding as in step b. That is, associate the root of the tree with the initial state, and, for any internal node associated with state i, associate its left descendant with state j if i j for a bit 0, and associate its right descendant with state k if i k for a 1 (see Figure 5.3).

The implementation of PAT trees using conventional Patricia trees is the obvious one, except for some implementation details which cannot be overlooked as they would increase the size of the index or its accessing time quite dramatically. Since this type of index is typically built over very large text databases, and the size of the index tree is linear in the size of the text, an economic implementation is mandatory.

The main ideas to solve (or alleviate) both problems are

**a. **bucketing of external nodes, and

**b. **mapping the tree onto the disk using supernodes.

The previous implementation has a parameter, the external node bucket size,* b.* When a search reaches a bucket, we have to scan all the external nodes in the bucket to determine which if any satisfy the search. If the bucket is too large, these costs become prohibitive.

There is a straightforward argument that shows that these arrays contain most of the information we had in the Patricia tree at the cost of a factor of log_{2} *n.* The argument simply says that for any interval in the array which contains all the external nodes that would be in a subtree, in log_{2} *n* comparisons in the worst case we can divide the interval according to the next bit which is different. The sizes of the subtrees are trivially obtained from the limits of any portion of the array, so the only information that is missing is the longest-repetition bit, which is not possible to represent without an additional structure. Any operation on a Patricia tree can be simulated in *O(*log *n*) accesses. Figure 5.4 shows this structure.

It turns out that it is not neccessary to simulate the Patricia tree for prefix and range searching, and we obtain an algorithm which is *O(*log* n*) instead of *O(*log^{2}* n*) for these operations. Actually, prefix searching and range searching become more uniform. Both can be implemented by doing an indirect binary search over the array with the results of the comparisons being less than, equal (or included in the case of range searching), and greater than. In this way, the searching takes at most 2 log_{2}* n* - 1 comparisons and 4log_{2}* n* disk accesses (in the worst case).

If a portion of the text is small enough to fit in main memory together with its PAT array, then this process can be done very efficiently as it is equivalent to string sorting. Note that here we are talking about main memory; if paging is used to simulate a larger memory, the random access patterns over the text will certainly cause severe memory thrashing.

Quicksort is an appropriate algorithm for this building phase since it has an almost sequential pattern of access over the sorted file. For maximal results we can put all of the index array on external storage and apply external quicksort (see Gonnet and Baeza-Yates, section 4.4.5 [1984]) indirectly over the text. In this case it is possible to build an index for any text which together with the program can fit in main memory. With today's memory sizes, this is not a case to ignore. This is the algorithm of choice for small files and also as a building block for other algorithms.

A second case that can be solved efficiently is the case of merging two indices (to produce a single one) when the text plus twice the index of one of them fits in main memory. This algorithm is not trivial and deserves a short explanation. The text of the small file together with a PAT array for the small file (of size *n*_{1}) plus an integer array of size *n*_{1} + 1 are kept in main memory. The integer array is used to count how many sistrings of the big file fall between each pair of index points in the small file (see Figure 5.5). To do this counting, the large file is read sequentially and each sistring is searched in the PAT array of the small file until it is located between a pair of points in the index. The corresponding counter is incremented. This step will require *O(n*_{2} log *n*_{1}) comparisons and *O(n*_{2}) characters to be read sequentially. Once the counting is finished, the merging takes place by reading the PAT array of the large file and inserting the PAT array of the small file guided by the counts (see Figure 5.6). This will require a sequential reading of *n*_{1}+ *n*_{2} words. In total, this algorithm performs a linear amount of sequential input and output and *O(n*_{2} log *n*_{1}) internal work, and its behavior is not only acceptable but exceptionally good.

The fact that comparing sistrings through random access to disk is so expensive is a consequence of two phenomena. First, the reading itself requires some relatively slow) physical movement. Second, the amount of text actually used is a small fraction of the text made available by an I/O operation.

**c.** a list of satisfied requests

We have seen an algorithm for merging a large file against a small file, small meaning that it fits, with its index, in memory. We may also need to merge two or more large indices. We can do this by reading the text associated with each key and comparing. If we have *n* keys and *m* files, and use a heap to organize the current keys for each file being merged, this gives us *O(n *log* m*) comparisons. More importantly, it gives us *n* random disk accesses to fetch key values from the text. If *n* is 100,000,000 and we can do 50 random disk accesses per second, then this approach will take 2,000,000 seconds or about 23 days to merge the files. A second problem inherent with sistrings is that we do not know how many characters will be necessary to compare the keys. This will be addressed later.

*memory number of files = temporary disk space*

Another improvement in this situation can be made by noticing that lexicographically adjacent keys will have common prefixes for some length. We can use "stemming" to reduce the data written out for each key, increasing the number of keys that can be merged in each pass.

FLAJOLET, P. and R. SEDGEWICK. 1986. "Digital Search Trees Revisited." *SIAM J Computing, *15; 748-67.

GONNET, G. 1984. *Handbook of Algorithms and Data Structures*. London: Addison-Wesley.

HOPCROFT, J., and J. ULLMAN. 1979. *Introduction to Automata Theory*. Reading, Mass.: Addison-Wesley.