Depto. de Ciencias de la Computación, Universidad de Chile, Casilla 2777, Santiago, Chile

Infomation retrieval (IR) is a multidisciplinary field. In this chapter we study data structures and algorithms used in the implementation of IR systems. In this sense, many contributions from theoretical computer science have practical and regular use in IR systems.

The first section covers some basic concepts: strings, regular expressions, and finite automata. In section 2.3 we have a look at the three classical foundations of structuring data in IR: search trees, hashing, and digital trees. We give the main performance measures of each structure and the associated trade-offs. In section 2.4 we attempt to classify IR algorithms based on their actions. We distinguish three main classes of algorithms and give examples of their use. These are retrieval, indexing, and filtering algorithms.

The presentation level is introductory, and assumes some programming knowledge as well as some theoretical computer science background. We do not include code bccause it is given in most standard textbooks. For good C or Pascal code we suggest the Handbook of Algorithms and Data Structures of Gonnet and Baeza-Yates (1991).

We use to denote the *alphabet* (a set of symbols). We say that the alphabet is *finite* if there exists a bound in the size of the alphabet, denoted by . Otherwise, if we do not know a priori a bound in the alphabet size, we say that the alphabet is *arbitrary*. A *string* over an alphabet is a finite length sequence of symbols from . The *empty string* () is the string with no symbols. If *x* and *y* are strings, *xy* denotes the *concatenation* of *x* and *y*. If * = *xyz* is a string, then *x* is a *prefix*, and *z* a *suffix *of . *The length of a string *x* (*x*) is the number of symbols of *x*. Any contiguous sequence of letters *y* from a string is called a *substring*. If the letters do not have to be contiguous, we say that *y* is a *subsequence*.

When manipulating strings, we need to know how similar are a pair of strings. For this purpose, several *similarity measures* have been defined. Each similarity model is defined by a distance function *d*, such that for any strings *s*_{1}, *s*_{2}, and *s*_{3,} satisfies the following properties:

d(s_{1},s_{1}) = 0,d(s_{1},s_{2}) 0,d(s_{1},s_{3})d(s_{1},s_{2}) +d(s_{2},s_{3})

The two main distance functions are as follows:

The **Hamming distance **is defined over strings of the same length. The function d is defined as the number of symbols in the same position that are different (number of mismatches). For example, *d*(*text,that*) = 2.

The **edit distance** is defined as the minimal number of symbols that is necessary to insert, delete, or substitute to transform a string s_{1} to s_{2}. Clearly, d(s_{1}, s_{2}) length(s_{1}) - length(s_{2}). For example, *d*(*text, tax*) = 2.

We use the usual definition of regular expressions (RE for short) defined by the operations of concatenation, union (+) and star or Kleene closure (*) (Hopcroft and Ullman (1979). A *language* over an alphabet is a set of strings over . Let *L*_{1} and *L*_{2} be two languages. The language {*xy**x* *L*_{1} *and y* * *L* _{2}} is called the *concatenation

We use *L*(*r*) to represent the *set of strings* in the language denoted by the regular expression *r*. The *regular expressions* over and the languages that they denote (*regular sets* or* regular languages*) are defined recursively as follows:

is a regular expression and denotes the empty set.

(empty string) is a regular expression and denotes the set {}.

For each symbol a in , a is a regular expression and denotes the set {a}.

to denote any symbol from (when the ambiguity is clearly resolvable by context).

r? to denote zero or one occurrence of r (that is, r? = * + r).*

[a_{1} . . a_{m}] to denote a range of symbols from . For this we need an order in .

rk to denote (finite closure).

All the examples given here arise from the Oxford English Dictionary:

<A>Scot^{}80^{ <W>(Kenilw + Discov)}

where< > are characters in the OED text that denote tags (A for author, W for work).

**2. **All "bl" tags (lemma in bold) containing a single word consisting of lowercase alphabetical only:

<bl>[a..z]*</bl>

**3. **All first citations accredited to Shakespeare between 1610-11:

<EQ>(<LQ>)?<Q><D> 161(0+1)</D> <A>Shak

**4.** All references to author W. Scott:

<A>((Sirb)? W)?bScottb?</A>

where *b* denotes a literal space.

A finite automaton is a mathematical model of a system. The automaton can be in any one of a finite number of states and is driven from state to state by a sequence of discrete inputs. Figure 2.1 depicts an automaton reading its input from a tape.

F *Q* is the set of *final* states, and

A finite automaton starts in state *q*_{0} reading the input symbols from a tape. In one move, the FA in state *q* and reading symbol *a* enters state(s) * (*q*, *a), and moves the reading head one position to the right. If (*q*, *a*) *F*, we say that the FA has accepted the string written on its input tape up to the last symbol read. If (*q*, *a*) has an unique value for every *q* and *a*, we say that the FA is *deterministic* (DFA); otherwise we say that it is *nondeterministic* (NFA).

In this section we cover three basic data structures used to organize data: search trees, digital trees, and hashing. They are used not only for storing text in secondary memory, but also as components in searching algorithms (especially digital trees). We do not describe arrays, because they are a well-known structure that can be used to implement static search tables, bit vectors for set manipulation, suffix arrays (Chapter 5), and so on.

Some examples of their use in subsequent chapters of this book are:

**Search trees:** for optical disk files (Chapter 6), prefix B-trees (Chapter 3), stoplists (Chapter 7).

**Hashing:** hashing itself (Chapter 13), string searching (Chapter 10), associated retrieval, Boolean operations (Chapters 12 and 15), optical disk file structures (Chapter 6), signature files (Chapter 4), stoplists (Chapter 7).

**Digital trees:** string searching (Chapter 10), suffix trees (Chapter 5).

The most well-known search tree is the binary search tree. Each internal node contains a key, and the left subtree stores all keys smaller that the parent key, while the right subtree stores all keys larger than the parent key. Binary search trees are adequate for main memory. However, for secondary memory, multiway search trees are better, because internal nodes are bigger. In particular, we describe a special class of balanced multiway search trees called B-tree.

A B-tree of order *m* is defined as follows:

The root has between 2 and 2m keys, while all other internal nodes have between m and 2m keys.

All leaves are at the same depth.

Updates are done bottom-up. To insert a new record, we search the insertion point. If there is not enough space in the corresponding leaf, we split it, and we promote a key to the previous level. The algorithm is applied recursively, up to the root, if necessary. In that case, the height of the tree increases by one. Splits provides a minimal storage utilization of 50 percent. Therefore, the height of the tree is at most log* _{m+}*1(

To improve storage utilization, several overflow techniques exist. Some of them are:

B*-trees: in case of overflow, we first see if neighboring nodes have space. In that case, a subset of the keys is shifted, avoiding a split. With this technique, 66 percent minimal storage utilization is provided. The main disadvantage is that updates are more expensive (Bayer and McCreight 1972; Knuth 1973).

Partial expansions: buckets of different sizes are used. If an overflow occurs, a bucket is expanded (if possible), or split. Using two bucket sizes of relative ratio 2/3, 66 percent minimal and 81 percent average storage utilization is achieved (Lomet 1987; Baeza-Yates and Larson 1989). This technique does not deteriorate update time.

Adaptive splits: two bucket sizes of relative ratios 1/2 are used. However, splits are not symmetric (balanced), and they depend on the insertion point. This technique achieves 77 percent average storage utilization and is robust against nonuniform distributions (low variance) (Baeza-Yates 1990).

A *hashing function h *(*x*) maps a key *x* to an integer in a given range (for example, 0 to *m* - 1). Hashing functions are designed to produce values uniformly distributed in the given range. For a good discussion about choosing hashing functions, see Ullman (1972), Knuth (1973), and Knott (1975). The hashing value is also called a *signature*.

A hashing function is used to map a set of keys to slots in a *hashing table*. If the hashing function gives the same slot for two different keys, we say that we have a *collision*. Hashing techniques mainly differ in how collisions are handled. There are two classes of collision resolution schemas: *open addressing* and *overflow addressing*.

In open addressing (Peterson 1957), the collided key is "rehashed" into the table, by computing a new index value. The most used technique in this class is *double hashing*, which uses a second hashing function (Bell and Kaman 1970; Guibas and Szemeredi 1978). The main limitation of this technique is that when the table becomes full, some kind of reorganization must be done. Figure 2.4 shows a hashing table of size 13, and the insertion of a key using the hashing function *h *(*x*) = *x* mod 13 (this is only an example, and we do not recommend using this hashing function!).

Because hashing "randomizes" the location of keys, a sequential scan in lexicographical order is not possible. Thus, ordered scanning or range searches are very expensive. More details on hashing can be found in Chapter 13.

Hashing schemes have also been used for secondary memory. The main difference is that tables have to grow dynamically as the number of keys increases. The main techniques are *extendible hashing* which uses hashing on two levels: a directory and a bucket level (Fagin et al. 1979), and linear hashing which uses an overflow area, and grows in a predetermined way (Litwin 1980; Larson 1980; Larson and Kajla 1984). For the case of textual databases, a special technique called *signature files* (Faloutsos 1987) is used most frequently. This technique is covered in detail in Chapter 4 of this book.

To improve search time on B-trees, and to allow range searches in hashing schemes, several hybrid methods have been devised. Between them, we have to mention the *bounded disorder* method (Litwin and Lomet 1987), where B^{+}-tree buckets are organized as hashing tables.

Efficient prefix searching can be done using indices. One of the best indices for prefix searching is a binary digital tree or binary trie constructed from a set of substrings of the text. This data structure is used in several algorithms.

*Tries* are recursive tree structures that use the digital decomposition of strings to represent a set of strings and to direct the searching. Tries were invented by de la Briandais (1959) and the name was suggested by Fredkin (1960), from information re*trie* val. If the alphabet is ordered, we have a lexicographically ordered tree. The root of the trie uses the first character, the children of the root use the second character, and so on. If the remaining subtrie contains only one string, that string's identity is stored in an external node.

The *height* of a trie is the number of nodes in the longest path from the root to an external node. The length of any path from the root to an external node is bounded by the height of the trie. On average, the height of a trie is logarithmic for any square-integrable probability distribution (Devroye 1982). For a random uniform distribution (Regnier 1981), we have

for a binary trie containing *n* strings.

A *Patricia tree* (Morrison 1968) is a trie with the additional constraint that single-descendant nodes are eliminated. This name is an acronym for "Practical Algorithm To Retrieve Information Coded In Alphanumerical." A counter is kept in each node to indicate which is the next bit to inspect. Figure 2.6 shows the Patricia tree corresponding to the binary trie in Figure 2.5.

A trie built using the substrings (suffixes) of a string is also called *suffix tree* (McCreight [1976] or Aho et al. [1974]). A variation of these are called position trees (Weiner 1973). Similarly, a Patricia tree is called a *compact suffix tree.*

It is hard to classify IR algorithms, and to draw a line between each type of application. However, we can identify three main types of algorithms, which are described below.

The main class of algorithms in IR is retrieval algorithms, that is, to extract information from a textual database. We can distinguish two types of retrieval algorithms, according to how much extra memory we need:

**3.** All the locations where the pattern occurs (the set of all possible values of *m*).

In general, the complexities of these problems are different.

This class of algorithms is such that the text is the input and a processed or filtered version of the text is the output. This is a typical transformation in IR, for example to reduce the size of a text, and/or standardize it to simplify searching.

The most common filtering/processing operations are:

Common words removed using a list of stopwords. This operation is discussed in Chapter 7.

Uppercase letters transformed to lowercase letters.

Special symbols removed and sequences of multiple spaces reduced to one space.

Numbers and dates transformed to a standard format (Gonnet 1987).

Spelling variants transformed using Soundex-like methods (Knuth 1973).

Word stemming (removing suffixes and/or prefixes). This is the topic of Chapter 8.

The usual meaning of indexing is to build a data structure that will allow quick searching of the text, as we mentioned previously. There are many classes of indices, based on different retrieval approaches. For example, we have inverted files (Chapter 3), signature files (Chapter 4), tries (Chapter 5), and so on, as we have seen in the previous section. Almost all type of indices are based on some kind of tree or hashing. Perhaps the main exceptions are clustered data structures (this kind of indexing is called *clustering*), which is covered in Chapter 16, and the Direct Acyclic Word Graph (DAWG) of the text, which represents all possible subwords of the text using a linear amount of space (Blumer et al. 1985), and is based on finite automata theory.

Usually, before indexing, the text is filtered. Figure 2.7 shows the complete process for the text.

BAYER, R., and K. UNTERAUER. 1977. "Prefix B-trees." *ACM TODS*, 2(1), 11-26.

BELL, J., and C. KAMAN. 1970. "The Linear Quotient Hash Code." *CACM*, 13(11), 675-77.

DEVROYE, L. 1982. "A Note on the Average Depth of Tries." *Computing*, 28, 367-71.

FREDKIN, E. 1960. "Trie Memory." *CACM*, 3, 490-99.

GUIBAS, L., and E. SZEMEREDI. 1978. "The Analysis of Double Hashing." *JCSS*, 16(2), 226-74.

HOPCROFT, J., and J. ULLMAN. 1979. *Introduction to Automata Theory*. Reading, Mass.: Addison-Wesley.

KNOTT, G. D. 1975. "Hashing Functions." *Computer Journal*, 18(3), 265-78.

LARSON, P.-A. 1980. "Linear Hashing with Partial Expansions," in *VLDB*, vol. 6, pp. 224-32, Montreal.

MCCREIGHT, E. 1976. "A Space-Economical Suffix Tree Construction Algorithm." *JACM*, 23, 262-72.

PETERSON, W. 1957. "Addressing for Random-Access Storage. *IBM J Res. Development*, 1(4), 130-46.

ULLMAN, J. 1972. "A Note on the Efficiency of Hashing Functions." *JACM*, 19(3), 569-75.

WEINER, P. 1973. "Linear Pattern Matching Algorithm," in *FOCS*, vol. 14, pp. 1-11.