Thinking Machines Corporation, 245 First Street, Cambridge, Massachusetts

Data Parallel computers, such as the Connection Machine CM-2, can provide interactive access to text databases containing tens, hundreds, or even thousands of Gigabytes of data. This chapter starts by presenting a brief overview of data parallel computing, a performance model of the CM-2, and a model of the workload involved in searching text databases. The remainder of the chapter discusses various algorithms used in information retrieval and gives performance estimates based on the data and processing models just presented. First, three algorithms are introduced for determining the N highest scores in a list of M scored documents. Next, the parallel signature file representation is described. Two document scoring algorithms are fully described; a sketch of a boolean query algorithm are also presented. The discussion of signatures concludes with consideration of false hit rates, data compression, secondary/tertiary storage, and the circumstances under which signatures should be considered. The final major section discusses inverted file methods. Two methods, parallel inverted files and partitioned posting files, are considered in detail. Finally, issues relating to secondary storage are briefly considered.

The algorithms fpresented in this chapter utilize the *data parallel* computing model proposed by Hillis and Steele (1986). In this model, there is a single program that controls a large number of processing elements, each of which will be performing the same operation at any moment. Parallelism is expressed by the creation of parallel data structures and the invocation of parallel operators. The model is made manifest in *data parallel programming languages*, such as the C* language developed by Thinking Machines Corporation (1990). The body of this section presents some basic data structures, operators, and notations which will be required to understand the algorithms presented in the remainder of the chapter. The section will conclude with a concise performance model for one implementation of the data parallel model (the Connection Machine model CM-2). More details on the architecture of this machine are provided by Hillis (1985) and Thinking Machines Corporation (1987).

C* includes all the usual C data structures and operations. These are called *scalar variables* and *scalar operations*.

P_1 8 6 3 4 9 1 2 0

-------------------------------

P_2 7 14 8 29 17 34 1 9

P_array [0] 4 38 17 87 30 38 90 81

-------------------------------------------

P_array [1] 37 3 56 39 89 10 10 38

-------------------------------------------

P_array [2] 01 83 79 85 13 87 38 61

P_z = P_x * P_y;

This might result in the following data store:

P_x 1 1 1 2 2 2 3 3

P_y 1 2 3 1 2 3 1 2

---------------------------

P_z 1 2 3 2 4 6 3 6

where (P_1 P_2)

P_min = P_1;

else

P_min = P_2;

Everything mentioned up to this point involves the simple extension of scalar data structures and operations to vector data structures and element-wise vector operations. We will now consider some basic operations that involve operating on data spread across multiple positions; these are collectively referred to as *nonlocal operations.*

The simplest of these operations are the *global reduction* operations. These operations compute cumulative sums, cumulative minima/maxima, and cumulative bitwise AND/ORs across all active positions in a shape. The following unary operators are used to stand for the global reduction operators:

+= Cumulative sum

&= Cumulative bitwise AND

|= Cumulative bitwise OR

>?= Cumulative maximum

<?= Cumulative minimum

P_x 5 0 6 4 1 7 3 2

P_i 7 4 1 2 5 0 6 3

---------------------------

P_y 7 6 4 2 0 1 3 5

= Send with overwrite (arbitrary choice)

+= Send with add

&= Send with bitwise AND

= Send with bitwise OR

<?= Send with minimum

>?= Send with maximum

These are binary forms of the global reduce operations introduced above.

^{3}These are referred to as the *send-reduce* operators.

P_x 2 0 1 2 4 3 2 1

scan_with_add (P_x) 2 2 3 5 9 12 14 15

Optionally, a Boolean flag (called a *segment flag*) may be supplied. Wherever this flag is equal to 1, it causes the running total to be reset to 0. For example:

P_x 2 0 1 2 4 3 2 1

B_s 1 0 0 0 1 0 1 0

add_scan (P_x, B_s) 2 2 3 5 4 7 2 3

We will now consider the performance of one parallel computer, the Connection Machine model CM-2. In scalar C, the various primitive operators such as + and = have fairly uniform time requirements. On parallel computers, however, different operators may have vastly differing time requirements. For example, adding two parallel variables is a purely local operation, and is very fast. Parallel left-indexing, on the other hand, involves moving data from one processor to another, and is two orders of magnitude slower.

In addition, any realization of this model must take into account the fact that a given machine has a finite number of processing elements and, if a shape becomes large enough, several positions will map to the same physical processor. We call the ratio of the number of positions in a shape to the number of physical processors the *virtual processing ratio* (VP ratio). As the VP ratio increases, each processor must do the work of several and, as a first approximation, a linear increase in running time will be observed.

The following symbols will be used:

NThe number of physical processors_{procs}

The VP ratio

sum_array(P_array)

{

P_result = 0;

for (i = 0; i< N; i++)

P_result += P_array;

return (+= P_result);

}

P = S

loop (N)

P += P

(+= P)

Operation Calls Time per Call

------------------------------------------

P = S 1 3 + 15r

P += P n 3 + 28r

(+= P) 1 137 + 70r

------------------------------------------

Total (140 + 3n) + (85 + 28n)r

The following timing equations characterize the performance of the CM-2.^{4}

^{4}Throughout this chapter, all times are in microseconds unless noted otherwise.

Operator Time r = 1 Comments

B = S 3 + 3r 6

B $= B 3 + 3r 6

B $$ B 3 + 3r 6

where(B) 8 + 2r 10

S = [S]P 16 16

[S]P = S 16 16

P = S 3 + 15r18

P += S 3 + 28r31

P += P 3 + 28r31

P[P] = P 11 + 60r71

P = P[P] 11 + 60r71

P == S 18 + 67r85 Same time for <= etc.

(>?= P) 137 + 70r207 Same time for += etc.

scan_with_or 632 + 56r688

scan_with_add 740 + 170r910

[P]P = P 2159r2159 Same time for += etc.

[P]P[P] = P 2159r2159 Same time for += etc.

18.3 A MODEL OF THE RETRIEVAL TASK

Retrieval consists of (1) scoring documents, and (2) determining which documents received the highest scores. This second step--ranking--will be considered first. Any of the ranking algorithms discussed below may be used in combination with any of the scoring algorithms which will be discussed later. It should be noted that, while scoring is probably the more interesting part of the retrieval process, ranking may be a large portion of the overall compute cost, and ranking algorithms are as deserving of careful design as are scoring algorithms.

The second case, assuming a VP ratio of one is used, requires an array of size

We will assume there is a fast method for converting a parallel variable at a VP ratio of *r* to a parallel array having *N _{rows}* cells per processor. Such a function is, in fact, provided on the Connection Machine; it requires essentially zero time.

Many parallel computing systems provide a parallel ranking routine which, given a parallel integer, returns 0 for the largest integer, 1 for the next largest, and so forth.^{5} This may be used to solve the problem quite directly: one finds the rank of every score, then sends the score and document identifier to the position indexed by its rank. The first *N _{ret}* values are then read out.

^{5}On the CM-2 this routine takes time 30004*r*.

P_score 83 98 1 38 78 37 17 55

------------------------------------------

rank 1 0 7 4 2 5 6 3

After send 98 83 78 55 38 37 17 1

rank_system(dest, P_doc_score, P_doc_id)

{

P_rank = rank(P_doc_score);

[P_rank]P_doc_score = P_doc_score;

[P_rank] P_doc_id = P_doc_id;

for (i = 0; i < N_RET; i++)

{

dest[i].score = [i] P_doc_score;

dest[i].id = [i]P_doc_id;

}

}

P = rank ()

[P]P = P

[P]P = P

loop (N_RET)

{

S = [S]P

S = [S]P

}

Its timing characteristics are:

Operation Calls Time per Call

-----------------------------------

rank ( ) 1 30004r

P]P = P 2 2159r

S = [S]P2_{ }N16_{ret}

-----------------------------------

Total 32N+ 34322_{ret}r

Substituting the VP ratio *r* = gives a time of:

The system-supplied ranking function does much more work than is really required: it ranks all *N _{docs}* scores, rather than the

P_score 83 98 1 38 78 37 17 55

P_score 83 -1 1 38 78 37 17 55

On the next iteration, 83 will be the highest-ranking score. The algorithm is as follows:

rank_iterative(dest, P_doc_score, P_doc_id)

{

for (i = 0; i < N_RET; i++)

{

best_score = ( > ?= P_doc_score);

where (P_doc_score == best_score)

{

position = ( <?= P_position);

dest[i].score = [position]P_doc_score;

dest[i].id = [position]P_doc_id;

[position]P_doc_score = -1;

}

}

}

loop (N_RET)

{

S = ( >?= P)

where (P == S)

{

S = ( <?= P)

S = [S]P

S = [S]P

[S]P = S

}

}

Operation Calls Time per Call

------------------------------------

(>? = P) 2N137 + 70_{ret }r

P == SN18 + 67_{ret }r

whereN8 + 2_{ret }r

S = [S]P 2N16_{ret }

[S]P = SN16_{ret }

------------------------------------

Total 348N+ 209_{ret}N_{ret}r

Substituting the VP ratio gives a time of:

The following algorithm, due to Jim Hutchinson (1988), improves on iterative extraction. Hutchinson's algorithm starts with an array of

P_scores [0] 88 16 87 10 94 04 21 11

P_scores [1] 90 17 83 30 37 39 42 17

P_scores [2] 48 43 10 62 4 12 10 9

P_scores [3] 83 98 1 38 78 37 17 55

P_scores [0] 88 16 87 10 -1 4 21 11

P_scores [1] -1 17 83 30 37 39 42 17

P_scores [2] 48 43 10 -1 4 12 10 9

P_scores [3] 83 -1 1 38 78 37 17 55

--------------------------------------------

P_best 94 90 62 98 -1 -1 -1 -1

We then extract the best of the best (in this case 98):

P_scores [0] 88 16 87 10 94 4 21 11

P_scores [1] -1 17 83 30 37 39 42 17

P_scores [2] 48 43 10 -1 4 12 10 9

P_scores [3] 83 -1 1 38 78 37 17 55

--------------------------------------------

P_best 94 90 62 -1 -1 -1 -1 -1

and replenish it from the appropriate row (3 in this case).

P_scores [0] 88 16 87 10 94 4 21 11

P_scores [1] -1 17 83 30 37 39 42 17

P_scores [2] 48 43 10 -1 4 12 10 9

P_scores [3] -1 -1 1 38 78 37 17 55

--------------------------------------------

P_best 94 90 62 83 -1 -1 -1 -1

extract_step(P_best_score, P_best_id, P_scores, P_ids, row)

{

max_ score = (>?= P_ scores[row]);

where (P_ scores[row] == max_score)

{

position = (<?= P_position);

[row]P_best_score = [position]P_scores[row];

[row]P_best_id = [position]P_scores[row];

[position]P_scores[row] = -1;

}

}

S = (>?= P)

where(P == S)

{

S = (<?= P)

[S] P = [S] P

[S] P = [S] P

[S] P = S

}

Operation Calls Time per Call

-------------------------------

(>?= P) 2 207

where l 10

S == P 1 85

[S]P = S 3 16

S = [S] P 2 16

-------------------------------

Total 589

Given this extraction step subroutine, one can easily implement Hutchinson's algorithm:

rank_hutchinson(dest, P_scores, P_ids)

{

P_best_score = -1;

P_best_id = 0;

for (row = 0; row < N_ROWS; row++)

extract_step(P_best_score, P_best_id,

P_scores, P_ids, row);

for (i = 0; i < N_RET; i++)

{

best_of_best = (>?= P_best_score);

where (P_best_score == best_of_best)

{

position = (<?= P_position);

dest [i].score = [position]P_scores;

dest[i].id = [position]P_ids;

[positions]P_best_score = -1;

}

extract_step(P_best_score, P_best_id,

P_scores, P_ids, row);

}

}

P = S

P = S

loop(N_ROWS)

extract_step ()

loop(N_RET)

{

S = (>?= P)

where(P == S)

{

S = (<?= P)

S = [S]P

S = [S]P

[S]P = S

}

extract_step()

}

Operation Calls Time per Call

-------------------------------------------------

( > ? = P ) 2N207_{ret}

S == PN85_{ret}

whereVN10_{ret}

S= [S]P 2N16_{ret}

[S]P = SN16_{ret}

extract_step (N+_{rows}N) 589_{ret}

-------------------------------------------------

Total 1149N+ 589_{ret}N_{rows}

Substituting in the value of *N _{rows}* we arrive at:

We have examined three ranking algorithms in this section: the system-defined ranking algorithm, iterative extraction, and Hutchinson's algorithm. Their times are as follows:

|D|NSystem Iterative Hutchinson_{docs}

1GB200 X 10^{3}138 ms 24 ms 25ms

10 GB 2 X 10^{6}1064 ms 137 ms 41 ms

100 GB 20 X 10^{6}10503 ms 1286 ms 203 ms

1000 GB 200 X 10^{6}104751 ms 12764 ms 1821 ms

The first scoring method to be considered here is based on *parallel signature files.* This file structure has been described by Stanfill and Kahle (1986, 1988, 1990a) and by Pogue and Willet (1987). This method is an adaptation of the overlap encoding techniques discussed in this book.

Overlap encoded signatures are a data structure that may be quickly probed for the presence of a word. A difficulty associated with this data structure is that the probe will sometimes return *present* when it should not. This is variously referred to as a *false hit* or a *false drop*. Adjusting the encoding parameters can reduce, but never eliminate, this possibility. Depending on the probability of such a false hit, signatures may be used in two manners. First, it is possible to use signatures as a filtering mechanism, requiring a two-phase search in which phase 1 probes a signature file for possible matches and phase 2 re-evaluates the query against the full text of documents accepted by phase 1. Second, if the false hit rate is sufficiently low, it is possible to use signatures in a single phase system. We will choose our signature parameters in anticipation of the second case but, if the former is desired, the results shown below may still be applied.

An overlap encoding scheme is defined by the following parameters:

SSize of signature in bits_{bits}

SWeight of word signatures_{weight}

SNumber of words to be inserted in each signature_{words}

H(_{j}T) A set of_{i}Shash functions_{weight}

Unless otherwise specified, the following values will be used:

S4096_{bits}

S10_{weight}

S120_{words}

create_signature (B_signature, words)

{

for (i = 0; i < S_BITS; i++)

B_signature[i] = 0;

for (i = 0; i < S_WORDS; i++)

for (j = 0; j < S_WEIGHT; j++)

B_signature[hash(j, words[i])] = 1;

}

The timing characteristics of this algorithm will not be presented.

To test a signature for the presence of a word, all S* _{weight}* hash functions are applied to it and the corresponding bits of the signature are ANDed together. A result of 0 is interpreted as

probe_signature(B_signature, word)

{

B_result = 1;

for (i = 0; i < S_WEIGHT; i++)

B_result &= B_signature[i];

return B-result;

}

B = S

loop(S_WEIGHT)

B &= B

Operation Calls Time per Call

-----------------------------------------------------

B = S 1 3 + 3r

B &= B S3 + 3_{weight}r

-----------------------------------------------------

Total 3(1 + S)(1 +_{weight}r)

Total for S= 10 33 + 33_{weight}r

The number of signatures in a database is then

We can now compute the average time per query term:

Documents having more than S* _{words}* must be split into multiple signatures. These signatures can then be placed in consecutive positions, and flag bits used to indicate the first and last positions for each document. For example, given the following set of documents:

______________________________________________________________

| | | Still another |

| This is the initial | This is yet another | document taking |

| document | document | yet more space |

| | | than the others |

|_____________________|_____________________|__________________|

we might arrive at signatures divided as follows:

Still

This initial This another another taking space

is the docu- is yet docu- docu- yet than others

B_signature ment ment ment more the

-----------------------------------------------------------------------------

B_first 1 0 1 0 1 0 0 0

B_last 0 1 0 1 0 0 0 1

Still

This initial This another another taking space

is the docu- is yet docu- docu- yet than others

B_signature ment ment ment more the

------------------------------------------------------------------------------

B_first 1 0 1 0 1 0 0 0

B_last 0 1 0 1 0 0 0 1

probe

("yet") 0 0 1 0 0 1 0 0

scan_with_or 0 0 1 1 0 1 1 1

The algorithm for this is as follows:

probe_document(B_signature, B_first, word)

{

B_local = probe_signature(B_signature, word);

B_result = scan_with_or(B_local, B_first);

}

B = probe_signature()

B = scan_with_or()

Operation Calls Time per Call

-------------------------------------

probe_signature 1 33 + 33r

scan_with_or 1 632 + 56r

-------------------------------------

Total 665 + 89r

Using the above building blocks, it is fairly simple to construct a scoring alogrithm. In this algorithm a query consists of an array of *terms*. Each term consists of a word and a weight. The score for a document is the sum of the weights of the words it contains. It may be implemented thus:

score_document(B_signature, B_first, terms)

{

P_score = 0;

for (i = 0; i < N_TERMS; i++)

{

B_probe = probe_document(B_signature, terms[i].word, B_first)

where (B_probe)

P_score += terms[i].weight;

}

return P_score;

}

P = S

loop (N_TERMS)

{

B = probe_document()

where(B)

P += S

}

Its timing characteristics are as follows:

Operation Calls Time per Call

-------------------------------------------------------

P = S 1 3 + 15r

probe_documentN665 + 89_{terms}r

whereN8 + 2_{terms}r

P += SN3 + 28_{terms}r

-------------------------------------------------------

Total 3 + 676N+_{terms }_{ }(15 + 119N)_{terms}r

query(B_signature, B_first, term)

{

arg0 = term-args[0];

arg1 = term-args[1];

switch (term-connective)

{

case AND: return query(B_signature, B_first, arg0) &&

query(B_signature, B_first, arg1);

case OR: return query(B_signature, B_first, arg0)

query(B_signature, B_first, arg1);

case NOT: return ! query(B_signature, B_first, arg0);

case WORD: return probe_document(B_signature, B_first, arg0);

}

}

The bulk of the time in the signature scoring algorithm is taken up by the probe_document operation. The bulk of the time for that operation, in turn, is taken up by the scan_with_or operation. This operation is performed once per query term. We should then seek to pull the operation outside the query-term loop. This may be done by (1) computing the score for each signature independently, then (2) summing the scores at the end. This is accomlished by the following routine:

score_document(B_signature, B_first, terms)

{

P_score = 0;

for (i = 0; i < N_TERMS; i++)

{

B_probe = probe_signature(B_signature, term[i].word);

where (B_probe)

P_score += term[i].weight;

}

P_score = scan_with_add(P_score, B_first);

return P_score;

}

P = S;

loop (N_TERMS)

{

B = probe_signature ();

where (B)

P += S;

}

P = scan_with_add ();

The timing characteristics are as follows:

Operation Calls Time per Call

-----------------------------------------------------------

P = S 1 3 + 15r

probe_signatureN33 + 33_{terms}r

whereN8 + 2_{terms}r

P += SN3 + 28_{terms}r

scan_with_add 1 740 + 170r

-----------------------------------------------------------

Total (743 + 44N) + (185 + 63_{terms}N)_{terms}r

Comparing the two scoring algorithms, we see:

Basic Algorithm(3 + 676N) + (15 + 119_{terms}N)_{terms}r

Improved Algorithm(743 + 44N) + (185 + 63_{terms}N)_{terms}r

The final step in executing a query is to rank the documents using one of the algorithms noted in the previous section. Those algorithms assumed, however, that every position contained a document score. The signature algorithm leaves us with only the last position of each document containing a score. Use of the previously explained algorithms thus requires some slight adaptation. The simplest such adaptation is to pad the scores out with - 1. In addition, if Hutchinson's ranking algorithm is to be used, it will be necessary to force the system to view a parallel score variable at a high VP ratio as an array of scores at a VP ratio of 1; the details are beyond the scope of this discussion.

Taking into account the VP ratio used in signature scoring, the ranking time will be:

Substituting the standard values for *N _{terms}*,

The times for various sizes of database, on a machine with 65,536 processors, are as follows:

DNScore Rank Total_{docs}

1 GB 200 X 10^{3}9 ms 28 ms 37 ms

10 GB 2 X 10^{6}74 ms 75 ms 149 ms

100 GB 20 X 10^{6}723 ms 545 ms 1268 ms

1000 GB 200 X 10^{6}7215 ms 5236 ms 12451 ms

It is possible that a signature file will not fit in primary storage, either because it is not possible to configure a machine with sufficient memory or because the expense of doing so is unjustified. In such cases it is necessary that the signature file reside on either secondary or tertiary storage. Such a file can then be searched by repetitively (1) transferring signatures from secondary storage to memory, (2) using the above signature-based algorithms to score the documents, and (3) storing the scores in a parallel array. When the full database has been passed through memory, any of the above ranking algorithms may be invoked to find the best matches. The algorithms described above need to be modified, but the compute time should be unchanged. There will, however, be the added expense of reading the signature file into primary memory. If *R _{IO}* is the I/O rate in megabytes per second, and

DI/O Time

1 GB 2 sec

10 GB 15 sec

100 GB 150 sec

1000 GB 1500 sec

Search Time

DI/O Time (100 queries) Total

1 GB 2 sec 4 sec 6 sec

10 GB 15 sec 15 sec 30 sec

100 GB 150 sec 127 sec 277 sec

100 GB 1500 sec 1245 sec 2745 sec

This has not, in practice, proved an attractive search method.

It is guaranteed that, if a word is inserted into a signature, probing for it will return *present*. It is possible, however, for a probe to return *present* for a word that was never inserted. This is referred to variously as a *false drop* or a *false hit*. The probability of a false hit depends on the size of the signature, the number of hash codes, and the number of bits set in the table. The number of bits actually set depends, in turn, on the number of words inserted into the table. The following approximation has proved useful:

^{6}The compression factor is defined as the ratio of the signature file size to the full text.

S_{words}Signatures/MB CompressionP_{false}False hits/GB

40 1540 77% 4.87 X 10^{-11}7.50 X 10^{-5}

80 820 42% 3.09 X 10^{-8}2.50 X 10^{-2}

120 580 30% 1.12 X 10^{-6}6.48 X 10^{-1}

160 460 24% 1.25 X 10^{-5}5.75 X 10^{0}

200 388 20% 7.41 X 10^{-5}2.88 X 10^{1}

240 340 17% 2.94 X 10^{-4}1.00 X 10^{2}

280 306 16% 8.88 X 10^{-4}2.72 X 10^{2}

320 280 14% 2.20 X 10^{-3}6.15 X 10^{2}

SS_{words}P_{bits}_{false}

80 2731 1.1163 X 10^{-6}

120 4096 1.1169 X 10^{-6}

160 5461 1.1172 X 10^{-6}

SS_{words}c_{bits}

60 2048 27%

120 4096 30%

240 8192 35%

An inverted file is a data structure that, for every word in the source file, contains a list of the documents in which it occurs. For example, the following source file:

______________________________________________________________

| | | Still another |

| This is the initial | This is yet another | document taking |

| document | document | yet more space |

| | | than the others |

|_____________________|_____________________|__________________|

has the following inverted index:

another 1 2

document 0 1 2

initial 0

is 0 1

more 2

others 2

space 2

still 2

taking 2

than 2

the 0 2

this 0 1

yet 1 2

Each element of an inverted index is called a posting, and minimally consists of a document identifier. Postings may contain additional information needed to support the search method being implemented. For example, if document-term weighting is used, each posting must contain a weight. In the event that a term occurs multiple times in a document, the implementer must decided whether to generate a single posting or multiple postings. For IR schemes based on document-term weighting, the former is preferred; for schemes based on proximity operations, the latter is most useful.

The parallel inverted file structure proposed by Stanfill, Thau, and Waltz (1989) is a straightforward adaptation of the conventional serial inverted file structure. A parallel inverted file is a parallel array of postings such that the postings for a given word occupy contiguous positions within a contiguous series of rows, plus an index structure indicating the start row, end row, start position, and end position of the block of postings for each word. For example, given the database and inverted file shown above, the following parallel inverted file would result:

Postings

----------

1 2 0 1

2 0 0 1

2 2 2 2

2 2 0 2

0 1 1 2

Index

-----------------------------------------

Word First First Last Last

Row Position Row Position

-----------------------------------------

another 0 0 0 1

document 0 2 1 0

initial 1 1 1 1

is 1 2 1 3

more 2 0 2 0

others 2 1 2 1

space 2 2 2 2

still 2 3 2 3

taking 3 0 3 0

than 3 1 3 1

the 3 2 3 3

this 4 0 4 1

yet 4 2 4 3

PThe number of postings for term_{i}T_{i}

RThe number of rows in which postings for_{i}Toccur_{i}

The average number of rows per query term

(r, p) A row-position pair

and the number of rows occupied by *T _{i}* will be

Also from the distribution model, *f*(*Q*)* = Z*, and . This gives us:

The scoring algorithm for parallel inverted files involves using both left- and right-indexing to increment a score accumulator. We start by creating an array of score registers, such as is used by Hutchinson's ranking algorithm. Each document is assigned a row and a position within that row. For example, document i might be* *mapped to row *i* mod **N*** _{procs}*, position . Each posting is then modified so that, rather than containing a document identifier, it contains the row and position to which it will be sent. The Send with add operation is then used to add a weight to the score accumulator. The algorithm is as follows:

score_term (P_scores, P_postings, term)

{

for (row = term.start_row; row <= term.end_row; row++)

{

if (row == term.start_row)

start_position = term.start_position;

else

start_position = 0;

if (row == term.end_row)

end_position = term.end_position;

else

end_position = N_PROCS-1;

where ((start_position <= P_position) &&

(P_position <= end_position))

{

P_dest_pos = P_postings[row].dest_pos;

P_dest_row = P_postings[row].dest_row;

[P_dest_pos] P_scores [P_dest_row] += term.weight;

}

}

}

loop(R_BAR)

where ((S = P) && (P = S)) /* Also 1 B && B operation */

[P] P [P] += S

This has the following timing characteristics:

Taking into account the value of yields the following time per query term:

|D| Time 10 Terms Rank Total

1 GB 2 ms 25 ms 25 ms 50 ms

10 GB 3 ms 34 ms 41 ms 75 ms

100 GB 13 ms 131 ms 203 ms 334 ms

1000 GB 110 ms 1097 ms 1821 ms 2918 ms

Up to now, the algorithms we have discussed support only binary document models, in which the only information encoded in the database is whether a given term appears in a document or not. The parallel inverted file structure can also support document term weighting, in which each posting incorporates a weighting factor; this weighting factor measures the strength of usage of a term within the document. The following variant on the query execution algorithm is then used:

score_weighted_term (P_scores, P_postings, term)

{

for (row = term.start_row; row = term.end_row; row++)

{

if (row == term.start_row)

start_position = term.start_position;

else

start_position = 0;

if (row == term.end_row)

end_position = term.end_position;

else

end_position = N_PROCS-1;

where ((start_position = P_position) &&

(P_position = end_position))

{

P_dest_pos = P_postings[row].dest_pos;

P_dest_row = P_postings[row].dest_row;

P_weight = term.weight * P_postings[row].weight;

[P_dest_pos] P_scores [P_dest_row] += P_weight;

}

}

}

This algorithm requires only slightly more time than the unweighted version.

18.7 PARTITIONED POSTING FILES

One major advantage of inverted files is that it is possible to query them without loading the entire file into memory. The algorithms shown above have assumed that the section of the file required to process a given query are already in memory. While a full discussion of the evolving field of I/O systems for parallel computing is beyond the scope of this paper, a brief presentation is in order. In the final analysis, most of what is known about I/O systems with large numbers of disks (e.g. mainframe computers) will probably hold true for parallel systems.

I/O systems for parallel computers are typically built from large arrays of simple disks. For example, the CM-2 supports a disk array called the Data Vault^{TM }which contains 32 data disk plus 8 ECC disks.^{8} It may be thought of as a single disk drive with an average latency of 200 milliseconds and a transfer rate of 25 MB/sec- ond. Up to 8 Data Vaults may be simultaneously active, yielding a transfer rate of up to 200 MB/second. This access method achieves very high transfer rates, but does not yield many I/O's per second; this can be crippling for all but the very largest databases. Consider, for example, a 64K processor Connection Machine with 8 disk arrays operating in single transfer mode. Assume we are using the partitioned posting file representation, and that each posting requires 4 bytes of storage. The storage required by each partition is then 4*FN_{procs.}*

^{8}Data Vault is a trademark of Thinking Machines Corporation

DSeek Transfer Score

1 GB 200 ms 5 ms 1 ms

10 GB 200 ms 5 ms 1 ms

100 GB 200 ms 11 ms 2 ms

1000 GB 200 ms 65 ms 9 ms

Under these circumstances, the system is severely seek-bound for all but the very largest databases.

Fortunately, the disk arrays contain buried in them the possibility of solving the problem. Each Data Vault has 32 disks embedded in it; a system with 8 disk arrays thus has a total of 256 disks. It is possible to access these disks independently. Under this I/O model, each disk transfers a block of data into the memories of all processors. The latency is stil 200 milliseconds, but 256 blocks of data will be transferred rather than l. This has the capability of greatly reducing the impact of seek times on system performance.

HILLIS, D. (1985) *The Connection Machine,* Cambridge, MA: MIT Press.

HUTCHINSON, J. (1988). Personal Communications.

STONE, H. (1987). Parallel Querying of Large Databases: a Case Study. *Computer*, *20*(10), 11-21.