Software Productivity Consortium

This chapter presents an overview of Boolean operations, which are one means of expressing queries in information retrieval systems. The concepts of Boolean operations are introduced, and two implementations based on sets are given. One implementation uses bit vectors; the other, hashing. The relative performance characteristics of the approaches are shown.

Information retrieval systems manage documents. These differ from the data stored in traditional database management systems in that they have much less structure. The data are not organized into neat little units with uniform content, where an integer may be expected at a particular location or a string at another. Little may be reliably determined beyond the existence of a document and its boundaries. Of course, information may be derived from a document (such as inverted indices--see Chapter 3). However, such information is invisible to the user of an information retrieval system. He or she sees only a document, with no particular structure save units derived from natural languages, such as words, sentences, or paragraphs.

select Name, Age from Person where Age > 25

Queries on documents in information retrieval systems, then, are limited to operations involving character strings, at best with some knowledge of sentence structure. A typical query might be, "Give me the names of all documents containing sentences with the words 'computer' and 'science'." Despite its seeming simplicity, such a query can be very costly. In a database management system, one might optimize this query by working from sorted data, and using search algorithms that capitalize on such order. An information retrieval system, however, cannot reorder a document's text. A search for a word within a document may therefore take time linearly proportional to the size of the document. As mentioned, the situation can be improved through the use of indices, but the use of a partial index can limit the flexibility of an information retrieval system, and a full index is often prohibitively expensive.

Boolean expressions are formed from user queries. In some information retrieval systems, the user enters them directly. In more sophisticated systems, the user enters natural language, which the system converts into Boolean expressions. In any event, they are then evaluated to determine the results of the query. This section will show how Boolean expressions are built from queries, and how those expressions are translated into sets.

Find all documents containing "information".

Find all documents that do not contain "information".

is represented as the Boolean expression:

not information

Find all documents containing "information" and "retrieval".

Find all documents containing "information"

or "retrieval" (or both).

Find all documents containing "information"

or "retrieval", but not both.

Find all documents containing "information", "retrieval",

or not containing both "retrieval" and "science".

This translates into the following Boolean expression:

(information and retrieval) or not (retrieval and science)

Parentheses are often helpful to avoid ambiguity.

**1. ***U - D*_{1} is the set of all documents not containing *P*_{1} (not).

**2. ***D*_{1} *D*_{2} is the set of all documents containing both *P*_{1} and *P*_{2} (and).

**3. ***D*_{1} *D*_{2} is the set of all documents containing either *P*_{1} or *P*_{2} (or).

The expression "information and retrieval" is:

{doc_{1}, doc_{3}} {doc_{1}, doc_{2}, doc_{4}} = {doc_{1}}

whereas "information or retrieval" is:

{*doc*_{1}, *doc*_{3}} {*doc*_{1}, *doc*_{2}, *doc*_{4}} = {*doc*_{1}, *doc*_{2}, *doc*_{3}, *doc*_{4}}

{information and retrieval} or not {retrieval and science}

- ({*doc*_{1}, *doc*_{2}} {*doc*_{2}, *doc*_{4}, *doc*_{5}})

= {*doc*_{2}} {*doc*_{1}, *doc*_{3}, *doc*_{4}, *doc*_{5}}

= {*doc*_{1}, *doc*_{2}, *doc*_{3}, *doc*_{4}, *doc*_{5}}

Document Terms

---------------------------------------------------

doc_{1}algorithm, information, retrieval

doc_{2}retrieval, science

doc_{3}algorithm, information, science

doc_{4}pattern, retrieval, science

doc_{5}science, algorithm

Before discussing how to implement sets, it is necessary to define them more precisely. Sets are a familiar concept, and implementations for them may be found in many data structures textbooks. However, no two textbooks seem to have precisely the same operations, and the semantics of the operations vary from book to book (Does an operation modify its inputs? Is it a procedure or a function?). A little time spent discussing the meaning of a set as used in this chapter will avoid confusion.

A set is a homogeneous, unordered collection of elements. In programming language terms, a set has an associated "element data type," and all elements of the set must be of this type. Each element data type has a *key*. For a given set, no two elements ever simultaneously possess the same key. An element data type may also specify other data that will be included along with the element. These data values need not be unique among all elements of a set. If the key constitutes the entire value of a set element, however, then by definition all values in a set differ.

To specify a set, one must provide an element data type *E*. This specifies the types of elements that may populate a given instance of a set. Some programming languages, such as Ada, permit programmers to enforce the constraint that a set contain homogeneous data. In C, the programming language used here, this constraint is unenforceable. However, only a little care is required to ensure that sets are in fact homogeneous.

The first order of business is to declare a data type for a set, and for elements of the set:

typedef <...> set;

typedef <...> elementType;

#define TRUE 1

#define FALSE 0

typedef int boolean;

v = "abc";

Insert(s, v);

v[1] = 'x';

a. The application will agree not to modify values that are used in sets.

b. The specification of a set can include a function that creates a copy of a datum.

Copy (s, s);

Unite(a, b, b);

void Clear(set *s); /* Make s contain zero elements. */

void Insert(set *s, /* Modify s to contain e, if */

elementType e); /* it does not already. */

void Delete(set *s, /* Remove e from s. */

elementType e);

void Unite(set *s1, set *s2, /* Make s3 the union of sets */

set *s3); /* s1 and s2. */

void Intersect(set *s1, set *s2, /* Make s3 the intersection */

set *s3); /* of sets s1 and s2. */

void Subtract(set *s1, set *s2, /* Make s3 the difference */

set *s3); /* of sets s1 and s2. */

boolean Empty(set *s); /* Return TRUE iff s has */

/* zero elements. */

boolean Member(set *s, /* Return TRUE iff e is a */

elementType e); /* member of s. */

void Copy(set *source, /* Make "destination" the */

set *destination); /* same set as "source". */

void Iterate(set *s, /* Call f(e) once for every */

boolean (*f)()); /* element e in set s. */

boolean Error_Occurred(); /* Return TRUE iff the last */

/* operation caused an error. */

boolean PrintElement(i)

elementType i;

{

printf("The value is %d\n", i);

return TRUE;

}

The following statement would then print all elements of a set S of integers:

Iterate(S, PrintElement);

Intersect (s1, s2, s3);

if ( Error_Occurred () ) {

...

}

The interface given in Figure 12.2 can provide an information retrieval system with operations on sets of various types. These operations may be used by many parts of the system. Their packaging must make this expedient. Information retrieval systems are large and consist of many subsystems. The code for a typical information retrieval system will probably be spread throughout many files. The set operations must be packaged such that each subsystem--that is, set of files--needing access to sets may have them in the simplest possible way, involving no duplication of code.

An implementation of a set would therefore be packaged as follows:

#include "set. h"

The routines' use will be illustrated by using them to solve the query given at the end of section 12.2. Doing so requires the use of six sets--three for each of the patterns, plus three to hold intermediate values and the results. Figure 12.3 gives the code. The issue of how the sets are initialized is ignored here; it is the subject of Chapter 3. That would involve calls to the Insert routine, one for each document to be inserted in a set.

#include "set.h"

SolveQuery ()

{

set info_docs, retr_docs, sci_docs;

set t1, t2, t3;

Clear (&info_docs);

Clear (&retr_docs);

Clear (&sci_docs);

Clear (&t1);

Clear (&t2);

Clear (&t3);

... Find the documents containing "information",

"retrieval" and "science", and put their names

into info_docs, retr_docs, and sci_docs,

respectively ...

Intersect(&info_docs, &retr_docs, &t1);

Intersect(&retr_docs, &sci_docs, &t2);

Unite(&t1, &retr_docs, &t3);

Subtract(&t3, &t2, &t1);

/* Set t1 contains the result of the query. */

}

Before considering specific techniques for implementing sets, it is worth mentioning the trade-offs one may encounter. Space and time considerations are as applicable in set algorithms as in any other area. However, the sets used in information retrieval can vary greatly in size, indeed by many orders of magnitude. Small documents, or ones being queried in the context of a particular domain, may have only a few terms. Then again, some documents will have tens of thousands of unique index terms on which users may wish to search. Sets for the former may easily be kept in memory and manipulated using fixed-space algorithms. Sets for the latter may require secondary memory implementation strategies. The same information retrieval system may well access both types of documents. One implementation approach to sets might not be enough.

In any case, large rather than small sets are likely to be the exception rather than the rule. Furthermore, the amount of information on one's system seldom shrinks. The ever-increasing amount of knowledge has made gigabyte information bases commonplace. Terabyte systems are not uncommon. Indeed, they will probably be the rule as global networks proliferate. Planning for a small, stable information base is unwise, then, and will lead to unacceptable performance. Every ounce of computing power must therefore be squeezed out of the algorithms that implement Boolean operations. The ones given here attempt to do so. They sometimes sacrifice clarity for speed, but a reader familiar with the C programming language should be able to figure them out without undue effort.

Bit vectors are a simple approach to implementing sets. They provide an extremely fast implementation that is particularly suited to sets where the element type has a small domain of possible values. The representation is not compact, as will be seen, but may be appropriate if the number of documents is small.

The approach using bit vectors is to represent a set as a sequence of bits, with one bit for each possible element in the domain of the element type. It is assumed that each value in the element's domain maps to a unique bit. If a bit's value is 1, then the element represented by that bit is in the set; if 0, the element is not in the set. These values map to the TRUE and FALSE constants of the Boolean data type, which provides a clearer understanding of the bit's purpose (TRUE means an element is in a set, FALSE means it is not) than do the digits.

( vector [ (i-1) /8] >> (i-1)%8 ) & 01

for ( i = 0; i < length(s1); i++ )

s3[i] = s1[i]&s2[i];

assuming that length (s) returns the number of bytes needed to represent set s.

void Create(int lower, int upper, set *s);

#define MAX_ELEMENTS 256

#define WORDSIZE 8

typedef struct {

int lower;

int upper;

char bits[MAX_ELEMENTS\WORDSIZE];

} set;

The implementation for Create is:

void Create(lower, upper, s)

int lower, upper;

set *s;

{

if ( lower > upper

|| (upper - lower) >= MAX_ELEMENTS ) {

error_occurred = TRUE;

return;

}

s->lower = lower;

s->upper = upper;

error_occurred = FALSE;

}

static boolean error_occurred;

boolean Error_Occurred()

{

return error_occurred;

}

Note that create does not initialize the set's contents; that is, the set is not guaranteed empty.

void Clear(s)

set *s;

{

register int i, Number_Of_Bytes;

Number_Of_Bytes = (s->upper - s-lower)/WORDSIZE + 1;

for ( i = 0; i < Number_Of_Bytes; i++ )

s->bits[i] = 0;

error_occurred = FALSE;

}

Insertion and deletion of elements are handled by setting the appropriate bit to 1 and 0, respectively. Figure 12.4 shows the code to do so. The routines mask the byte containing the appropriate bit with a bit string that turns the bit on or off, respectively. The bit string for Insert consists of a string of 0's, with a 1 at the appropriate bit; or'ing this string with a byte with result in a byte unchanged except for (possibly) the bit at the location of the 1 in the bit string. The bit string for Delete consists of a string of 1'S, with a 0 at the appropriate bit; and'ing this string with a byte will result in a byte unchanged except for (possibly) the bit at the location of the 0. C's bit-oriented operators make this simple.

void Insert(s, e)

set *s;

elementType e;

{

if ( e < s->lower || e > s->upper ) {

error_occurred = TRUE;

return;

}

s->bits[(e-s->lower)/WORDSIZE] |= 01 << ( (e-s->lower)%WORDSIZE);

error_occurred = FALSE;

}

void Delete(s, e)

set *s;

elementType e;

{

if ( e < s->lower || e > s->upper ) {

error_occurred = TRUE;

return;

}

s->bits[(e-s->lower)/WORDSIZE] &= ~(01 << ((e-s->lower)%WORDSIZE));

error_occurred = FALSE;

}

boolean Empty(s)

set *s;

{

register int i, Number_Of_Bytes;

error_occurred = FALSE;

Number_Of_Bytes = (s->upper - s->lower)/WORDSIZE + 1;

for ( i = 0; i < Number_Of_Bytes; i++ )

if ( s->bits[i] )

return FALSE;

return TRUE;

}

boolean Member(s, e)

set *s;

elementType e;

{

if ( error_occurred = (e < s->lower || e < s->upper) )

return FALSE;

return (s->bits[(e - s->lower)/WORDSIZE] >>

(e - s->lower)%WORDSIZE) & 01;

}

#define equivalent(s1, s2) \

((s1)->lower==(s2)->lower && (s1)->upper==(s2)->upper)

Figures 12.6-12.8 show the code.

void Unite(s1, s2, s3)

set *s1, *s2;

set *s3;

{

register int i, Number_Of_Bytes;

if ( ! (equivalent(s1, s2) && equivalent(s2, s3)) ) {

error_occurred = TRUE;

return;

}

Number_Of_Bytes = (s1->upper - s1->lower)/WORDSIZE + 1;

for ( i = 0 ; i Number_Of_Bytes; i++ )

s3->bits[i] = (s1->bits[i] | s2->bits[i]);

error_occurred = FALSE;

}

Applications sometimes need to copy sets. Using bit vectors, this is straightforward. The bits array must be copied byte by byte, since C does not permit assignment of entire arrays:

void Copy(source, destination)

set *source,*destination;

{

register int i;

if ( ! equivalent(source, destination) ) {

error_occurred = TRUE;

return;

}

for ( i = 0; i < MAX_ELEMENTS/WORDSIZE; i++ )

destination-bits[i] = source->bits[i];

error_occurred = FALSE;

}

void Intersect(s1, s2, s3)

set *s1, *s2;

set *s3;

{

register int i, Number_Of_Bytes;

if ( ! (equivalent(s1, s2) && equivalent(s2, s3)) ) {

error_occurred = TRUE;

return;

}

Number_Of_Bytes = (s1->upper - s1->lower)/WORDSIZE + 1;

for ( i = 0 ; i < Number_Of_Bytes; i++ )

s3->bits[i] = (s1->bits[i] & s2->bits[i]);

error_occurred = FALSE;

}

void Subtract(s1, s2, s3)

set *s1, *s2;

set *s3;

{

register int i, Number_Of_Bytes;

if ( ! (equivalent(s1, s2) && equivalent(s2, s3)) ) {

error_occurred = TRUE;

return;

}

Number_Of_Bytes = (s1->upper - s1->lower)/WORDSIZE + 1;

for ( i = 0 ; i < Number_Of_Bytes; i++ )

s3->bits[i] = s1->bits[i] & ~ (s2->bits[i]);

error_occurred = FALSE;

}

The final operation is iteration. This operation uses C's dereferencing features to execute an application-specified function. Figure 12.9 shows the code. The function f receives each element--not the bit, but the actual value. This requires converting the bit position into the value, the inverse mapping from that required for other operations. The function is expected to return a Boolean value. If this value is TRUE, another element will be passed to f; if FALSE, the iteration will cease. This lets applications exert some control over the iteration.

void Iterate(s, f)

set *s;

boolean (*f) ();

{

register int i, j, Number_Of_Bytes;

Number_Of_Bytes = (s->upper - s->lower)/WORDSIZE + 1;

error_occurred = FALSE;

for ( i = 0; i < Number_Of_Bytes; i++ )

for ( j = 0; j < WORDSIZE; j++ )

if ( (s->bits[i] >> j) % 2 )

if ( ! (*f)(j + i*WORDSIZE + s->lower) )

return;

}

If a large number of documents must be searched, or if representing their names using integers is not convenient, then bit vectors are unacceptable. This section will show how to use *hashing*, another common set implementation technique. The approach is not as fast as bit vectors, but it makes far more efficient use of space. In fact, whereas bit vectors explicitly restrict the maximum number of elements, hashing permits sets of arbitrary size. The trade-off is in speed: performance is inversely proportional to a set's cardinality. However, good performance characteristics can be achieved if one has some idea in advance of the average size of the sets that will be manipulated, a quantity that is usually available.

typedef char element_Type;

typedef struct be_str {

elementType datum;

struct be_str *next_datum;

} bucket_element;

typedef struct {

int Number_Of_Buckets;

bucket_element **buckets;

int (*hashing_function)();

boolean (*comparator)();

} set;

The creation routine is given in Figure 12.10. Note that its interface is different from the bit-vector approach, reflecting the different information needed to maintain and manipulate the sets. A "Comparator" routine must also be provided. This routine, given two values of type elementType, should return TRUE if they are equal, and FALSE if they are not. Such a routine is needed for hash tables containing strings or dynamically allocated data, since C's "= =" operator does not understand such structures.

void Create(Number_Of_Buckets, Hashing_Function, Comparator, s)

int Number_Of_Buckets;

int (*Hashing_Function)();

boolean (*Comparator)();

set *s;

{

register unsigned int Bucket_Array_Size;

register int i;

if ( Number_Of_Buckets <= 0 ) {

error_occurred = TRUE;

return;

}

s->Number_Of_Buckets = Number_Of_Buckets;

s->hashing_function = Hashing_Function;

s->comparator = Comparator;

Bucket_Array_Size = sizeof(bucket_element) * Number_Of_Buckets;

s->buckets = (bucket_element **)malloc(Bucket_Array_Size);

if ( error_occurred = (s->buckets == NULL) )

return;

for ( i = 0; i < Number_Of_Buckets; i++ )

s->buckets[i] = (bucket_element *)NULL;

}

void Clear(s)

set *s;

{

register int i;

register bucket_element *b, *next_b;

for ( i = 0; i < s->Number_Of_Buckets; i++ )

if ( s->buckets[i] != NULL ) {

b = s->buckets[i];

while ( b != NULL ) {

next_b = b->next_datum;

free( (char *)b );

b = next_b;

}

s->buckets[i] = NULL;

}

error_occurred = FALSE;

}

#define hash(s,e) (abs ((* ((s)->hashing_function)) (e)) %

(s)->Number_Of_Buckets)

Insertion and deletion into a hash table involve linked-list operations. Figures 12.11 and 12.12 contain one implementation. This version of Insert ( ) inserts an element by traversing the list associated with the bucket to which it maps. If the element is not already in the list, it is inserted at the list's head. Insertion at the head is common because, in practice, elements added recently tend to be accessed most often. Data sets for which this does not hold could use an implementation where the element is added at the end of the list. Note that application-supplied hashing function is equivalent to* â* (*v*), not *h* (*v*), that is, its range is all integers rather than all integers between a certain bucket size. This simplifies dynamic resizing of hash tables (an improvement discussed in Chapter 13)*.*

void Insert(s, e)

set *s;

elementType e;

{

register bucket_element *b, *New_Element;

register int bucket;

error_occurred = FALSE;

bucket = hash(s,e);

for ( b = s->buckets[bucket]; b != NULL ; b = b->next_datum )

if ( (*(s->comparator))(b->datum, e) )

return;

New_Element = (bucket_element *)malloc(sizeof (bucket_element));

if ( New_Element == NULL ) {

error_occurred = TRUE;

return;

}

New_Element->datum = e;

New_Element->next_datum = s-->buckets[bucket];

s->buckets[bucket] = New_Element;

}

void Delete(s, e)

set *s;

elementType e;

{

register bucket_element *b, *previous;

register int bucket;

error_occurred = FALSE;

bucket = hash(s, e);

if ( (b = s->buckets[bucket]) == NULL )

return;

if ( (*(s->comparator))(b->datum, e) )

s->buckets[bucket] = b->next_datum;

else {

while ( b->next_datum != NULL ) {

if ( (*(s->comparator))(b->datum, e) )

break;

previous = b;

b = b->next_datum;

}

if ( b == NULL )

return;

previous->next_datum = b->next_datum;

}

free ( (char *)b );

}

boolean Empty(s)

set *s;

{

register int i;

error_occurred = FALSE;

for ( i = 0; i < s->Number_Of_Buckets; i++ )

if ( s->buckets[i] != NULL )

return FALSE;

return TRUE;

}

boolean Member (s, e)

set *s;

elementType e;

{

register bucket_element *b;

register int bucket;

error_occurred = FALSE;

bucket = hash(s, e);

for ( b = s->buckets[bucket]; b != NULL ; b = b->next_datum )

if ( (*(s->comparator))(b->datum, e) )

return TRUE;

return FALSE;

}

void Unite(s1, s2, s3)

set *s1, *s2;

set *s3;

{

register int i;

register bucket_element *b;

Copy(s1, s3);

if ( Error_Occurred() )

return;

for ( i = 0; i < s2->Number_Of_Buckets; i++ ) {

if ( s2->buckets[i] == NULL )

continue;

for ( b = s2->buckets[i]; b != NULL; b = b->next_datum )

if ( ! Member(s3, b->datum) ) {

Insert(s3, b->datum);

if ( Error_Occurred() )

return;

}

}

error_occurred = FALSE;

}

void Intersect(s1, s2, s3)

set *s1, *s2;

set *s3;

{

register int i;

register bucket_element *b;

Clear(s3);

for ( i = 0; i < s1->Number_Of_Buckets; i++ ) {

if ( s1->buckets[i] == NULL )

continue;

for ( b = s1->buckets[i]; b != NULL; b = b->next_datum )

if ( Member(s2, b->datum) ) {

Insert(s3, b->datum);

if ( Error_Occurred() )

return;

}

}

error_occurred = FALSE;

}

void Subtract (s1, s2, s3)

set *s1, *s2;

set *s3;

{

register int i;

register bucket_element *b;

Clear(s3);

for ( i = 0; i < s1->Number_Of_Buckets; i++ ) {

if ( s1->buckets[i] == NULL )

continue;

for ( b = s1->buckets[i]; b != NULL; b = b->next_datum )

if ( ! Member(s2, b->datum) ) {

Insert(s3, b->datum);

if ( Error_Occurred() )

return;

}

}

error_occurred = FALSE;

}

Copying a hash table involves copying the bucket array, plus all lists in the buckets. The code to do so is in Figure 12.17. The routine does *not* simply copy the lists; the data from one must be rehashed into the other, since the two may not have identical bucket array lengths or hash functions. Even if the tables are equivalent in these respects, they will not be identical after the copying operation: the order of elements within the buckets is reversed. This will only show up during iteration. Since the order of elements during iteration is not defined, the difference is irrelevant.

To aid the reader in understanding the relative merits of the two implementations, this section presents an analysis of the time and space required for the two set implementation techniques. The analysis for each is straightforward. The operations of most concern are Insert, Unite, Intersect, and Subtract, since it is presumed that most applications will spend the majority of their time manipulating sets rather than creating them. For the sake of completeness, however, all routines will be analyzed.

void Copy(source, destination)

set *source, *destination;

{

register int i h;

register bucket_element *e, *b;

Clear (destination);

for ( i = o; i < source->Number_Of_Buckets; i++ ) {

if ( source->buckets[i] == NULL )

continue;

for ( b = source->buckets[i]; b != NULL; b = b->next_datum ) {

h = hash(destination, b->datum);

e = (bucket_element *)malloc(sizeof (bucket_element));

if ( e == NULL ) {

error_occurred = TRUE;

return;

}

e->datum = b->datum;

e->next_datum = destination->buckets[h];

destination->buckets[h] = e;

}

}

error_occurred = FALSE;

}

void Iterate(s, f)

set *s;

boolean (*f) ();

{

register int i;

register bucket_element *b;

error_occurred = FALSE;

for ( i = 0; i < s->Number_Of_Buckets; i++ )

for ( b = s->buckets[i]; b != NULL; b = b->next_datum )

if ( ! (*f) (b->datum) )

return;

}

The space requirements for bit vectors are somewhat worse. They are also constant, but potentially quite large. In the implementation given here, MAX_ELEMENTS/WORDSIZE+2 sizeof (int) bytes will be needed to store any bit-vector set. As noted previously, this can easily be rather large.

The behavior of hashing algorithms is not quite so predictable. It depends on the randomness of the hashing function on the data being stored. Assuming that the hashing function is "reasonably" random, worst-case and expected behavior are not difficult to derive. Worst-case will be presented first. The hashing function is usually chosen to be *O*(*C*) (nonlinear functions defeat the advantages of hashing). The time required once the hashing function is computed will be proportional to the number of elements in the bucket. At worst, all elements will be in a single bucket. If so, then insertion, deletion, and membership tests all require *O*(*N*), where *N* is the number of elements in the set. The union, intersection, and difference routines have significantly poorer worst-case behavior. Consider the Unite operation on two sets that each have all elements in a single bucket. The Member() test requires *O*(*N*) steps for a single invocation. Since the inner for-loop iterates through *N* elements, the complexity of the loop is *O*(*N*^{2}). This logic also applies to Intersect and Subtract.

Hashing Bit Vectors

Worst-Case Average Worst-Case Average

-------------------------------------------------------------------

Insert/Delete/MemberO(N)O(C)0(C)O(C)

Unite/Intersect/SubtractO(N^{2})O(N)O(W)0(W)

Create/EmptyO(B)O(B)O(W)0(W)

CopyO(N)O(N)0(W)0(W)

*B* = number of buckets in hash table

*W* = number of words in bit string

Bit Vectors Hash Tables

Domain Number of Elements Domain Number of Elements

Size 64 1024 2Size 64 1024 2^{15 }^{15}

--------------------------- ----------------------------

256 32 -- -- 256 1198 4078 99,310

2^{15 }4096 4096 4096 2^{15 }1262 5102 132,078

2^{31 }2^{28 }2^{28}2^{28}2^{31 }1390 7100 197,614

SEDGEWICK, R. 1990. *Algorithms in C*. Reading, Mass: Addison-Wesley.