Systems Analyst, Acumenics Research and Technology, 9990 Lee Highway, Suite 580, Fairfax, VA 22030

The classical interpretation of the Boolean operators in an information retrieval system is in general too strict. A standard Boolean query rarely comes close to retrieving all and only those documents which are relevant to a query. Many models have been proposed with the aim of softening the interpretation of the Boolean operators in order to improve the precision and recall of the search results. This chapter discusses three such models: the Mixed Min and Max (MMM), the Paice, and the P-norm models. The MMM and Paice models are essentially variations of the classical fuzzy-set model, while the P-norm scheme is a distance-based approach. Our experimental results indicate that each of the above models provides better performance than the classical Boolean model in terms of retrieval effectiveness.

This model is based on the concept of fuzzy sets proposed by Zadeh (1965). In fuzzy-set theory, an element has a varying degree of membership, say* d _{A}*, to a given set

d_{A}_{B}= min(,_{dA})_{db}

d_{A}_{B}= max(,_{dA})_{db}

Q= (_{or}A_{1}or A_{2}or. . .or A) and_{n}

Q= (_{and}A_{1}and A_{2}and . . . and A)_{n},

the query-document similarity in the MMM model is computed as follows:

SlM(Q,_{or}D)= C1 * max(_{or}d1,_{A}d2, . . . ,_{A}d) +_{An}C2 * min(_{or}d1,_{A}d2, . . . ,_{A}d)_{An}

SlM(Q,_{and}D)= C1 * min(_{and}d1,_{A}d2, . . . ,_{A}d) +_{An}C_{and2}* max(d1,_{A}d2 . . . ,_{A}d)_{An}

The model proposed by Paice (1984) is also based on fuzzy-set theory. It is similar to the MMM model in that it assumes that there is a fuzzy set associated with each index term, and the document weight of a document with respect to an index term represents the degree of membership of the document with respect to the fuzzy set associated with that index term. However, while the MMM model considers only the maximum and the minimum document weights for the index terms while calculating the similarity, the Paice model takes into account all of the document weights.

Q= (_{or}A_{1}orA_{2}or...A) and_{n}

Q= (_{and}A_{1}andA_{2}and...andA)_{n}

the query-document similarity in the Paice model is computed as follows:

Besides allowing document weights for index terms, the P-norm model also allows query terms to have weights. In the P-norm model, a document D with weights *d _{A}*1,

In the P-norm model, the generalized queries are of the form:

Q= (_{or p}A_{1},a_{1})or(_{P}A_{2},a_{2})or..._{P}or(_{P}A,_{n}a), and_{n}

Q= (A_{and p}_{1},a_{1})and(_{P}A_{2},a_{2})and..._{P }and(_{P }A,_{n}a)._{n}

The query-document similarity for the P-norm model is calculated using the following formulae:

As in any other information retrieval system, we need to convert the given set of documents and queries into a standard format. We first discuss the data structures used in our implementation, and then follow up with a discussion of the actual algorithms.

There are several important data structures for both the document and query representations.

typedef struct

{

float doc_wts[NUM_ITERM]; /*document weight vector*/

} DOC_STRUCT

We assume that each query has an id associated with it, and is in the form of a tree with the operators as internal nodes and index terms as the leaves (see Figure 15.1). The data structure used to store a query is as follows:

typedef struct

{ long query_id; /* query id */

TREE_NODE *beg_node _array; /* array of query nodes */

ITERM_TUPLE *beg_iterm_array; /* array of index terms */

OP_TUPLE *beg_op_array; /* array of operators */

} QUERY_STRUCT

** **an array of the nodes in the query tree

The tree-node structure is as follows:

typedef struct

{

int iterm_op_index; /* index of index-term/operator for

leaf/non-leaf node resp. */

short child_index; /* index of the left child of the

current node */

short sibling_index; /* index of right sibling of the

current node */

} TREE_NODE /* structure of a node in the query tree */

The data structure that stores information about an index term is as follows:

typedef struct

{

long iterm_num; /* unique ID given to index-term */

float iterm_wt; /* weight of the index-term */

} ITERM_TUPLE /* structure of an index term */

The data structure for an operator is as follows:

typedef struct

{

int op_type; /* type ot operator -

AND_OP / OR_OP / NOT_OP */

long op_coeff; /* coefficient of the operator -

= 0.0 for NOT_OP

= C_or1 / C _and1 for MMM model,

= r value for Paice model, and

= P value for P-norm model */

float op_wt; /* wt. of the operator */

} OP_TUPLE /* structure of an operator */

If the MMM model is being used, then op_coeff is the C* _{or}*1 / C

Given a query and a set of documents in the standard form, we use the following algorithm to retrieve and rank documents:

For each document

Find similarity;

Sort similarities;

Display document / similarity data in descending order.

We update the values of two variables, *maximum *and* minimum* (both of type double), where

maximum= max(d1_{A}d2,...,_{A}d), and_{An}

minimum= min(d1_{A}d2,...,_{A}d)._{An}

We update the values of two variables, *numerator* and *denominator* (both of type double), where

For the MMM model, we can calculate the similarity for the node by using the following formula:

SIM(Q,D) = C*_{op}maximum+ (1 -C) *_{op}minimum,

SIM(Q,D)= 0.7 * 0.8 + (1 - 0.7) * 0.5 = 0.71

SIM(Q,D) = 0.7^{0}*0.8 + 0.7^{1}*0.6 + 0.7^{2}*0.5/0.7^{0}+ 0.7^{1 }+ 0.7^{2}

= 1.465/2.19 = 0.6689

SIM(Q)_{or,}D=(numerator/denominator)^{1/p_value}

SIM(Q)_{and,}D=1 - (numerator/denominator)^{1/p_value}

Index Term Numerator Denominator

1 A 0 + 0.5^{2}* 0.5^{2}0 + 0.5^{2}

= 0.0625 = 0.25

2 B 0.0625 + 0.5^{2}* 0.8^{2 }0.25 + 0.5^{2}

= 0.2225 = 0.50

3 C 0.225 + 0.5^{2}* 0.6^{2 }0.50 + 0.5^{2}

= 0.3125 = 0.75.

SIM(Q,_{or}D) = (0.3125/0.75)^{1/2}= 0.6455

The C programs that implement these algorithms are given in the appendix. The routine *cal_sim() *is called with the array containing the document weights which are of type float as the first parameter, the query vector which is of type QUERY_STRUCT as the second parameter, and the third argument sim_type which determines the type of similarity being computed, that is, MMM, Paice, or P-norm. Based on the value of the third parameter, one of the following sets of routines is called for further computations:

*update*_numer_denom*( ) &* *p_norm_sim( )*: P-norm model

Numerous experiments performed with the extended Boolean models have shown that they give much better performance than the standard Boolean model as reported by Lee & Fox (1988). Table 15.1 gives the percentage improvement in the average precision that these models showed over the standard Boolean model for three test collections: the CISI, CACM, and the INSPEC collections. As can be seen in the table, all three models show substantial improvement over the standard Boolean model. In fact, for the CACM and the INSPEC collections, all three models show more than 100 percent improvement.

Scheme CISI CACM INSPEC

------------------- ------------------ ------------------

Best Average Best Average Best Average

Precision Rank Precision Rank Precision Rank

-------------------------------------------------------------------

P-norm 79 1 106 2 210 1

Paice 77 2 104 3 206 2

MMM 68 3 109 1 195 3

The P-norm model is computationally expensive because of the number of exponentiation operations that it requires. The Paice model requires one to sort the similarities of the children. This adds to its computational overhead. The MMM model clearly has the least computational overhead of the three models.

COOPER, W. S. 1988. "Getting Beyond Boole." *Information Processing & Management* 24(3), 243-48.

ZADEH, L. A. 1965. "Fuzzy Sets," *Information and Control,* 8, 338-53.

/**

** Compute query-document similarity using MMM, Paice and P-norm

** formulas

**/

#include <stdio.h>

#include <math.h>

#define UNDEF-l

#define NOT_OP -2

#define AND_OP -3

#define OR_OP -4

#define MMM_SIM 1

#define PAICE_SIM 2

#define P_NORM_SIM 3

#define P_INFINITY 0.0 /* Used only for P-norm model */

typedef struct

{

long iterm_num; /* Unique ID given to the index-term */

float iterm_wt; /* Weight of index-term (P-norm queries only) */

} ITERM_TUPLE; /* Structure of an index-term in a query */

typedef struct

{

int op_type; /* operator type - AND_OP, OP_OP, NOT_OP */

float op_coeff; /* Coefficient of the operator

= 0.0 for NOT_OP,

= C_or1/C_and1 for MMM model,

= r value for Paice model, and

= P value for P-norm model */

float op_wt; /* Wt. of operator - (P-norm queries only) */

} OP_TUPLE; /* Structure of an operator in a query */

typedef struct

{

long iterm_op_index; /* If child_index = UNDEF then concept

index, else operator index */

short child_index; /* Current node's left child index */

short sibling_index; /* Current node's right sibling index */

} TREE_NODE; /* Structure of a node in the query tree */

typedef struct

{

long query_id; /* Query id */

TREE_NODE *beg_node_array; /* Array of query nodes */

ITERM_TUPLE *beg_iterm_array; /* Array of index terms */

OP-TUPLE *beg_op_array; /* Array of operators */

} QUERY-STRUCT; /* Structure of a query */

#define NUM_ITERMS 1000 /* Max. no. of index-terms

allowed in a document */

/***************************************************************************

double

calc_sim ( doc_wts, query_vec, sim_type )

Returns: The MMM / Paice / P-norm similarity between the given

document and query.

Purpose: To compute:

MMM_SIM (Doc, Query ) if sim_type = MMM_SIM, or

PAICE_SIM (Doc, Query ) if sim_type = PAICE_SIM, or

P_NORM_SIM (Doc, Query ) if sim_type = P_NORM_SIM,

Plan: Call the routine to calculate the similarity by passing

to it the index of the root of the query tree (= zero).

***************************************************************************/

double

calc_sim ( doc_wts, query_vec, sim_type )

float doc_wts[NUM_ITERMS]; /* In: doc. weights vector */

register QUERY_STRUCT *query_vec; /* In: Given Query vector */

short sim_type; /* In: Similarity type flag */

{

double calc_tree_sim ();

if ( ( sim_type != MMM_SIM ) && ( sim_type != PAICE-SIM ) &&

( sim_type != P_NORM_SIM ) )

fprintf ( stderr, "calc_sim: illegal similarity\

type %d, query %d\n", sim_type,

query_vec-> query_id );

return ( calc_tree_sim (doc_wts, query_vec, 0, sim_type ) );

}

/***************************************************************************

static double

calc_tree_sim ( doc_wts, query_vec, root_index, sim_type )

Returns: The similarity between the given document and the

subtree of the query tree with its root at ROOT_INDEX

Purpose: To compute: SIM (Doc, Query-subtree)

Plan: If the root of the subtree is an index-term then

similarity is the document wt. for that index-term

else, if it is an operator then,

if the operator == NOT_OP then

(1) Calculate the similarity of the child of

the current node (CHILD_VALUE)

(2) Return (1.0 - CHILD_VALUE)

else if the operator == AND_OP or OR_OP then

(1) Calculate the similarity considering

each child of the current node

(2) If (SIM_TYPE == MMM_SIM) then

Find the maximum and the minimum of

these similarities,

else if (SIM_TYPE == PAICE_SIM) then

Store this similarity in the array

child_sims,

else if (SIM_TYPE == P_NORM_SIM) then

Find the numerator and denominator

to be used in the formula

else

Print "invalid sim_type".

(3) Use the appropriate similarity computation

formula to compute the similarity for

the current node.

***************************************************************************/

static double

calc_tree_sim ( doc_wts, query_vec, root_index, sim_type)

float doc_wts [NUM_ITERMS]; /* In: Document weights */

register QUERY_STRUCT *query_vec; /* In: Query vector */

short root_index; /* In: Index of the root of the

subtree to be evaluated */

short sim_type; /* In: Similarity type flag */

{

register TREE_NODE *root_ptr; /* Addr. of the root of the

subtree being considered */

register TREE_NODE *child_ptr; /* Addr. of a child of the root

of subtree being considered */

register OP_TUPLE *op_ptr; /* Addr. of current operator */

register ITERM_TUPLE *iterm_ptr; /* Addr. of current index-term */

long doc_index; /* Index of the weight of the

current index-term within

DOC-WTS array */

short child_index; /* Index of the child of the root

within the set of tree nodes */

double child_value; /* Sim. of child's subtree */

/* Declarations for MMM model */

double maximum; /* Maximum (children's sim.) */

double minimum; /* Minimum (children's sim.) */

double mmm_sim (); /* Func. computing MMM sim. */

/* Declarations for Paice model */

double child_sims[NUM_ITERMS]; /* Array to store sim. of all

the children of the root */

int curr_i; /* Number of sim. stored in

child_sims */

double paice_sim (); /* Func. computing Paice sim. */

/* Declarations for P-norm model */

double child_wt; /* Wt. of the child */

double numerator; /* Numerator in sim. formula */

double denominator; /* Denominator in sim. formula */

void update_numer_denom(); /* Func. updating value of */

/* numerator & denominator */

double p_norm_sim (); /* Func. computing P-norm sim. */

double calc_tree_sim(); /* Recursive call for each subtree */

/* Addr. of first node of the tree + index of the current node

= the addr. of the root of the subtree under consideration */

root_ptr = query_vec->beg_node_array + root_index;

if UNDEF == root_ptr->child_index)

{

/* If node is an index-term, then return its doc-wt. Index in */

/* the doc_wts array is the ITERM_NUM of current index-term */

iterm_ptr = query_vec->beg_iterm_array + root_ptr->iterm_op_index;

doc-index = iterm_ptr->iterm_num;

return ( (double) doc_wts [doc_index] );

}

else

{ /* If current node is an operator, then compute its sim. */

op_ptr = query_vec->beg_op_array + root_ptr->iterm_op_index;

if ( ( op_ptr->op_type != NOT_OP ) &&

( op_ptr->op_type != OR_OP ) &&

( op_ptr->op_type != AND_OP ) )

{ /* if neither NOT, OR or AND */

fprintf ( stderr, "calc_tree sim:\

illegal operator type %d, query %d.\n",

op_ptr->op_type, query_vec->query_id);

return ( (double) UNDEF );

}

switch (op_ptr->op_type)

{

case (NOT_OP): /* If the operator is NOT_OP */

if (UNDEF != (query_vec->beg_node_array +

root_ptr->child_index)->sibling_index)

{ /* NOT_OP operator can have only one child */

fprintf(stderr,

"calc_tree_sim: NOT operator has more\

than one child.\n');

return ( (double) UNDEF);

}

/* if only child, return (1.0 - similarity of child) */

if ((double) UNDEF == (child_value =

calc_tree_sim ( doc_wts, query_vec,

root_ptr->child_index, sim_type)))

return ( (double) UNDEF );

return ( 1.0-child_value );

break;

case (OR_OP):

case (AND_OP): /* If the operator is OR_OP or AND_OP */

maximum = -99999.0;

minimum = 99999.0; /* Init. for MMM model */

curr_i = -1; /* Init. for Paice model */

numerator = 0.0;

denominator = 0.0; /* Init. for P-norm model */

/* Start with the first child of the current node, *

* consider each of its siblings, until none left */

for (child_index = root_ptr->child_index;

UNDEF != child_index;

child_index = (query_vec->beg_node_array +

child_index)->sibling_index )

{

if ( (double)UNDEF == (child_value =

calc_tree_sim ( doc_wts, query_vec, child_index,

sim_type)))

return ((double UNDEF);

switch (sim_type)

{

case (MMM_SIM): /* update max and mim */

maximum = (child_value > maximum) ? child_value

: maximum;

minimum = (child_value < minimum) ? child_value

: minimum;

break;

case (PAICE_SIM):

curr_i = curr_i + 1;

child_sims [curr_i] = child_value;

break;

case (P_NORM_SIM): /* Find the wt. of the child */

child_ptr = query_vec->beg_node_array +

root_ptr->child_index;

if (UNDEF == child_ptr->child_index)

child_wt = (query_vec->beg_iterm_array +

child_ptr->iterm_op_index)->iterm_wt;

else

child_wt = (query_vec->beg_op_array +

child_ptr->iterm_op_index)->op_wt;

update_numer_denom ( child_value, child_wt,

op_ptr->op_type, op_ptr->op_coeff,

&numerator, &denominator );

break;

} /* switch - sim_type */

} /* for */

/* After considering all the children, compute the *

* sim. of current subtree with appropriate formula */

if ( sim_type == MMM_SIM )

return ( mmm_sim ( op_ptr->op_coeff,

op_ptr->op_type, maximum, minimum));

else

if ( sim_type == PAICE_SIM )

return (paice_sim (op_ptr->op_coeff,

op_ptr->op_type, child_sims, curr_i ));

else

if ( sim_type == P_NORM_SIM )

return ( p_norm_sim (op_ptr->op_coeff,

op_ptr->op_type, numerator, denominator ) );

} /* switch - op_ptr->op_type */

} /* else */

}

/***************************************************************************

double

mmm_sim (coeff, type, maximum, minimum)

Returns: The MMM similarity

Purpose: To calculate the MMM similarity using

MAXIMUM and MINIMUM.

SIM(Qor, D) = coeff * MAXIMUM + (1-coeff) * MINIMUM

SIM(Qand, D) = coeff * MINIMUM + (1-coeff) * MAXIMUM

Plan: Depending on the type of the operator use the

appropriate formula

***************************************************************************/

double

mmm_sim ( op_coeff, op_type, maximum, minimum )

register float op_coeff; /* In: Value of the coefficient */

int op_type; /* In: Type of operator */

double maximum; /* In: Maximum of the similarities */

double minimum; /* In: Minimum of the similarities */

{

if ( op_type == OR_OP )

return ( (op_coeff * maximum + (1 - op_coeff) * minimum));

else

if ( op_type == AND_OP )

return ( (op_coeff * minimum + (1 - op_coeff) * maximum));

}

/***************************************************************************

double

paice_sim (op_coeff, op_type, child_sims, num_i)

Returns: The Paice similarity

Purpose: To calculate the Paice similarity using the

Paice similarity computation formula

Plan: Sort the array in ascending order

If the operator is OR_OP then:

numerator = SUM (child_sims[i] * op_coeff/\(num_i - i) ) ;

for i = 0 to num_i

denominator = SUM (op_coeff/\(num_i-i));

for i = 0 to num_i

else if the operator is AND_OP then:

numerator = SUM (child_sims[i] *op_coeff/\(i) );

for i = 0 to num_i

denominator = SUM (op_coeff/\(i) );

for i = 0 to num_i

***************************************************************************/

double

paice_sim ( op_coeff, op_type, child_sims, num_i )

register float op_coeff; /* In: Coefficient r */

int op_type; /* In: Type of operator */

double child_sims [NUM_ITERMS]; /* In: Array containing sim. */

int num_i; /* In: No. of elements in

child_sim */

{

int i;

double numerator;

double denominator;

double power;

void qsort (); /* Quick Sort Func. in C lib. */

int comp_double (); /* Func. used by qsort */

qsort ( (char *) child_sims, num_i,

sizeof (double), comp_double );

numerator = 0;

denominator = 0;

if ( op_type == OR_OP )

{

for (i = 0; i <=num_i; i++)

{

power = pow((double) op_coeff, (double)num_i-i);

numerator = numerator + power*child_sims[i];

denominator = denominator + power;

}

return (numerator/denominator);

}

else

if(op_type == AND_OP)

{

for (i = 0; i <= num_i; i++)

{

power = pow((double) op_coeff, (double)i);

numerator = numerator + power * child_sims[i];

denominator = denominator + power;

}

return ( numerator / denominator );

}

}

/***************************************************************************

int

comp_double ( d1, d2 );

Returns: int

Purpose: Compares d1 and d2 and returns -1, 0, or 1.

***************************************************************************/

int

comp_double (d1, d2)

double *d1, *d2;

{

if (*d1 < *d2)

return (-1);

else

if (*d1 == *d2)

return (0);

else

return (1);

}

/***************************************************************************

void

update_numer_denom (value, weight, op_type, p_value,

numerator, denominator)

Returns: Void

Purpose: Update the values of NUMERATOR and DENOMINATOR

Plan: If P_value == P_INFINITY then

if VALUE > NUMERATOR then

NUMERATOR = VALUE

DENOMINATOR = 1

else

if OP_TYPE == OR_OP then

NUMERATOR = NUMERATOR + weight/\P_value * value/\P_value

DENOMINATOR = DENOMINATOR + weight/\P_value

if OP_TYPE == AND_OP then

NUMERATOR = NUMERATOR + weight/\P_value *

(1 - value) /\P_value

DENOMINATOR = DENOMINATOR + weight/\P_value

***************************************************************************/

void

update_numer_denom ( value, weight, op_type, p_value,

numerator, denominator )

double value; /* In: The sim. of the child */

double weight; /* In: The query weight */

register float p_value; /* In: The p_value */

int op_type; /* In: The type of the operator */

register double *numerator; /* Out: Numerator in p-norm sim.

calculation formula */

register double *denominator; /* Out: Denominator in p-norm sim.

calculation formula */

{

double power

double pow ()

if (p_value == P_INFINITY)

{

*denominator = 1;

if (value > *numerator)

*numerator = value;

}

else

switch (op_type)

{

case (OR_OP):

power = pow (weight, p_value);

*numerator = *numerator + power * pow(value,p_value);

*denominator = *denominator + power;

break;

case (AND_OP):

power = pow (weight, p_value );

*numerator = *numerator + power * pow(1-value,p_value);

*denominator = *denominator + power;

break;

}

}

/***************************************************************************

double

p_norm_sim (p_value,op_type,numerator,denominator)

Returns: The P-norm similarity

Purpose: To calculate the P-norm similarity using the

P-norm similarity computation formula

Plan: Depending on the type of the operator use the

appropriate formula

SIM(Q (P_INFINITY) , D) = NUMERATOR

SIM (Qor_P, D) = (NUMERATOR / DENOMINATOR) /\ (1/P)

SIM (Qand_P, D) = 1 - (NUMERATOR / DENOMINATOR) /\ (1/P)

***************************************************************************/

double

p_norm_sim ( p_value, op_type numerator, denominator)

register float p_value; /* In: P value */

int op_type; /* In: Type of operator */

double numerator; /* In: Numerator */

double denominator; /* In: Denominator */

{

double pow ();

if (p_value == P_INFINITY)

return (numerator);

else

if (op_type == OR_OP)

return (pow (numerator / denominator, 1/p_value) );

else

if (op_type == AND_OP)

return (1 - pow (numerator / denominator, 1/p_value) );

}