*Three illustrations of the critical effect on program behavior of data structure selection are presented**Huffman coding, a text compression and coding technique, is used to show**good algorithm development**the benefits of good data structure selection**Optimal binary search trees, a related topic, are useful for storing static collections of information to be randomly accessed and traversed in sorted order; they demonstrate**the use of recursion**how to develop efficient programs based on recursion**Nearly optimal binary search trees are considered to emphasize program development*

Many important techniques are available for the compression of text. Text compression is important when dealing with large collections of information, since internal memory is scarce and access to external memory is slow. Huffman coding is an elegant method that can be used profitably when characters do not appear with equal frequency in a text. This is the case with the letters of the alphabet in English text.

The application of "simple" and balanced binary search trees for record storage and retrieval by key was demonstrated in Chapter 9. As shown there, "simple" binary search trees provide excellent average search times with random input when keys are accessed equally often. Also, AVL trees guarantee excellent worst-case search time. We shall see in this chapter that *optimal* binary search trees are useful when the keys have known, even unequal, frequencies, and when the trees, once grown, remain unchanged. Words in English text occur with known and unequal frequencies. An optimal binary search tree could be applied, for example, as a dictionary storing the 500 most frequently used English words, along with their definitions and their common misspellings. A search of the tree would then yield the definitions and misspellings. Once built, this tree would remain unchanged.

An important measure associated with binary trees is the weighted path length. This measure underlies the entire development in this chapter. Suppose a numeric weight is assigned to each node of a binary tree *T*. The meaning of these weights is dependent on the intended application. When the application of the binary tree is to text compression or text encoding, only the weights assigned to terminal nodes have meaning. Each terminal node corresponds to a different character that can appear in the text. The weight assigned to the node represents the relative number of occurrences of the character in the text. When the binary tree is used for record storage, each node stores a record with a particular key. The weight assigned to a node represents the frequency of search requests for the key corresponding to the node. These weights must be estimated or determined empirically for each specific application.

The **weighted path length of a node in***T* is the product of its depth and its assigned weight. The **weighted path length of***T, W*(*T*), is the sum of the weighted path lengths of all nodes of *T*. As an example, consider the binary tree in Figure 12.1(a), with assigned weights indicated for each node. Its weighted path length is

(1 3) + (2 17) + (2 10) + (3 15) + (3 0)

+ (3 20) + (4 31) + (4 4)

or 302. This was calculated directly from the definition of the weighted path length.

Another way to determine the weighted path length is to determine first the value of each node of the tree. The * value* of a node is the sum of the assigned weights of all nodes in the subtree with that node as root. Thus the root of a tree is assigned a value that is the sum of all the weights of the tree. The example would have values assigned to each node as indicated in Figure 12.1(b). The weighted path length of the tree is then calculated by adding up the values of the nodes.Thus, from Figure 12.1(b), the weighted path length is

100 + 67 + 30 + 15 + 35 + 20 + 31 + 4 = 302

With this background, let us turn to Huffman coding itself. First we will look at how information may be coded based on a binary tree and how it is coded in computers. Consider a binary tree with *n* terminal nodes. Assign a zero to every left successor branch and a one to every right successor branch of the tree. *Encode* each terminal node with the sequence of 0's and l's encountered when following the path from the root to the terminal node. Since this path is unique, the sequence or code assigned to each node must be unique. For example, in binary tree *T*1 (shown in Figure 12.2) the twenty-six letters of the alphabet and the space character designated by were assigned to the terminal nodes. Thus the code for the leftmost terminal node (*N*) in Figure 12.2 is the sequence 00000. Any character, word, or information associated with a terminal node is encoded with the unique code for that node.

100101001100011111

The tree *T*1 generates a * variable-length code*, since it does not assign the same length sequences of 0's and 1's to each character. A

The character set of a computer is the alphabet of characters that it can accept. For example, not all computers accept braces (that is, {}). Two important standard character sets are ASCII (American Standard Code for Information Interchange) and EBCDIC (Extended Binary Coded Decimal Interchange Codes). Both use fixed-length codes. ASCII uses a seven-bit code and EBCDIC uses eight bits. An *n*-bit fixed-length code can distinguish 2* ^{n}* characters, and a sequence of

Using a variable-length code, a text in which the characters occur with different frequencies can be compressed. Shorter-length sequences would be assigned to characters used more frequently, and longer-length sequences to characters used less frequently. This is the idea behind the Morse code. Morse, in fact, estimated the relative frequencies of characters by looking at the frequency with which printer's boxes containing the different characters of the alphabet needed to be refilled.

4. Start again at the root to decode the remainder of the sequence in the same way.

A B C D E F G H I J K L

186 64 13 22 32 103 21 15 47 57 1 5 32

M N O P Q R S T U V W X Y Z

20 57 63 15 1 48 51 80 23 8 18 1 16 1

These relative frequencies are used to assign weights to the nodes of a tree such as *T*1. Each internal node is assigned weight 0, and each terminal node is assigned the relative frequency of the character associated with it. Thus, node *S* of *T*1 is assigned weight 51. The * average amount of compression* is measured by the weighted path length of the tree. In this way, the amount of compression achieved with different codes can be compared, since each tree has associated with it a unique code. The

The programmer's goal should be to find a binary tree, called the **optimal binary tree****,** which has minimum weighted path length when the characters of the alphabet are stored at its *n* terminal nodes. Incidentally, the references here are to characters associated with the terminal nodes, but more generally they can be words or messages. A straightforward algorithm for finding this optimal binary tree is to generate each binary tree with *n* terminal nodes, associate the alphabetic characters with these *n* nodes, and calculate the weighted path length of the tree. A tree that yields the minimal weighted path length is optimal. Unfortunately, this algorithm is feasible only for small *n*. If *n* = 13, there are more than 10^{12} such trees, and the number of trees increases by a factor of almost 4 each time *n* increases by 1. We have to find a better way.

An optimal binary tree generates what is called a * Huffman code*. Fortunately, the searching just described is not needed to find it; it can be constructed as follows.

The algorithm chooses the combination of current trees that yields a minimal value for the combined trees. It is in this sense that it is a "greedy" algorithm. It doesn't consider future ramifications of the current choice of subtrees to combine but takes the current two "best" subtrees. In this instance, best means "of smallest value." The algorithm makes a "local" best or greedy decision. It is an example of the *greedy* method of algorithm design, which involves making choices on the basis of the immediate best alternative. This may lead to later choices being constrained to such poor alternatives as to lead to an overall suboptimal result. Being less greedy now may lead to better choices later, which in turn may yield better overall results. Hence it is not obvious that taking the best possible next combination, as done here, must lead to an optimal tree.

Let us now consider how the binary tree constructed by the Huffman algorithm should be stored in memory. To this end we consider an example that shows that even though the weighted path length is as small as possible for a given set of weights, the depth of the tree can still be great.

Suppose that the first *n* Fibonacci^{*} numbers are assigned as weights to characters of an alphabet represented as terminal nodes in a tree. The task is to generate a Huffman code.

For those readers who would like a proof of the Huffman construction, the following is an interesting example of the use of mathematical induction applied to trees. It can easily be adopted to prove that a recursive program for the Huffman algorithm is correct.

So far we have been considering binary trees with nonzero weights assigned only to terminal nodes. Information (a character or message) has been associated only with their terminal nodes. We now consider * extended binary search trees,* which have keys stored at their internal nodes.

An obvious way to find an optimal binary search tree is to generate each possible binary search tree for the keys, calculate its weighted path length, and keep that tree with the smallest weighted path length. This search through all possible solutions is not feasible except for small *n*, since the number of such trees grows exponentially with *n*. A feasible alternative would be a recursive algorithm based on the structure inherent in optimal binary search trees. Such an algorithm can be developed in a way similar to that used for the construction of Fibonacci trees in Chapter 7.

Denote the weights assigned to the *n* stored keys by *'*s and the weights assigned to the terminal nodes by 's. The weight assigned to *k _{i}* is

Finally the algorithm can be stated.

For each *k(k)* as root, *k* = 1, 2, . . . , *n*

find obst(1,k-1) and obst(k+1,n).

Find a *k* that minimizes [*w*(1, *k* - 1) + *w*(*k* + 1, *n*)] over *k* = 1, 2, . . . , *n*.

Obst(1,n) is given by the tree with root *k _{k}*, obst(1,k-1) as its left subtree, and

obst(k+1,n) as its right subtree.

The weighted path length of obst(1,n) is *w*(1, *n*).

*w*(1, *n*) = *sw*(1, *n*) + *w*(1, *k*-1) + *w*(*k*+1, *n*).

Replicated Component Occurrences

----------------------------------

(_{2},_{3},_{3},_{4,}_{4}) 2

(_{0},_{1},_{1},_{2,}_{2}) 2

(_{1},_{2},_{2},_{3},_{3}) 2

(_{0},_{1},_{1)}4

(_{1},_{2},_{2}) 5

(_{2},_{3},_{3}) 5

(_{3},_{4},_{4}) 4

(_{0}) 4

(_{1}) 3

(_{2}) 4

(_{3}) 3

(_{4}) 4

(-_{i}_{1}) and (,_{i}+_{i}_{1}, . . . ,,_{j})_{j}

i.e., obst(i,i-1) and obst(i+1,j)

(-1,_{i},_{i}) and (_{i}+1,_{i}+2, . . . , B_{i},_{j})_{j}

i.e., obst(i,i) and obst(i+2,j)

(_{i-1},_{i}, . . . ,-1,_{j}-1) and (_{j})_{j}

i.e.,obst(i,j-1) andobst(j+1,j)

This algorithm requires *O*(*n ^{3}*) execution time and

The bottom-up algorithm, incorporating the new minimization limits, may now be written.

a. Set *w*(*i* + 1, *i*) to 0 and *sw*(*i* + 1, *i*) to _{i}

Set *sw*(*i* + 1, *j*) to *sw*(*i* + 1, *j* - 1) + * _{j}* +

a. Set *w*(*j, j*) to *sw*(*j, j*) and *r*(*j, j*) to *j.*

(This initializes all obst's containing 0 keys and 1 key.)

To date, no one knows how to construct the *optimal *binary search tree using less time or storage in the general case of nonzero 's and *'*s. For this reason it is important to develop algorithms that produce good, but not necessarily optimal, solutions with less time and storage being required. Algorithms have been found that construct **nearly optimal binary search trees***.* The weighted path lengths of such trees cannot differ by more than a constant factor from the optimal value.

One such algorithm produces * greedy binary search trees*. These trees are constructed by an algorithm using the greedy method. Unlike the Huffman algorithm, which is also based on the greedy method, they are not guaranteed to be optimal, but they are guaranteed to be nearly optimal. Other distinct, nearly optimal, tree constructions are found in Mehlhorn [1975, 1977] and in Horibe and Nemetz [1979]. A comparison of all these and other algorithms for the generation of nearly optimal binary search trees appears in Korsh [1982]. The proof that the following greedy method yields nearly optimal trees is in Korsh [1981].

To start the development, we use the 's and 's of Example 12.3 and array them as shown in Figure 12.14(a). In an inorder traversal of *any* binary search tree with the six keys, the weights assigned to the accessed nodes would occur in the order _{0} _{1} _{1} _{2} _{2} * _{3}*, and so on. This sequence looks like

2

4 3

3

6 4

_{1}_{2}_{3}_{4}_{5}_{6}

10 3 9 2 0 10

5 6 4 4 3 8 0

_{0}_{1}_{2}_{3}_{4}_{5}_{6}

The *greedy* tree thus constructed by always taking a minimal triple is shown in Figure 12.15. Its weighted path length can be found by keeping an accumulated sum, as in the Huffman algorithm. In this case the greedy algorithm actually yields the optimal tree.

Just as in the Huffman tree construction, if we are not careful we will end up with an *O*(*n*^{2}) time implementation. Instead, we can use a heap similar to that for the Huffman construction to obtain an *O*(*n* 1g *n*) time requirement and *O*(*n*) storage. However, we can do even better.

We start with a list-structure illustrated in Figure 12.16. Each record represents a potential triple to be combined in the construction of a greedy tree. Traverse the list starting at the left and search for a *locally* minimal triple -- that is, a triple whose value does not exceed the value of its successor triple. No triple to the left of this locally minimal triple can be combined before it in any greedy tree construction. In addition, any future combining of locally minimal triples to its right cannot cause the resultant tree to fail to be greedy. In other words, combining the minimal triple always allows a greedy tree to be constructed with the combined minimal triple as a subtree. Combine the locally minimal triple (say it is the *i*th), and obtain the new list-structure shown in Figure 12.16(b).

Consider the twenty-seven characters with relative frequencies given in Section 12.3 as weights. Order them as < *A* < *B* < *C* < < *Z*. Suppose we want to construct an extended binary *search* tree with these characters associated with the terminal nodes. We take the weights given there as the *'s of this problem and take the *'s to be zero. The extended binary tree will then generate what is called an * alphabetic code*. The optimal binary search tree will have weighted path length 5,200, and the greedy binary search tree will have weighted path length 5,307. Recall that the Huffman tree had weighted path length 5,124. Requiring use of a binary search tree rather than a binary tree results in an increase relative to 5,124 of only (5,200 - 5,124)/5,124 or 1.5 percent. The greedy construction yields an increase relative to 5,200 of (5,307 - 5,200)/5,200 or 2 percent.

**1. a.** Using its definition, calculate the weighted path length of the following tree. The weight of a node appears at the node.

**2. **Write a process function to turn a postorder traversal into a function for determining the weighted path length of a binary tree *T* based on the method of Exercise 1(b).

**3. **Determine the Huffman code (tree) for weights 1, 2, 3, 4, 5.

**4. **Determine the weighted path length for your solution to Exercise 3.

**5. **The Huffman tree for *n* weights will always have (*n* - 1) internal nodes and *n* terminal nodes. This is because it is a *full* binary tree. That is, every internal node has exactly 0 or exactly two successors. Prove, by induction on the total number of nodes of a full binary tree, that this is true.

**6. **Write a nonrecursive function, huffman, with parameters n, weights, tree, and code. Weights is an array containing n weights, code is an array returned containing the n codes (sequences of 0's and 1's) assigned to each weight by the Huffman code, and tree points to the root of the corresponding Huffman tree upon return. Your function should take *O*(*n* 1g *n*) time.

**7. **Write a recursive version of the function of Exercise 6.

**8. **Simulate the proof (by induction on *n*) that the Huffman tree yields a binary tree with minimal weighted path length for *n* weights for Exercise 3.

**9. **Write a function decode with parameters n, alpha, and tree. Alpha is a sequence of n 0's and 1's, and tree points to a Huffman tree for the twenty-seven characters of Section 12.3 (determined by their weights). Decode is to decode the sequence of n 0's and 1's into the corresponding characters. The sequence represents a coded version of a message composed of the characters.

**10. **Suppose you know the Huffman tree for the twenty-seven characters of Section 12.3 (determined by their weights). Write a function encode to encode a message composed of characters into the Huffman code. To do this you might consider using the following data structures:

Discuss the relative advantages and disadvantages for each of these in the function encode.

**11. **Write a function encode to encode a message in a Huffman code.

**12. **Let _{0} = 10, _{1} = 6, _{2} = 8, _{3} = 4, _{4} = 2, _{1} = 2, _{2} = 5, _{3} = 7, _{4} = 6. Determine the optimal binary search tree by generating all possible binary search trees storing the four keys with weights 2, 5, 7, and 6, and calculating their weighted path lengths.

**13. **The recursive approach of Section 12.4.1 is based on the idea that an optimal binary search tree must have an optimal left and an optimal right subtree. Can you use this idea to show that some of the trees you considered in Exercise 12 could have been ignored in searching for the optimal tree?

**14. **Simulate the bottom-up algorithm of Section 12.4.1 to reproduce the arrays of Figure12.11.

**15. **Why do we want to find nearly optimal binary search trees rather than optimal binary search trees for large numbers of keys?

**16. **Find the greedy binary search tree for the weights of Exercise 12 using the definition of a greedy tree. Determine the relative difference between its weighted path length and the optimal value.

**17. **Find the greedy binary search tree for the weights of Exercise 12 using the implementation of the greedy tree construction. Is its weighted path length the same as the greedy tree of Exercise 16?

**18. **Write a nonrecursive function greedy to implement the greedy tree construction of Section 12.5.3.

**19. **Give a recursive definition of a greedy binary search tree.

**20. **Write a recursive function greedy to implement the greedy tree construction of Section 12.5.3.

**21. **Another method of constructing a nearly optimal binary search tree is to select the key to be placed at the root to be a key that minimizes the difference between the weights of its left and right subtrees. Having obtained the root, apply this procedure recursively to construct its left and right subtrees. This is a top-down construction, as opposed to the greedy tree construction, which is bottom-up. What tree does it generate for the weights of Exercise 12 and for the example of Section 12.5.2?

**22. **Write a recursive function for the construction method of Exercise 21. How much time does your function require?

**23. **Can you find a way to implement the procedure of Exercise 21 in *O*(*n*) time, where *n* is the number of internal keys to be stored?

**24. **What will a greedy tree look like if the *n* *'*s are zero and the *n* + 1 's are given by the first *n* + 1 Fibonacci numbers? Give two answers--one when the greedy tree definition is used, and one when the *O*(*n*) implementation is used.

**1. a.** Write programs to find and output optimal greedy trees and their weighted path lengths.

**2. a.** Write a function to read a file composed of the twenty-six letters of the alphabet, plus the blank character, the comma, and the period; produce an output file that is the encoded version of the input file. The Huffman code is to be used, with weights based on the number of times each character appears in the input text.

**c. **Run your program for part (b) with the table implemented as

ii. a hash table using double hashing