Huffman Coding

Huffman Encoding

Before we proceed to encoding, we need to construct a Huffman tree for the input text. The algorithm for the construction of the Huffman tree is described and illustrated with an example below. I have also implemented this method in HuffmanCode.java. Even though you are not writing this method, I recommend you read through the description and study the implementation details in HuffmanCode.java.

For our input text, “who are you”, we have the following table:

Character Frequency

a 1

e 1

h 1

o 2

r 1

u 1

w 1

y 1

We will create a binary tree (Tree.java) for each character. To begin with, we will have 9 binary trees with a single node (the root). In the root of every binary tree, we will store the frequency and the value of the character. These nine binary trees will be stored in a Priority Queue of type Tree.

A Priority Queue is a data structure where each new insertion is stored in ascending order of its value. Typically, a minimum binary heap is utilized to implement a Priority Queue (refer to the unit on binary heaps). Hence, the binary trees will be stored in the queue in the ascending order of their frequencies. The algorithm below utilizes a greedy strategy to build the Huffman tree. A greedy algorithmic strategy computes the most optimal solution at every step with the ultimate aim of finding the globally optimal solution.

We will now execute the following steps:

1) Remove the first two binary trees from the queue. Let’s call them A and B.

2) Create a new binary tree, C and connect A and B as left and right nodes of C.

3) The data value in C will be the sum of the data values in A and B.

4) The value of the character in C will be the concatenation of the values of characters in A and B.

5) Insert C back into the queue.

We will execute Steps 1) – 5) until we are left with a single binary tree in the queue. The step-by-step output for the above example is show in the link below:

· https://nhrec.org/~schinnab/Huffman_Encoding_Example.txt

Pseudocode for the “codeGen” method in Tree.java:

Once we have completed the construction of the Huffman tree as shown in the above link, we can observe that all the characters in the input text are leaves. At this point, we are ready to compute the binary coding for every character in the input text. Note that the binary codes are simply strings which contain 1’s and 0’s as their values.

To compute the binary code for every character, C in the input text, we will follow the algorithm below:

1) Create a StringBuilder variable, S to store the binary code for C. It starts off being empty.

2) Start at the root and check to see if its left or right nodes have C as one of the characters in their value:

a. If the left node contains C, append a “0” to S.

else append a “1” to S.

3) Continue Step 2) until you reach the leaf node that contains C as its only value.

4) Return S.

Following the above algorithm yields the following binary codes for the characters in the input text:

Code Frequency

111 2

a 000 1

e 001 1

h 100 1

o 01 2

r 1011 1

u 1010 1

w 1101 1

y 1100 1

We can now scan through our input text and perform a look-up in the above table and encode every single character. The input text, “who are you” will be encoded as below:

11011000111100010110011111100011010

There a total of 35 bits (4 bytes + 3 bits) in the above encoding. This is a substantial savings compared to allocating a byte for every character in the input text (88 bytes).

Observe that the higher the frequency of a character, the lower is its binary code. Why? In steps 1) through 5) of the encoding algorithm, we are merging two binary trees with the least frequency. For this reason, the characters with the least frequency will end up farther from the root and hence will have a bigger code length. The characters with the highest frequency will end up closer to the root and will have a smaller code length. In summary, the Huffman’s algorithm, as explained above is able to generate smaller code lengths for most frequent characters at the expense of larger code lengths for least frequent characters.