Huffman
Encoding
Before
we proceed to encoding, we need to construct a Huffman tree for the input text.
The algorithm for the construction of the Huffman tree is described and illustrated
with an example below. I have also implemented this method in HuffmanCode.java.
Even though you are not writing this method, I recommend you read through the
description and study the implementation details in HuffmanCode.java.
For our input text, “who are you”, we have the
following table:
Character Frequency
2
a 1
e 1
h 1
o 2
r 1
u 1
w 1
y 1
We will
create a binary tree (Tree.java) for each character. To begin with, we will
have 9 binary trees with a single node (the root). In the root of every binary
tree, we will store the frequency and the value of the character. These nine
binary trees will be stored in a Priority Queue of type Tree.
A Priority
Queue is a data structure where each new insertion is stored in ascending order
of its value. Typically, a minimum binary heap is utilized to implement a
Priority Queue (refer to the unit on binary heaps). Hence, the binary trees
will be stored in the queue in the ascending order of their frequencies. The
algorithm below utilizes a greedy strategy to build the Huffman tree. A greedy
algorithmic strategy computes the most optimal solution at every step with the
ultimate aim of finding the globally optimal solution.
We will
now execute the following steps:
1)
Remove the first two binary trees from
the queue. Let’s call them A and B.
2)
Create a new binary tree, C and connect
A and B as left and right nodes of C.
3)
The data value in C will be the sum of
the data values in A and B.
4)
The value of the character in C will be
the concatenation of the values of characters in A and B.
5)
Insert C back into the queue.
We will
execute Steps 1) – 5) until we are left with a single binary tree in the queue.
The step-by-step output for the above example is show in the link below:
·
https://nhrec.org/~schinnab/Huffman_Encoding_Example.txt
Pseudocode
for the “codeGen” method in Tree.java:
Once we
have completed the construction of the Huffman tree as shown in the above link,
we can observe that all the characters in the input text are leaves. At this point, we are ready to compute the
binary coding for every character in the input text. Note that the binary codes
are simply strings which contain 1’s and 0’s as their values.
To
compute the binary code for every character, C in the input text, we will
follow the algorithm below:
1)
Create a StringBuilder variable, S to
store the binary code for C. It starts off being empty.
2)
Start at the root and check to see if
its left or right nodes have C as one of the characters in their value:
a.
If the left node contains C, append a
“0” to S.
else append a “1” to S.
3)
Continue Step 2) until you reach the
leaf node that contains C as its only value.
4)
Return S.
Following
the above algorithm yields the following binary codes for the characters in the
input text:
Code Frequency
111 2
a 000 1
e 001 1
h 100 1
o 01 2
r 1011 1
u 1010 1
w 1101 1
y 1100 1
We can
now scan through our input text and perform a look-up in the above table and
encode every single character. The input text, “who are you” will be encoded as
below:
11011000111100010110011111100011010
There a
total of 35 bits (4 bytes + 3 bits) in the above encoding. This is a substantial savings compared to
allocating a byte for every character in the input text (88 bytes).
Observe
that the higher the frequency of a character, the lower is its binary
code. Why? In steps 1) through 5) of the
encoding algorithm, we are merging two binary trees with the least frequency.
For this reason, the characters with the least frequency will end up farther
from the root and hence will have a bigger code length. The characters with the
highest frequency will end up closer to the root and will have a smaller code
length. In summary, the Huffman’s algorithm, as explained above is able to
generate smaller code lengths for most frequent characters at the expense of
larger code lengths for least frequent characters.