Huffman Compression Algorithm

Sources:

· Source : http://people.cs.pitt.edu/~kirk/cs1501/animations/Huffman.html

Problem: Implement the Huffman encoding and decoding algorithms for an input line of text.

Background: Files on a computer occupy a significant amount of space if they are stored without being compressed. Typically, when a file is saved on a computer, an encoding algorithm will compress the file to a fraction of its original size. When the file is opened for viewing, a decoding algorithm will decode all the encoding and open the file to its original size.

Let’s say we have a text file that contains the following text:

· who are you

In order to store the above text, we could allocate one byte of memory for encoding each character. A byte is an integer variable that spans from 0 to 255. There are eight bits in a byte. These bits are either 0’s or 1’s. If all the bits in a byte variable are 0’s, the value of a byte is 0 (least value). If all the bits in a byte variable are 1, the value of a byte is 255 (highest value). All other combinations of 0s and 1’s will yield values between 0 and 255. The ASCII values for the characters in a typical keyboard also range from 0 to 255.One way of encoding the above text is to simply encode each character to its ASCII value. That would require a total of 88 bytes of storage (11 characters, 1 byte for each character) Can we do better?

Huffman encoding combines the following ideas to encode a line of text:

1) In a typical line of text, certain characters repeat more often than others.

2) Rather than encode every character with eight bits, we could encode characters with varying sizes of bits (say, 2 or 9) depending on its frequency. We would want a lower bit length for characters that repeat the most often and higher bit lengths for characters that repeat less often. Why can’t we have lower bit lengths for all characters? This is because of the way the Huffman algorithm operates and will be clear once you see the pseudocode and program output (refer to the Huffman Encoding link below).

3) To summarize, our initial strategy of allocating the same bit length (one byte) for encoding each character without any reference to its frequency is akin to a brute-force strategy. By encoding smaller bit lengths for the higher frequency characters, we are able to save considerable storage space.

The links below illustrate the Huffman algorithm with an example and describe the pseudocode:

1) Huffman Encoding

2) Huffman Decoding

Running time: O (n log n)

Space Complexity: O (n)

Java Implementation:

· Node.java

· Tree.java

· HuffmaCode.java

Program Output