Huffman Compression Algorithm
Sources:
·
Source : http://people.cs.pitt.edu/~kirk/cs1501/animations/Huffman.html
Problem: Implement the Huffman
encoding and decoding algorithms for an input line of text.
Background: Files on a computer occupy a
significant amount of space if they are stored without being compressed.
Typically, when a file is saved on a computer, an encoding algorithm will
compress the file to a fraction of its original size. When the file is opened
for viewing, a decoding algorithm will decode all the encoding and open the
file to its original size.
Let’s
say we have a text file that contains the following text:
· who are you
In
order to store the above text, we could allocate one byte of memory for
encoding each character. A byte is an integer variable that spans from 0 to
255. There are eight bits in a byte. These bits are either 0’s or 1’s. If all
the bits in a byte variable are 0’s, the value of a byte is 0 (least value). If
all the bits in a byte variable are 1, the value of a byte is 255 (highest
value). All other combinations of 0s and 1’s will yield values between 0 and
255. The ASCII values for the characters in a typical keyboard also range from
0 to 255.One way of encoding the above text is to simply encode each character
to its ASCII value. That would require a total of 88 bytes of storage (11
characters, 1 byte for each character) Can we do better?
Huffman
encoding combines the following ideas to encode a line of text:
1) In a typical line of text, certain
characters repeat more often than others.
2) Rather than encode every character with
eight bits, we could encode characters with varying sizes of bits (say, 2 or 9)
depending on its frequency. We would want a lower bit length for characters
that repeat the most often and higher bit lengths for characters that repeat
less often. Why can’t we have lower bit lengths for all characters? This is
because of the way the Huffman algorithm operates and will be clear once you
see the pseudocode and program output (refer to the Huffman Encoding link
below).
3) To summarize, our initial strategy of
allocating the same bit length (one byte) for encoding each character without
any reference to its frequency is akin to a brute-force strategy. By encoding
smaller bit lengths for the higher frequency characters, we are able to save
considerable storage space.
The links below
illustrate the Huffman algorithm with an example and describe the pseudocode:
Running time: O (n log n)
Space Complexity: O
(n)
Java Implementation: