|
Huffman coding uses a variable length code for each of the elements within the data. This normally involves analyzing the data to determine the probability of elements it. The most probable elements are coded with a few bits and the least probable coded with a greater number of bits. This could be done on a character-by-character basis, in a text file, or could be achieved on a byte-by-byte basis for other files.
The following example relates to characters. First, the textual data is scanned to deter-mine the number of occurrences of a given letter. For example:
|
Letter
|
'b'
|
'c'
|
'e'
|
'i'
|
'o'
|
'p'
|
|
No. of occurances
|
12
|
3
|
57
|
51
|
33
|
20
|
Next the characters are arranged in order of their number of occurrences, such as:
|
Letter
|
'e'
|
'i'
|
'o'
|
'p'
|
'b'
|
'c'
|
|
No. of occurances
|
57
|
51
|
33
|
20
|
12
|
3
|
After this the two least probable characters are assigned either a 0 or a 1. Figure 1 shows that the least probable ('c') has been assigned a 0 and the next least probable ('b') has been assigned a 1. The addition of the number of occurrences for these is then taken into the next column and the occurrence values are again arranged in descending order (that is, 57, 51, 33, 20 and 15). As with the first column, the least probable occurrence is assigned a 0 and the next least probable occurrence is assigned a 1. This continues until the last column. When complete, the Huffman-coded values are read from left to right and the bits are listed from right to left.
Figure 1 Huffman coding example
The final coding will be:
|
Letter
|
Binary code
|
|
'e'
|
11
|
|
'i'
|
10
|
|
'o'
|
00
|
|
'p'
|
011
|
|
'b'
|
0101
|
|
'c'
|
0100
|
The great advantage of Huffman coding is that, although each character is coded with a different number of bits, the receiver will automatically determine the character whatever their order. For example if a 1 is followed by a 1 then the received character is an 'e'. If it is then followed by two 0s then it is an 'o'. Here is an example:
11000110100100110100
will be decoded as:
'e' 'o' 'p' 'c' 'i' 'p' 'c'
When transmitting or storing Huffman-coded data, the coding table needs to be stored with the data (if the table is generated dynamically). It is generally a good compression technique but it does not take into account higher order associations between characters. For example, the character 'q' is normally followed by the character 'u' (apart from words such as Iraq). An efficient coding scheme for text would be to encode a single character 'q' with a longer bit sequence than a 'qu' sequence.
In a previous example we used soccer matches as an example of how data could be compressed. In a small sample of UK soccer matches the following resulted:
0 goals - 21 times; 1 goals -34 times; 2 goals- 15 times; 3 goals- 14 times; 4 goals - 5 times; 5 goals - 2 times; 6 goals - 1 time.
We could then order them as follows:
1 goals - 34 times; 0 goals - 21 times; 2 goals - 15 times; 3 goals - 14 times; 4 goals - 5 times; 5 goals - 2 times; 6 goals - 1 time.
This is obviously a small sample, and there are thus no codes for 7 goals or more. With a larger sample, there would be an associated number of occurrences. Figure 2 shows the resulting Huffman coding for these results. Thus, for example, a binary value of 01 will represent zero goals scored, and so on. This code could be combined with a table of values that represent each of the soccer teams. So, if we have 256 different teams (from 0 to 255), and use 00000000b to represent Mulchester, and 00000001b to represent Madchester, then the result:
Mulchester 2 Madchester 0
Would be coded as:
00000000000000000101
where the bold digits represent the score. If the next match between the two teams resulted in a score of 4-1 then the code would be:
0000000010010000000111
Notice that the number of bits used to code the score can vary in size, but as we use a Huffman code we can automatically detect the number of bits that it has.
Figure 2: Huffman coding for goals scored
|