Compress - Huffman Coding

The Huffman algorithm converts characters into bit strings using a binary tree containing all possible characters. We are not concerned with the construction of the tree initially. We simply have the prerequisit that the tree contains all characters in the input list. The Huffman code for a character may be obtained by traversing the tree, where a left branch is choosen we have the bit 0, if a right branch is taken the bit is 1.

The algorithm given here does not describe how to build the tree - however given a Huffman tree ht which includes our entire character set we define:
fun compress s = code ht (explode s);
fun decompress l = implode(decode ht l);

The task of proving this compression method sound is reduced to proving that:

decode ht (code ht l) = l
for l such that (isin ht) forall l
isin and forall are define below.

Huffman trees

A tree may be defined in ML as follows:

datatype 'a binTree = tip of 'a 
|                      /\ of 'a binTree * 'a binTree;

infix /\;

The above tree, T may be defined as follows:

val T = tip"e" /\((tip "b" /\ tip "s")/\(tip "a" /\(tip "z" /\ tip "x")));

There are a couple of useful functions that must be defined. The function isin takes a tree and a character and returns true if the character occurs in the tree. The function forall takes a predicate (a function from an item to a boolean) and a list - it returns true if the predicate is true for all items in the list. With these two we can express the precondition of the function, namely that all characters in the list to be compressed are in the tree.

(isin T) forall l
Where
fun isin (tip a) x  = a=x
|   isin (t1/\t2) x = (isin t1 x) orelse (isin t2 x);
and
infix forall;
fun f forall nil = true
|   f forall(h::t) = (f h) andalso (f forall t);

Coding and decoding characters

To code a single character into a bit list we use the function codec

fun codec (tip c) x  = nil
|   codec (t1/\t2) x = if isin t1 x then 0::codec t1 x
                                    else 1::codec t2 x;

To decode we accept a tree and a list, when the character has been found we return the rest of list.

fun decodec (tip c) l      = (c,l)
|   decodec (t1/\_) (0::t) = decodec t1 t
|   decodec (_/\t2) (1::t) = decodec t2 t;

Claim for codec, decodec

isin tree c => decodec tree (code tree c)@l = (c,l)

Proof by structural induction on the tree for any character c and list l:

Base case:

tree is (tip x) for any x

we note that x=c as isin (tip x) c

LHS decodec (tip c) (codec (tip c) c)@l
                 = decodec (tip c) nil@l 
                 = decodec (tip c) l 
                 = (c,l)
Step:

tree is t1/\t2 where isin (t1/\t2) c is true

To start we assume that isin t1 c (otherwise we may be sure of isin t2 c)

LHS decodec (t1/\t2) (codec (t1/\t2) c)@l
                    = decodec (t1/\t2) (0::codec t1 c)@l
                    = decodec (t1/\t2) 0::((codec t1 c)@l)
                    = decodec t1 (codec t1 c)@l
                    = (c,l)

The inductive hypothesis requires isin t1 c

The argument for isin t2 c is all but identical.

Coding and decoding lists

The above functions generalise to lists in a natural way. Given some tree containing every member of the list l.

fun code nil = nil
|   code (h::t) = (codec tree h) @ (code t);
fun decode nil = nil
|   decode l = let val (c,t) = decodec tree l
               in c::decode t end;

(isin tree) forall l => decode code l = l

Proof by induction on l

Base, l is nil

LHS decode code nil = decode nil
                    = nil

Step

LHS decode code (h::l) = decode (codec tree h)@(code l)
                       = c::(decode t)
                       where (c,t) = decodec tree (codec tree h)@(code l)
                                   = (h,code l)                by lemma c
                       = h::(decode code l)
                       = h::l                                  by IH
There are a few point that have been skipped over.

The second reduction:

decode (codec tree h)@(code l) = c::(decode t) is only valid if (codec tree h)@(code l) != nil this turns out to be true only if the tree contains more than one item. The theorem is false for a tree of one item only.

Both lemma c and the IH have membership prerequesits - examination shows that these are satisfied.