Homework 6: Huffman compression

Due: 2020-05-13 at 11:00

Preliminaries

First, find a partner. You’re allowed to work by yourself, but I highly recommend working with a partner. Click on the assignment link. One partner should create a new team. The second partner should click the link and choose the appropriate team. (Please don’t choose the wrong team, there’s a maximum of two people and if you join the wrong one, you’ll prevent the correct person from joining.)

Once you have accepted the assignment and created/joined a team, you can clone the repository on clyde and begin working. But before you do, read the entire assignment.

Be sure to ask any questions on Piazza.

Submission

To submit your homework, you must commit and push to GitHub before the deadline.

The README.md should contain

You should not include any compiled files (either object files, the library, or the programs).

Part 0

You’re going to write a Huffman encoder/decoder that can be used for simple file compression/decompression. This homework is trickier than the previous ones; however, it builds upon several of the previous ones.

The construction proceeds in parts. It’s not required that you complete them in order; however, tests for later parts depend on earlier parts.

Take a look through the code. There are multiple directories. For ease of exposition, I will refer to the repository directory as repo. The code is structured as a library, libhuffman, and two programs encode and decode which depend on the library.

The first directory to examine is repo/libhuffman which is where all of the library code and tests reside. repo/libhuffman/include contains the header huffman.h which is what the two applications will include. The remainder of the library code is in source and header files in repo/libhuffman and unit tests are in repo/libhuffman/tests.

Tests for encode and decode reside in repo/tests.

In this assignment, don’t make any assumptions about how many bits comprise a byte. Use CHAR_BIT.

File formats

The input to the encode program is a file which you’ll treat as a sequence of CHAR_BIT bytes. The end result of encoding a file is a (hopefully) compressed output file. Unlike the input, the compressed file is a stream of bits (that are packed into bytes).

The compressed file starts with a canonical representation of the Huffman trie. To do this, you will do a pre-order traversal of the trie using a 0-bit to indicate that it is an internal node and therefore has left and right children, or a 1-bit to indicate that it is a leaf node. Immediately following the 1-bit you will write the CHAR_BIT bits from most to least significant that make up the value of the character at that location in the tree.

Immediately following the pre-order traversal of the tree, you will write the code word representing EOF. (You will later use that to find the leaf that represents EOF and correct the value there.)

After the code word for EOF, you should output the code words that are needed to represent the input file. You’ll have to buffer the bits until you get CHAR_BIT of them and then output it. (The most significant bit is the first bit, and then they progress downward.)

Finally, write the code word for EOF to mark the end of the file followed by as many 0 bits as necessary to fill out the final byte.

Encoding

The process for encoding involves two passes over the data. The first pass is for counting how often each character appears in the file and the second is for performing the encoding and writing the output.

You’re going to build the trie by constructing a list of trees and iteratively combining them until a single tree remains. To do this, you’re going to use a node that has a pointer to the next tree in the list and also two pointers for the left and right children. See repo/libhuffman/huffman_list.h.

The steps are the following.

  1. Read the whole data and count how often each character appears;
  2. Create a linked consisting of a single node with a count of 1 and a ch of EOF (see repo/libhuffman/huffman_list.h);
  3. For each character with a nonzero count from the file, create a new node with the corresponding ch and count and insert it into the list as you would do with insertion sort. Insert before nodes with equal counts.
  4. While there are two or more nodes in the list, remove the first two nodes from the list and create a new node with the two nodes as the left and right children. Insert the node into the list, maintaining sorted order as in step 3.

Once there is only a single tree in the list, it is the Huffman trie.

The code word for each byte, and for EOF, can be recovered by walking from the root to each leaf. Going left corresponds to a 0 bit and going right corresponds to a 1 bit.

Sample run

Input:

cheese

Frequency Counts:

-1     EOF     1
10     \n      1
99     c       1
101    e       3
104    h       1
115    s       1

Linked List (initial):

s(1) -> h(1) -> c(1) -> \n(1) -> EOF(1) -> e(3)

First pass:

c(1) -> \n(1) -> EOF(1) -> (2) -> e(3)
                          /   \
                      s(1)     h(1)    

Second pass:

EOF(1) -> (2)   -->   (2) -> e(3)
         /   \       /   \
      c(1)  \n(1)  s(1)   h(1)    

Third pass:

   (2)  ------>  (3)  ---> e(3)
  /   \         /   \
s(1)   h(1) EOF(1)  (2)
                   /   \
                 c(1) \n(1)

Fourth pass:

e(3) -----------> (5)
               /       \
           (2)           (3)
          /   \         /   \
        s(1)   h(1) EOF(1)  (2)
                           /   \
                         c(1) \n(1)

Fifth (and final) pass:

         (8)
    /           \
e(3)              (5)
               /       \
           (2)           (3)
          /   \         /   \
        s(1)   h(1) EOF(1)  (2)
                           /   \
                         c(1) \n(1)

Code words:

char      code words
----      ----------
 -1  EOF  110
 10  \n   1111
 99  c    1110
101  e    0
104  h    101
115  s    100

Trie representation (with added spaces for clarity):

0 1 01100101 0 0 1 01110011 1 01101000 0 1 11111111 0 1 01100011 1 00001010

EOF:

110

Remainder of file (spaces added, includes EOF):

1110 101 0 0 100 0 1111 110

Remainder is padding to make it a full char:

000000

You can also use some Unix tools to examine your output files:

File passed through xxd:

0000000: 594b 9da1 ff58 e15b a91f 80              YK...X.[...

File passed through xxd -b (bits):

0000000: 01011001 01001011 10011101 10100001 11111111 01011000  YK...X
0000006: 11100001 01011011 10101001 00011111 10000000           .[...

Decoding

To decode the file, read in the pre-order traversal of the tree, assembling it as you go. A 0 bit indicates an internal node which has both a left and right child. A 1 bit indicates it is a leaf and the next CHAR_BIT bits represent the value at that node from MSB to LSB. (I found a recursive function to work nicely for this.)

Now you need to fix the value of EOF in the tree. Switch over to a bitwise read/tree traversal routine where 0 indicates to go left and 1 indicates to go right. Once you hit the first leaf, you now have the location for the actual EOF marker and you should update the value there accordingly.

Now you continue with a bitwise read/tree traversal routine and use those to determine if you should go left on 0 or right on 1 in the tree. Once you reach a leaf, you should be at a letter. Output it and move back to the root. When you reach the EOF code word you should stop reading/printing. Nothing is printed for the EOF code word.

Part 1. Counting

This is the easiest part. Look in repo/libhuffman/counts.[hc]. Implement the function by reading characters and updating the counts in the passed in array. This is similar to homework 2. Make sure repo/libhuffman/tests/test_counts succeeds before moving on.

Part 2. Bit stream

Look in repo/libhuffman/bit_stream.[hc]. Implement the functions in these files to read and write bits to the underlying FILE * stream.

When writing bits, buffer the bits until CHAR_BIT bits have been added at which point a byte should be written to the underlying stream.

When reading bits, read bits one byte at a time from the underlying stream and buffer them.

You should implement bit_stream_put() and bit_stream_get() first and write the rest of the functions in terms of those.

Make sure you correctly handle EOF being returned from fputc(3)/fgetc(3) as well as returning the appropriate values from the functions.

Make sure repo/libhuffman/tests/test_bit_stream passes before moving on.

Part 3. Huffman list

Implement the basic functions for creating and freeing huffman_nodes. In addition, implement the insertion function which inserts a node into a linked list in sorted order.

Make sure repo/libhuffman/tests/test_list passes before moving on.

Part 4. Encoding

You will need parts 1–3 working before this so make sure you implement those first.

Read the descriptions of the functions in repo/libhuffman/huffman_encode.h and repo/libhuffman/include/huffman.h carefully and implement the functions in repo/libhuffman/huffman_encode.c. Make sure that if any of the functions fails, any allocated memory is correctly freed.

When implementing huffman_codewords, you’re probably going to want to perform a recursive tree-traversal, passing the codewords and lengths arrays as well as building up a code word (and its length). At each leaf, the tree-traversal should set the appropriate entry in codewords and lengths.

Make sure repo/libhuffman/tests/test_encode passes.

Part 5. Decoding

You will need parts 2 and 3 working before this so make sure you implement those first.

Read the descriptions of the functions in repo/libhuffman/huffman_decode.h and repo/libhuffman/include/huffman.h carefully and implement the functions in repo/libhuffman/huffman_decode.c. Make sure that if any of the functions fails, any allocated memory is correctly freed.

Make sure repo/libhuffman/tests/test_decode passes.

You can run all of the library tests by running

$ make check

from the repo/libhuffman directory.

Part 6. Programs

You’ll need to have parts 4 and 5 finished before you can test these.

Implement repo/encode and repo/decode. They should each take two arguments.

The Makefile has -Ilibhuffman/include in the CFLAGS so make sure you’re including huffman.h appropriately. (If you don’t recall what -I means, check out the gcc(1) man page.)

$ ./encode
Usage: ./encode input output
$ ./decode
Usage: ./decode input output

Make sure they exit with zero on success and nonzero on error. You should print out an error message on stderr if there is an error such as not being able to open a file or when you can’t decompress.

Make sure all of the tests pass by running

$ make check

from the repo directory.