Homework 6: Huffman compression
Due: 2020-05-13 at 11:00
Preliminaries
First, find a partner. You’re allowed to work by yourself, but I highly recommend working with a partner. Click on the assignment link. One partner should create a new team. The second partner should click the link and choose the appropriate team. (Please don’t choose the wrong team, there’s a maximum of two people and if you join the wrong one, you’ll prevent the correct person from joining.)
Once you have accepted the assignment and created/joined a team, you can clone the repository on clyde and begin working. But before you do, read the entire assignment.
Be sure to ask any questions on Piazza.
Submission
To submit your homework, you must commit and push to GitHub before the deadline.
The README.md
should contain
- The names of both partners (or just your name if you worked alone…but please don’t work alone if you can manage it).
You should not include any compiled files (either object files, the library, or the programs).
Part 0
You’re going to write a Huffman encoder/decoder that can be used for simple file compression/decompression. This homework is trickier than the previous ones; however, it builds upon several of the previous ones.
The construction proceeds in parts. It’s not required that you complete them in order; however, tests for later parts depend on earlier parts.
Take a look through the code. There are multiple directories. For ease of exposition, I will refer to the repository directory as repo
. The code is structured as a library, libhuffman
, and two programs encode
and decode
which depend on the library.
The first directory to examine is repo/libhuffman
which is where all of the library code and tests reside. repo/libhuffman/include
contains the header huffman.h
which is what the two applications will include. The remainder of the library code is in source and header files in repo/libhuffman
and unit tests are in repo/libhuffman/tests
.
Tests for encode
and decode
reside in repo/tests
.
In this assignment, don’t make any assumptions about how many bits comprise a byte. Use CHAR_BIT
.
The input to the encode
program is a file which you’ll treat as a sequence of CHAR_BIT
bytes. The end result of encoding a file is a (hopefully) compressed output file. Unlike the input, the compressed file is a stream of bits (that are packed into bytes).
The compressed file starts with a canonical representation of the Huffman trie. To do this, you will do a pre-order traversal of the trie using a 0-bit to indicate that it is an internal node and therefore has left and right children, or a 1-bit to indicate that it is a leaf node. Immediately following the 1-bit you will write the CHAR_BIT
bits from most to least significant that make up the value of the character at that location in the tree.
Immediately following the pre-order traversal of the tree, you will write the code word representing EOF
. (You will later use that to find the leaf that represents EOF
and correct the value there.)
After the code word for EOF
, you should output the code words that are needed to represent the input file. You’ll have to buffer the bits until you get CHAR_BIT
of them and then output it. (The most significant bit is the first bit, and then they progress downward.)
Finally, write the code word for EOF
to mark the end of the file followed by as many 0
bits as necessary to fill out the final byte.
Encoding
The process for encoding involves two passes over the data. The first pass is for counting how often each character appears in the file and the second is for performing the encoding and writing the output.
You’re going to build the trie by constructing a list of trees and iteratively combining them until a single tree remains. To do this, you’re going to use a node that has a pointer to the next tree in the list and also two pointers for the left and right children. See repo/libhuffman/huffman_list.h
.
The steps are the following.
- Read the whole data and count how often each character appears;
- Create a linked consisting of a single node with a
count
of 1 and a ch
of EOF
(see repo/libhuffman/huffman_list.h
); - For each character with a nonzero count from the file, create a new node with the corresponding
ch
and count
and insert it into the list as you would do with insertion sort. Insert before nodes with equal count
s. - While there are two or more nodes in the list, remove the first two nodes from the list and create a new node with the two nodes as the left and right children. Insert the node into the list, maintaining sorted order as in step 3.
Once there is only a single tree in the list, it is the Huffman trie.
The code word for each byte, and for EOF
, can be recovered by walking from the root to each leaf. Going left corresponds to a 0
bit and going right corresponds to a 1
bit.
Sample run
Input:
Frequency Counts:
-1 EOF 1
10 \n 1
99 c 1
101 e 3
104 h 1
115 s 1
Linked List (initial):
s(1) -> h(1) -> c(1) -> \n(1) -> EOF(1) -> e(3)
First pass:
c(1) -> \n(1) -> EOF(1) -> (2) -> e(3)
/ \
s(1) h(1)
Second pass:
EOF(1) -> (2) --> (2) -> e(3)
/ \ / \
c(1) \n(1) s(1) h(1)
Third pass:
(2) ------> (3) ---> e(3)
/ \ / \
s(1) h(1) EOF(1) (2)
/ \
c(1) \n(1)
Fourth pass:
e(3) -----------> (5)
/ \
(2) (3)
/ \ / \
s(1) h(1) EOF(1) (2)
/ \
c(1) \n(1)
Fifth (and final) pass:
(8)
/ \
e(3) (5)
/ \
(2) (3)
/ \ / \
s(1) h(1) EOF(1) (2)
/ \
c(1) \n(1)
Code words:
char code words
---- ----------
-1 EOF 110
10 \n 1111
99 c 1110
101 e 0
104 h 101
115 s 100
Trie representation (with added spaces for clarity):
0 1 01100101 0 0 1 01110011 1 01101000 0 1 11111111 0 1 01100011 1 00001010
EOF:
Remainder of file (spaces added, includes EOF
):
1110 101 0 0 100 0 1111 110
Remainder is padding to make it a full char:
You can also use some Unix tools to examine your output files:
File passed through xxd
:
0000000: 594b 9da1 ff58 e15b a91f 80 YK...X.[...
File passed through xxd -b
(bits):
0000000: 01011001 01001011 10011101 10100001 11111111 01011000 YK...X
0000006: 11100001 01011011 10101001 00011111 10000000 .[...
Decoding
To decode the file, read in the pre-order traversal of the tree, assembling it as you go. A 0
bit indicates an internal node which has both a left and right child. A 1
bit indicates it is a leaf and the next CHAR_BIT
bits represent the value at that node from MSB to LSB. (I found a recursive function to work nicely for this.)
Now you need to fix the value of EOF
in the tree. Switch over to a bitwise read/tree traversal routine where 0
indicates to go left and 1
indicates to go right. Once you hit the first leaf, you now have the location for the actual EOF
marker and you should update the value there accordingly.
Now you continue with a bitwise read/tree traversal routine and use those to determine if you should go left on 0
or right on 1
in the tree. Once you reach a leaf, you should be at a letter. Output it and move back to the root. When you reach the EOF
code word you should stop reading/printing. Nothing is printed for the EOF
code word.
Part 1. Counting
This is the easiest part. Look in repo/libhuffman/counts.[hc]
. Implement the function by reading characters and updating the counts in the passed in array. This is similar to homework 2. Make sure repo/libhuffman/tests/test_counts
succeeds before moving on.
Part 2. Bit stream
Look in repo/libhuffman/bit_stream.[hc]
. Implement the functions in these files to read and write bits to the underlying FILE *
stream.
When writing bits, buffer the bits until CHAR_BIT
bits have been added at which point a byte should be written to the underlying stream.
When reading bits, read bits one byte at a time from the underlying stream and buffer them.
You should implement bit_stream_put()
and bit_stream_get()
first and write the rest of the functions in terms of those.
Make sure you correctly handle EOF
being returned from fputc(3)
/fgetc(3)
as well as returning the appropriate values from the functions.
Make sure repo/libhuffman/tests/test_bit_stream
passes before moving on.
Part 3. Huffman list
Implement the basic functions for creating and freeing huffman_node
s. In addition, implement the insertion function which inserts a node into a linked list in sorted order.
Make sure repo/libhuffman/tests/test_list
passes before moving on.
Part 4. Encoding
You will need parts 1–3 working before this so make sure you implement those first.
Read the descriptions of the functions in repo/libhuffman/huffman_encode.h
and repo/libhuffman/include/huffman.h
carefully and implement the functions in repo/libhuffman/huffman_encode.c
. Make sure that if any of the functions fails, any allocated memory is correctly freed.
When implementing huffman_codewords
, you’re probably going to want to perform a recursive tree-traversal, passing the codewords
and lengths
arrays as well as building up a code word (and its length). At each leaf, the tree-traversal should set the appropriate entry in codewords
and lengths
.
Make sure repo/libhuffman/tests/test_encode
passes.
Part 5. Decoding
You will need parts 2 and 3 working before this so make sure you implement those first.
Read the descriptions of the functions in repo/libhuffman/huffman_decode.h
and repo/libhuffman/include/huffman.h
carefully and implement the functions in repo/libhuffman/huffman_decode.c
. Make sure that if any of the functions fails, any allocated memory is correctly freed.
Make sure repo/libhuffman/tests/test_decode
passes.
You can run all of the library tests by running
from the repo/libhuffman
directory.
Part 6. Programs
You’ll need to have parts 4 and 5 finished before you can test these.
Implement repo/encode
and repo/decode
. They should each take two arguments.
The Makefile
has -Ilibhuffman/include
in the CFLAGS
so make sure you’re including huffman.h
appropriately. (If you don’t recall what -I
means, check out the gcc(1)
man page.)
$ ./encode
Usage: ./encode input output
$ ./decode
Usage: ./decode input output
Make sure they exit with zero on success and nonzero on error. You should print out an error message on stderr
if there is an error such as not being able to open a file or when you can’t decompress.
Make sure all of the tests pass by running
from the repo
directory.