Homework 2: Bit by bit

Due: 2019-10-06 at 23:59

Preliminaries

First, find a partner. You’re allowed to work by yourself, but I highly recommend working with a partner. Click on the assignment link. One partner should create a new team. The second partner should click the link and choose the appropriate team. (Please don’t choose the wrong team, there’s a maximum of two people and if you join the wrong one, you’ll prevent the correct person from joining.)

Once you have accepted the assignment and created/joined a team, you can clone the repository on clyde and begin working. But before you do, read the entire assignment and be sure to check out the expected coding style.

Be sure to ask any questions on Piazza.

Coding style

You should run clang-format on each of the C source and header files you write. This will format all of your files consistently.

If you use NeoVim or Vim as your editor, you can include the line (called a mode line)

// vim: set sw=2 sts=2 ts=8 et:

at the bottom of each of your files to force Vim to indent by 2 spaces and to ensure that tabs will insert spaces. You can set options in your ~/.vimrc file, creating one if necessary. For example, on clyde, I have the simple ~/.vimrc (note that this is slightly expanded from the one given in the homework 1 write up to set the values for C files).

set background=dark
filetype plugin indent on
autocmd FileType sh setlocal shiftwidth=2 softtabstop=2 tabstop=8 expandtab
autocmd FileType c setlocal shiftwidth=2 softtabstop=2 tabstop=8 expandtab

The first line tells Vim to use colors suitable for a terminal with a dark background. The second line tells Vim to use file-type aware indenting. The third line tells Vim to set those options for shell script files. And the fourth line does the same thing, but for C files. See the Vim wiki for more details.

If you use emacs, you’re kind of on your own. Feel free to ask on Piazza, search StackOverflow, and read the Emacs Wiki.

Same with Nano. This might be useful.

Run time errors and return values

For each of the parts below that ask you to print out an error message, this message should be printed to stderr. You can use code like this to print to stderr.

fprintf(stderr, "%d is invalid\n", foo);

Any errors should cause the program to exit with a nonzero value (1 is a pretty good choice). Programs that run successfully should exit with value 0. You can use exit(value) to exit with a particular value. (Read the man page for exit(3) to see which header you need to include.

You can combine these two operations using the errx(3) function, see the corresponding man page for usage information.

Code warnings and errors

Your code must compile without error or warning when compiled with -std=c11 -Wall.

Submission

To submit your homework, you must commit and push to GitHub before the deadline.

Your repository should contain the following files

├── bits.c
├── bits.h
├── decode_bits.c
├── encode_bits.c
├── frequency.c
├── getnum.c
├── getnum.h
├── Makefile
├── README
├── tests
│   ├── common.sh
│   ├── run_tests
│   ├── test_decode_bits
│   ├── test_encode_bits
│   ├── test_frequency
│   ├── test_tobinary
│   ├── test_todecimal
│   ├── test_tohex
│   └── test_tooctal
├── tobinary.c
├── todecimal.c
├── tohex.c
├── tooctal.c
└── .travis.yml

It may also a .gitignore file which tells Git to ignore files matching patterns in your working directory (like *.o, for example).

Any additional files you have added to your repository should be removed from the master branch. (You’re free to make other branches, if you desire, but make sure master contains the version of the code you want graded.)

The README should contain

The names of both partners (or just your name if you worked alone…but please don’t work alone if you can manage it).
An estimate of the amount of time it took to complete each program.
Any known bugs or incomplete functions.
Any interesting design decisions you’d like to share.

Each of your source files with a main function should contain a comment at the top of the file that contains usage information plus a description of what the program does.

Example.

// Usage: ./frequency < input_file
// 
// Frequency will ...

Part 0. Makefile, testing, and Travis CI (15 points)

Makefile

The assignment code comes with a bare-bones Makefile. You need to expand upon it as you work through the following parts. Use a pattern rule to compile .o files from the corresponding .c files. Make sure each .o file depends on the appropriate header files and each program you write depends on the appropriate .o files.

The all target should depend on all of the programs you write.

The clean target should delete all of the programs you write and all of the object files. (Using *.o in clean is perfectly acceptable.)

Here is the output of running $ make and $ make clean in the homework solutions.

steve@clyde:~/hw2-solutions$ make
clang -std=c11 -Wall -c -o frequency.o frequency.c
clang -o frequency frequency.o
clang -std=c11 -Wall -c -o encode_bits.o encode_bits.c
clang -std=c11 -Wall -c -o bits.o bits.c
clang -o encode_bits encode_bits.o bits.o
clang -std=c11 -Wall -c -o decode_bits.o decode_bits.c
clang -o decode_bits decode_bits.o bits.o
clang -std=c11 -Wall -c -o todecimal.o todecimal.c
clang -std=c11 -Wall -c -o getnum.o getnum.c
clang -o todecimal todecimal.o getnum.o
clang -std=c11 -Wall -c -o tobinary.o tobinary.c
clang -o tobinary tobinary.o getnum.o
clang -std=c11 -Wall -c -o tohex.o tohex.c
clang -o tohex tohex.o getnum.o
clang -std=c11 -Wall -c -o tooctal.o tooctal.c
clang -o tooctal tooctal.o getnum.o
steve@clyde:~/hw2-solutions$ make clean
rm -f frequency encode_bits decode_bits todecimal tobinary tohex tooctal *.o

When you’re done, your output should look similar.

Testing

The assignment code also comes with a set of testing scripts in the tests directory. You can run them by running $ make check.

There are a handful of tests for Parts 1 and 2 and a whole bunch of tests for tobinary in Part 3.

You should add additional tests test_frequency, test_encode_bits, and test_decode_bits.

In addition, you should add three new test scripts, test_todecimal, test_tooctal, and test_tohex to the tests directory to test those programs. I recommend you copy test_tobinary and make the appropriate modifications to the expected output. Those need not be as extensive as the tests in test_tobinary, but I found the tests extremely helpful when developing my own solutions. I recommend you write tests before you start writing any code.

As you work through the parts below, run $ make check frequently to make sure that you haven’t broken any code you previously had working.

Travis CI

This homework also uses the Travis CI continuous integration service. Each time you push your code to GitHub, Travis will download the latest version, build your program and then run the tests.

You should be able to see the status of your latest commit by logging in to Travis CI using your GitHub account. You can also see the latest status from Travis by clicking on the 1 branch link near the top of the GitHub page for your repository which has URL https://github.com/systems-programming/${repo_name}/branches. You should see either a green check mark if all tests passed or a red X if one or more tests failed.

There’s nothing additional you need to do with Travis for this assignment.

Part 1. Frequency analysis (15 points)

In this part, we will start using arrays to contain frequencies of characters and print a summary of a text’s character distribution.

In C, an array is defined as

int arr[length];

where length is either a variable or a constant. Typically, we will declare a constant like

#define ARRAY_LENGTH 256

and use ARRAY_LENGTH for the length. (You might want to pick a more descriptive name than ARRAY_LENGTH.)

Your assignment is to create a program, frequency, that computes the count and frequency of all letters a–z.

Your program should read characters one at a time from stdin using the getchar(3) function until EOF is returned to indicate the end of file. (Remember, the 3 in getchar(3) means that the function is described in section 3 of the manual, so $ man 3 getchar.) Use isalpha(3) on each character to test if it is a letter a–z or A–Z. If not, ignore the character and move to the next one. If it is a letter, then use tolower(3) to convert it to lowercase.

Use an array of size 26 to hold the count of each character. (Hint, if c is a lower case character, then c - 'a' is an integer in the set $\{0,1,\ldots,25\}$ .)

After all input has been read (so getchar() returns EOF), print out, in tabular form, the letter, the number of times that it has appeared (the count), and the percentage of all letters that this letter represents (the frequency).

Following this table, print out which is the most frequent and least frequent letter. If there are multiple letters that are most or least frequent, you should print them all out in alphabetical order.

Two example runs follow where hamlet.txt can be downloaded from Project Gutenberg via $ curl -o hamlet.txt https://www.gutenberg.org/files/1524/1524-0.txt.

$ ./frequency <hamlet.txt
Character    Count    Frequency (%)
a            11324            7.549
b             2110            1.407
c             3410            2.273
d             5675            3.783
e            17565           11.709
f             3124            2.083
g             2849            1.899
h             9245            6.163
i            10005            6.670
j              200            0.133
k             1415            0.943
l             6862            4.574
m             4644            3.096
n             9739            6.492
o            12856            8.570
p             2461            1.641
q              230            0.153
r             9139            6.092
s             9449            6.299
t            14065            9.376
u             4983            3.322
v             1347            0.898
w             3425            2.283
x              206            0.137
y             3554            2.369
z              127            0.085
Most frequent: e
Least frequent: z
$ echo 'Hello CSCI 241!' | ./frequency
Character    Count    Frequency (%)
a                0            0.000
b                0            0.000
c                2           22.222
d                0            0.000
e                1           11.111
f                0            0.000
g                0            0.000
h                1           11.111
i                1           11.111
j                0            0.000
k                0            0.000
l                2           22.222
m                0            0.000
n                0            0.000
o                1           11.111
p                0            0.000
q                0            0.000
r                0            0.000
s                1           11.111
t                0            0.000
u                0            0.000
v                0            0.000
w                0            0.000
x                0            0.000
y                0            0.000
z                0            0.000
Most frequent: c l
Least frequent: a b d f g j k m n p q r t u v w x y z

Your output should exactly match the output given above.

Make sure your code can handle no letters at all in the input $ ./frequency </dev/null. (The counts should be 0 and the frequencies 0.000.)

Hints:

You may want to use the following call to printf to produce output in the appropriate form.
```
printf("%c         %8d    %13.3f\n", c, count, freq);
```
For computing the least frequently occurring character, you may want to use INT_MAX which is defined in the standard library header limits.h which is the maximum value an int can hold.
When computing the frequency for a given character that appeared count times, you’ll want to compute 100.0 * count / total where total is the total number of letters in the input, not the total number of characters.

Part 2. Convert to binary and back (20 points)

In this part, you will be creating 2 programs. encode_bits which will generate the “binary” representation of a file and decode_bits which will take that representation and convert it back to the original format.

encode_bits

Create a program called encode_bits. This program should use getchar(3) to read in characters one at a time and then call print_bits() (see below) to output that character as a sequence of ‘1’ and ‘0’ characters. It should stop on EOF.

After all characters have been sent to print_bits(), output a newline. Together with print_bits(), this should ensure that every line of output ends with a newline and that every line of output will contain 70 0 or 1 characters (except for the last line which may contain fewer).

decode_bits

Create a program called decode_bits. This program should use getchar(3) to read in characters one at a time and then call decode_bits() (see below) to output that sequence of ‘1’ and ‘0’ characters as actual characters. It should stop on EOF.

bits.c and bits.h

Create a file called bits.c that contains the definitions of the two functions described below and a header file bits.h that contains a guard against multiple inclusion and function prototypes.

void print_bits(int ch)

Takes the character ch and outputs its value in binary format with all leading zeros and with the MSB first. For example, the letter A has a value of 0x41, and should be output as 01000001. A newline character has a value of 0x0a and should be output as 00001010.
You should not assume the number of bits that are in a char, instead use the constant CHAR_BIT from the standard library header limits.h.
While printing the bits, if the current line of output has 70 characters (one per bit printed by print_bits()), print a newline before printing the next 0 or 1.

void decode_bits(int ch)

If the character is whitespace, skip it. (You might want to use isspace(3).)
If the character is a 1 or a 0, you should add it into a global variable (numerically, shifting the current contents appropriately). Once you’ve seen CHAR_BIT bits, you should print the corresponding character out.
If the character is not white space and not 0 or 1, you should print an error message to stderr, and exit with a nonzero value (see the instructions at the top of the write up about error messages).

Here is some sample output.

$ echo "Printing bit by bit" | ./encode_bits
0101000001110010011010010110111001110100011010010110111001100111001000
0001100010011010010111010000100000011000100111100100100000011000100110
10010111010000001010
$ echo "Printing bit by bit" | ./encode_bits | ./decode_bits
Printing bit by bit
$ echo "bad input!" | ./decode_bits
decode_bits: Character 'b' is not '0' or '1'
$ echo $?
1

Hints:

You’ll probably want to use static local variables in both print_bits() and decode_bits() to decide when to print newlines (print_bits()) or the character (decode_bits()).
Make sure encode_bits doesn’t start by outputting a newline or end with a blank line (unless the input was empty).

Part 3. Base conversion (50 points)

For this part, you will be creating a function to read in a signed integer value and storing the result in a long integer variable. You will also be creating four short programs that will use that function to read in integers and output them in one of four different formats: binary, decimal, octal, or hexadecimal.

Reading in a number

Create a file called getnum.c and another called getnum.h. In getnum.c you will create the function getnum() that is used by your other programs.

long getnum(void)

Read in a signed integer in one of 4 formats described below and then return the value. Skip leading whitespace and stop reading in a number on the next occurrence of whitespace or EOF. All formats optionally begin with a - to indicate negative value. Positive values do not have this (i.e., no + for positive).

Binary – begins with a leading 0b and then a sequence of 0 and 1 characters.
Octal – begins with a leading 0 followed by 0 or more digits from 0–7.
Decimal – begins with a digit from 1–9 followed by zero or more digits from 0–9.
Hexadecimal – begins with a leading 0x followed by 1 or more digits from 0–9, a–f, or A–F.

Note that this means that a single 0 is considered octal. It may be helpful to consult this railroad diagram.

If EOF is encountered before any digits, then set a global variable bool at_eof; to true and return. Otherwise, at_eof should be false when getnum() returns.

If the integer read is valid (meaning its value is at least LONG_MIN and at most LONG_MAX) and has one of the four prescribed formats, then set a global variable bool valid_num; to true and return the value. LONG_MIN and LONG_MAX are defined in the standard library header limits.h.

If the integer read is invalid, either because it doesn’t fit into a long or because it doesn’t have one of the formats, then

skip to the next whitespace in the input or EOF;
set valid_num to false; and
return.

You might find it useful to “unread” a character. You can do so using ungetc(ch, stdin); where ch is the character you just read. Note that you can only un-read one character at a time until you read in a new character and you may not “un-read” EOF.

Printing out a number

You will then create four short programs that will read in a sequence of numbers and then output them one per line in the specified format. All four will be using sign-magnitude format. (If negative, print out the sign and then the rest as if it were positive.)

tobinary - outputs the signed value in binary with a leading 0b. Unless the integer is 0, the output should not contain any leading 0s (after the 0b).
todecimal - outputs the signed value in base 10 with no leading 0 characters, unless the integer is 0 in which case it should output 0.
tooctal - outputs the signed value in base 8 with a single leading 0.
tohex - outputs the signed value in base 16 (uppercase) with a leading 0x. Unless the integer is 0, the output should not contain any leading 0s (after the 0x).

For example, if the input is 0, then the output should be, 0b0, 0, 0, and 0x0, respectively. If the input is 1, the output should be 0b1, 1, 01, and 0x1, respectively. If the input is -2, the output should be -0b10, -2, -02, and -0x2, respectively.

If the integer read is invalid, output INVALID to standard out (not stderr!).

Each program should loop until no more integers remain.

Hints:

You will definitely want to have more functions than just the one required in this part and the four main() functions. You can also have other functions or global flags that can be used by the toBASENAME programs.
Checking if an integer will “overflow”—that is, checking if adding the next digit to the number makes it larger than LONG_MAX (if positive) or smaller than LONG_MIN (if negative) is tricky!
Consider base 10 as an example. If the integer being read is nonnegative (i.e., it didn’t start with -), value is the current value of the integer and next is the value of the next digit to add, then value = 10 * value + next; gives us the new value. Unfortunately, we perform that computation and then check if value > LONG_MAX because, by definition, LONG_MAX is the largest value a long can hold. Instead, we need to first check that value <= (LONG_MAX - next) / 10. If that is not true, then our number will overflow and is thus invalid.
Similarly, if we’re reading a negative value, then value = 10 * value - next; is our update so we need to first check that value >= (LONG_MIN + next) / 10. (Pay close attention to the signs.)
Other bases are similar. I recommend you implement a function like
```
// Returns true if adding the next digit (with value next) results in a value
// in the range of supported values for a long.
static bool in_range(int base, bool negative, long result, int next) {
  /* ... */
}
```
containing that logic and use that to determine if adding the next digit leaves you with a valid long or not.
Printing decimal values is the easiest (because you can use printf() directly) and you should definitely implement todecimal first. Printing binary is by far the hardest.
For tooctal, tohex, and tobinary, there are two values in particular that are slightly tricky to implement: 0 and LONG_MIN. For 0, I recommend checking if the value is 0 and then just printing it out in the appropriate format as described above.
For every negative number value other than LONG_MIN, you can print a - followed by printing the positive number -value. Unfortunately, -LONG_MIN causes an overflow due to two’s complement. Fortunately, there’s a work around!
```
unsigned long uval;
if (value == LONG_MIN) {
  uval = LONG_MAX;
  uval += -(LONG_MAX + LONG_MIN);
} else if (value < 0) {
  uval = -value;
} else {
  uval = value;
}
```
Now you only have to deal with printing the positive value uval.
The reason this works is on a two’s complement machine, LONG_MAX + LONG_MIN will be $-1$ so the result is one more than LONG_MAX, but as an unsigned long. (On a sign and magnitude machine—which C supports!—LONG_MIN is precisely -LONG_MAX, so LONG_MAX + LONG_MIN is 0 which gives the correct result. If you’re willing to ignore sign and magnitude machines, you could also just write uval = LONG_MAX + 1UL; and it’d have the same effect. Do it whichever way you like but you definitely cannot do uval = LONG_MAX + 1;.)
At no point in time should your program segfault or generate errors when running under valgrind. We haven’t talked about valgrind yet, but you can run $ valgrind ./frequency <input_file (and similar for the others) and it will run your program and check for memory errors.
Be sure your program can handle both upper and lower case in hex. You’ll probably want to use some of the functions in ctype.h. Check out isxdigit(3) and the related functions mentioned in that man page.
Be sure to handle the special case of single ‘0’ character.
Be sure to handle the case where EOF immediately follows a valid integer.
Don’t forget to add these targets into your Makefile.