Projects for CS3500

Project 1, DFAs. 12 points.

Input: a string s.

Output: a DFA that takes as input a string t and determines if s occurs as a substring of t.

Testing: your code also has to demonstrate that your DFA works correctly.

Project 2, CFGs. Due Nov. 10 Points: Either replace test 1 or 15 points, whichever is better.

Input: a grammar G in Chomsky Normal form.

Output: a program that takes as input a string s, and determines if G generates s. Make this program run as efficiently as you can.

Grading: if you invent your own algorithm, or you speed up the CKY algorithm (page 241 of Sipser), you get full credit. A straight implementation of CKY gets half credit.

Project 3, Heaps. Due Nov. 18 10 points

Speed up heapsort by using the faster extract-max shown in class. Test your implementation on big cases (at least 10^6) and empirically estimate the improvement in number of comparisons. If you can, also compare cpu usage. The faster extract-max pretends that the leaf moved to the root has infinitely ``bad'' (small) value until it becomes a leaf. Then it moves the element upwards if necessary.

Project 4, Selection Algorithm, due Nov. 29 12 points

1) Implement the randomized selection algorithm, and verify that it takes O(n) time on average, when all keys are distinct. You will have to use several different values of n to do this. Since only ordinal values matter, you can safely assume the values are 1,2,3,...,n.

2) Demonstrate with computational tests that the algorithm does not take O(n) time on average when keys are not distinct. 3) Modify the algorithm so that it takes O(n) time whether or not the keys are distinct, and verify with computational tests. Try to make your modification slight -- a one line patch if possible :)

Project 5, Huffman codes. Due Dec. 6 20 points total

1) (5 points) Implement Huffman coding in O(n log n) time using binary heaps. This is described in your textbook.

Input, a set of characters with frequency counts;

Output, a full binary tree giving an optimal prefix encoding

2) (5 points) Implement an encoder and decoder, and use them to test (1) on three samples of English text. The space, comma, period, colon, etc. should be characters, but linefeeds,etc. are not. How well does the optimal encoding for the first text perform on the second text? Can you find a single encoding that gives good compression for each of the three texts?

3) (10 points) Explore the replacement of strings of characters as follows: find the K (try K=10) most frequently occuring strings of characters of length 2 to 3 (use a program to do this). Scan the text, replacing these strings with new characters. The original characters, together with your new characters, form an extended character set. In the same pass throught the text, get a frequency count for the extended character set. Find the Huffman code, compress, and decompress to the original character set. How well does this perform on your English texts? Notice that you use a prefix encoding, so decompression is easy. However, you have to make some decisions to do the compression. I recommend a simple left-to-right procedure that looks ahead two characters and matches the longest string possible without backtracking. For example, if the alpabet is a,b,c, and the common strings are aa, aaa,ab, then the text aaaabaaab breaks into aaa ab aaa b.

Project 6, O(n) vs. O(n log n). Due Nov. 29. 5 points

Implement build-heap and build-heap' (defined in problem 7-1 at the end of the chapter). Show that the first is Theta(n) and the second is Theta(n log n) by running both functions on varying size instances.