CSC 310 ASSIGNMENT #2 SOLUTION. Radford M. Neal, March 2002. This solution follows the specifications in the assignment handout, with the addition that I implemented a "-H" option, as an alternative to "-h". When "-H" is specified, there are FOUR contexts relating to position within an HTML document: 0 for outside a directive, 1 for inside a directive but outside double-quotes, 2 for inside double-quotes before a colon, and 4 for inside double-quotes after a colon. The hope is that this will improve compression by allowing different statistics for the URL type (eg, "http:") and the file name part of the URL. There are source files for the three programs, as specified in the assignment handout, called html-encode.c, html-decode.c, and html-freq.c. There is also a file of procedures used by all these programs, called html-contexts.c , and a header file, html.h. The programs were tested for correctness with the "chk" command file, and compression performance was evaluated with the "tst" command file. The test results are in the file "tst-results". A plot showing how the method with -h, -p, and a statistics file compares with gzip is in tst-plot.ps (or tst-plot.pdf). The compression performance of the methods varies quite a bit with the length of the file, and to some extent with the contents of the file. Without a statistics file, html-encode often does worse when the options specified create many contexts, especially when the file is short. The -h option alone (3 contexts) is almost always better than no options (1 context), the exceptions being two short files. Using -h alone is also better than -p (256 contexts), or both -h and -p (768 contexts, except for the two longest files. With a statistics file, using both -p and -h is better than any other combination of options, even for the shortests files, with a single exception for which -p alone is better. A quite consistent pattern is seen when comparing the compression performance of gzip with that of html-encode using -p and -h and a statistics file. For short files, html-encode tends to compress better than gzip (by up a factor of 1.25), whereas for long files, gzip tends to compress better (up to a factor of 1.9). This pattern may be explained by two factors. First, gzip is not able to use the training data used the create the statistics file used by html-encode. Because of this, gzip has a disadvantage on short files where its dictionary will be based on little previous text. Modifying gzip to start with a dictionary that is based on the training files would help improve its performance. Second, html-encode is not able to exploit higher-order dependencies between characters, except via the specialized contexts produced with -h. Because of this, html-encode has a disadvantage with respect to gzip once gzip has learned a good dictionary. Modifying html-encode to use higher-order contexts, as is done with PPM, would help improve its performance. Finally, the modification to use four HTML contexts, with -H, was tried out in combination with -p and with a statistics file. This modification usually improved compression slightly in comparison with using -h, by up to a factor of 1.024 (ie, about a 2.4% improvement). For only one file was -H worse (by a slight amount). For one other file, the compressed file sizes were the same, and for several others the improvement with -H was very small. This modification therefore seems to be beneficial, but it does not produce a major improvement.