CSC 310 ASSIGNMENT #2 SOLUTION.

Radford M. Neal, March 2002.


This solution follows the specifications in the assignment handout,
with the addition that I implemented a "-H" option, as an alternative
to "-h".  When "-H" is specified, there are FOUR contexts relating to
position within an HTML document: 0 for outside a directive, 1 for
inside a directive but outside double-quotes, 2 for inside
double-quotes before a colon, and 4 for inside double-quotes after a
colon.  The hope is that this will improve compression by allowing
different statistics for the URL type (eg, "http:") and the file name
part of the URL.

There are source files for the three programs, as specified in the
assignment handout, called html-encode.c, html-decode.c, and
html-freq.c.  There is also a file of procedures used by all these
programs, called html-contexts.c , and a header file, html.h.

The programs were tested for correctness with the "chk" command file,
and compression performance was evaluated with the "tst" command file.
The test results are in the file "tst-results".  A plot showing how
the method with -h, -p, and a statistics file compares with gzip is in
tst-plot.ps (or tst-plot.pdf).

The compression performance of the methods varies quite a bit with the
length of the file, and to some extent with the contents of the file.
Without a statistics file, html-encode often does worse when the
options specified create many contexts, especially when the file is
short.  The -h option alone (3 contexts) is almost always better than
no options (1 context), the exceptions being two short files.  Using
-h alone is also better than -p (256 contexts), or both -h and -p (768
contexts, except for the two longest files.  With a statistics file,
using both -p and -h is better than any other combination of options,
even for the shortests files, with a single exception for which -p
alone is better.

A quite consistent pattern is seen when comparing the compression
performance of gzip with that of html-encode using -p and -h and a
statistics file.  For short files, html-encode tends to compress
better than gzip (by up a factor of 1.25), whereas for long files,
gzip tends to compress better (up to a factor of 1.9).  This pattern
may be explained by two factors.  First, gzip is not able to use the
training data used the create the statistics file used by html-encode.
Because of this, gzip has a disadvantage on short files where its
dictionary will be based on little previous text.  Modifying gzip to
start with a dictionary that is based on the training files would help
improve its performance.  Second, html-encode is not able to exploit
higher-order dependencies between characters, except via the
specialized contexts produced with -h.  Because of this, html-encode
has a disadvantage with respect to gzip once gzip has learned a good
dictionary.  Modifying html-encode to use higher-order contexts, as is
done with PPM, would help improve its performance.

Finally, the modification to use four HTML contexts, with -H, was
tried out in combination with -p and with a statistics file.  This
modification usually improved compression slightly in comparison with
using -h, by up to a factor of 1.024 (ie, about a 2.4% improvement).
For only one file was -H worse (by a slight amount).  For one other
file, the compressed file sizes were the same, and for several others
the improvement with -H was very small.  This modification therefore
seems to be beneficial, but it does not produce a major improvement.