===========================================================================
CSC 310H                     Lecture 11 Summary                   Fall 2007
===========================================================================

------
Topics
------

 -  Assumptions about probabilities.
 -  Guess probabilities.
 -  Compute probabilities.
 -  Adaptively estimate probabilities using counts, Laplace model.
 -  Markov sources.
 -  Markov source + Laplace model + Arithmetic coding
 -  Reference: sections 2.6, 6.2.

-------------------------------
Assumptions about probabilities
-------------------------------

 -  So far, we assumed that the source generates symbols:

     1. with known probabilities, and

     2. independently of one another.

 -  These assumptions are no practical.

 -  For coding to be efficient, we need a good probabilistic model of the
    source.

 -  For coding to be correct, the encoder and decoder need to use the same
    probabilities.

-------------------
Guess probabilities
-------------------

Ideas:

 -  What if we simply guess the symbol probabilities as q_1, .., q_I, when
    in fact they are p_1, .., p_I. Still assume independence for now.

 -  The entropy of the source is really sum_{j=1}^I{p_j*log(1/p_j)}.

 -  If we were able to obtain the perfect symbol code for this source (the
    q_i's would have to be powers of 2), then a_i would get a codeword of
    length log(1/q_i).

 -  Thus, the average codeword length would be sum_{j=1}^I{p_j*log(1/q_j)}.

 -  The difference between these quantities, sum_{j=1}^I{p_j*log(p_j/q_j)},
    is called the relative entropy between distributions p and q.

 -  This quantity is 0 if and only if p=q. Thus, when guessing the wrong
    distribution, we'll have to pay at least that difference between the
    average encoding length and the entropy.

Reference: section 2.6.

---------------------
Compute probabilities
---------------------

Ideas:

 -  The encoder could go through the entire string and compute the
    probabilities with which each symbol occurs, then encode using these.

 -  The decoder must be aware of these probabilities, thus they must be
    communicated as well, probably as a header of the file containing the
    actual encoding.

 -  Theoretically, this code cannot be optimal because it is not complete.
    Say we have a bitstring of length n with k 1's in it. If the encoder
    sends k, the count of 1 bits in a bit string, and then encodes the
    string using symbol probabilities k/n for 1's and (n-k)/n for 0's, the
    code is not complete because, after seeing k, it will never be the case
    that we receive the encoding of a bit string with k-1 1's.

 -  However, this scheme might work well, especially if the number of bits
    required to send the symbol counts is much less than the size of the
    message itself.

-------------------------------------------------------------
Adaptively estimate probabilities using counts, Laplace model
-------------------------------------------------------------

Ideas:

 -  A way to avoid sending over the probabilities is to estimated them
    using counts of occurences up to the point where we are encoding. In
    this case, the decoder will have access to this information, so it can
    use the same probabilities as the encoder.

 -  Given a sequence of coin tosses of length n, in which heads came up k
    times, we can encode the N+1 toss using Pr[heads] = k/n and
    Pr[tails] = (n-k)/n.

 -  A slight problem occurs when some of the symbol counts are 0. To avoid
    it, we will increase all counts by 1.

 -  If for all i, f_i is the count of occurences of symbol a_i so far, we
    give symbol a_k probability p_k = (f_k + 1)/(sum_{j=1}^I{f_j + 1}).

 -  Arithmetic coding can easily be adapted to change probabilities in this
    way.

 -  Not clear how to implement this using Huffman coding.

 -  If using an end-of-string marker #, can give it a constant probability
    p_#, and perform the Laplace estimation on the remaining (1-p_#) of the
    probability space.

 -  Can check that the probability for a certain N-symbol string X is

        1/((N+I-1)!/(I-1)!) * prod_{j=1}^I{n_j!}

    where n_j is the number of occurences of symbol a_j in X.

Reference: end of section 6.2.

--------------
Markov sources
--------------

Idea:

 -  So far, we assumed symbols are generated independently of one another.
    This might work for strings of coin tosses, but in general it is rarely
    the case.

Defintion:

    A K-th order Markov source is one in which the probabilities for
    generating the next symbol depend on the preceding K symbols.

 -  The probability a specific N-symbol string X is generated by a Markov
    source of order 2 is

        Pr[X] = Pr[X_1 = a_{i_1}] *
                Pr[X_2 = a_{i_2}|X_1 = a_{i_1}] *
                Pr[X_3 = a_{i_3}|X_2 = a_{i_2} and X_1 = a_{i_1}] *
                ..
                Pr[X_N = a_{i_N}|X_{N-1} = a_{i_{N-1}} and X_{N-2} = a_{i_{N-2}}]

 -  Using M(i,j,k) = Pr[X_l = a_k|X_{l-1} = a_j and X_{l-2} = a_i],

        Pr[X] = Pr[X_1 = a_{i_1}] *
                Pr[X_2 = a_{i_2}|X_1 = a_{i_1}] *
                M(i_1, i_2, i_3) *
                ..
                M(i_{N-2}, i_{N-1}, i_N)

 -  We call M(i,j,k) the probability of symbol a_k in context (a_i a_j).

 -  Need to handle beginning of the string differently. For example,
    imagine the context is a special character meaning space-before-string.

 -  If these probabilities are now known, and in general they won't be,
    they can be estimated with the Laplace model as before. Let F(i,j,k) be
    the count of occurences of a_k in context (a_i a_j). Then use

        M(i,j,k) = (F(i,j,k) + 1)/sum_{k'=1}^I{F(i,j,k') + 1}

-------------------------------------------------
Markov source + Laplace model + Arithmetic coding
-------------------------------------------------

Idea:

 -  Complete (quite potent) coding solution: for some fixed K, use a source
    model of order K, estimate symbol probabilities using Laplace model,
    encode symbols using Arithmetic coding.

 -  Performance depends on the choice of K.

 -  large K:

     -  more space needed for keeping track of possible contexts

     -  counts (thus probabilities) more accurate on the long run

 -  small K:

     -  less space

     -  counts for various contexts get populated faster when the number of
        contexts is small

     -  the model doesn't learn as much on the long run