===========================================================================
CSC 310H                     Lecture 10 Summary                   Fall 2007
===========================================================================

------
Topics
------

 -  Doubling the interval around 1/2.
 -  Arithmetic Coding, take 3.
 -  How many final bits?
 -  Buffer decoding.
 -  Integer implementation.
 -  References: section 6.2, lectures 7-8 from 2006 course by prof Roweis.

--------------------------------
Doubling the interval around 1/2
--------------------------------

Idea:

 -  We do not want v-u to become small, in order to prevent floating point
    arithmetic errors.

 -  In our algorithm, [u,v) might become smaller and smaller, yet always
    including the midpoint 1/2.

 -  The moment [u,v) is included in [1/4, 3/4) = [0.01, 0.11), we know that
    if we later discover that the next bit to transmit is going to be 1,
    after the expansion, [u,v) will be contained in [0.0, 0.1), so the
    next bit will be 0. Similarly, if the next bit turns out to be 0, we
    know that the one following that will be a 1.

 -  There might be several of these expansions in a row.

 -  We use this observation as follows. We keep a counter c of "opposite"
    bits that we have to output immediately after the next bit.

-------------------------
Arithmetic coding, take 3
-------------------------

Algo:

    u <- 0
    v <- 1
    c <- 0
    for every symbol a_{i_k} in turn
        r <- v-u
        u <- u + r*sum_{j=1}^{i_k-1}{p_j}
        v <- v + r*p_{i_k}
        while (u >= 1/2) or (v <= 1/2) or ((u >= 1/4) and (v <= 3/4))
            if u >= 1/2
                send 1
                send c 0s
                c <- 0
                u <- 2*(u-1/2)
                v <- 2*(v-1/2)
            else if v <= 1/2
                send 0
                send c 1s
                c <- 0
                u <- 2*u
                v <- 2*v
            else if ((u >= 1/4) and (v <= 3/4))
                c <- c+1
                u <- 2*(u-1/4)
                v <- 2*(v-1/4)
            end if
        end while
    end for
    send enough final bits to specify a number inside [u,v)

--------------------
How many final bits?
--------------------

 -  With the current scheme, it will always be the case that either
    [0, 1/2) or [1/2, 3/4) is completely contained inside [u, v). Thus, at
    most 2 final bits are needed to specify a number inside [u, v): either
    01 or 10.

---------------
Buffer decoding
---------------

Idea:

 -  We again keep an interval of interest [u,v).

 -  We perform decoding by loading a buffer b of size k with bits from the
    message we receive.

 -  We identify the next letter to output by looking where inside [u,v) the
    2^{-k} interval int(b) is located.

 -  We expand the interval [u,v) as in the encoding part, and every time
    we perform an expansion, we discard one bit from the buffer and we read
    in another input bit.

 -  NOTE: THE BIT WE DISCARD IS DIFFERENT DEPENDING ON THE EXPANSION.

-----------------------------------
Arithmetic Coding, take 3, decoding
-----------------------------------

Algo:

    [u,v) <- [0,1)
    read k bits into buffer b
    do
        divide [u,v) into subintervals
        find i so that int(b) is a subset of the subinterval of a_i
        output a_i
        [u,v) <- subinterval corresponding to a_i
        while (u >= 1/2) or (v <= 1/2) or ((u >= 1/4) and (v <= 3/4))
            if u >= 1/2
                [u,v) <- [2*(u-1/2),2*(v-1/2))
                discard leftmost bit in b (it should be a 1 anyhow)
                shift contents of b to the left
                read in next input bit in the rightmost position of b
            else if v <= 1/2
                [u,v) <- [2*u,2*v)
                discard leftmost bit in b (it should be a 0 anyhow)
                shift contents of b to the left
                read in next input bit in the rightmost position of b
            else if (u >= 1/4) and (v <= 3/4)
                [u,v) <- [2*(u-1/4),2*(v-1/4))
                discard 2nd leftmost bit of b (should be opposite of leftmost)
                shift contents of b to the left but keeping leftmost bit
                read in next input bit in the rightmost position of b
            end if
        end while
    end do

Note:

 -  Whenever there are no bits left to read in the encoding, use 0s.

 -  Must make sure it is possible to select next symbol a_i using current
    buffer b of size k.

----------------------
Integer implementation
----------------------

Ideas:

 -  We want to avoid using floating point arithmetic. Instead, use
    m-bit integers everywhere. How big should m be?

 -  We can estimate probabilities with integer counts. Suppose that
    {a_1, .., a_I} occur with counts {f_1, .., f_I}. For all i, define
    F_i = sum_{j=1}^i{f_j}. Then, we can use f_i/F_I instead of p_i.

 -  Use integers u' and v' instead of u and v with the intended meaning
    that [u',v') represents the interval [u'/2^m, (v'+1)/2^m).

 -  Adding the 1 to v' allows us to represent the endpoint v=1 using only
    m bits. Without the +1, we would have needed m+1 bits: 2^m itself is a
    1 followed by m (not m-1) 0s.

 -  The smallest the interval [u', v') is going to get is (2^m - 1)/4. We
    must make sure none of the symbols get an empty subinterval. The
    smallest probability can be 1/F_I (if there are symbols with a count of
    0, bump them up to 1 to prevent problems). Thus, we want

        (2^m - 1)/4 * 1/F_I >= 1

        m >= log(1 + 4*F_I)

 -  Setting m = 2 + ceil(log(F_I + 1)) is enough.

---------------------------
How to terminate the string
---------------------------

 -  We need a way to be able to tell the decoder to stop.

 -  Use a special symbol #, that stands for end-of-transmission.

 -  Give it a constant probability, say p_# = .1, and split the remaining
    probability space (1-p_#) among the other symbols.