Tutorial 4 for CSC 310 For tomorrow's tutorial, I'd like you to get them thinking about the idea that a model of a source is a way of assigning probabilities to sequences of symbols. The motivation is that if we think a sequence (ie, a message) has probability p, then we should encode that sequence using log(1/p) bits, since that is its Shannon information content. Conversely, if we are using some compression scheme that encodes message m in l_m bits, then we are implicitly making the assumption that message m has probability 2^(-lm). Previously in this course, we assumed that (a) symbols in a sequence are independent, and (b) we know what the probabilities are for the different symbols. Let's stick (sort of, see below) with (a) for the moment, but consider what happens when (b) isn't true. As a concrete example, consider encoding a sequences of 0's and 1's of length ten. For example, we might want to encode the sequence 0111011111. If we assume independence and know the probability of a 1 (call that p_1), then what is the probability of this sequence? (Answer: p_1^8 (1-p_1)^2) But what if we don't know p_1? Can we say anything useful about the probabilities? Suppose we decide we can't say anything useful - that we might as well consider ALL sequences to be equally likely. How should we encode a sequence in that case? (Answer: just use the obvious method, encoding each sequence in 10 bits.) Clearly, we can't do this if we hope to compress the data. Is there anything we know about these probabilities, even if we don't know p_1? Hint: Can we say anything about the relative probabilities of the sequences 0111011111, 1110111101, and 001111111? (Answer: they all have the same probability, whatever it is) In fact, all sequences with the same number of 1's will have the same probability, regardless of what p_1 is. Since all sequences with the same number of 1's are equally likely, all we REALLY don't know is how likely the different numbers of 1's are. Let's call the number of 1's in the message S. For instance, suppose we knew that there is a 10% chance that 8 of the 10 symbols will be 1's - ie, that S=8. What is the probability of the sequence 0111011111? (Answer: it's 0.1 divided by the number of sequences in which 8 of 10 symbols are 1's, which is "10 choose 8" = "10 choose 2" = (10*9)/(1*2) = 45) So what would be a good choice for a distribution for S? For starters, what is the distribution of S when we assume all sequences are equally likely (an assumption which we saw produced no compression)? (Answer: It's the binomial(10,1/2) distribution, with the probability of k 1's being (10 choose k) 1/2^10) What's the shape of this distribution? (Answer: it strongly favours values near 5, giving much smaller probability to extreme values like 0 or 10) We need some other distribution if we want to compress. What would be a likely candidate? (Get them to suggest a uniform distribution, which is over integers from 0 to 10 in this case) If we assume a uniform distribution for S - ie, that P(S=k) = 1/11 for k=0,1,...,10 - then what is the probability of the sequence 0111011111? (Answer: (1/11)*(1/45)) So far so good, but this doesn't seem like it's of much practical use, since we'd have to compute huge numbers of probabilities for sequences in order to directly encode based on these probabilities. Let's consider a "kludge" instead: Predict that the probability of each symbol being a 1 is equal to (n_1 + 1) / (n_0 + n_1 + 2), where n_0 is the number of EARLIER symbols that are 0's and n_1 is the number of earlier symbols that are 1's. We basically estimate the probability of a 1 its frequency in earlier symbols, except we add 1 to the counts of boths 0's and 1's to avoid either probability being zero. With this scheme, what's the probability of the sequence 0111011111? Answer: We need to multiply the probabilities of successive symbols together, with each probability being based on the counts of earlier symbols. The result: 0 1 1 1 0 1 1 1 1 1 (1/2) (1/3) (2/4) (3/5) (2/6) (4/7) (5/8) (6/9) (7/10) (8/11) which equals (1/11) (2!8!/10!) = (1/11)(1/45). Note that this is the same probability as this sequence had under the previous scheme! Is this true in general. (Answer: yes) So we see that we can use the previous scheme in practice after all, since we could easily use these successive probabilities in conjunction with arithmetic coding. And we see that the second (equivalent) scheme maybe isn't such a kludge. Note: In a sense, the symbols making up a sequence are no longer independent with this scheme, since their probabilities depend on earlier symbols. However, this dependence is merely the result of us not knowing p_1; it doesn't mean that we think that one symbol influences another in any physical sense. Technically, although the symbols in the sequence aren't independent, the probability distribution over sequences IS "exchangeable", meaning that the order of symbols doesn't matter. This is the counterpart to independence when we don't know the probabilities. In the unlikely event that you still have time: Consider a modified scheme in which the probability of a 1 is 1/(n+2) if all n previous symbols are 0's, and (n+1)/(n+2) if all previous n symbols are 1's. If some 0's and some 1's have been seen earlier, then we estimate probabilities the frequences (without adding 1), so the probability of a 1 is n_1/(n_0+n_1). This scheme also avoids probabilities of 0 or 1. Does this scheme produce an exchangeable distribution? Answer: No. The sequence 011 has probability (1/2)(1/3)(1/2) = 1/12, whereas the sequence 110 has probability (1/2)(2/3)(1/4) = 1/24. Does this matter?