Here are some project ideas for CSC 2541. You're welcome (indeed, encouraged) to come up with your own idea, of course. I'll add to this list as ideas occur to me. If more than one person/group wants to do the same idea from below, I'll talk to them to see which has a more suitable background, etc. I might also suggest that two people get together and do it as a group. Another possibility is that one idea could produce two projects, if two or more people/groups come up with somewhat different aspects of the idea to focus on. For all these ideas, if they turn out to be interesting, it would be worth writing a paper on them. For the course project, however, it is enough to do a preliminary assessment - seeing whether the basic idea is promising. It's OK if it turns out to not be promising, as long as you did a good job demonstrating that. 1) MCMC for mixture models without component indicators. The MCMC methods I presented for mixtures include in the state indicators for each observation of which mixture component it comes from, which are updates during the MCMC run. This is what is commonly done, perhaps because it sometimes leads to nice simple Gibbs sampling algorithms. However, for finite mixtures, one can sum the probabilities of a data point coming from each component, and use the likelihood found this way to compute the posterior probability (up to an unknown factor) for a state consisting of only the mixing proportions and the parameters of each mixture. One could then use Metropolis, slice sampling, or other MCMC updates to sample for this state. The project would investigate whether this works better or worse than standard methods using component indicators. There are many possible variations, of course, so this isn't a straightforward assessment. One could also try to think of a way of handling infinite mixtures this way, or at least handling mixtures with a large number of components efficiently when many components are not actually used. 2) Density estimation by regression. One can always write a joint density for x1, x2, ..., xn as P(x1,x2,...,xn) = P(x1)P(x2|x1)P(x3|x1,x2)...P(xn|x1,...,x(n-1)) We could model the factor P(x1) as a univariate Gaussian, with unknown mean and variance, and each subsequent factor by a regression model of xk on x1,...,x(k-1), with Gaussian residuals of constant variance. If the regression models are linear in the predictors, this produces a multivariate Gaussian model for x1,x2,...,xn. We could instead use a regression model that allows for non-linear relationships of xk to x1,...,x(k-1), such as one based on Gaussian processes, which we will get to shortly, and which can be seen as the infinite basis function models of short excercise 4. One could try doing this with non-Bayesian methods, but a Bayesian approach may be particularly suitable, since there seems to be a lot of potential for overfitting here. One problem is selecting the order of variables, which could make a big difference (eg, we want x1 to have a Gaussian distribution). Randomly rotating the coordinate system could also be considered. 3) Behaviour of finite or infinite mixture models as the number of variables increases. We saw in lectures how one can set up a Bayesian mixture model to have sensible behaviour as the number of components goes to infinity. What happens as the number of variables goes to infinity. Knowing this could be informative about performance on lots of bioinformatics problems with tens of thousands of genes. The answer will presumably depend on what the increasing number of variables are like. One could look at a) All variables are actually independent and have distributions in the family used for the mixtures components, so one mixture component is all that's needed. b) Some fixed number of variables have distributions that need more than one component to model, or have dependencies that need to be modeled using more than one component. An unlimited number of other variables are independent and have distributions that can be modeled by one component, as for (a). c) All variables are dependent. There are many ways they could be dependent, of course... You might investigate these questions entirely theoretically, or by doing numerical experiments, or both.