Here are some project ideas for CSC 2541. You're welcome (indeed,
encouraged) to come up with your own idea, of course. I'll add to
this list as ideas occur to me.
If more than one person/group wants to do the same idea from below,
I'll talk to them to see which has a more suitable background, etc.
I might also suggest that two people get together and do it as a
group. Another possibility is that one idea could produce two
projects, if two or more people/groups come up with somewhat different
aspects of the idea to focus on.
For all these ideas, if they turn out to be interesting, it would be
worth writing a paper on them. For the course project, however, it is
enough to do a preliminary assessment - seeing whether the basic idea
is promising. It's OK if it turns out to not be promising, as long as
you did a good job demonstrating that.
1) MCMC for mixture models without component indicators. The MCMC
methods I presented for mixtures include in the state indicators
for each observation of which mixture component it comes from,
which are updates during the MCMC run. This is what is commonly
done, perhaps because it sometimes leads to nice simple Gibbs
sampling algorithms.
However, for finite mixtures, one can sum the probabilities of a
data point coming from each component, and use the likelihood
found this way to compute the posterior probability (up to an unknown
factor) for a state consisting of only the mixing proportions
and the parameters of each mixture. One could then use Metropolis,
slice sampling, or other MCMC updates to sample for this state.
The project would investigate whether this works better or worse
than standard methods using component indicators. There are
many possible variations, of course, so this isn't a straightforward
assessment. One could also try to think of a way of handling
infinite mixtures this way, or at least handling mixtures with a
large number of components efficiently when many components are not
actually used.
2) Density estimation by regression. One can always write a joint
density for x1, x2, ..., xn as
P(x1,x2,...,xn) = P(x1)P(x2|x1)P(x3|x1,x2)...P(xn|x1,...,x(n-1))
We could model the factor P(x1) as a univariate Gaussian, with
unknown mean and variance, and each subsequent factor by a regression
model of xk on x1,...,x(k-1), with Gaussian residuals of constant
variance. If the regression models are linear in the predictors,
this produces a multivariate Gaussian model for x1,x2,...,xn. We
could instead use a regression model that allows for non-linear
relationships of xk to x1,...,x(k-1), such as one based on Gaussian
processes, which we will get to shortly, and which can be seen as
the infinite basis function models of short excercise 4.
One could try doing this with non-Bayesian methods, but a Bayesian
approach may be particularly suitable, since there seems to be a
lot of potential for overfitting here.
One problem is selecting the order of variables, which could make
a big difference (eg, we want x1 to have a Gaussian distribution).
Randomly rotating the coordinate system could also be considered.
3) Behaviour of finite or infinite mixture models as the number of
variables increases. We saw in lectures how one can set up a
Bayesian mixture model to have sensible behaviour as the number
of components goes to infinity. What happens as the number of
variables goes to infinity. Knowing this could be informative
about performance on lots of bioinformatics problems with tens
of thousands of genes.
The answer will presumably depend on what the increasing number
of variables are like. One could look at
a) All variables are actually independent and have distributions
in the family used for the mixtures components, so one mixture
component is all that's needed.
b) Some fixed number of variables have distributions that need
more than one component to model, or have dependencies that
need to be modeled using more than one component. An unlimited
number of other variables are independent and have distributions
that can be modeled by one component, as for (a).
c) All variables are dependent. There are many ways they could
be dependent, of course...
You might investigate these questions entirely theoretically, or
by doing numerical experiments, or both.