MCMC methods for mixture models typically use a state that includes indicators for which component each training case came from. If conjugate priors are used, one can then do Gibbs sampling. An alternative is to let the state consist only of the mixing proportions and the parameters of the component distributions, with the component indicators summed over. Gibbs sampling will not be possible, but the Metropolis algorithm, slice sampling, Hamiltonian Monte Carlo methods, and others should be possible. How well does this approach work compared with the conventional approach using component indicators? This project is now taken.
Logistic regression models class probabilities as a function of the vector of input variables, x, byP(Y=1|X=x) = 1/[1+exp(-(a+bx))]Where the scalar a and vector b are parameters to be learned from the training data (bx is the scalar product). This model can lead to probabilities that are arbitrarily close to 0 or 1. We might think this is unrealistic - for example, if we think that there is some possibility that the class recorded is just an error. We could modify the model to avoid probabilities near 0 or 1 as follows:P(Y=1|X=x) = c + [1-c-d]/[1+exp(-(a+bx))]where c and d are additional parameters, which might be fixed on the basis of our knowledge of the problem, or might be learned from the data. The minimum probability is now c and the maximum is 1-d.
The aim of this project would be to examine how well this works in a Bayesian framework. You might compare with how well the model works using maximum likelihood estimates. You might try to see if the model improves performance on interesting tasks like digit recognition or classification using DNA microarray data. This project is now taken.
As noted above, logistic regression models class probabilities as a function of the vector of input variables, x, byP(Y=1|X=x) = 1/[1+exp(-(a+bx))]Where the scalar a and vector b are parameters to be learned from the training data (bx is the scalar product).
Suppose that the input variables, x, are high-dimensional (eg, 1000 dimensions), but in the training and test cases, the x values lie close to (but not exactly on) some lower-dimensional subspace (eg, 10 dimensions). What affect does this have when using Bayesian inference for this model? In particular, how does it affect the ability to predict the class in test cases, and how does it affect the correctness of the model's idea of how uncertain these predictions are?
You might start by using a simple multivariate Gaussian prior for b, with covariance matrix of cI, where c is an uknown hyperparameter to be inferred from the data. You can first test this model on data generated with a value for b drawn from this prior (and with x values generated to lie near a lower-dimensional subspace. You can then try testing on data in which the class depends only on the projection of x onto the subspace, not on the (small) distance of x from this subspace. Can one do a better job for such data by using a prior for b that depends on the observed x?
For this project, a combination of empirical and theoretical work might be best.
Consider a classification problem with no noise - ie, in which the input space can be divided into regions such that points within one region have only one class. Hard linear classifiers can be seen in this way, with there being two regions, separated by a line. There's no unique way of generalizing this to more than one region. One possibility is to specify a set of points in the input space, and define what are called "Voronoi" regions, which consist of all points closer to some point than to any other point. If we then associate a class with each such region, we will have a classifier. There could be any number of classes (eg, just two, same number as the number of regions, or more than two but less than the number of regions).
In this project, you would try a Bayesian approach to this, in which the locations of the points and their associated classes come from some prior distribution. You could try to see whether letting the number of points (and hence regions) go to infinity makes sense, for some prior. You could also compare with other approaches, such as some sort of generalization of "maximum margin" ideas. This project is now taken.