Steve's Explanation of MEMMs

MEMM stands for Maximum Entropy Markov Models, which are a variation on the traditional Hidden Markov Models (HMMs). These models attempts to characterize a string of tokens (such as words in a sentence, or sound fragments in a speech signal) as a most likely set of transitions through a Markov model, which is a special finite state machine. For the sake of simplicity, I assume that each of the states corresponds to a conceptual stage in the sentence or speech signal, and each state can emit certain tokens (i.e. words or speech fragments). The most likely path through the HMM or MEMM would be defined as the one that is most likely to generate the observed sequence of tokens.

Maximum Entropy

Entropy is a measure of the randomness of a system. Maximum entropy is therefore the push towards the greatest randomness, under the given contraints. Because of this, applying maximum entropy principles to a problem attempts to generate the most general models possible, under certain circumstances.

The idea behind maximum entropy models is that instead of trying to train a model to simply emit the tokens from the training data, one can instead create a set of boolean features, and then train a model to exhibit these features in the same proportions that they are found in the training data. Examples of features for sentences would be if there is a date listed in the sentence, if the sentence is over five words in length, if it contains a fragment in quotes, et cetera.

In this case, the features would be the constraints described above, and maximum entropy will try to generate the most general model possible, given the specified features/constraints. The reason why training on features is sometimes better than training on output tokens is because sometimes the features are more interesting than the individual tokens, and in the case of smaller training sets, it can be difficult to train a strong model with sparse data. However, a smaller set of features can be easily extracted from small datasets, and this more populated data will produce better training as a result.

How It's Done

So assuming that one has training data and a set of feature functions that will determine whether a token has a certain feature or not, one can create a Markov model with its initial transition probabilities set to arbitrary values. These state transition probablilities would be proportional to a product of feature parameters, one for each feature exhibited by that state. The current model is tested against the training data to determine the likelihood of the different paths through the model. Since certain features occur with the combination of certain states and output tokens, the probability of encountering a given feature from the current model can be calculated as the probability of encountering those states that exhibit that feature.

If there is a mismatch between a feature exhibited by the model and the feature exhibited by the training set, the parameter for that feature is adjusted to compensate. If the feature occurs in greater proportions in the model than in the training set, the transition probabilities are lowered for all states that exhibit that feature. This is done by lowering the parameter corresponding to that feature (remembering that the transition probability for a state is proportional the the product of all its feature parameters).

Once all the feature parameters are updated, all the transition probabilities should be improved from before, and the model is tested and adjusted repeatedly until the features exhibited by the model converge to those of the training set.


This explanation is derived from my interpretation of:

  • Maximum Entropy Markov Models for Information Extraction and Segmentation by Andrew McCallum, Dayne Freitag and Fernando Pereira, Proceedings of the Seventeenth International Conference on Machine Learning 2000
  • Adam Berger's Brief Maxent Tutorial, http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html