Steve's Explanation of MEMMs
MEMM stands for Maximum Entropy Markov Models, which are a variation on the
traditional Hidden Markov Models (HMMs). These models attempts to characterize
a string of tokens (such as words in a sentence, or sound fragments in a
speech signal) as a most likely set of transitions through a Markov model,
which is a special finite state machine. For the sake of simplicity, I assume
that each of the states corresponds to a conceptual stage in the sentence or
speech signal, and each state can emit certain tokens (i.e. words or speech
fragments). The most likely path through the HMM or MEMM would be defined as
the one that is most likely to generate the observed sequence of tokens.
Maximum Entropy
Entropy is a measure of the randomness of a system. Maximum entropy is
therefore the push towards the greatest randomness, under the given
contraints. Because of this, applying maximum entropy principles to a problem
attempts to generate the most general models possible, under certain
circumstances.
The idea behind maximum entropy models is that instead of trying to train a
model to simply emit the tokens from the training data, one can instead create
a set of boolean features, and then train a model to exhibit these features in
the same proportions that they are found in the training data. Examples of
features for sentences would be if there is a date listed in the sentence, if
the sentence is over five words in length, if it contains a fragment in
quotes, et cetera.
In this case, the features would be the constraints described above, and
maximum entropy will try to generate the most general model possible, given
the specified features/constraints. The reason why training on features is
sometimes better than training on output tokens is because sometimes the
features are more interesting than the individual tokens, and in the case of
smaller training sets, it can be difficult to train a strong model with sparse
data. However, a smaller set of features can be easily extracted from small
datasets, and this more populated data will produce better training as a
result.
How It's Done
So assuming that one has training data and a set of feature functions that
will determine whether a token has a certain feature or not, one can create a
Markov model with its initial transition probabilities set to arbitrary
values. These state transition probablilities would be proportional to a
product of feature parameters, one for each feature exhibited by that state.
The current model is tested against the training data to determine the
likelihood of the different paths through the model. Since certain features
occur with the combination of certain states and output tokens, the
probability of encountering a given feature from the current model can be
calculated as the probability of encountering those states that exhibit that
feature.
If there is a mismatch between a feature exhibited by the model and the
feature exhibited by the training set, the parameter for that feature is
adjusted to compensate. If the feature occurs in greater proportions in the
model than in the training set, the transition probabilities are lowered for
all states that exhibit that feature. This is done by lowering the parameter
corresponding to that feature (remembering that the transition probability
for a state is proportional the the product of all its feature parameters).
Once all the feature parameters are updated, all the transition probabilities
should be improved from before, and the model is tested and adjusted
repeatedly until the features exhibited by the model converge to those of the
training set.
This explanation is derived from my interpretation of:
Maximum Entropy Markov Models for Information Extraction and
Segmentation by Andrew McCallum, Dayne Freitag and Fernando Pereira,
Proceedings of the Seventeenth International Conference on Machine
Learning 2000
Adam Berger's Brief Maxent Tutorial, http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html