THE BLAG.
1:
Research has shown that many things about human perception is
Bayesian. Humans accomplish many tasks the way a Bayesian system: that
is, the mistakes humans make are similar, qualitatively and
quantitatively, to the mistakes the system makes.
By looking at the priors and the likelihoods of these Bayesian models,
we can learn something about the assumptions made by the brain. We
also get to answer the question "how does the brain work."
But here's a dumb question: why is the brain Bayesian? Is there a
special reason for it? Is the Bayesian nature of our brain is encoded
in our DNA? And if so, why?
The answer, I claim, is "obvious". For any problem where we have some
inputs X and we try to get an output Y that minimizes some loss, the
optimal predictor would be equivalent to a very specific Bayesian
model. If we assume that every day is i.i.d., then the data can be
described by the world's distribution P(X,Y). That which we wish to
infer has a prior P(Y), and a it is corrupted by a certain likelihood
P(X|Y) to produce X. In this setting, the posterior distribution
P(Y|X) contains all the information X has about Y. This is
"Bayesian". And here is our argument: since the optimal system is
Bayesian, by continuity, nearly-optimal systems are nearly-Bayesian.
In particular, if the "X" variable consists of images and audio
(eyes and ears), then our system will do the right Bayesian thing
regarding the fusion of the sources.
That's why it makes sense for our DNA to build Bayesian brains. But
since any optimal system is Bayesian, the brain could obtain "Bayesian
capabilities" by plain loss minimization. Specifically, if we observe
X and we wish to infer Y, and we have lots of training data (X,Y) from
the true distribution P, and if we learn a really good function X->Y,
then this function will exhibit all the right Bayesian properties. We
could even talk about its likelihood and posterior.
2:
When should a learning system be Bayesian?
There is a very simple answer to that question. If there is lots of
data, there is little benefit to a Bayesian model -- a traditional
parametric model will do just as well, with less work.
A Bayesian model shines when we have a very small dataset (20 training
cases) and where we have quite concrete intuitions about how this
data is produced. If our intuitions are correct, the resulting
Bayesian model will make a very good use of this scarce data. Things
like medical trials where volunteers aren't plenty.
But there is one large-data regime where Bayesian models will likely
be used in one way or another: tasks involving huge numbers of classes
will necessarily have very many sparse classes, which will require
powerful generalization using something more Bayesian. Like object
recognition. Some object categories are plain sparse and nothing can
be done about that.
Collaborative filtering is an example of such a situation: any model
will remain uncertain about rare users and rare movies, and similarly
for the sparse categories in large image collections.
3:
What do we mean when we say that a problem is "hard"?
We say that computer vision is hard. Calculating the partition
function in large, interesting models is also hard. Rumor has it
that proving P is not NP is hard as well (but I don't really
know--never tried). The game of Go is also hard.
What do we really mean by these claims?
There is only one meaning: many people have tried and failed so
far. Vision is hard only because a good vision system does not exist
yet. We often say that Go is hard "because of its large state space",
but that's only part of its hardness. The size of its state space is
of any relevance at all because our best approaches brute-force their
way, to a large extent, to find a good move, and such approaches don't
like large state spaces. But there are other games with even larger
state spaces (for example, strategy computer games), but that alone
doesn't make them difficult.
There is one imprecision in many learning papers: computing the
partition function in a powerful model is hard not because is written
as the sum of exponentially many terms, but because there is no (and
there will never be an) algorithm that can compute the partition
function. In some models, this exponential sum can be calculated using
algorithmic trickery, but most models don't have this luck. Indeed,
we really don't care if something is a sum of exponentially many terms
if there is an algorithm that can get the answer without evaluating
most of them.
4:
Life's error rate.
-------------------------
Knowing, or predicting, the consequences of our actions is something
we cannot do (we aren't smart enough, but if we were, we'd be living
in a world populated with much smarter entities, so predicting the
consequences of our actions would be as impossible as it is now).
That's why experience is so valuable. Experience is the equivalent of
knowing the future before it happens, although admittedly in a limited
domain. An experienced person knows how things turn out to be in a
similar situation, and this knowledge allows them to make much better
decisions. For example, after starting 2 businesses, starting the
third one is probably going to be easier due to the smaller
uncertainty. Uncertainty is really scary because as long our brain
believes that there's an non-negligible chance of a complete disaster,
it's "irrational" to act, unless we are not afraid of complete
disasters.
So experience is useful, and it wouldn't hurt to get it as quickly as
possible. An easy way to ensure that we constantly grow our
experience, is by always making mistakes. By constantly making
mistakes, a person can be sure that they are doing things they are not
absolutely certain of. That's how we push our envelopes.
4:
The meaning of life. I claim to completely know the answer to this one
(it's simple too), but it won't fit in the margin.
5:
Personality.
We all know where babies come from but what about older people? How
can it be that adults often completely fail to understand and
sympathize with their teenage children? The only explanation is that
they've become different persons. Sure, the persons are similar, but
they are different enough. They have different experiences, different
desires, different ambitions, different tastes. Perhaps their
personalities too are different. So although an adult and their former
teenage-self have many things in common, their differences make them
into legitimately different persons.
So the answer is that adults "grow out" of teenagers who slowly vanish
into nonexistence. In particular, it implies that even if we lived
forever, our current self would be a temporary phenomenon that would
slowly fade into a different being with different views and tastes. If
we really lived forever, since the "set of distinct personalities" is
essentially finite, we would eventually "be" the same personality
any (every) number of times!
6:
Writing about stuff and having a diary.
There is claim that writing helps us understand a subject. If we are
confused, we should write a clear essay about the topic, and chances
are we'll become less confused. Turns out a similar thing is true
about papers. If we write them, we understand really what it is that
we do and the additional experiments that need to be done, in case we
have some uncertainty over it.