* One can speak of overfitting in an unsupervised setting, if the description of the data found by the algorthim is actually more explicit than the data in the first place. By that, I mean e.g. an embedding in a higher dimensional space or clustering of k elements in n > k groups. Likewise, a mere enumeration of the input data is an overfitted representation, if we assume that there is more structure to the data than simply "it's exactly this list of elements and nothing else." Maybe this can be even formalized some more in a information theoretic setting, in a sense such that the entropy of the new code of the data is less than the current entropy or something like that.

* There also seems to be a problem of discrimination, i.e. if the found solution is to general I'd probably think of it also as overfitted. E.g. if the output of some algorithm would state that "there is data", I'd consider it a case of overfitting, because it found a meaningless tautology. Maybe, in that sense the problem can be thought of as the scientific method (or at least Popper's falsification theory): if the output of an unsupervised algorithm does not lead to a falsifiable hypothesis, it is overfitted, either because it is to vague or to specific.

On a side node: I actually think this process can often be observed in mathematics: Sometimes people 'discover' structures or generalize concepts to a point where it is so general, that it does not lead to any meaningful theory anymore (or only to little fruitful theory). Take e.g. the definition of a stochastic process. Mathematically speaking it is a perfectly fine definition in itself, but I don't know of any meaningful theorem that says anything about the family of all stochastic processes. Usually you need at least some additional structure e.g. Markov or Martingale or something like that. The same can probably be said about semigroups in algebra or a ring (of sets) in measure theory. But that is probably a discussion for another day.

]]>Many thanks for your comments. I agree that one can deal with uncertainty by marginalizing out certain hyper-parameters and it works well if you assume that you have a "correct" prior.

Maybe this can also be viewed as a bias-variance tradeoff. Without a prior, your solution would be unbiased, but has high variance. On the other hand, incorporating the prior results in the biased solution, which has smaller variance. I think that is why one should be careful when choosing a "prior" because if the solution is biased toward the predefined prior.

Krik

]]>You can interpret the prior as a way to regularise your learning problem. To me it's simply more transparent to encode your assumptions about the problem in the form of probability distributions than in other ways. It makes it very helpful to use a common language to discuss the assumptions that went into a model. This is not always the case when your assumptions are encoded in the way you define your learning algorithm, regularisation, etc.

]]>Firstly, let'sassume that if your prior is 'true', then you perform well in your Bayesian analysis and you reach your goal of inferring the posterior.

But if you are not certain about your prior, you can use a hyperparamter on it. It will describe your uncertainty on the prior distribution. Then you simply marginalise it out.

That's just off the top of my head, what do you think?

Greetings from London!

Tomek

I will be very happy to discuss about it in detail if you have an idea. Thanks.

]]>