Recently, I have been struggling to understand something, which is related to the notion of "overfitting". In learning theory, It usually refers to the situation in which one try to infer a general concept (e.g., regressor or classifier) from finite observations (aka data). The concept is said to overfit the observation if it performs well on the set of observations, but performs worse on the previously unseen observations. The ability to generalize to the future observation is known as a "generalization" ability of the concept we have inferred. In practice, we would prefer the concept that not only performs well on the current data, but also do so on the data we might observe in the future (we generally assume that the data come from the same distribution).

When I think of overfitting, it is unavoidable to refer to "generalisation". In fact, as we can see above, we give the definition of overfitting based on the generalization ability of the concept. The notion of overfitting is also closely related to the notion of ill-posedness in the inverse problem. The is in fact the motivation of the regularisation problem we often encounter in the machine learning.

In the past few weeks, I have to deal with estimation problem. In principle, it is different from regression or classification problem(and regularization problem as well). However, there seems to be a connection between estimation problem and regularization problem, that still puzzle me. In estimation theory, the problem is mostly unsupervised, so it is not clear how to define "overfitting" in this case. Can we look at overfitting based on something beyond generalisation?

So the question I would like to ask in this post is "what else can we see/consider as overfitting?" If you have good examples, please feel free to leave comments.

That is a very interesting question. Out of my head, two concepts arise:

* One can speak of overfitting in an unsupervised setting, if the description of the data found by the algorthim is actually more explicit than the data in the first place. By that, I mean e.g. an embedding in a higher dimensional space or clustering of k elements in n > k groups. Likewise, a mere enumeration of the input data is an overfitted representation, if we assume that there is more structure to the data than simply "it's exactly this list of elements and nothing else." Maybe this can be even formalized some more in a information theoretic setting, in a sense such that the entropy of the new code of the data is less than the current entropy or something like that.

* There also seems to be a problem of discrimination, i.e. if the found solution is to general I'd probably think of it also as overfitted. E.g. if the output of some algorithm would state that "there is data", I'd consider it a case of overfitting, because it found a meaningless tautology. Maybe, in that sense the problem can be thought of as the scientific method (or at least Popper's falsification theory): if the output of an unsupervised algorithm does not lead to a falsifiable hypothesis, it is overfitted, either because it is to vague or to specific.

On a side node: I actually think this process can often be observed in mathematics: Sometimes people 'discover' structures or generalize concepts to a point where it is so general, that it does not lead to any meaningful theory anymore (or only to little fruitful theory). Take e.g. the definition of a stochastic process. Mathematically speaking it is a perfectly fine definition in itself, but I don't know of any meaningful theorem that says anything about the family of all stochastic processes. Usually you need at least some additional structure e.g. Markov or Martingale or something like that. The same can probably be said about semigroups in algebra or a ring (of sets) in measure theory. But that is probably a discussion for another day.