Category Archives: Machine Learning

NIPS2016 Exhibition of Rejects

I finally have a courage to restart my blog post after a long long time. This time, I want to advertise a new experiment for Neural Information Processing Systems (NIPS 2016) known as "exhibition of rejects" which aims to highlight some of the NIPS2016 rejected papers in a spirit similar to the Salon de Refuesés. Bob Williamson is very kind to host the exhibition. The list of papers is available at

The Palais de l'Industrie where the exhibition took place (source : Wikipedia)

The Palais de l'Industrie where the exhibition took place (source : Wikipedia)

Also, I have to say that it is my pleasure to serve as one of the workflow managers for NIPS2016. I'll try to write my experience on this in another post.


Causality in Machine Learning

Early this week, Bernhard and I started to discuss about future direction of our research. It is quite difficult to decide because, on the one hand, I think there still be a number of open questions along the line of kernel mean embedding and its applications. Kernel methods have become one of the most important tools in machine learning and I am certain it will still be. On the other hand, this might be a good opportunity to learn something new.

One of the possibilities that we discussed about is "causal inference". Causal inference has been one of the main research directions of our department in Tuebingen (see for people who work in this area and their contributions). I have to admit that this topic is new to me. I have very little knowledge about causal inference, which is why I am quite excited about it.

In a sense, the goal of causal inference is rather different from standard statistical inference. In statistical inference, given random variables X and Y, the goal is to discover association patterns between them encoded in the joint distribution P(X,Y). On the other hand, causal inference aims to discover the casual relations between X and Y, i.e., either X causes Y or Y causes X or the there is a common cause between X and Y. Since revealing causal relation involves an intervention on one of the variables, it is not trivial how to do so on the non-experimental data. Moreover, there is an issue of identifiability, i.e., several causal models could have generated the same P(X,Y). As a result, certain assumptions about the model are necessary.

Supernova Classification

It has been a long week. We had the nips deadline on Friday. Fortunately, we manage to submit the papers. Let's keep the finger crossed! I was very fortunate to receive very constructive comments about my nips paper from Ross Fedely. David Hogg also gave last-minute comments which helped improve the paper further (and special thank goes to Bernhard for that). Here is a picture of us working toward the deadline:

2013-05-31 16.48.36

After submitting the papers, we hanged out with many people, including Rob Fergus, in the park to celebrate our submissions.

Right. The main topic of this post is about supernova classification. Early this week, Bernhard and I had a quick meeting with astronomy people from CCPP (Center for Cosmology and Particle Physics). They are working on the problem of supernova classification (identifying the type of supernova from their spectra), and are interested in applying machine learning techniques to solve this problem. Briefly, the main challenge of this problem is the fact that the supernova itself change over time. That is, it can belong to different type depending on when it is observed. . Another challenge of this problem is that we have a small dataset, usually in the order of hundred.

According to wikipedia, a supernova is an energetic explosion of a star. The explosion can be triggered either by the reignition of nuclear fusion in a degenerate star or by the collapse of the core of a massive star, Either way, a massive amount of energy is generated. Interestingly, the expanding shock waves of supernova explosions can trigger the formation of new stars.

Supernovae are important in cosmology because maximum intensities of their explosions could be used as "standard candles". Briefly, it helps astronomers indicate the astronomical distances.

One of the previous works used the correlation between the objects' spectra and set of templates to identify their type. I will read the paper on the weekend and see if we can build something better than just simple correlation.


Recently, I have been struggling to understand something, which is related to the notion of "overfitting". In learning theory, It usually refers to the situation in which one try to infer a general concept (e.g., regressor or classifier) from finite observations (aka data). The concept is said to overfit the observation if it performs well on the set of observations, but performs worse on the previously unseen observations. The ability to generalize to the future observation is known as a "generalization" ability of the concept we have inferred. In practice, we would prefer the concept that not only performs well on the current data, but also do so on the  data we might observe in the future (we generally assume that the data come from the same distribution).

When I think of overfitting, it is unavoidable to refer to "generalisation". In fact, as we can see above, we give the definition of overfitting based on the generalization ability of the concept. The notion of overfitting is also closely related to the notion of ill-posedness in the inverse problem. The is in fact the motivation of the regularisation problem we often encounter in the machine learning.

In the past few weeks, I have to deal with estimation problem. In principle, it is different from regression or classification problem(and regularization problem as well). However, there seems to be a connection between estimation problem and regularization problem, that still puzzle me. In estimation theory, the problem is mostly unsupervised, so it is not clear how to define "overfitting" in this case. Can we look at overfitting based on something beyond generalisation?

So the question I would like to ask in this post is "what else can we see/consider as overfitting?" If you have good examples, please feel free to leave comments.

Does incorporating prior cause additional uncertainty?

I have recently thought about the question that I had long time ago. This question has arisen during the discussion at the astro-imaging workshop in Switzerland. If I remembered correctly, the discussion went along the two school of thoughts on how to model the astronomical images. The Frequentist school of thought was primarily supported by Stefan Harmeling. Christian Schule and Rob Fergus, on the other hand, represented the Bayesian school of thought. Other people who were at the discussion included Bernhard Schölkopf, David Hogg, Dillip Khrisnan, Michael Hirsch, etc.

In short, the story goes like this:

Stefan believed that he could somehow solve the problem by directly formulating the objective function and optimizing it. The message here, as I understood, was to avoid any prior knowledge. I believe there is more to his point of view on this problem, but for the sake of brevity, I will skip it as it is not the main topic we are going to discuss about.

On the other hand, Christian and Rob had a slightly different point of view. They believed one should incorporate a "prior information". They pointed out that the prior for modelling the astronomical images is a key. Similarly, there is more to the story, but I will skip it.

As an observer, I agreed with all three of them. Using only Stefan's objective function, I think he could find a reasonably good solution. Similarly, Christian and Rob might be able to find a better solution with a "reasonably right" prior. The question is which approach should I use?

This question essentially arises before you actually solve the problem. Christian and Rob may have a good prior which can possibly helps obtain better solutions than Stefan's approach. But as a observer, who does not know anything about the prior, it seems that I need to deal with another source of uncertainty: Is the prior actually a good one? The aforementioned statement may not hold anymore if one has a bad prior.

In summary, I would like to know the answer to the following questions:

  1. Does incorporating prior actually cause more uncertainty about the problem we are trying to solve?
  2. If so, is it then harder to solve a problem with a prior as opposed to without one?
  3. Statistically speaking, how do most statisticians deal with this uncertainty?

Feel free to leave comments if you have one. Thanks.