NIPS2016 Exhibition of Rejects

I finally have a courage to restart my blog post after a long long time. This time, I want to advertise a new experiment for Neural Information Processing Systems (NIPS 2016) known as "exhibition of rejects" which aims to highlight some of the NIPS2016 rejected papers in a spirit similar to the Salon de Refuesés. Bob Williamson is very kind to host the exhibition. The list of papers is available at

The Palais de l'Industrie where the exhibition took place (source : Wikipedia)

The Palais de l'Industrie where the exhibition took place (source : Wikipedia)

Also, I have to say that it is my pleasure to serve as one of the workflow managers for NIPS2016. I'll try to write my experience on this in another post.



On Thursday and Friday, Bernhard, Rob, and I discussed about one of Rob's works on exoplanet detection. He has been working on it for awhile, and we will see if the current technique can be improved using more sophisticate machine learning techniques.

An extrasolar planet, or exoplanet, is basically a planet outside the Solar System. As far as I understand, an ultimate goal is to discover the Earth 2.0, those extrasolar planets that orbit in the habitable zone where it is possible liquid water to exist on the surface. The detection of exoplanets itself is very difficult, let alone the extraction of molecular composition of the planets, because planets are extremely faint compared to their parent stars.

On Friday morning, I also met Ralf Herbrich, who is currently a director of machine learning science at Amazon. We didn't talk much, but I guess I will meet him again at UAI2013.

This basically concludes my trip to NYC. I will be at ICML (Atlanta, Georgia) next week and looking forward to meeting many renowned machine learning people.

Causality in Machine Learning

Early this week, Bernhard and I started to discuss about future direction of our research. It is quite difficult to decide because, on the one hand, I think there still be a number of open questions along the line of kernel mean embedding and its applications. Kernel methods have become one of the most important tools in machine learning and I am certain it will still be. On the other hand, this might be a good opportunity to learn something new.

One of the possibilities that we discussed about is "causal inference". Causal inference has been one of the main research directions of our department in Tuebingen (see for people who work in this area and their contributions). I have to admit that this topic is new to me. I have very little knowledge about causal inference, which is why I am quite excited about it.

In a sense, the goal of causal inference is rather different from standard statistical inference. In statistical inference, given random variables X and Y, the goal is to discover association patterns between them encoded in the joint distribution P(X,Y). On the other hand, causal inference aims to discover the casual relations between X and Y, i.e., either X causes Y or Y causes X or the there is a common cause between X and Y. Since revealing causal relation involves an intervention on one of the variables, it is not trivial how to do so on the non-experimental data. Moreover, there is an issue of identifiability, i.e., several causal models could have generated the same P(X,Y). As a result, certain assumptions about the model are necessary.

Supernova Classification

It has been a long week. We had the nips deadline on Friday. Fortunately, we manage to submit the papers. Let's keep the finger crossed! I was very fortunate to receive very constructive comments about my nips paper from Ross Fedely. David Hogg also gave last-minute comments which helped improve the paper further (and special thank goes to Bernhard for that). Here is a picture of us working toward the deadline:

2013-05-31 16.48.36

After submitting the papers, we hanged out with many people, including Rob Fergus, in the park to celebrate our submissions.

Right. The main topic of this post is about supernova classification. Early this week, Bernhard and I had a quick meeting with astronomy people from CCPP (Center for Cosmology and Particle Physics). They are working on the problem of supernova classification (identifying the type of supernova from their spectra), and are interested in applying machine learning techniques to solve this problem. Briefly, the main challenge of this problem is the fact that the supernova itself change over time. That is, it can belong to different type depending on when it is observed. . Another challenge of this problem is that we have a small dataset, usually in the order of hundred.

According to wikipedia, a supernova is an energetic explosion of a star. The explosion can be triggered either by the reignition of nuclear fusion in a degenerate star or by the collapse of the core of a massive star, Either way, a massive amount of energy is generated. Interestingly, the expanding shock waves of supernova explosions can trigger the formation of new stars.

Supernovae are important in cosmology because maximum intensities of their explosions could be used as "standard candles". Briefly, it helps astronomers indicate the astronomical distances.

One of the previous works used the correlation between the objects' spectra and set of templates to identify their type. I will read the paper on the weekend and see if we can build something better than just simple correlation.

ML trip to New York

Washington Square Park

Hi readers, I am now in New York, a city that never sleeps and one of the cities where many great-mind in science reside (many great machine learners also live here).

It is good to be here. I will take this opportunity to interact with people who are working in different fields, such as astrophysics, particle physics, computer vision, etc, and hopefully learn something new.

The primary goal of this trip is to visit my advisor, Prof. Bernhard Scholkopf, who is visiting NYU for three months and to finish our nips paper. Also, another goal is to continue a collaboration with David Hogg and Jo Bovy on quasar target selection and see if we can continue our collaboration in another direction.

I started on Monday when Dustin Lang from CMU also visited David and Bernhard for three days to work on something about image denoising. I am very impressed by how much job they could get done in three days. Bernhard also told me about the idea of inferring the CCD sensitivity from image patches, which I find very interesting. Dustin also took us to the company where one of his friends works called Etsy. It's website company that sells hand-made stuffs. We had a quick tour inside the company and the office is quite relaxing.

While everyone was busy, I tried my best to finish the first draft for our nips paper. It's now in its final shape.

Think Coffee

On Friday, we hanged out with Will Freeman from MIT. I met Will at the Astroimaging workshop in Switzerland. We spent the whole morning together with Bernhard, David, Rob, Ross, etc, discussing random stuffs. Will then gave a talk in the afternoon about his work on image/motion amplification. It's very cool stuffs.

I am now looking forward to another exciting week.

ICML 2013

It is very exciting to see many interesting papers at ICML this year (see for a list of accepted papers). It is also good to see that several papers are co-authored by the AGBS members.

This year, I have been involved in two ICML papers, both of which are in the area of kernel methods and transfer learning. The first paper is

Domain Generalization via Invariant Feature Representation
K. Muandet (MPI-IS), D. Balduzzi (ETH Zurich), and B. Schoelkopf (MPI-IS)

As opposed to domain adaptation, where one usually assume that the data from the target domain is available during training, domain generalization solves the problem without that assumption by collecting information from several source domains and, given the data from the target domain, infer the target domain during the test time. The paper is already available online (see the link above).

The second paper is

Domain Adaptation under Target and Conditional Shift
K. Zhang (MPI-IS), B. Scoelkopf (MPI-IS), K. Muandet (MPI-IS), and Z. Wang (MPI-IS)

The work investigates the domain adaptation problem when the conditional distribution also changes, as opposed to previous setting where only the marginal can change. We make use of the knowledge from causality to solve this problem. The paper will be available soon.



Recently, I have been struggling to understand something, which is related to the notion of "overfitting". In learning theory, It usually refers to the situation in which one try to infer a general concept (e.g., regressor or classifier) from finite observations (aka data). The concept is said to overfit the observation if it performs well on the set of observations, but performs worse on the previously unseen observations. The ability to generalize to the future observation is known as a "generalization" ability of the concept we have inferred. In practice, we would prefer the concept that not only performs well on the current data, but also do so on the  data we might observe in the future (we generally assume that the data come from the same distribution).

When I think of overfitting, it is unavoidable to refer to "generalisation". In fact, as we can see above, we give the definition of overfitting based on the generalization ability of the concept. The notion of overfitting is also closely related to the notion of ill-posedness in the inverse problem. The is in fact the motivation of the regularisation problem we often encounter in the machine learning.

In the past few weeks, I have to deal with estimation problem. In principle, it is different from regression or classification problem(and regularization problem as well). However, there seems to be a connection between estimation problem and regularization problem, that still puzzle me. In estimation theory, the problem is mostly unsupervised, so it is not clear how to define "overfitting" in this case. Can we look at overfitting based on something beyond generalisation?

So the question I would like to ask in this post is "what else can we see/consider as overfitting?" If you have good examples, please feel free to leave comments.

Japan Visit

I am currently visiting Prof. Kenji Fukumizu at the Institute of Statistical Mathematics in Tokyo, Japan, where I will be spending most of the time working. Since I arrived last week, Kenji and I have already produced some interesting results on our joint work in kernel mean embedding for distributions. Hopefully, I can keep myself consistently productive in the next few weeks.

Apart from work, I also had some trips to Tachikawa and Tokyo downtown, despite the fact that the weather was not on my side. They includes a trip to Shinjuku (walking around in the area and enjoying the Japanese lifestyle), Asakusa, and the Tokyo Skytree Tower. The weather is getting better this week so I hope I will have a wonderful trip this weekend.

Well, this post is not really about machine learning, I will keep posting about what I learn while I am here.

Does incorporating prior cause additional uncertainty?

I have recently thought about the question that I had long time ago. This question has arisen during the discussion at the astro-imaging workshop in Switzerland. If I remembered correctly, the discussion went along the two school of thoughts on how to model the astronomical images. The Frequentist school of thought was primarily supported by Stefan Harmeling. Christian Schule and Rob Fergus, on the other hand, represented the Bayesian school of thought. Other people who were at the discussion included Bernhard Schölkopf, David Hogg, Dillip Khrisnan, Michael Hirsch, etc.

In short, the story goes like this:

Stefan believed that he could somehow solve the problem by directly formulating the objective function and optimizing it. The message here, as I understood, was to avoid any prior knowledge. I believe there is more to his point of view on this problem, but for the sake of brevity, I will skip it as it is not the main topic we are going to discuss about.

On the other hand, Christian and Rob had a slightly different point of view. They believed one should incorporate a "prior information". They pointed out that the prior for modelling the astronomical images is a key. Similarly, there is more to the story, but I will skip it.

As an observer, I agreed with all three of them. Using only Stefan's objective function, I think he could find a reasonably good solution. Similarly, Christian and Rob might be able to find a better solution with a "reasonably right" prior. The question is which approach should I use?

This question essentially arises before you actually solve the problem. Christian and Rob may have a good prior which can possibly helps obtain better solutions than Stefan's approach. But as a observer, who does not know anything about the prior, it seems that I need to deal with another source of uncertainty: Is the prior actually a good one? The aforementioned statement may not hold anymore if one has a bad prior.

In summary, I would like to know the answer to the following questions:

  1. Does incorporating prior actually cause more uncertainty about the problem we are trying to solve?
  2. If so, is it then harder to solve a problem with a prior as opposed to without one?
  3. Statistically speaking, how do most statisticians deal with this uncertainty?

Feel free to leave comments if you have one. Thanks.

SVM, SMM, and the kernel trick

Today I gave a talk (and led an informal discussion) on the fundamental concept of support vector machine, support measure machine, and the kernel trick at the Center for Cosmology and Particle Physics (CCPP), NYU. Most of the audiences are astronomers who know very little about SVM and kernel methods, but they eventually seemed to understand the concept very quickly. I am quite impressed.

The highlight of the day seemed to be the discussion on the benefits of the kernel trick. David Hogg (NYU), who organized this talk for me, pointed out many interesting insights of kernel and how one can apply this technique in astronomy. In fact, he seemed to be very excited about the idea of kernel trick. David also pointed out that the distance metric we used in SMM for quasar target selection looks like the Chi-square, which is nice because this is naturally the case when comparing two Gaussian distributions. Moreover, Jonathan Goodman (NYU), who is a mathematician, also gave some insights about kernel function and Mercer's theorem. He was also curious about the different between SMM on distributions and SVM on infinitely many samples drawn from distributions, which was one of the most fundamental questions we addressed in our NIPS2012 paper.

At the end of the talk, I explained very briefly how one can use SMM for quasar target selection. The quasar target selection is essentially a classification problem in which one is interested in detecting a quasar, which looks very much like a star.