Distribution Output Learning

My current research interest is in developing machine learning techniques for probability distributions. That is, instead of using samples as training data, we have probability distributions as training samples (these probability distributions may be seen as random measures from some unknown distributions). In support measure machines (SMMs), the training samples is (\mathbb{P}_1,y_1),\ldots,(\mathbb{P}_n,y_n) where \mathbb{P}_i denotes a probability distribution over some input space \mathcal{X}. In the simplest case, y_i\in\{-1,+1\}, i.e., classification problem where the input is probability distribution.

Today, I will talk briefly about one of my ideas on learning from probability distributions called "distribution output learning". As its name suggests, we consider the learning problem when the output is "probability distribution". That is, the training sample in this case is (x_1,\mathbb{P}_1),\ldots,(x_n,\mathbb{P}_n) where x_i\in\mathcal{X} and \mathbb{P}_i is the probability distribution defined over some output space \mathcal{Y}. Note that the input space \mathcal{X} may as well be a space of probability distributions, but to simplify the problem we will focus on \mathcal{X}=\mathbb{R}^d.

Why is it useful to have such a framework? Is there any application which supports this idea? These are important questions we need to answer before really putting our effort on constructing the learning algorithm. To give some motivations, consider the following examples.

  1. Preference prediction -- one may look at the "preference" as a probability distribution (or positive measure) over a set of objects (either discrete or continuous depending on how you represent these objects). I will call it the "preference distribution". If one object is preferred over another, the probability associated with the object will be relatively higher. Therefore, in recommendation system, we can look at a set of products purchased by a customer as draws from the preference distribution. Given the purchase history of several customers, one may want to construct an "algorithm" that can predict the "preference" of the new customer so that the products can be recommended according to the predicted preference. Note in this case that we are predicting the "distribution".
  2. Multi-class prediction -- multi-class classification problem is very important in machine learning, and there have been many researches in this direction. Generally speaking, the main aim of multi-class classification is to estimate \mathbb{P}(Y|X=x) given a measurement x. The conditional probability \mathbb{P}(Y|X=x) is the distribution over \mathcal{Y} and varies as the measurement x changes. Therefore, one can look at this problem as a "distribution output learning" problem.

Although I have a rough idea on how to perform a prediction algorithmically, there are some theoretical questions that I would like to investigate further such as:

  1.  How to define the distribution over the space of probability measures (or measures in general)? This may seems trivial at first glance, but there are some technical issues here that need to be investigated further.
  2. What characterizes the universal kernels for distribution output learning?
  3. How does the generalization bound look like?

This framework is of course closely related to the "structured output learning". Many researchers have paid attention to the structured output learning in the past few years and they have proposed many different approaches to tackle this problem. In other words, this will keep me busy for awhile.

Thank for reading,

Support Measure Machines

We have got a paper at NIPS this year.

Learning from Distributions via Support Measure Machines (Spotlight)
K. Muandet, K. Fukumizu, F. Dinuzzo, B. Schoelkopf

Abstract This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (Flex-SVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework.

This is joint work with Kenji Fukumizu (The Institute for Statistical Mathematics, Japan), Francesco Dinuzzo (MPI-IS), and my supervisor Prof. Bernhard Schoelkopf (MPI-IS). Parts of this work were done while Kenji was visiting us in summer 2011.

The arXiv manuscript can be found here. This is not up-to-date version, but will give a basic idea of this work.

Generalized Kernel Trick

If you are machine learner and are working on something related to kernel methods, I am sure most of you are familiar with the so-called kernel trick, which is very fundamentally important for most kernel-based learning machines. The equation below gives a formal definition of the kernel trick:

 \langle\phi(x),\phi(y)\rangle_{\mathcal{H}} = k(x,y)

That is, the inner product between the feature map \phi(x) and \phi(y) can be written in term of some positive semidefinite function k. This allows one to replace the inner product with the kernel evaluation, and thereby does not need to compute \phi(x) explicitly. Similar to the standard kernel trick, the generalized version can be written as

 \langle\mathcal{T}\phi(x),\phi(y)\rangle_{\mathcal{H}} = [\mathcal{T}k(x,\cdot)](y)

where \mathcal{T} is an operator in \mathcal{L}(\mathcal{H}). Note that the generalized kernel trick reduces to the standard kernel trick when \mathcal{T}=\mathcal{I} where \mathcal{I} is the identity operator. Kadri et al. (2012) showed that this trick holds for any implicit mapping \phi of a Mercer kernel given for self-adjoint operator \mathcal{T}. This is trick particularly useful when deriving the learning algorithm for structured output learning.


Baryon Acoustic Oscillations

I have been working on quasar target selection problem for awhile. Essentially, this is a classification problem where one want to identify the objects in the sky as quasars or stars based on their flux measurement. The problem is easy for the low-redshift range because there is a clear separation between quasars and stella objects, but as for the medium- and high-redshift ranges, quasar target selection becomes more difficult. For z>2.2, objects must be targeted down to g=22 mag, where the photometric measurement uncertainty becomes substantial. Moreover, at z = 2.8, the quasar and stella loci cross in color space.

Despite the challenges of the problem itself, it is very important to me to understand why such a distant object is worth detected at all. So I did some researches and came up with a simple explanation.

Shortly after the Big Bang, the cosmic plasma composed of photons and baryons were excited by the initial perturbation. Initially, the pressure from the cosmic microwave background keeps the photon+baryon plasma from decoupling. This plasma acts like a sound wave that moves outward until the Universe becomes neutral at redshift 1000. As the Universe has cooled enough, the proton captures the electron to form neutral Hydrogen, which also decouple the photons from the baryons. Photons continue to stream away, leading to the dramatic acoustic oscillations seen in cosmic microwave background anisotropy data. The baryons, on the other hand, remain in place and leave the baryon peak stalled at about 150 comoving Mpc. This causes a small excess in number of pairs of galaxies separated by such distance. These features are often referred to as the baryon acoustic oscillations (BAO). BAO determine the rate of growth of cosmic structure with the overall expansion  of the universe. The observability of BAO will help cosmologists measure the expansion history of the universe and thereby a probe of cosmic dark energy.

In principle, BAO can also be observed in all forms of cosmic structure including the distribution of intergalactic medium as probed by the Lyman alpha forest (LAF). The LAF can be seen in the spectra of high redshift quasars. To detect BAO in the LAF, one may cross-correlate absorption spectra in widely separate quasar pairs. This has been previously impossible due to lack of sufficient data. Therefore, detection of sufficiently large number of high redshift quasars becomes substantially important.

After working in this direction for awhile, I have a feeling that machine learning in astronomy has not been explored much. There might be some open problems that one can tackle from machine learning point of view. I have also got this inspiration from the talk by David Hogg.

Empirical Inference Journal Club

I have joined the Department of Empirical Inference at Max Planck Institute for Intelligent Systems for already one year. I've learnt and experienced a lot during the course of my PhD.

Apart from getting my own research projects done, I have always been enthusiastic about learning new things, broadening and deepening my knowledge. So I've been reading rigorously on many different topics ranging from econometrics to cosmology. Understanding the theoretical aspects of machine learning, I think, is very important, but understanding its role in real-world applications is even more important.

Reading lots of papers, of course, already gives me a big picture of where machine learning is in scientific communities. However, it lacks social context. I would also like to know what other people think about it.

So I have recently set up a journal club called the Empirical Inference Journal Club with a strong hope that it will provide such a platform for students and postdocs in the department to share their knowledge on some particular topics related to empirical inference. In the department, people have actually been organizing the reading groups on different topics, but to my knowledge they had the reading for a short period of time and then stop.

I commit to keeping this journal club running. Of course, I have to do some extra works, but I think it's worthwhile. After three-week of the journal club, things seem to go smoothly. I hope more people will join and contribute to the journal club.

We always have two options: accepting things the way they are or having enough courage to change them.

Hello world!

After many attempts in starting up an academic blog, I have been successful.

Primarily, I will try to write regularly about my ongoing works and ideas that I have during the day. I hope it will be somewhat helpful both to myself and to other people who will be reading my blog.