# Distribution Output Learning

My current research interest is in developing machine learning techniques for probability distributions. That is, instead of using samples as training data, we have probability distributions as training samples (these probability distributions may be seen as random measures from some unknown distributions). In support measure machines (SMMs), the training samples is $(\mathbb{P}_1,y_1),\ldots,(\mathbb{P}_n,y_n)$ where $\mathbb{P}_i$ denotes a probability distribution over some input space $\mathcal{X}$. In the simplest case, $y_i\in\{-1,+1\}$, i.e., classification problem where the input is probability distribution.

Today, I will talk briefly about one of my ideas on learning from probability distributions called "distribution output learning". As its name suggests, we consider the learning problem when the output is "probability distribution". That is, the training sample in this case is $(x_1,\mathbb{P}_1),\ldots,(x_n,\mathbb{P}_n)$ where $x_i\in\mathcal{X}$ and $\mathbb{P}_i$ is the probability distribution defined over some output space $\mathcal{Y}$. Note that the input space $\mathcal{X}$ may as well be a space of probability distributions, but to simplify the problem we will focus on $\mathcal{X}=\mathbb{R}^d$.

Why is it useful to have such a framework? Is there any application which supports this idea? These are important questions we need to answer before really putting our effort on constructing the learning algorithm. To give some motivations, consider the following examples.

1. Preference prediction -- one may look at the "preference" as a probability distribution (or positive measure) over a set of objects (either discrete or continuous depending on how you represent these objects). I will call it the "preference distribution". If one object is preferred over another, the probability associated with the object will be relatively higher. Therefore, in recommendation system, we can look at a set of products purchased by a customer as draws from the preference distribution. Given the purchase history of several customers, one may want to construct an "algorithm" that can predict the "preference" of the new customer so that the products can be recommended according to the predicted preference. Note in this case that we are predicting the "distribution".
2. Multi-class prediction -- multi-class classification problem is very important in machine learning, and there have been many researches in this direction. Generally speaking, the main aim of multi-class classification is to estimate $\mathbb{P}(Y|X=x)$ given a measurement $x$. The conditional probability $\mathbb{P}(Y|X=x)$ is the distribution over $\mathcal{Y}$ and varies as the measurement $x$ changes. Therefore, one can look at this problem as a "distribution output learning" problem.

Although I have a rough idea on how to perform a prediction algorithmically, there are some theoretical questions that I would like to investigate further such as:

1.  How to define the distribution over the space of probability measures (or measures in general)? This may seems trivial at first glance, but there are some technical issues here that need to be investigated further.
2. What characterizes the universal kernels for distribution output learning?
3. How does the generalization bound look like?

This framework is of course closely related to the "structured output learning". Many researchers have paid attention to the structured output learning in the past few years and they have proposed many different approaches to tackle this problem. In other words, this will keep me busy for awhile.

Krikamol

# Support Measure Machines

We have got a paper at NIPS this year.

Learning from Distributions via Support Measure Machines (Spotlight)
K. Muandet, K. Fukumizu, F. Dinuzzo, B. Schoelkopf

Abstract This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (Flex-SVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework.

This is joint work with Kenji Fukumizu (The Institute for Statistical Mathematics, Japan), Francesco Dinuzzo (MPI-IS), and my supervisor Prof. Bernhard Schoelkopf (MPI-IS). Parts of this work were done while Kenji was visiting us in summer 2011.

The arXiv manuscript can be found here. This is not up-to-date version, but will give a basic idea of this work.