# Distribution Output Learning

My current research interest is in developing machine learning techniques for probability distributions. That is, instead of using samples as training data, we have probability distributions as training samples (these probability distributions may be seen as random measures from some unknown distributions). In support measure machines (SMMs), the training samples is $(\mathbb{P}_1,y_1),\ldots,(\mathbb{P}_n,y_n)$ where $\mathbb{P}_i$ denotes a probability distribution over some input space $\mathcal{X}$. In the simplest case, $y_i\in\{-1,+1\}$, i.e., classification problem where the input is probability distribution.

Today, I will talk briefly about one of my ideas on learning from probability distributions called "distribution output learning". As its name suggests, we consider the learning problem when the output is "probability distribution". That is, the training sample in this case is $(x_1,\mathbb{P}_1),\ldots,(x_n,\mathbb{P}_n)$ where $x_i\in\mathcal{X}$ and $\mathbb{P}_i$ is the probability distribution defined over some output space $\mathcal{Y}$. Note that the input space $\mathcal{X}$ may as well be a space of probability distributions, but to simplify the problem we will focus on $\mathcal{X}=\mathbb{R}^d$.

Why is it useful to have such a framework? Is there any application which supports this idea? These are important questions we need to answer before really putting our effort on constructing the learning algorithm. To give some motivations, consider the following examples.

1. Preference prediction -- one may look at the "preference" as a probability distribution (or positive measure) over a set of objects (either discrete or continuous depending on how you represent these objects). I will call it the "preference distribution". If one object is preferred over another, the probability associated with the object will be relatively higher. Therefore, in recommendation system, we can look at a set of products purchased by a customer as draws from the preference distribution. Given the purchase history of several customers, one may want to construct an "algorithm" that can predict the "preference" of the new customer so that the products can be recommended according to the predicted preference. Note in this case that we are predicting the "distribution".
2. Multi-class prediction -- multi-class classification problem is very important in machine learning, and there have been many researches in this direction. Generally speaking, the main aim of multi-class classification is to estimate $\mathbb{P}(Y|X=x)$ given a measurement $x$. The conditional probability $\mathbb{P}(Y|X=x)$ is the distribution over $\mathcal{Y}$ and varies as the measurement $x$ changes. Therefore, one can look at this problem as a "distribution output learning" problem.

Although I have a rough idea on how to perform a prediction algorithmically, there are some theoretical questions that I would like to investigate further such as:

1.  How to define the distribution over the space of probability measures (or measures in general)? This may seems trivial at first glance, but there are some technical issues here that need to be investigated further.
2. What characterizes the universal kernels for distribution output learning?
3. How does the generalization bound look like?

This framework is of course closely related to the "structured output learning". Many researchers have paid attention to the structured output learning in the past few years and they have proposed many different approaches to tackle this problem. In other words, this will keep me busy for awhile.