natural language processing blog: 2013/04

Sparsity has been all the rage for a couple of years now. The standard notion of "sparse" vector u is that the number of non-zeros in u is small. This is simply the l_0 norm of u, ||u||_0. This norm is well studied, known to be non-convex, and often relaxed to the l_1 norm of u, ||u||_1: the sum of absolute values. (Which has the nice property of being the "tightest" convex approximation to l_0.)

In some circumstances, it might not be that most of u is zero, but simply that most of u is some fixed scalar constant a. The "non-constant" norm of u would be something like "the number of components that are not equal to some a" or, in more mathy terms, "min_a ||u-a||_0". I first came across it this in the Differentiable sparse coding paper by Bradley and Bagnell (NIPS 2008), where they claim physicists call it "Maximum entropy on the mean" if you're using a KL distance, but I don't know if there's a name for it in the l_0 sense, so I'l call it "l_0 from the mode" which seems to capture the right notion.

In many cases, we want to use a sparse norm to define a notion of a sparse distance function between two vectors u and v. For instance d_0(u,v) = ||u-v||_0 would be a reasonable notion of a sparse difference between two vectors.

Now, let's suppose that u and v are probability distributions over a finite event space, so that u_i is the probability of event i. In this case, both u and v belong to the simplex: u_i >= 0 for all i, and sum_i u_i = 1 (which is equivalent to saying ||u||_1 = 1, given the first condition).

It seems natural to ask ourselves whether there is a reasonable notion of sparse difference between two probability distributions. This is something I've thought about, but not come up with a satistfactory answer. The "natural" thing would be to define d_0(p,q) = ||p-q||_0, where I've switched from u/v to p/q to emphasize that these are distributions. This is certainly reasonable in some sense, but does not correspond to how I usually think of probability distributions, essentially because it ignores the "normalization" effect.

As a simple example, let's suppose that I generate a data set by rolling a dice with sides A-F. Suppose it comes up A five times, B twice, C once and F twice. Then my maximum likelihood estimate for p would be [.5, .2, .1, 0, 0, .2]. If I had a different dataset, in which exactly the same thing happened except I got one roll of D in addition to all the other rolls, I would estimate the distribution here as q = [.45, .18, .09, .09, 0, .18], which differs everywhere except the "E" position. The d_0 distance between these two distributions would be 4, which intuitively seems "wrong.

One possible way around this would be to allow us to rescale the distributions before computing the l_0 distance, something like d(p,q) = min_(a,b > 1) || ap - bq ||_0. When talking about l_0 as the underlying norm, this can be simplified to min_(0 < a < 1) || ap - (1-a)q ||_0. Note that the inequalities on "a" are necessary. With equalities, you could "erase" all of p (or q) by setting a to zero (or 1), which is definitely not desirable. Fortunately, the optimal "a" can be computed in linear time in the dimensionality: it's just the value that maximizes the number of i for which ap_i = (1-a)q_i. This definitlon captures the example above and gives a distance of "1" (corresponding to a = mode(q / (p+q)) = 10/21 = 0.4797).

We can attempt to make this definition more formal by invoking the notion of exponential families. Recall that an expfam distribution has the form log p(x;t) = t*f(x) - A(t) + B(x), where t are the natural parameters, f is the vector of sufficient statistics, * denotes dot product, and A() is the log partition function. A quick observation is that depending on how f looks, there might be lots of t that give rise to the same probability distribution. If we guarantee that the fs are always linearly independent, then this representation is called "minimal"; otherwise it is "overcomplete." (Formally, it is overcomplete if there exists a non-zero vector a for which a*f(x) is constant for all x.)

For a categorical distribution, an overcomplete representation would be a six dimensional indicator of which side of the die came up, and the natural parameters would be the (log of the) ps and qs shown above (it's overcomplete because one component is completely determined by the other 5). For example, the example p above would have exponential form with t=log[.5/.8, .2/.8, .1/.8, 0, 0] and log partition function A(t) = log 1.8. (See wikipedia's variant 3 for categorical distributions.) and q would have t=log[.45/.82, .18/.82, .09/.82, .09/.82, 0] with A(t) = log 1.82.

Supposing that we're working with a minimal representation, what I think this amounts to is that we want the difference in natural parameters to be sparse in the "l_0 from the mode" sense -- the difference is equal to a constant. In the above example, the difference in sufficient statistics is equal to 0.1301 in the first three components, -Infinity in the fourth, and something like zero in the last (who knows -- NaN?) which gives a distance of one (if you discard the last one).

The problem here is that there is not a unique minimal representation; I could also have chosen t=log[.2/.5, .1/.5, 0, 0, .2/.5] and A=log 1.5 for p and t=log[.18/.55 .09/.55 .09/.55 0/.55 .18/.55] and A=log 1.55 for q. In this case, the mode of the difference is 0.2007 (in the first, second and last components) there's one -Infinity and one NaN. So the answer works out the same, at least if you know how to deal with the NaNs.

There's an obviously needed lemma here: that any minimal representation will give you the same distance. I haven't thought nearly long enough about why this should be true (i.e., I've thought about if for about 30 seconds :P). There's also a lot of other questions: what if you relax this notion of sparsity to something more l_1-like? What other properties does it have? And of course is there a better notion of sparse difference of probability distributions for this or other circumstances?

natural language processing blog

30 April 2013

What is a sparse difference in probability distributions?