Sparsity has been all the rage for a couple of years now. The standard
notion of "sparse" vector u is that the number of non-zeros in u is
small. This is simply the l_0 norm of u, ||u||_0. This norm is well
studied, known to be non-convex, and often relaxed to the l_1 norm of
u, ||u||_1: the sum of absolute values. (Which has the nice property
of being the "tightest" convex approximation to l_0.)

In some circumstances, it might not be that most of u is zero, but
simply that most of u is some fixed scalar constant a. The
"non-constant" norm of u would be something like "the number of
components that are not equal to some a" or, in more mathy terms,
"min_a ||u-a||_0". I first came across it this in the Differentiable
sparse coding paper by Bradley and Bagnell (NIPS 2008), where they
claim physicists call it "Maximum entropy on the mean" if you're using
a KL distance, but I don't know if there's a name for it in the l_0
sense, so I'l call it "l_0 from the mode" which seems to capture the
right notion.

In many cases, we want to use a sparse norm to define a notion of a
sparse distance function between two vectors u and v. For instance
d_0(u,v) = ||u-v||_0 would be a reasonable notion of a sparse
difference between two vectors.

Now, let's suppose that u and v are probability distributions over a
finite event space, so that u_i is the probability of event i. In this
case, both u and v belong to the simplex: u_i >= 0 for all i, and
sum_i u_i = 1 (which is equivalent to saying ||u||_1 = 1, given the
first condition).

It seems natural to ask ourselves whether there is a reasonable notion
of sparse difference between two probability distributions. This is
something I've thought about, but not come up with a satistfactory
answer. The "natural" thing would be to define d_0(p,q) = ||p-q||_0,
where I've switched from u/v to p/q to emphasize that these are
distributions. This is certainly reasonable in some sense, but does
not correspond to how I usually think of probability distributions,
essentially because it ignores the "normalization" effect.

As a simple example, let's suppose that I generate a data set by
rolling a dice with sides A-F. Suppose it comes up A five times, B
twice, C once and F twice. Then my maximum likelihood estimate for p
would be [.5, .2, .1, 0, 0, .2]. If I had a different dataset, in
which exactly the same thing happened except I got one roll of D in
addition to all the other rolls, I would estimate the distribution
here as q = [.45, .18, .09, .09, 0, .18], which differs everywhere
except the "E" position. The d_0 distance between these two
distributions would be 4, which intuitively seems "wrong.

One possible way around this would be to allow us to rescale the
distributions before computing the l_0 distance, something like d(p,q)
= min_(a,b > 1) || ap - bq ||_0. When talking about l_0 as the
underlying norm, this can be simplified to min_(0 < a < 1) || ap -
(1-a)q ||_0. Note that the inequalities on "a" are necessary. With
equalities, you could "erase" all of p (or q) by setting a to zero (or
1), which is definitely not desirable. Fortunately, the optimal "a"
can be computed in linear time in the dimensionality: it's just the
value that maximizes the number of i for which ap_i = (1-a)q_i. This
definitlon captures the example above and gives a distance of "1"
(corresponding to a = mode(q / (p+q)) = 10/21 = 0.4797).

We can attempt to make this definition more formal by invoking the
notion of exponential families. Recall that an expfam distribution has
the form log p(x;t) = t*f(x) - A(t) + B(x), where t are the natural
parameters, f is the vector of sufficient statistics, * denotes dot
product, and A() is the log partition function. A quick observation is
that depending on how f looks, there might be lots of t that give rise
to the same probability distribution. If we guarantee that the fs are
always linearly independent, then this representation is called
"minimal"; otherwise it is "overcomplete." (Formally, it is
overcomplete if there exists a non-zero vector a for which a*f(x) is
constant for all x.)

For a categorical distribution, an overcomplete representation would
be a six dimensional indicator of which side of the die came up, and
the natural parameters would be the (log of the) ps and qs shown above
(it's overcomplete because one component is completely determined by
the other 5). For example, the example p above would have exponential
form with t=log[.5/.8, .2/.8, .1/.8, 0, 0] and log partition function
A(t) = log 1.8. (See wikipedia's
variant 3 for categorical distributions.) and q would have
t=log[.45/.82, .18/.82, .09/.82, .09/.82, 0] with A(t) = log 1.82.

Supposing that we're working with a minimal representation, what I
think this amounts to is that we want the difference in natural
parameters to be sparse in the "l_0 from the mode" sense -- the
difference is equal to a constant. In the above example, the
difference in sufficient statistics is equal to 0.1301 in the first
three components, -Infinity in the fourth, and something like zero in
the last (who knows -- NaN?) which gives a distance of one (if you
discard the last one).

The problem here is that there is not a unique minimal representation;
I could also have chosen t=log[.2/.5, .1/.5, 0, 0, .2/.5] and A=log
1.5 for p and t=log[.18/.55 .09/.55 .09/.55 0/.55 .18/.55] and A=log
1.55 for q. In this case, the mode of the difference is 0.2007 (in the
first, second and last components) there's one -Infinity and one
NaN. So the answer works out the same, at least if you know how to
deal with the NaNs.

There's an obviously needed lemma here: that any minimal
representation will give you the same distance. I haven't thought
nearly long enough about why this should be true (i.e., I've thought
about if for about 30 seconds :P). There's also a lot of other
questions: what if you relax this notion of sparsity to something more
l_1-like? What other properties does it have? And of course is there a
better notion of sparse difference of probability distributions for this or other circumstances?

Some Westlake quotes

9 hours ago