12 December 2011

It's that magical time of year...

By which I mean NIPS, and the incumbent exhaustion of 14 hour days.  (P.s., if you're at NIPS, see the meta-comment I have at the end of this post, even if you skip the main body :P.)

Today I went to two tutorials: one by Yee Whye Teh and Peter Orbanz (who is starting shortly at Columbia) on non-parametric Bayesian stuff, and one by Naftali Tishby on information theory in learning and control.  They were streamed online; I'm sure the videos will show up at some point on some web page, but I can't find them right now (incidentally, and shamelessly, I think NAACL should have video tutorials, too -- actually my dear friend Chris wants that too, and since Kevin Knight has already promised that ACL approves all my blog posts, I suppose I can only additionally promise that I will do everything I can to keep at least a handful of MT papers appearing in each NAACL despite the fact that no one really works on it anymore :P).  Then there were spotlights followed by posters, passed (as well as sat down) hors d'oeuvres, free wine/sangria/beer/etc, and friends and colleagues.

The first half of the Teh/Orbanz tutorial is roughly what I would categorize as "NP Bayes 101" -- stuff that everyone should know, with the addition of some pointers to results about consistency, rates of convergence of the posterior, etc.  The second half included a lot of stuff that's recently become interesting, in particular topics like completely random measures, coagulation/fragmentation processes, and the connection between gamma processes (an example of a completely random measure) and Dirichlet processes (which we all know and love/hate).

One of the more interesting things toward the end was what I was roughly characterized as variants of the DeFinetti theorem on exchangable objects.  What follows is from memory, so please forgive errors: you can look it up in the tutorial.  DeFinetti's theorem states that if p(X1, X2, ..., Xn, ...) is exchangeable, then p has a representation as a mixture model, with (perhaps) infinite dimensional mixing coefficients.  This is a fairly well-known result, and was apparently part of the initial reason Bayesians started looking into non-parametrics.

The generalizations (due to people like Kingman, Pitman, Aldous, etc...) are basically what happens for other types of data (i.e., other than just exchangeable).  For instance, if a sequence of data is block-exchangeable (think of a time-series, which is obviously not exchangeable, but for which you could conceivably cut it into a bunch of contiguous pieces and these pieces would be exchangeable) then it has a representation as a mixture of Markov chains.  For graph-structured data, if the nodes are exchangeable (i.e., all that matters is the pattern of edges, not precisely which nodes they happen to connect), then this also has a mixture parameterization, though I've forgotten the details.

The Tishby tutorial started off with some very interesting connections between information theory, statistics, and machine learning, essentially from the point of view of hypothesis testing.  The first half of the tutorial centered around information bottleneck, which is a very beautiful idea. You should all go read about it if you don't know it already.

What actually really struck me was a comment that Tishby made somewhat off-hand, and I'm wondering if anyone can help me out with a reference.  The statement has to do with the question "why KL?"  His answer had two parts.  For the first part, consider mutual information (which is closely related to KL).  MI has the property that if "X -> Y -> Z" is a Markov chain, then the amount of information that Y gives you about Z is at most the amount of information that X gives you about Z.  In other words, if you think if Y as a "processed" version of X, then this processing cannot give you more information.  This property is more general than just MI, and I believe anything that obeys it is a Csiszar divergence.  The second part is the part that I'm not so sure of.  It originated with the observation that if you have a product, take a log, you now get an additive term.  This is really nice because you can apply results like the central limit theorem to this additive term.  (Many of the results in the first half of his tutorial hinged on this additivity.)  The claim was something like: the only divergences that have this additivity are Bregman divergences.  (This is not obvious to me, and actually not entirely obvious what the right definition of additivity is, so if someone wants to help out, please do so!)  But the connection is that MI and KL are the "intersection" of Bregman divergences and Csiszar divergences.  In other words, if you want the decreasing information property and you want the additivity property, then you MUST use information theoretic measures.

I confess that roughly the middle third of the talk went above my head, but I did learn about an interesting connection between Gaussian information bottleneck and CCA: basically they're the same, up to a trimming of the eigenvalues.  This is in a 2005 JMLR paper by Amir Globerson and others.  In the context of this, actually, Tishby made a very offhand comment that I couldn't quite parse as whether it was a theorem or a hope.  Basically the context was that when working with Gaussian distributed random variables, you can do information bottleneck "easily," but that it's hard for other distributions.  So what do we do?  We do a kernel mapping into a high dimension space (they use an RBF kernel) where the data will look "more Gaussian."  As I said, I couldn't quite parse whether this is "where the data will provably look more Gaussian" or "where we hope that maybe by dumb luck the data will look more Gaussian" or something in between.  If anyone knows the answer, again, I'd love to know.  And if you're here at NIPS and can answer either of these two questions to my satisfaction, I'll buy you a glass of wine (or beer, but why would you want beer? :P).

Anyway, that's my report for day one of NIPS!

p.s. I made the silly decision of taking a flight from Granada to Madrid at 7a on Monday 19 Dec.  This is way too early to take a bus, and I really don't want to take a bus Sunday night.  Therefore, I will probably take a cab.  I think it will be about 90 euros.  If you also were silly and booked early morning travel on Monday and would like to share said cab, please email me (me AT hal3 DOT name).

2 comments:

Suresh Venkatasubramanian said...

I'm confused about a few things: a Csiszar divergence (or f-divergence) is a divergence, so how do you relate it to MI ? Is it in the sense of the "Csiszar MI" being the average f-divergence to some mean value (a la Bregman information) ?

On the second point, what does it mean to say that a Bregman divergence is "additive" ?

Mark said...

Mutual information can be expressed as the KL divergence between a distribution and the product of its marginals (i.e., how "far away" a distribution is from a similar one with independent variables).

I guess for an arbitrary f-divergence you could compute a similar quantity. For example, if P(X,Y) is a distribution and D is an f-divergence, you could define the MI w.r.t. D_f between X and Y by I_f(X,Y) = D_f(P(X,Y), P(X)P(Y)).

Perhaps by "additive" it was meant that a Bregman divergence is convex in one of its arguments?