22 January 2008

An NLPer in the Algorithmists Court

I just returned from SODA (the Symposium on Discrete Algorithms). Obviously (I suppose), I didn't have a paper there, but I was interested in learning new things. Moreover, it was in SF which is a short cheap flight and admits free housing by crashing with one of my very close friends. My fellow professorial newbie, Suresh Venkatasubramanian, ran a seminar this past Fall on approximate high-dimensional geometry that I attended. This is a topic that is excruciatingly close to many areas of machine learning, and I suspect that there will be a lot of cross-polination in the next few years. I'll try to get around at some point to post about things I learned in the seminar. Some are more ML-related, some more NLP. Anyway, I wanted to (a) see what a theory conference was like and (b) learn about the latest, greatest techniques. (Sadly, I had to miss the last day because I teach early Tuesday mornings.)

Below are a few of the papers I saw that I liked enough to put them on my "to read" list. I'm not quite sure what my qualification for "liked" is, so take this with a huge grain of salt. If you want more authoritative opinions, see either Suresh's blog or one of the other theory blogs. I also warn you guys that most of these things really have nothing to do with NLP.

The first invited talk was by Bonnie Berger, talking about computational biology. Most of the topics were pretty standard: RNA/protein structure prediction, protein-protein interaction graphcs, etc. There are some surface connections to NLP (eg., some other people use stochastic CFGs to do structure prediction (though she does not). The twist that she is taking is to try to solve these problems simultaneously for more than two organisms (typically: yeast, worm, fly, mouse, human; these are the organisms for which we have the most data). The rational is based on the hypothesis that if a structure is biologically important, then it is (roughly) conserved across species. I feel like a lot of linguistics is based on roughly the same hypothesis, and this is probably a topic I'll try to blog about in the near(ish) future.

The second invited talk was by Persi Diaconis, who I've now seen give two talks and I just love him (I wish I had been able to take probability from him as an undergrad). At a very high level, his talk was a bit of evangelizing functional combinatorics, which is essentially a paradigm for understanding combinatorial problems that unifies a large body of disparate problems in terms of the analysis of symmetric polynomials (which have, themselves, a rich theory). The particular example he gave was a relationship between carrying (like, what you do when you add numbers with pen and paper) and shuffling. I confess I learned a lot more about shuffling (which seems to be a bit of a love for Perci), but it was still pretty interesting. Not enough for me to learn functional combinatorics, but still interesting.

One of my favorite papers was Fast dimension reduction using Rademacher series on dual BCH codes by Nir Ailon and Edo Liberty. Johnson-Lindenstrauss gives us a method of embedding N points in D dimensions into K<<D dimensions that preserve l2 distances with high probability, using random projection matrices (lots of follow on work, too). One problem is that these methods require storing a D*K matrix and each projection takes D*K operations. The result of this paper is that we can get the same results using much smaller space and time for each projection (or, in certain boundary cases, a matched space/time complexity). The key idea is to force oneself to use a random diagonal K-matrix composed with "something else" that gives us the desired property. The paper analyzes what properties the "something else" must have, and then constructs such a (deterministic) matrix based on Fourier codes. I thought it was both interesting theoretically and liketly to be quite practical (though time will tell on the latter).

Another paper I really liked was Declaring independence via the sketching of sketches by Piotr Indyk and Andrew McGregor. The problem they are tackling is trying to determing whether two variables are correlated, when the variables come in pairs in a stream. (For those who don't know streaming, think of it as an algorithm that gets one data point at a time and has to do something quickly using small space.) The two quantities that are relevant are the joint distribution over the pairs, and the product-of-marginals distribution over the pairs. If these are very similar, the two items in the pair are likely to be independent. They measure distance in three ways: Euclidean (l2) distance, variational (l1) distance, and mutual information. For l2, the have a very clever analysis that is very straightforward to understand if you know anything about sketching (it's basically a natural extension of previous results by Indyk). They are able to get pretty good results. They have similar results for the other metrics. Again, I can already think of a handful of possible applications of this stuff. It also leaves open an interesting question: what if you get k-tuples (instead of pairs) and you want to extract the M-most-correlated pairs. I suspect you can do this efficiently using a combination of this result with known results on frequent item sets. This would have immediate application to Bayes net structure learning, among other things.

Venkatesan Guruswami, James Lee and Alexander Razborov had a paper on Almost Euclidean subspaces of l_1^N via expander codes (I can't find a link because Google doesn't like latex codes) that I thought was both interesting and very well presented. This paper is a bit harder for me to get across (much less understand!). At a high level, their observation is that for a lot of metric embedding problems, the only known way to get a good embedding (i.e., one that preserves distances or norms) is to use a randomize construction (and then prove that it happens with high probability). Their result is a deterministic, constructive embedding matrix for embedding l1 into l2 (under some qualifications). The other plus of this method is that the embedding matrix is sparse, which means that the image of the vectors under the embedding ar sparse. There are also interesting connections to compressed sensing, which are mentioned in the paper.

There were other papers that I liked, but that I'm not going to try to summarize. Timothy Chan had a impressive (if somewhat ugly) result On the bichormatic k-set problem, which is effectively the problem of trying to figure out how many bad halfspace classifiers there are for labeled data (translated into machine learning lingo). Hubert Chan, Anupam Gupta and Kunal Talwar had a nice result on Ultra-low-dimensional embeddings for doubling metrics (I could only find a previous version that appeared in a NIPS workshop) that shows that if your metric doesn't have nearly uniform volume (characterized by the doubling dimension) then you can embed into Euclidean space with low distortion (this is somewhat surprising: classic results of Bourgain show that in general this is not possible). Despite the niceness of their result, the thing I found most interesting in the talk was a reference to a construction due to Satish Rao on Small distortion and volume preserving embeddings for planar and Euclidean metrics that they basically are extending.

Overall, I enjoyed the conference, despite getting pretty lost in a bunch of the talks. I wish I could have stayed for the last day, since there are a handful of papers that I think would probably be interesting. But I suppose that's what this 1300 page proceedings they dumped on me is for. In particular, the ones that look interesting from titles and skimming are: Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm (actually, I've read the TR version of this paper and it's very good); Sampling algorithms and coresets for l_p regression (only considers the N>>D case, which is a bit limiting) and Linked decompositions of networks and power of choice in Polya urns, which has interesting connections to (shockingly) Polya urn schemes. In particular, they show that if you have a Polya urn where in each step you sample two urns proportional to their occupancy, but then insert into the smaller of the two, then you get a balanced distribution. This is in contrast to typically Polya urn schemes that show up in Dirichlet processes and Pitman-Yor processes where you get power law distributions. (The "power of two" references is to hashing strategies where you hash by two independent hash functions and insert into the lesser occupied of the two bins.)

(p.s., I apologize if I misquoted the exact results in any of these papers; most of this post is based on memory and very short notes I took.)


Igor said...


Thanks for the summary.

The paper by Venkatesan Guruswami, James Lee and Alexander Razborov is on Jame's Lee's Page:



Kevin Duh said...

I love this post! (Though I don't understand much of the technical descriptions yet). *This* post is on my To-Read(again) list. Please keep it coming. :)

Anonymous said...

Hi Hal,

Nice post! Just a quick comment: the l2 result is a natural extension of the previous result by *Alon-Matias-Szegedy*, not Indyk (I wish I could take credit for that wonderful paper though :)



Anonymous said...

Nice blog. I just discovered your blog and i enjoy a lot about your recommendation" how to start NLP". i am so far from US exactly in the border between Iran and Iraq and so much interested in NLP. right now Manning and Schutz.
Best wished

Anonymous said...

Hi Hal,

Very nice post. Thanks for the good job.


Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

gamefan12 said...

This is a topic that is excruciatingly close to many areas of machine learning, and I suspect that there will be a lot of cross-polination in the next few years. I'll try to get around at some point to post about things I learned in the seminar.
boca raton sedation dentist