I just returned from SODA (the Symposium on Discrete Algorithms). Obviously (I suppose), I didn't have a paper there, but I was interested in learning new things. Moreover, it was in SF which is a short cheap flight and admits free housing by crashing with one of my very close friends. My fellow professorial newbie, Suresh Venkatasubramanian, ran a seminar this past Fall on approximate high-dimensional geometry that I attended. This is a topic that is excruciatingly close to many areas of machine learning, and I suspect that there will be a lot of cross-polination in the next few years. I'll try to get around at some point to post about things I learned in the seminar. Some are more ML-related, some more NLP. Anyway, I wanted to (a) see what a theory conference was like and (b) learn about the latest, greatest techniques. (Sadly, I had to miss the last day because I teach early Tuesday mornings.)

Below are a few of the papers I saw that I liked enough to put them on my "to read" list. I'm not quite sure what my qualification for "liked" is, so take this with a huge grain of salt. If you want more authoritative opinions, see either Suresh's blog or one of the other theory blogs. I also warn you guys that most of these things really have nothing to do with NLP.

The first invited talk was by Bonnie Berger, talking about computational biology. Most of the topics were pretty standard: RNA/protein structure prediction, protein-protein interaction graphcs, etc. There are some surface connections to NLP (eg., some other people use stochastic CFGs to do structure prediction (though she does not). The twist that she is taking is to try to solve these problems simultaneously for more than two organisms (typically: yeast, worm, fly, mouse, human; these are the organisms for which we have the most data). The rational is based on the hypothesis that if a structure is biologically important, then it is (roughly) conserved across species. I feel like a lot of linguistics is based on roughly the same hypothesis, and this is probably a topic I'll try to blog about in the near(ish) future.

The second invited talk was by Persi Diaconis, who I've now seen give two talks and I just love him (I wish I had been able to take probability from him as an undergrad). At a very high level, his talk was a bit of evangelizing functional combinatorics, which is essentially a paradigm for understanding combinatorial problems that unifies a large body of disparate problems in terms of the analysis of symmetric polynomials (which have, themselves, a rich theory). The particular example he gave was a relationship between carrying (like, what you do when you add numbers with pen and paper) and shuffling. I confess I learned a lot more about shuffling (which seems to be a bit of a love for Perci), but it was still pretty interesting. Not enough for me to learn functional combinatorics, but still interesting.

One of my favorite papers was Fast dimension reduction using Rademacher series on dual BCH codes by Nir Ailon and Edo Liberty. Johnson-Lindenstrauss gives us a method of embedding N points in D dimensions into K<<D dimensions that preserve l2 distances with high probability, using random projection matrices (lots of follow on work, too). One problem is that these methods require storing a D*K matrix and each projection takes D*K operations. The result of this paper is that we can get the same results using much smaller space and time for each projection (or, in certain boundary cases, a matched space/time complexity). The key idea is to force oneself to use a random diagonal K-matrix composed with "something else" that gives us the desired property. The paper analyzes what properties the "something else" must have, and then constructs such a (deterministic) matrix based on Fourier codes. I thought it was both interesting theoretically and liketly to be quite practical (though time will tell on the latter).

Another paper I really liked was Declaring independence via the sketching of sketches by Piotr Indyk and Andrew McGregor. The problem they are tackling is trying to determing whether two variables are correlated, when the variables come in pairs in a stream. (For those who don't know streaming, think of it as an algorithm that gets one data point at a time and has to do something quickly using small space.) The two quantities that are relevant are the joint distribution over the pairs, and the product-of-marginals distribution over the pairs. If these are very similar, the two items in the pair are likely to be independent. They measure distance in three ways: Euclidean (l2) distance, variational (l1) distance, and mutual information. For l2, the have a very clever analysis that is very straightforward to understand if you know anything about sketching (it's basically a natural extension of previous results by Indyk). They are able to get pretty good results. They have similar results for the other metrics. Again, I can already think of a handful of possible applications of this stuff. It also leaves open an interesting question: what if you get k-tuples (instead of pairs) and you want to extract the M-most-correlated pairs. I suspect you can do this efficiently using a combination of this result with known results on frequent item sets. This would have immediate application to Bayes net structure learning, among other things.

Venkatesan Guruswami, James Lee and Alexander Razborov had a paper on *Almost Euclidean subspaces of l_1^N via expander codes* (I can't find a link because Google doesn't like latex codes) that I thought was both interesting and very well presented. This paper is a bit harder for me to get across (much less understand!). At a high level, their observation is that for a lot of metric embedding problems, the only known way to get a good embedding (i.e., one that preserves distances or norms) is to use a randomize construction (and then prove that it happens with high probability). Their result is a *deterministic, constructive* embedding matrix for embedding l1 into l2 (under some qualifications). The other plus of this method is that the embedding matrix is *sparse*, which means that the image of the vectors under the embedding ar sparse. There are also interesting connections to compressed sensing, which are mentioned in the paper.

There were other papers that I liked, but that I'm not going to try to summarize. Timothy Chan had a impressive (if somewhat ugly) result On the bichormatic k-set problem, which is effectively the problem of trying to figure out how many *bad* halfspace classifiers there are for labeled data (translated into machine learning lingo). Hubert Chan, Anupam Gupta and Kunal Talwar had a nice result on Ultra-low-dimensional embeddings for doubling metrics (I could only find a previous version that appeared in a NIPS workshop) that shows that if your metric doesn't have nearly uniform volume (characterized by the doubling dimension) then you *can* embed into Euclidean space with low distortion (this is somewhat surprising: classic results of Bourgain show that in general this is not possible). Despite the niceness of their result, the thing I found most interesting in the talk was a reference to a construction due to Satish Rao on Small distortion and volume preserving embeddings for planar and Euclidean metrics that they basically are extending.

Overall, I enjoyed the conference, despite getting pretty lost in a bunch of the talks. I wish I could have stayed for the last day, since there are a handful of papers that I think would probably be interesting. But I suppose that's what this 1300 page proceedings they dumped on me is for. In particular, the ones that look interesting from titles and skimming are: Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm (actually, I've read the TR version of this paper and it's very good); Sampling algorithms and coresets for l_p regression (only considers the N>>D case, which is a bit limiting) and Linked decompositions of networks and power of choice in Polya urns, which has interesting connections to (shockingly) Polya urn schemes. In particular, they show that if you have a Polya urn where in each step you sample *two* urns proportional to their occupancy, but then insert into the *smaller* of the two, then you get a balanced distribution. This is in contrast to typically Polya urn schemes that show up in Dirichlet processes and Pitman-Yor processes where you get power law distributions. (The "power of two" references is to hashing strategies where you hash by two independent hash functions and insert into the lesser occupied of the two bins.)

(p.s., I apologize if I misquoted the exact results in any of these papers; most of this post is based on memory and very short notes I took.)

## 22 January 2008

### An NLPer in the Algorithmists Court

Posted by hal at 1/22/2008 01:38:00 PM

Labels: algorithms, papers, problems

Subscribe to:
Post Comments (Atom)

## 13 comments:

Hal,

Thanks for the summary.

The paper by Venkatesan Guruswami, James Lee and Alexander Razborov is on Jame's Lee's Page:

http://www.cs.washington.edu/homes/jrl/papers/spread.pdf

Cheers,

Igor.

http://nuit-blanche.blogspot.com

I love this post! (Though I don't understand much of the technical descriptions yet). *This* post is on my To-Read(again) list. Please keep it coming. :)

Hi Hal,

Nice post! Just a quick comment: the l2 result is a natural extension of the previous result by *Alon-Matias-Szegedy*, not Indyk (I wish I could take credit for that wonderful paper though :)

Cheers,

Piotr

Hi

Nice blog. I just discovered your blog and i enjoy a lot about your recommendation" how to start NLP". i am so far from US exactly in the border between Iran and Iraq and so much interested in NLP. right now Manning and Schutz.

Best wished

Hassan

Hi Hal,

Very nice post. Thanks for the good job.

-Reza

It is the knight noah which makes me very happy these days, my brother says knight gold is his favorite games gold he likes, he usually knight online gold to start his game and most of the time he will win the knight online noah back and give me some cheap knight gold to play the game.

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差、打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店當日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作、兼差打工、假日兼職、兼職工作、酒店兼差、兼差、打工兼差、日領工作、晚上兼差工作、酒店工作、酒店上班、酒店打工、兼職、兼差、兼差工作、酒店上班等,想了解酒店相關工作和特種行業內容,想兼職工作日領、假日兼職、兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚，無業績壓力，無脫秀無喝酒壓力，高層次會員制客源，工作輕鬆，可日領、現領。

一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已，對水水們的上班安全一點保障都沒有！艾葳酒店經紀公司的水水們上班時全程媽咪作陪，不需擔心！只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職、缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??

艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,

等相關服務 幫您快速的實現您的夢想~!!

This is a topic that is excruciatingly close to many areas of machine learning, and I suspect that there will be a lot of cross-polination in the next few years. I'll try to get around at some point to post about things I learned in the seminar.

boca raton sedation dentist

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet

seslisohbet

sesli chat

seslichat

sesli sohbet sitesi

sesli chat sitesi

sesli sohpet

kamerali sohbet

kamerali chat

webcam sohbet

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..

seslisohbet

seslichat

sesli sohbet

sesli chat

sesli

sesli site

görünlütü sohbet

görüntülü chat

kameralı sohbet

kameralı chat

sesli sohbet siteleri

sesli chat siteleri

görüntülü sohbet siteleri

görüntülü chat siteleri

kameralı sohbet siteleri

canlı sohbet

sesli muhabbet

görüntülü muhabbet

kameralı muhabbet

seslidunya

seslisehir

sesli sex

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..

seslisohbet

seslichat

sesli sohbet

sesli chat

sesli

sesli site

görünlütü sohbet

görüntülü chat

kameralı sohbet

kameralı chat

sesli sohbet siteleri

sesli chat siteleri

sesli muhabbet siteleri

görüntülü sohbet siteleri

görüntülü chat siteleri

görüntülü muhabbet siteleri

kameralı sohbet siteleri

kameralı chat siteleri

kameralı muhabbet siteleri

canlı sohbet

sesli muhabbet

görüntülü muhabbet

kameralı muhabbet

birsesver

birses

seslidunya

seslisehir

sesli sex

Post a Comment