22 January 2008

An NLPer in the Algorithmists Court

I just returned from SODA (the Symposium on Discrete Algorithms). Obviously (I suppose), I didn't have a paper there, but I was interested in learning new things. Moreover, it was in SF which is a short cheap flight and admits free housing by crashing with one of my very close friends. My fellow professorial newbie, Suresh Venkatasubramanian, ran a seminar this past Fall on approximate high-dimensional geometry that I attended. This is a topic that is excruciatingly close to many areas of machine learning, and I suspect that there will be a lot of cross-polination in the next few years. I'll try to get around at some point to post about things I learned in the seminar. Some are more ML-related, some more NLP. Anyway, I wanted to (a) see what a theory conference was like and (b) learn about the latest, greatest techniques. (Sadly, I had to miss the last day because I teach early Tuesday mornings.)

Below are a few of the papers I saw that I liked enough to put them on my "to read" list. I'm not quite sure what my qualification for "liked" is, so take this with a huge grain of salt. If you want more authoritative opinions, see either Suresh's blog or one of the other theory blogs. I also warn you guys that most of these things really have nothing to do with NLP.

The first invited talk was by Bonnie Berger, talking about computational biology. Most of the topics were pretty standard: RNA/protein structure prediction, protein-protein interaction graphcs, etc. There are some surface connections to NLP (eg., some other people use stochastic CFGs to do structure prediction (though she does not). The twist that she is taking is to try to solve these problems simultaneously for more than two organisms (typically: yeast, worm, fly, mouse, human; these are the organisms for which we have the most data). The rational is based on the hypothesis that if a structure is biologically important, then it is (roughly) conserved across species. I feel like a lot of linguistics is based on roughly the same hypothesis, and this is probably a topic I'll try to blog about in the near(ish) future.

The second invited talk was by Persi Diaconis, who I've now seen give two talks and I just love him (I wish I had been able to take probability from him as an undergrad). At a very high level, his talk was a bit of evangelizing functional combinatorics, which is essentially a paradigm for understanding combinatorial problems that unifies a large body of disparate problems in terms of the analysis of symmetric polynomials (which have, themselves, a rich theory). The particular example he gave was a relationship between carrying (like, what you do when you add numbers with pen and paper) and shuffling. I confess I learned a lot more about shuffling (which seems to be a bit of a love for Perci), but it was still pretty interesting. Not enough for me to learn functional combinatorics, but still interesting.

One of my favorite papers was Fast dimension reduction using Rademacher series on dual BCH codes by Nir Ailon and Edo Liberty. Johnson-Lindenstrauss gives us a method of embedding N points in D dimensions into K<<D dimensions that preserve l2 distances with high probability, using random projection matrices (lots of follow on work, too). One problem is that these methods require storing a D*K matrix and each projection takes D*K operations. The result of this paper is that we can get the same results using much smaller space and time for each projection (or, in certain boundary cases, a matched space/time complexity). The key idea is to force oneself to use a random diagonal K-matrix composed with "something else" that gives us the desired property. The paper analyzes what properties the "something else" must have, and then constructs such a (deterministic) matrix based on Fourier codes. I thought it was both interesting theoretically and liketly to be quite practical (though time will tell on the latter).

Another paper I really liked was Declaring independence via the sketching of sketches by Piotr Indyk and Andrew McGregor. The problem they are tackling is trying to determing whether two variables are correlated, when the variables come in pairs in a stream. (For those who don't know streaming, think of it as an algorithm that gets one data point at a time and has to do something quickly using small space.) The two quantities that are relevant are the joint distribution over the pairs, and the product-of-marginals distribution over the pairs. If these are very similar, the two items in the pair are likely to be independent. They measure distance in three ways: Euclidean (l2) distance, variational (l1) distance, and mutual information. For l2, the have a very clever analysis that is very straightforward to understand if you know anything about sketching (it's basically a natural extension of previous results by Indyk). They are able to get pretty good results. They have similar results for the other metrics. Again, I can already think of a handful of possible applications of this stuff. It also leaves open an interesting question: what if you get k-tuples (instead of pairs) and you want to extract the M-most-correlated pairs. I suspect you can do this efficiently using a combination of this result with known results on frequent item sets. This would have immediate application to Bayes net structure learning, among other things.

Venkatesan Guruswami, James Lee and Alexander Razborov had a paper on Almost Euclidean subspaces of l_1^N via expander codes (I can't find a link because Google doesn't like latex codes) that I thought was both interesting and very well presented. This paper is a bit harder for me to get across (much less understand!). At a high level, their observation is that for a lot of metric embedding problems, the only known way to get a good embedding (i.e., one that preserves distances or norms) is to use a randomize construction (and then prove that it happens with high probability). Their result is a deterministic, constructive embedding matrix for embedding l1 into l2 (under some qualifications). The other plus of this method is that the embedding matrix is sparse, which means that the image of the vectors under the embedding ar sparse. There are also interesting connections to compressed sensing, which are mentioned in the paper.

There were other papers that I liked, but that I'm not going to try to summarize. Timothy Chan had a impressive (if somewhat ugly) result On the bichormatic k-set problem, which is effectively the problem of trying to figure out how many bad halfspace classifiers there are for labeled data (translated into machine learning lingo). Hubert Chan, Anupam Gupta and Kunal Talwar had a nice result on Ultra-low-dimensional embeddings for doubling metrics (I could only find a previous version that appeared in a NIPS workshop) that shows that if your metric doesn't have nearly uniform volume (characterized by the doubling dimension) then you can embed into Euclidean space with low distortion (this is somewhat surprising: classic results of Bourgain show that in general this is not possible). Despite the niceness of their result, the thing I found most interesting in the talk was a reference to a construction due to Satish Rao on Small distortion and volume preserving embeddings for planar and Euclidean metrics that they basically are extending.

Overall, I enjoyed the conference, despite getting pretty lost in a bunch of the talks. I wish I could have stayed for the last day, since there are a handful of papers that I think would probably be interesting. But I suppose that's what this 1300 page proceedings they dumped on me is for. In particular, the ones that look interesting from titles and skimming are: Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm (actually, I've read the TR version of this paper and it's very good); Sampling algorithms and coresets for l_p regression (only considers the N>>D case, which is a bit limiting) and Linked decompositions of networks and power of choice in Polya urns, which has interesting connections to (shockingly) Polya urn schemes. In particular, they show that if you have a Polya urn where in each step you sample two urns proportional to their occupancy, but then insert into the smaller of the two, then you get a balanced distribution. This is in contrast to typically Polya urn schemes that show up in Dirichlet processes and Pitman-Yor processes where you get power law distributions. (The "power of two" references is to hashing strategies where you hash by two independent hash functions and insert into the lesser occupied of the two bins.)

(p.s., I apologize if I misquoted the exact results in any of these papers; most of this post is based on memory and very short notes I took.)

19 January 2008

We're hiring machine learning/robotics...

Just a quick note because this is a bit off-topic for this blog, but I wanted to let you all know that this year, we're trying to hire a machine learning/robotics person. If you do machine learning that in any way is robotics related, I'd encourage you to apply. (I suppose same holds the other way around, but I probably wouldn't be quite as excited.) The "deadline for full consideration" is listed as Jan 15, which has passed, but if you apply by the end of the month, things would be fine. In the interest of full disclosure, we're also hiring in graphics, systems and sensor networks.

17 January 2008

Syntax-based and syntax-informed translation

Happy new year, all... Apologies for being remiss about posting recently. (Can I say "apologies about my remission"?) This post is a bit more of a review of what's out there, rather than trying out new ideas. Citations are not meant to get anywhere close to full coverage.

There are a handful of ways of viewing the syntactic MT problem, essentially delineated by which side(s) of the translation does the tree appear on, and whether the tree is induced entirely from data or whether it's induced from a treebank. There's additionally the constituent- versus dependency-tree issue.

  1. String-to-tree (constituent). Here, we have a string on the input and a tree on the output. In Chinese-to-English translation, this would mean that at translation time, we're essentially trying to parse Chinese into an English tree by means of local reordering. This was essentially what Yamada and Knight were doing back in 2001 and 2002. In order to train such a model, we need (a) parallel data and (b) a parser on the target language. Translation is typically done by some variant of CKY.
  2. Tree-to-string (constituent). Here, we map an input tree to an output string. In C2E, this means that at translation time, we first parse the Chinese, then reorder and flatten it out into an English string. This is essentially what the JHU workshop on adding syntactic features to phrase-based translation was trying to do (though in somewhat of a weak way), and also what Liang Huang has been doing lately. In order to train, we need (a) parallel data and (b) a parser on the source side. Translation can be done any way you want, though once parsing is done, it's usually a lot easier than running CKY.
  3. Tree-to-tree (constituent). Here, we map an input tree to an output tree. In C2E, this means that at translation time, we first parse the Chinese, then reorder the tree and translate subtrees/leaves into English. Koehn and Collins have worked on this problem. In order to train, we need (a) parallel data, (b) a parser on the source side and (c) a parser on the target side. Translation is similar to Tree-to-string.
  4. String-to-tree-to-string (constituent). I'm not sure if there's agreed-upon terminology for this task, but the idea here is to translate without having a parser on either side. At translation time, we essentially parse and translate simultaneously (rather like string-to-tree). But at training time, we don't have access to source trees: we have to induce them from just (a) parallel data. This is typified by Wu's inverse transduction grammars and more recently by David Chiang's Hiero system (plus others).
  5. Any of the above but on dependency trees. To my knowledge, the only groups that are seriously working on this are Microsoft (see, eg., Chris Quirk) and our friends at Charles University.
Below is my attempt to categorize the pros and cons of these different approaches along a wide variety of metrics. The metrics are: resource usage (how much labeled data do we need), training ease (computationally), testing ease (computationally), ngram LM (is it easy to include a target ngram language model), syntactic LM (is it easy to include a target syntactic language model), minimal assumptions (does the model make strong assumptions about, for instance, the degree of matchy-ness between source and target?), extensibility (is it easy to add additional features to the input side, such as NE info without breaking things).


S2TT2ST2TS2T2S
resource med med bad good
train-ease good good good bad
test-ease bad good good bad
ngram-lm med good med good
syntax-lm good bad good bad
assumptions good good bad good
extensibility med good good bad

I didn't include dependencies on this list because it essentially depends on what the underlying model is, and then remains the same as in the constituent case. The only caveat here is that syntactic language modeling with dependency trees is not as well proven as with constituency trees.

As far as syntactic language modeling goes, pretty much this is easy if you are producing trees on the target; otherwise, not. Tree-to-tree gets the unique distinction of being bad with respect to assumptions, essentially because you have to do tree-matching which is hard to do well. You could argue that certain forms of s2t2s should be here too (Wellington, Turian and Melamed had a paper on the assumptions that ITG makes at ACL 2006).

In terms of resource usage, T2T is obviously the worst, S2T2S is obviously the best. Whether S2T or T2s is better depends on the language pair.

There's of course the issue of which work best. For this, I don't really know (I'm sure people involved in MTeval and GALE can chime in here, though of course these competitions are always "unfair" in the sense that they also measure engineering ability).

At this point, I'd like to relate this back to the recent discussion of Translation out of English. One thing to notice is that when translating into English, S2T has the advantage that all we need is an English parser (there are a few of those lying around if you search hard enough) whereas T2S requires a foreign parser. However, when translating from English, T2S starts looking a lot more attractive because we have good English parsers. Actually, we have a lot of good annotations for English (NE coref, for instance; discourse parsing to some degree; etc.). From this perspective, the relative ease of extensibility of T2S models begins to look pretty attractive. That is to say, if I were actually to work myself on translation out of English, I would probably seriously consider T2S models. Of course, if you're translating from X to Y, where neither X nor Y is English, then you're kind of hosed and might prefer S2T2S.