26 March 2008

N-best lists and duplication

One thing that continually irks me about (some varieties of) n-best lists is duplicates. Duplicates typically arise due to an "arg max" approximation in a structured search process. In other words, we have some model that decomposes into . Here, "in" is the input, "out" is the desired output and "hid" is some hidden/latent variable. For example, in phrase-based MT, "in" is French, "out" in English and "hid" might be the phrasal segmentation. The problem that arises is that we typically do not compute by summing over hidden states; instead, we try to find the "out"/"hid" pair that maximizes the joint probability .

Now, the first problem with this is that we're not actually after the out/hid pair that maximizes the joint; we're after the out that maximize the marginal. But this is largely a separate issue.

The issue that I care about is that when we look at an N-best list, we get the top N out/hid pairs, rather than the top N outputs. This leads to "duplication" in the N-best list, where many of the top N are actually the same output, but with different latent variables.

One might ask: is this a big problem? I suspect so. Below is a figure that shows how many unique outputs do we get in a top 5000 list from phrase-based translation (French to English). There are 2000 sentences being translated (along the x-axis) and each produces some number of unique outputs (along the y-axis). The sentences are sorted by the number of uniques they produce. So the ones on the left produce almost all 5000 unique outputs, but the ones on the right produce only one or two unique outputs. (Both axes are log-scaled.)
What we observe is that the median number of unique outputs in the top 5000 list is only 56! The mean is only 178. Over 10% of then have ten or fewer uniques. Only six have over 2500 uniques.

This does not bode well for any system that intends to pipeline the N-best list into some other module (whether it is a straightforward reranker or something more complex; for instance, like a summarization system or search system).

One may ask: does the number of unique outputs correlate with the input sentence length. The answer is: sort of. The following figure plots input sentence length along the x-axis and number of unique translations along the y-axis.

We see that there is some weak correlation. Pretty much all of the examples that lead to large numbers of uniques are short sentences (length 10 or less). "Average" sentence of length 30-
50 seem to produce somewhere about 250 uniques, and long sentences (>100) tend to produce only a handful of outputs. Since average sentences are, in many ways, the most interesting, this is a bit frustrating. I don't particularly care about getting lots of translations for 10-word-long sentences because they're kind of boring anyway.

So what can be done about this? Here are a few things that come to mind, though I'm not sure any is a complete solution:

  • Generate more than 5000 total and hope to get more than 50 unique. Sure, this works (at the expense of computation and memory), but a 1% return is really quite bad. If we could even get, say, a 10% or 20% return I would be much much happier.
  • Instead of using n-best lists, generate samples from the posterior (here, I assume that marginalization is still too much to ask for). Probably you'd want to take the 1-best as well, since there's no guarantee that the MAP would show up in a finite sample. I'm also not sure we know how to do this (efficiently) for arbitrary models.
  • Try to optimize the n-best list for diversity, instead of simply optimality. You can write this down, but it seems hard to implement and may be computationally less pleasant than just generating 50,000 total.
  • Improve hypothesis recombination inside the decoder. Here, my worry would be that it wouldn't be until the final step when we could combine everything, in which case this boils down to the "generate 50,000" approach.
I'm sure there are other ways, but these are what come to mind. How do other people deal with this? What size n-best lists are popular?

22 March 2008

ICML/UAI/COLT Workshops Posted

See here for the current list. They include: Nonparametric Bayes (woohoo!), machine learning and music, Bayesian modeling applications, prior knowledge for text and language processing, sparse optimization and variable selection, as well as stand-alone workshops on the reinforcement learning competition and mining and learning with graphs.

Because I'm one of the organizers, I'd like to call attention to the Prior knowledge for text and language processing workshop. We'd definitely like submissions on any of the following topics:

  • Prior knowledge for language modeling, parsing, translation
  • Topic modeling for document analysis and retrieval
  • Parametric and non-parametric Bayesian models in NLP
  • Graphical models embodying structural knowledge of texts
  • Complex features/kernels that incorporate linguistic knowledge; kernels built from generative models
  • Limitations of purely data-driven learning techniques for text and language applications; performance gains due to incorporation of prior knowledge
  • Typology of different forms of prior knowledge for NLP (knowledge embodied in generative Bayesian models, in MDL models, in ILP/logical models, in linguistic features, in representational frameworks, in grammatical rules…)
  • Formal principles for combining rule-based and data-based approaches to NLP
  • Linguistic science and cognitive models as sources of prior knowledge
Yes, I know that's a shameless plug, but do you really expect better from me?!

19 March 2008

What to do with a million summaries?

Let's pretend.

Let's pretend that someone gave you one million document/summary pairs. If you like single-document, pretend they're single-document; if you like multi-document, pretend they're multi-document.

For those of us who work on summarization, this seems like it would be a pretty cool gift. Most of the corpora we're used to using have, at best, a few hundred such pairs. Some (eg., for headline generation) have more, but then I didn't allow you to pretend that these were headlines (I can get you billions of these, if you really want them).

So here's the question I pose: what would you do with a million summaries?

I actually have a hard time answering this question. Sure, we have whatever our favorite sentence extraction method is. If it's learned at all, it's probably learned over a dozen or so features: position, length, similarity to query (if a query exists), similarity to document centroid, etc. This would probably be optimized against some automatic measure like one of the many flavors of Rouge. Fitting a dozen parameters on a corpus of 100 examples is probably pretty reasonable and we have some results that suggest that we've gone about as far as we can go with sentence extraction (at least with respect to the DUC corpora); see, for instance, section 6.6 of my thesis. Here, we see that we're pretty much matching oracle performance at sentence extraction on DUC04 data when evaluated using Rouge(I've independently seen other people present similar results, so I think it's replicable). (Yes, I'm aware there are caveats here: the use of Rouge, the fixation to one corpus, etc.)

But now we have a million doc/sum pairs. Fitting a dozen parameters on a million examples seems a bit wasteful. It also seems a bit wasteful to continue to focus on sentence extraction in this case. Why? Well, first, we've already "solved" this problem (the quotes indicate the caveats hinted at above). Second, I have a seriously hard time thinking of too many more high level features that I could possibly tune (my best entry ever into DUC---I think we came in second or third, depending on the scoring---had about 25, many of which ended up getting very very small weights).

So, being me, my next thought is: do word alignment on the summaries, like what they do in machine translation. It turns out that somebody has already tried this, with a reasonable amount of success. In all seriousness, if I were to try something like this again, I think I would throw out the "phrase" issue and deal with words; probably also consider throwing out the HMM issue and do something akin to Model 1. The key difference is that I would continue to include the additional features, like stem-identity, WordNet, etc. I might also throw in some word clustering just for fun.

So let's say that the alignments worked. Now what? We could decode, of course, by intersecting the learned alignment model with a language model. I think this would be a really bad idea, essentially because I don't think there's enough information in the alignments to actually produce new summaries; just enough to get reasonable alignments.

So now we've got a million document/summary pairs that have been aligned. What now?

You could say "learn to create abstracts", but I'm actually not particularly thrilled with this idea, either. (Why? Well, there's a long story, but the basic problem is that if you ask humans to write summaries, they're lazy. What this means is that they do a lot of word copying, at least up to the stem. If you look in the alignments paper, there are some results that say that over half of the words in the summary are aligned to identical words (stems) in the document, even with a human doing the alignment. What this means is that if you're using a lexically-based scoring method, like Rouge, odds are against you if you ever change a word because chances are the human writing the summary didn't change it.)

You could suggest trying to learn to do compression, which is probably what I'd look at most seriously. But I also don't think we have a really good understanding of this. In the Searn paper, we show how to use Searn to compute compressions, but to be honest it's really slow and I don't think it's really scalable to 1 million doc/sum pairs. But I suppose I would probably start with something like that.

But that's what I would do. What would you do?

(Incidentally, if you want to ask me: "do you have a million summaries?" the answer is "no." I have about 120,000. But there are some complications with this data. Maybe I'll post about it in the near future.)

12 March 2008

ACL papers up

As Chris Brew pointed out, ACL accepts have been posted. In keeping with tradition, here are some of the top key words (stemmed, with stop words, including things like "model" or "based" removed):

  • translat (16)
  • learn (14)
  • word (12)
  • gener (12)
  • evalu (10)
  • unsupervis (9)
  • pars (9)
  • machin (9)
  • phrase (8)
  • languag (8)
  • grammar (8)
  • depend (8)
  • automat (8)
  • segment (7)
  • web (6)
  • text (6)
  • supervis (6)
  • semant (6)
  • question (6)
  • featur (6)
I'm happy to say that a cursory glance of the list includes at least a handful of topics that I don't consider the normal ACL fodder. I'm also happy to say that, as chair for the summarization track, we had a number of excellent summarization papers this year.

20 February 2008

ACL 2008 Workshops and Tutorials

The list of workshops and tutorials for ACL is now up. There are actually two separate workshops on MT! I'm particularly excited to see the that BioNLP is continuing.

10 February 2008

Kernels, distances and strings

Kernels and distances are closely related. Given a kernel K(x,z), this induces an RKHS H such that K(x,z)=, where f is the mapping from the input space to H, and <.,.> is the dot product in H. Dot products induce norms in the obvious way: ||x||^2 = K(x,x). This, in turn, induces an obvious distance metric: d(x,z)^2 = K(x,x)+K(z,z)-2K(x,z).

On the other hand, we also know how to turn distance metrics into kernels. If (X,d) is a metric space (i.e., X is a space and d is a metric on X), then K(x,z)=exp(-d(x,z)) is positive semi-definite when x,z are drawn from X.

Now, there are three questions that arise. The first is: in the distance->kernel setting, what does the RKHS "look like." I think the sense here is that it's basically the same as the RKHS induced by the Gaussian kernel: it's an infinite-dimensional feature space where different dimensions look like distances to some point. In practice, this "some point" is going to be one of the training points.

The second question follows the fact that we can iterate this construction. One way to ask this is: if we use a kernel to induce a distance, then use this distance to introduce a new kernel, what does this new kernel look like. Or, alternatively, if we have a distance, kernelize it to induce a new distance, then what does this distance look like. I don't have a good intuitive sense for the answer to any of these. Obviously it's straightforward to write out what these things look like, but that doesn't directly give me a sense of what's going on.

The final question is a bit "off topic" from the above two, but still relevant. There's been a lot of work in kernels for discrete data, like strings. The most popular string subsequence kernel is based on counting how many subsequences match between two strings, down-weighted by the length of the subsequences. It's well known, and recognized in string kernel papers, that a kernel of the form K(x,z) = 1-ned(x,z), where ned is a normalized string edit distance is not a valid kernel. However, from what we know about distance metrics, K(x,z) = exp(-sed(x,z)) should be a valid kernel. Yet I'm not aware of this being used anywhere. Is there any reason why? The reason I ask is because it's a much more intuitive than the subsequence kernel, which I only have a vague sense about.

03 February 2008

The behemoth, PubMed

The friend I crashed with while attending SODA is someone I've known since we were five years old. (Incidentally, there's actually someone in the NLP world who I've actually known from earlier...small world.) Anyway, the friend I stayed with is just finishing med school at UCSF and will soon be staying there for residency. His specialty is neurosurgery, and his interests are in neural pathologies. He spent some time doing research on Alzheimer's disease, effectively by studying mice (there's something I feel sort of bad about finding slightly amusing about mice with Alzheimer's disease). Needless to say, in the process of doing research, he made nearly daily use out of PubMed. (For those of you who don't know, PubMed is like the ACL anthology, but with hundreds of thousands of papers, with new ones being added by the truckload daily, and will a bunch of additional things, like ontologies and data sets.)

There are two things I want to talk about regarding PubMed. I think both of these admit very interesting problems that we, as NLPers, are qualified to tackle. I think the most important thing, however, is opening and maintaining a wide channel of communication. There seems to be less interaction between people who do (for instance) bio-medical informatics (we have a fairly large group here) and what I'll term as mainstream NLPers. Sure, there have been BioNLP workshops at ACLs, but I really think that both communities would be well-served to interact more. And for those of you who don't want to work on BioNLP because it's "just a small domain problem", let me assure you: it is not easy... don't think of it in the same vein as a true "sublanguage" -- it is quite broad.

I suppose I should give a caveat that my comments below are based on a sample size of one (my friend), so it may not be totally representative. But I think it generalizes.

Search in PubMed, from what I've heard, is good in the same ways that web search is good and bad in the same ways that web search is bad. It is good when you know what you're looking for (i.e., you know the name for it) and bad otherwise. One of the most common sorts of queries that my friend wants to do is something like "show me all the research on proteins that interact in some way with XXX in the context of YYY" where XXX is (eg) a gene and YYY is (eg) a disease. The key is that we don't know which proteins these are and so it's hard to query for them directly. I know that this is something that the folks at Penn (and probably elsewhere) are working on, and I get the impression that a good solution to this problem would make lots and lots of biologists much happier (and more productive). One thing that was particularly interesting, however, is that he was pretty averse to using structured queries like the one I gave above. He effectively wants to search for "XXX YYY" and have it realize that XXX is a gene, YYY is a disease, and that it's "obvious" that what he wants is proteins that interact with (or even, for instance, pathways that contain) XXX in the context of disease YYY. On the other hand, if YYY were another gene, then probably he's be looking for diseases or pathways that are regulated by both XXX and YYY. It's a bit complex, but I don't think this is something particularly beyond our means.

The other thing I want to talk about is summarization. PubMed actually archives a fairly substantial collection of human-written summaries. These fall into one of two categories. The first, called "systematic reviews" are more or less what we would think of as summaries. However, they are themselves quite long and complex. They're really not anything like sentence extracts. The second, called "meta analyses" are really not like summaries at all. In a meta analysis, an author will consider a handful of previously published papers on, say, the effects of smoking on lifespan. He will take the data and results published in these individual papers, and actually do novel statistical analyses on them to see how well the conclusions hold.

From a computational perspective, the automatic creation of meta analyses would essentially be impossible, until we have machines that can actually run experiments in the lab. "Systematic reviews", on the other hand, while totally outside the scope of our current technology, are things we could hope to do. And they give us lots of training data. There are somewhere around ten to forty thousand systematic reviews on PubMed, each about 20 pages long, and each with references back to papers, almost all of which are themselves in PubMed. Finding systematic reviews older than a few years ago (when the began being tagged explicitly) has actually sprouted a tiny cottage industry. And PubMed nicely makes all of their data available for download, without having to crawl, something that makes life much easier for us.

My friend warns that it might not be a good idea to use all systematic reviews, but only those from top journals. (They tend to be less biased, and better written.) However, in so far as I don't think we'd even have hope of getting something as good as a systematic review from the worst journal in the world, I'm not sure this matters much. Maybe all it says is that for evaluation, we should be careful to evaluate against the top.

Now, I should point out that people in biomedical informatics have actually been working on the summarization problem too. From what I can tell, the majority of effort there is on rule-based systems that build on top of more rule-based systems that extract genes, pathways and relations. People at the National Library of Medicine, Rindflesch and Fiszman, use SemRep to do summarization, and they have tried applying it to some medical texts. Two other people that I know are doing this kind of work are Kathy McKeown and Greg Whalen, both at Columbia. The Columbia group has access to a medically informed NLP concept extractor called MedLEE, which gives them a leg up on the low-level processing details. If you search for 'summarization medical OR biomedical' in GoogleScholar, you'll get a raft of hits (~9000).

Now, don't get me wrong -- I'm not saying that this is easy -- but for summarization folks who are constantly looking for "natural artifacts" of their craft, this is an enormous repository.

22 January 2008

An NLPer in the Algorithmists Court

I just returned from SODA (the Symposium on Discrete Algorithms). Obviously (I suppose), I didn't have a paper there, but I was interested in learning new things. Moreover, it was in SF which is a short cheap flight and admits free housing by crashing with one of my very close friends. My fellow professorial newbie, Suresh Venkatasubramanian, ran a seminar this past Fall on approximate high-dimensional geometry that I attended. This is a topic that is excruciatingly close to many areas of machine learning, and I suspect that there will be a lot of cross-polination in the next few years. I'll try to get around at some point to post about things I learned in the seminar. Some are more ML-related, some more NLP. Anyway, I wanted to (a) see what a theory conference was like and (b) learn about the latest, greatest techniques. (Sadly, I had to miss the last day because I teach early Tuesday mornings.)

Below are a few of the papers I saw that I liked enough to put them on my "to read" list. I'm not quite sure what my qualification for "liked" is, so take this with a huge grain of salt. If you want more authoritative opinions, see either Suresh's blog or one of the other theory blogs. I also warn you guys that most of these things really have nothing to do with NLP.

The first invited talk was by Bonnie Berger, talking about computational biology. Most of the topics were pretty standard: RNA/protein structure prediction, protein-protein interaction graphcs, etc. There are some surface connections to NLP (eg., some other people use stochastic CFGs to do structure prediction (though she does not). The twist that she is taking is to try to solve these problems simultaneously for more than two organisms (typically: yeast, worm, fly, mouse, human; these are the organisms for which we have the most data). The rational is based on the hypothesis that if a structure is biologically important, then it is (roughly) conserved across species. I feel like a lot of linguistics is based on roughly the same hypothesis, and this is probably a topic I'll try to blog about in the near(ish) future.

The second invited talk was by Persi Diaconis, who I've now seen give two talks and I just love him (I wish I had been able to take probability from him as an undergrad). At a very high level, his talk was a bit of evangelizing functional combinatorics, which is essentially a paradigm for understanding combinatorial problems that unifies a large body of disparate problems in terms of the analysis of symmetric polynomials (which have, themselves, a rich theory). The particular example he gave was a relationship between carrying (like, what you do when you add numbers with pen and paper) and shuffling. I confess I learned a lot more about shuffling (which seems to be a bit of a love for Perci), but it was still pretty interesting. Not enough for me to learn functional combinatorics, but still interesting.

One of my favorite papers was Fast dimension reduction using Rademacher series on dual BCH codes by Nir Ailon and Edo Liberty. Johnson-Lindenstrauss gives us a method of embedding N points in D dimensions into K<<D dimensions that preserve l2 distances with high probability, using random projection matrices (lots of follow on work, too). One problem is that these methods require storing a D*K matrix and each projection takes D*K operations. The result of this paper is that we can get the same results using much smaller space and time for each projection (or, in certain boundary cases, a matched space/time complexity). The key idea is to force oneself to use a random diagonal K-matrix composed with "something else" that gives us the desired property. The paper analyzes what properties the "something else" must have, and then constructs such a (deterministic) matrix based on Fourier codes. I thought it was both interesting theoretically and liketly to be quite practical (though time will tell on the latter).

Another paper I really liked was Declaring independence via the sketching of sketches by Piotr Indyk and Andrew McGregor. The problem they are tackling is trying to determing whether two variables are correlated, when the variables come in pairs in a stream. (For those who don't know streaming, think of it as an algorithm that gets one data point at a time and has to do something quickly using small space.) The two quantities that are relevant are the joint distribution over the pairs, and the product-of-marginals distribution over the pairs. If these are very similar, the two items in the pair are likely to be independent. They measure distance in three ways: Euclidean (l2) distance, variational (l1) distance, and mutual information. For l2, the have a very clever analysis that is very straightforward to understand if you know anything about sketching (it's basically a natural extension of previous results by Indyk). They are able to get pretty good results. They have similar results for the other metrics. Again, I can already think of a handful of possible applications of this stuff. It also leaves open an interesting question: what if you get k-tuples (instead of pairs) and you want to extract the M-most-correlated pairs. I suspect you can do this efficiently using a combination of this result with known results on frequent item sets. This would have immediate application to Bayes net structure learning, among other things.

Venkatesan Guruswami, James Lee and Alexander Razborov had a paper on Almost Euclidean subspaces of l_1^N via expander codes (I can't find a link because Google doesn't like latex codes) that I thought was both interesting and very well presented. This paper is a bit harder for me to get across (much less understand!). At a high level, their observation is that for a lot of metric embedding problems, the only known way to get a good embedding (i.e., one that preserves distances or norms) is to use a randomize construction (and then prove that it happens with high probability). Their result is a deterministic, constructive embedding matrix for embedding l1 into l2 (under some qualifications). The other plus of this method is that the embedding matrix is sparse, which means that the image of the vectors under the embedding ar sparse. There are also interesting connections to compressed sensing, which are mentioned in the paper.

There were other papers that I liked, but that I'm not going to try to summarize. Timothy Chan had a impressive (if somewhat ugly) result On the bichormatic k-set problem, which is effectively the problem of trying to figure out how many bad halfspace classifiers there are for labeled data (translated into machine learning lingo). Hubert Chan, Anupam Gupta and Kunal Talwar had a nice result on Ultra-low-dimensional embeddings for doubling metrics (I could only find a previous version that appeared in a NIPS workshop) that shows that if your metric doesn't have nearly uniform volume (characterized by the doubling dimension) then you can embed into Euclidean space with low distortion (this is somewhat surprising: classic results of Bourgain show that in general this is not possible). Despite the niceness of their result, the thing I found most interesting in the talk was a reference to a construction due to Satish Rao on Small distortion and volume preserving embeddings for planar and Euclidean metrics that they basically are extending.

Overall, I enjoyed the conference, despite getting pretty lost in a bunch of the talks. I wish I could have stayed for the last day, since there are a handful of papers that I think would probably be interesting. But I suppose that's what this 1300 page proceedings they dumped on me is for. In particular, the ones that look interesting from titles and skimming are: Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm (actually, I've read the TR version of this paper and it's very good); Sampling algorithms and coresets for l_p regression (only considers the N>>D case, which is a bit limiting) and Linked decompositions of networks and power of choice in Polya urns, which has interesting connections to (shockingly) Polya urn schemes. In particular, they show that if you have a Polya urn where in each step you sample two urns proportional to their occupancy, but then insert into the smaller of the two, then you get a balanced distribution. This is in contrast to typically Polya urn schemes that show up in Dirichlet processes and Pitman-Yor processes where you get power law distributions. (The "power of two" references is to hashing strategies where you hash by two independent hash functions and insert into the lesser occupied of the two bins.)

(p.s., I apologize if I misquoted the exact results in any of these papers; most of this post is based on memory and very short notes I took.)

19 January 2008

We're hiring machine learning/robotics...

Just a quick note because this is a bit off-topic for this blog, but I wanted to let you all know that this year, we're trying to hire a machine learning/robotics person. If you do machine learning that in any way is robotics related, I'd encourage you to apply. (I suppose same holds the other way around, but I probably wouldn't be quite as excited.) The "deadline for full consideration" is listed as Jan 15, which has passed, but if you apply by the end of the month, things would be fine. In the interest of full disclosure, we're also hiring in graphics, systems and sensor networks.

17 January 2008

Syntax-based and syntax-informed translation

Happy new year, all... Apologies for being remiss about posting recently. (Can I say "apologies about my remission"?) This post is a bit more of a review of what's out there, rather than trying out new ideas. Citations are not meant to get anywhere close to full coverage.

There are a handful of ways of viewing the syntactic MT problem, essentially delineated by which side(s) of the translation does the tree appear on, and whether the tree is induced entirely from data or whether it's induced from a treebank. There's additionally the constituent- versus dependency-tree issue.

  1. String-to-tree (constituent). Here, we have a string on the input and a tree on the output. In Chinese-to-English translation, this would mean that at translation time, we're essentially trying to parse Chinese into an English tree by means of local reordering. This was essentially what Yamada and Knight were doing back in 2001 and 2002. In order to train such a model, we need (a) parallel data and (b) a parser on the target language. Translation is typically done by some variant of CKY.
  2. Tree-to-string (constituent). Here, we map an input tree to an output string. In C2E, this means that at translation time, we first parse the Chinese, then reorder and flatten it out into an English string. This is essentially what the JHU workshop on adding syntactic features to phrase-based translation was trying to do (though in somewhat of a weak way), and also what Liang Huang has been doing lately. In order to train, we need (a) parallel data and (b) a parser on the source side. Translation can be done any way you want, though once parsing is done, it's usually a lot easier than running CKY.
  3. Tree-to-tree (constituent). Here, we map an input tree to an output tree. In C2E, this means that at translation time, we first parse the Chinese, then reorder the tree and translate subtrees/leaves into English. Koehn and Collins have worked on this problem. In order to train, we need (a) parallel data, (b) a parser on the source side and (c) a parser on the target side. Translation is similar to Tree-to-string.
  4. String-to-tree-to-string (constituent). I'm not sure if there's agreed-upon terminology for this task, but the idea here is to translate without having a parser on either side. At translation time, we essentially parse and translate simultaneously (rather like string-to-tree). But at training time, we don't have access to source trees: we have to induce them from just (a) parallel data. This is typified by Wu's inverse transduction grammars and more recently by David Chiang's Hiero system (plus others).
  5. Any of the above but on dependency trees. To my knowledge, the only groups that are seriously working on this are Microsoft (see, eg., Chris Quirk) and our friends at Charles University.
Below is my attempt to categorize the pros and cons of these different approaches along a wide variety of metrics. The metrics are: resource usage (how much labeled data do we need), training ease (computationally), testing ease (computationally), ngram LM (is it easy to include a target ngram language model), syntactic LM (is it easy to include a target syntactic language model), minimal assumptions (does the model make strong assumptions about, for instance, the degree of matchy-ness between source and target?), extensibility (is it easy to add additional features to the input side, such as NE info without breaking things).


S2TT2ST2TS2T2S
resource med med bad good
train-ease good good good bad
test-ease bad good good bad
ngram-lm med good med good
syntax-lm good bad good bad
assumptions good good bad good
extensibility med good good bad

I didn't include dependencies on this list because it essentially depends on what the underlying model is, and then remains the same as in the constituent case. The only caveat here is that syntactic language modeling with dependency trees is not as well proven as with constituency trees.

As far as syntactic language modeling goes, pretty much this is easy if you are producing trees on the target; otherwise, not. Tree-to-tree gets the unique distinction of being bad with respect to assumptions, essentially because you have to do tree-matching which is hard to do well. You could argue that certain forms of s2t2s should be here too (Wellington, Turian and Melamed had a paper on the assumptions that ITG makes at ACL 2006).

In terms of resource usage, T2T is obviously the worst, S2T2S is obviously the best. Whether S2T or T2s is better depends on the language pair.

There's of course the issue of which work best. For this, I don't really know (I'm sure people involved in MTeval and GALE can chime in here, though of course these competitions are always "unfair" in the sense that they also measure engineering ability).

At this point, I'd like to relate this back to the recent discussion of Translation out of English. One thing to notice is that when translating into English, S2T has the advantage that all we need is an English parser (there are a few of those lying around if you search hard enough) whereas T2S requires a foreign parser. However, when translating from English, T2S starts looking a lot more attractive because we have good English parsers. Actually, we have a lot of good annotations for English (NE coref, for instance; discourse parsing to some degree; etc.). From this perspective, the relative ease of extensibility of T2S models begins to look pretty attractive. That is to say, if I were actually to work myself on translation out of English, I would probably seriously consider T2S models. Of course, if you're translating from X to Y, where neither X nor Y is English, then you're kind of hosed and might prefer S2T2S.

27 December 2007

Those Darn Biologists...

I've recently been talking to a handful of biologists. The lab/bench kind, not the computational kind. As an experiment in species observation, they intrigue me greatly. I'm going to stereotype, and I'm sure it doesn't hold universally, but I think it's pretty true and several (non-biologist) friends have confirmed this (one of whom is actually married to a biologist). The reason they intrigue me is because they are wholly uninterested in techniques that allow you to do X better, once they already have a solution for X.

This is a bit of an exaggeration, but not a huge one (I feel).

It is perhaps best served by example. Take dimensionality reduction. This is a very very well studied problem in the machine learning community. There are a bajillion ways to do it. What do biologists use? PCA. Still. And it's not that people have tried other things and they've failed. Other things have worked. Better. But no one cares. Everyone still uses PCA.

Part of this is probably psychological. After using PCA for a decade, people become comfortable with it and perhaps able to anticipate a bit of what it will do. Changing techniques would erase this.

Part of this is probably cultural. If you're trying to publish a bio paper (i.e., a paper in a bio journal) and you're doing dimensionality reduction and you aren't using PCA, you're probably going to really have to convince the reviewers that there's a good reason not to use PCA and that PCA just plain wouldn't work.

Now, that's not to say that biologists are luddites. They seem perfectly happy to accept new solutions to new problems. For instance, still in the dimensionality reduction setting, non-negative matrix factorization techniques are starting to show their heads in microarray settings. But it's not because they're better at solving the same old problem. What was observed was that if you have not just one, but a handful of microarray data sets, taken under slightly different experimental conditions, then the variance captured by PCA is the variance across experiments, not the variance that you actually care about. It has been (empirically) observed that non-negative matrix factorization techniques don't suffer from this problem. Note that it's only fairly recently that we've had such a surplus of microarray data that this has even become an issue. So here we have an example of a new problem needing a new solution.

Let's contrast this with the *ACL modus operandi.

There were 132 papers in ACL 2007. How many of them do you think introduced a new problem?

Now, I know I'm being a bit unfair. I'm considering dimensionality reduction across different platforms (or experimental conditions) as a different problem from dimensionality reduction on a single platform. You could argue that these really are the same problem. I just kind of disagree.

Just as biologists tend to be wary of new techniques, so we tend to be wary of new problems. I think anyone who has worked on a new problem and tried to get it published in a *ACL has probably gone through an experience that left a few scars. I know I have. It's just plain hard to introduce a new problem.

The issue is that if you work on an age-old problem, then all you have to do is introduce a new technique and then show that it works better than some old technique. And convince the reviewers that the technique is interesting (though this can even be done away with if the "works better" part is replaced with "works a lot better").

On the other hand, if you work on a new problem, you have to do a lot. (1) You have to introduce the new problem and convince us that it is interesting. (2) You have to introduce a new evaluation metric and convince us that it is reasonable. (3) You have to introduce a new technique and show that it's better than whatever existed before.

The problem is that (1) is highly subjective, so a disinterested reviewer can easily kill such a paper with a "this problem isn't interesting" remark. (2) is almost impossible to do right, especially the first time around. Again, this is something that's easy to pick on as a reviewer and instantly kill a paper. One might argue that (3) is either irrelevant (being a new problem, there isn't anything that existed before) or no harder than in the "age old problem" case. I've actually found this not to be true. If you work on an age old problem, everyone knows what the right baseline is. If you work on a new problem, not only do you have to come up with your own technique, but you also have to come up with reasonable baselines. I've gotten (on multiple occasions) reviews for "new problems" papers along the lines of "what about a baseline that does XXX." The irony is that while XXX is often creative and very interesting, it's not really a baseline and would actually probably warrant it's own paper!

That said, I didn't write this post to complain about reviewing, but rather to point out a marked difference between how our community works and how another community (that is also successful) works. Personally, I don't think either is completely healthy. I feel like there is merit in figuring out how to do something better. But I also feel that there is merit in figuring out what new problems are and how to solve them. The key problem with the latter is that the common reviewer complaints I cited above (problem not interesting or metric not useful) often are real issues with "new problem" papers. And I'm not claiming that my cases are exceptions. And once you accept a "new problem" paper that is on the problem that really isn't interesting, you've opened Pandora's box and you can expect to see a handful of copy cat papers next year ( could, but won't, give a handful of examples of this happening in the past 5 years). And it's really hard to reject the second paper on a topic as being uninteresting, given that the precedent has been set.

18 December 2007

Particle filtering versus beam search

I had a very interesting discussion at NIPS with Vikash Mansingka about beam search and particle filtering. Much of what is below is a result of this conversation and he deserves much credit.

In a very real sense, particle filtering can be seen (for many models) as a sort of stochastic beam search. Just to set the stage, let's talk about generic beam search:

1. Initialize a min-queue "beam" to contain a single initial hypothesis.
2. While beam contains non-complete hypothesis:
A. Take h, the lowest scoring hypothesis from the beam
B. For each h' in the neighborhood of h
I. Add h' to the beam, with some score
C. If the beam exceeds a certain capacity, drop high-scoring elements
3. Return the lowest scoring hypothesis from beam
The key variant in beam search algorithms is the score function that's used. The most naive is just path cost---how much did we have to pay to get to the current hypothesis? The "A*" way is to use path cost + admissible heuristic cost, where the admissible heuristic is an underestimate of the amount of cost remaining to complete.

There are (of course) other variants: sometimes we expand the whole beam in one step; sometimes we use multiple beams; etc. We'll come back to these shortly.

Particle filtering (using as similar terminology to the beam search as possible), on the other hand, looks something like:
1. Initialize a max-queue "beam" to contain a single initial hypothesis.
2. While beam contains non-complete hypothesis:
A. Take h, the highest scoring hypothesis from the beam
B. Sample h' from the neighborhood of h
I. Add h' to the beam, with some score
C. If the beam exceeds a certain capacity, randomly drop elements
D. If the "effective sample size" of the beam is too low, resample elements
3. Return the highest scoring hypothesis from beam
To map beam search to particle filtering, use negative log probabilities for scores. (This turns max search into min search.)

In a sense, the major thing that differs between PF and BS is that in PF everything happens randomly -- we're always sampling from a proposal distribution, rather than greedily taking the "best" hypotheses at each step.

There are, however, some minor issues: what proposal distribution is used in step 2B in PF and what does resampling do.

In general, the proposal distribution should be the true posterior probability, but this is pretty much always intractable (otherwise we wouldn't be using approximations at all). The most common thing to do is to use a transition probability, which corresponds exactly to using a path cost in beam search. However, just as we know that using a heuristic can help in beam search, this suggest that using a proposal distribution in PF that contains something like a future cost heuristic is going to be helpful. I'm sure this has been done, but I don't think it's nearly as common as it perhaps should be, given how helpful it is in BS.

What 2D in PF does is to measure whether the samples that we have in the beam have fallen out of the true posterior---in other words, are we "off the mark." There's a straightforward way to compute this (see the wikipedia page linked above). If we are off the mark, we resample all the elements in our beam, essentially trying to get back on the right path. My experience with PF is that this is pretty essential to getting things to work, but as far as I know there's no analogue in BS land.

It may be that these two things (resampling and heuristic) are two different ways of solving the same problem and that doing both is not so helpful. But I tend to doubt it.

One other thing that I've never seen done in PF is to maintain multiple beams. Admittedly this makes the resampling step harder (maybe...I haven't worked through it), but it may be helpful. One thing, for instance, that we found in the PF for the coalescent was that the PF tends to like to merge clusters early with no regard for how hard life is going to be later. We use resampling to try to fix this, but another way would be to use multiple beams. This is commonly done in, eg., machine translation, where a single beam doesn't do well, so we maintain one beam for each number of foreign (or English) words covered.

So what's the point? I think that by recognizing the similarity between BS and PF, we can see ways to potentially improve both. For instance, the following seem promising:
  1. Use resampling techniques in beam search
  2. Use multiple beams in particle filtering
  3. Use A*-style heuristics in the proposal distribution for particle filtering
An alternative suggestion would be for NLPers to pay more attention to particle filtering. It's a nice, clean probabilistic analogue to a technique that we love (or at least are forced to use).

12 December 2007

NIPS retrospective

I got back from NIPS on Sunday; sadly I got sick on the last day. Despite this, it was by far my most productive conference ever. Admittedly, I made a point to try to make it productive (being somewhat geographically isolated means that it's in my best interest), but even so, I had a fantastic time. There are several ideas that I plan on blogging on in the near future, but for now, I'll focus on the actual conference program. The standard disclaimer applies: I didn't see everything and just because something's not listed here doesn't mean I didn't enjoy it. I'll try to limit my commentary to papers that are relevant to the NLP audience, but I may stray. I'll conclude with a bit of my sense of what was in and out this year.

  1. Kristina Toutanova and Mark Johnson had a topic-model style paper for doing semisupervised POS tagging. There are two new things here. First, they actually model context (yay!) which we know is important in language. Second, they introduce a notion of a confusion class (what possible POS tags can a word get) and actually model this. This latter is something that makes sense in retrospect, but is not obvious to think of (IMO). They actually get good results (which is non-trivial for topic models in this context).
  2. Alex Bouchard-Cote and three more of the Berkeley gang had a paper on language change. If you saw their paper at ACL, they're attacking a pretty similar problem, by looking at phonological divergences between a handful of languages. I'd love to be able to reconcile their approach with what I've been working on---I use a huge database of typological knowledge; they use word forms...I'd love to use both, but I really like working with thousands of languages and it's hard to find corpora for all of these :).
  3. John Blitzer had a good paper (with other Penn folks) on learning bounds for domain adaptation. If you care at all about domain adaptation, you should read this paper.
  4. Alex Kulesza and Fernando Pereira had a great paper on what happens if you try to do structured learning when the underlying prediction (inference) algorithm is approximate. They show that things can go very wrong in widely different ways. This is a bit of a negative results paper, but it gives us (a) a strong sense that proceeding with a generic learning algorithm on top of an approximate inference algorithm is not okay and (b) a few ideas of what can actually go wrong.
  5. Yee Whye Teh, Kenichi Kurihara and Max Welling continue their quest toward collapsed variational models for all problems in the world by solving the HDP. (For those who don't know, you can think of this as LDA with a potentially infinite number of topics, but of course the real case is more interesting.)
  6. Ali Rahimi and Benjamin Recht presented a very cool analysis of kernel methods when your kernel is of the form K(x,z) = f(x-z). This is the case for, eg., the RBF kernel. This is something that I've wanted to try for a while, but to be honest my poor recollection of my analysis training left me a bit ill equipped. Essentially they apply a Fourier analysis to such kernels and then represent the kernel product K(x,z) as a dot product in a new input space, where kernelization is no longer required. The upshot is that you can now effectively do kernel learning in a linear model in primal space, which is very nice when the number of examples is large.
  7. David Heckerman gave a really nice invited talk on graphical models for HIV vaccine design. What he was particularly interested in was analyzing whether standard methods of determining correlation between events (eg., "do I have HIV given that I have some genomic signal?") work well when the data points are not independent, but rather belong to, eg., some genetic population. This is effectively what Lyle Campbell and I were doing in our ACL paper last year on typological features. Curing HIV might be a bit more of an interesting application, though :). Essentially what David and his colleagues do is to build a hierarchy on top of the data points and then let this explain some of the interactions and then figure out what isn't explained by this hierarchy. In the linguistic sense, this is effectively separating historical from areal divergences. I'll have to read up more on what they do, precisely.
There were several themes that stuck out at NIPS this year, though they aren't represented in the above list. Probably the biggest one is recommender systems. Just search the titles for "matrix" and you'll come up with a handful of such systems. I have to believe that this is spurred, at least in part, by the NetFlix challenge. Another theme was deep belief networks, which even had a rogue workshop dedicated to them. I have a feeling we'll be seeing more and more of these things. A final theme was randomized algorithms, particularly as applied to large scale learning. I think this is a great direction, especially at NIPS, where, historically, things have been less focused on the large scale.

My only serious quibble with the program this year is that I saw a handful of papers (and at least one that got an oral and one a spotlight) that really fell down in terms of evaluation. That is, these papers all proposed a new method for solving task X and then proceeded to solve it. The experiments, however, contained only comparisons to algorithms that did not have access to all the information that X had. I'll give an example of something that would fall in this category but I have not seen appear anywhere: I build a structured prediction algorithm and compare only against algorithms that make independent classification decisions. I'm not a huge quibbler over comparing against the absolute state of the art, but it really really irks me when there are no comparisons to algorithms that use the same information, especially when standard (read: they are in any reasonable machine learning textbook) algorithms exist.

30 November 2007

Domain adaptation vs. transfer learning

The standard classification setting is a input distribution p(X) and a label distribution p(Y|X). Roughly speaking, domain adaptation (DA) is the problem that occurs when p(X) changes between training and test. Transfer learning (TL) is the problem that occurs when p(Y|X) changes between training and test. In other words, in DA the input distribution changes but the labels remain the same; in TL, the input distributions stays the same, but the labels change. The two problems are clearly quite similar, and indeed you see similar (if not identical) techniques being applied to both problems. (Incidentally, you also see papers that are really solving one of the two problems but claim to be solving the other.)

As a brief aside, we actually need to be a bit more specific about the domain adaptation case. In particular, if p(X) changes then we can always encode any alternative labeling function by "hiding" some extra information in p(X). In other words, under the model that p(X) changes, the assumption that p(Y|X) doesn't change is actually vacuous. (Many people have pointed this out, I think I first heard it from Shai Ben-David a few years ago.) It is because of the assumption that theoretical work in domain adaptation has been required to make stronger assumptions. A reasonable one is the assumption that (within a particular concept class---i.e., space of possible classifiers), there exists one that doesn't do too bad on either the source or the target domain. This is a stronger assumption that the "p(Y|X) doesn't change", but actually enables us to do stuff. (Though, see (*) below for a bit more discussion on this assumption.)
Now, beyond the DA versus TL breakdown, there is a further breakdown: for which sides of the problem do we have labeled or unlabeled data. In DA, the two "sides" are the source domain and the target domain. In TL, the two sides are task 1 (the "source" task) and task 2 (the "target" task). In all cases, we want something that does well on the target. Let's enumerate the four possibilities:
  1. Source labeled, target labeled (S+T+)
  2. Source labeled, target only unlabeled (S+T-)
  3. Source only unlabeled, target labeled (S-T+)
  4. Source only unlabeled, target only unlabeled (S-T-)
We can immediately throw away S-T- because this is basically an unsupervised learning problem.

The typical assumption in TL is S+T+. That is, we have labeled data for both tasks. (Typically, it is actually assumed that we have one data set that is labeled for both problems, but I don't feel that this is a necessary assumption.)

In DA, there are two standard settings: S+T+ (this is essentially the side of DA that I have worked on) and S+T- (this is essentially the side of DA that John Blitzer has worked on).

Now, I think it's fair to say that any of the T- settings are impossible for TL. Since we're assuming that the label function changes and can change roughly arbitrarily, it seems like we just have to have some labeled target data. (I.e., unlike the case in DA where we assume a single good classifier exists, this assumption doesn't make sense in TL.)

This begs the question: in TL and DA, does the S-T+ setting make any sense?

For DA, the S-T+ setting is a bit hard to argue for from a practical perspective. Usually we want to do DA so that we don't have to label (much) target data. However, one could make a semi-supervised sort of argument here. Maybe it's just hard to come by target data, labeled or otherwise. In this case, we'd like to use a bunch of unlabeled source data to help out. (Though I feel that in this case, we're probably reasonably likely to already have some labeled source.) From a more theoretical perspective, I don't really see anything wrong with it. In fact, any DA algorithm that works in the S+T- setting would stand a reasonable chance here.

For TL, I actually think that this setting makes a lot of sense, despite the fact that I can't come up with a single TL paper that does this (of course, I don't follow TL as closely as I follow DA). Why do I think this makes sense? Essentially, the TL assumption basically says that the labeling function can change arbitrarily, but the underlying distribution can't. If this is true, and we have labeled target data, I see no reason why we would need labeled source data. That is, we're assuming that knowing the source label distribution tells us nothing about the target label distribution. Hence, the only information we should really be able to get out of the source side is information about the underlying distribution p(X), since this is the only thing that stays the same.

What this suggests is that if having labeled source data in TL is helpful, then maybe the problem is really more domain adaptation-ish. I've actually heard (eg., at AI-Stats this year) a little muttering about how the two tasks used in TL work are often really quite similar. There's certainly nothing wrong with this, but it seems like if this is indeed true, then we should be willing to make this an explicit assumption in our model. Perhaps not something so severe as in DA (there exists a good classifier on both sides), but something not so strong as independence of labeling distributions. Maybe some assumption on the bound of the KL divergence or some such thing.

How I feel at this point is basically that for DA the interesting cases are S+T+ and S+T- (which are the well studied cases) and for TL the only interesting one is S-T+. This is actually quite surprising, given that similar techniques have been used for both.
(*) I think one exception to this assumption occurs in microarray analysis in computational biology. One of the big problems faced in this arena is that it is very hard to combine data from microarrays taken using different platforms (the platform is essentially the manufacturer of the actual device) or in different experimental conditions. What is typically done in compbio is to do a fairly heavy handed normalization of the data, usually by some complex rank-ordering and binning process. A lot of information is lost in this transformation, but at least puts the different data sets on the same scale and renders them (hopefully) roughly comparable. One can think of not doing the normalization step and instead thinking of this as a DA problem. However, due to the different scales and variances of the gene expression levels, it's not clear that a "single good classifier" exists. (You also have a compounded problem that not all platforms measure exactly the same set of genes, so you get lots of missing data.)

23 November 2007

NIPS 2007 Pre-prints Up (and WhatToSee-d)

NIPS 2007 preprints are online, with a big warning that they're not final versions. Be that as it may, I've indexed them on WhatToSee for your perusal. (More info on WhatToSee)

19 November 2007

Translation out of English

If you look at MT papers published in the *ACL conferences and siblings, I imagine you'll find a cornucopia of results for translating into English (usually, from Chinese or Arabic, though sometimes from German or Spanish or French, if the corpus used is from EU or UN or de-news). The parenthetical options are usually just due to the availability of corpora for those languages. The former two are due to our friend DARPA's interest in translating from Chinese and Arabic into English. There are certainly groups out there who work on translation with other target languages; the ones that come most readily to mind are Microsoft (which wants to translate its product docs from English into a handful of "commercially viable" languages), and a group that works on translation into Hungarian, which seems to be quite a difficult proposition!

Maybe I'm stretching here, but I feel like we have a pretty good handle on translation into English at this point. Our beloved n-gram language models work beautifully on a languages with such a fixed word order and (to a first approximation) no morphology. Between the phrase-based models that have dominated for a few years, and their hierarchical cousins (both with and without WSJ as input), I think we're doing a pretty good job on this task.

I think the state of affairs in translation out of English is much worse off. In particular, I think the state of affairs for translation from a morphologically-poor language to a morphologically-rich language. (Yes, I will concede that, in comparison to English, Chinese is morphologically-poorer, but I think the difference is not particularly substantial.)

Why do I think this is an interesting problem? For one, I think it challenges a handful of preconceived notions about translation. For instance, my impression is that while language modeling is pretty darn good in English, it's pretty darn bad in languages with complex morphology. There was a JHU workshop in 2002 on speech recognition for Arabic, a large component of which was on language modeling. From their final report, "All these morphology-based language models yielded slight but consistent reductions in word error rate when combined with standard word-based language models." I don't want to belittle that work---it was fantastic. But one would hope for more than slight reduction, given how badly word based ngram models work in Arabic.

Second, one of my biggest pet-peeves about MT (well, at least, why I think MT is easier than most people usually think of it) is that interpretation doesn't seem to be a key component. That is, the goal of MT is just to map from one language to another. It is still up to the human reading the (translated) document to do interpretation. One place where this (partially) falls down is when there is less information in the source language than you need to produce grammatical sentences in the target language. This is not a morphological issue per se (for instance, in the degree to which context plays a role for interpretation in Japanese is significantly higher than in English---directly translated sentences would often not be interpretable in English), but really an issue of information-poor to information-rich translation. It just so happens that a lot of this information is often marked in morphology, which languages like English lack.

That said, there is at least one good reason why one might not work on translation out of English. For me, at least, I don't speak another language well enough to really figure out what's going on in translations (i.e., I would be bad at error analysis). The language other than English that I speak best is Japanese. But in Japanese I could probably only catch gross translation errors, nothing particularly subtle. Moreover, Japanese is not what I would call a morphologically rich language. I would imagine Japanese to English might actually be harder than the other way, due to the huge amount of dropping (pro-drop and then some) that goes on in Japanese.

If I spoke Arabic, I think English to Arabic translation would be quite interesting. Not only do we have huge amounts of data (just flip all our "Arabic to English" data :P), but Arabic has complex, but well-studied morphology (even in the NLP literature). As cited above, there's been some progress in language modeling for Arabic, but I think it's far from solved. Finally, one big advantage of going out of English is that, if we wanted, we have a ton of very good tools we could throw at the source language: parsers, POS taggers, NE recognition, coreference systems, etc. Such things might be important in generating, eg., gender and number morphemes. But alas, my Arabic is not quite up to par.

(p.s., I recognize that there's no reason English even has to be one of the languages; it's just that most of our parallel data includes English and it's a very widely spoken language and so it seems at least not unnatural to include it. Moreover, from the perspective of "information poor", it's pretty close to the top!)

15 November 2007

NetFlix "solved" (the small version)

See the post on Hunch. Maybe that last 1.5% might benefit not from fancy ML, but from fancy (or even stupid) NLP. "But what data do we have to run the NLP on," my friends may ask. How about stuff like this, or if you're adventurous like this? (Had I enough time, I might give it a whirl, but alas...)

p.s., If any of the above links encourage copyright violations, then I'm not actually advocating their use.

12 November 2007

Understanding Model Predictions

One significant (IMO) issue that we face when attempting to do some sort of feature engineering is trying to understand not only what sorts of errors our model is making, but why. This is an area in which pattern-based methods seem to have a leg up on classification-based methods. If an error occurs, we know exactly what pattern fired to yield that error, and we can look back at where the pattern came from and (perhaps) see what went wrong.

I've been trying to think for a while what the best way to do this in a classification style system is, especially when the output is structured. I think I have a way, but in retrospect it seems so obvious that I feel someone must have tried it before. Note that unlike some past blog posts, I haven't actually tried doing this yet, but I think maybe the idea is promising.

Suppose we learn a system to solve some task, and we have some held out dev data on which we want to do, say, feature engineering. What we'd like to do is (a) find common varieties of errors and (b) figure out why these errors are common. I'm going to assume that (a) is solved... for instance in tagging problems you could look at confusion matrices (though note that in something like MT or summarization, where the structure of the output space changes based on past decisions, this may be nontrivial).

Let's say we're doing POS tagging and we've observed some error class that we'd like to understand better, so we can add new features that will fix it. One way of thinking about this is that our classifier predicted X in some context where it should have produced Y. The context, of course, is described by a feature set. So what we want to do, essentially, is look back over the training data for similar contexts, but where the correct answer was X (what our current model predicted). In the case of linear classifiers, it seems something as simple as cosine difference over feature vectors may be a sufficiently good metric to use.

25 October 2007

Non-parametric versus model selection/averaging

Non-parametric approaches (probably the most familiar of which is the Dirichlet process, but there are a whole host of them) are nice because they don't require us to pre-specify a bunch of things that in standard parametric inference would essentially be a model selection issue. For instance, in the DP, we needn't specify how many "clusters" gave rise to our data (in the context of a mixture model).

This brings up the immediate question, though: instead of doing inference in a non-parametric model, why don't you just do model selection (eg by comparing marginals) or model averaging. You can just vary whatever it is that is the "non-parametric" part of the model. For instance, in a DP, you run a bunch of inferences with different numbers of clusters and either choose the best (model selection) or average with respect to the marginals (model averaging). In something like an IBP, you can run with a different number of latent factors and select or average.

I've been asked this general question a few times by non-ML people and I rarely feel like I can give a compelling answer. In particular, I'm not aware of any non-toy experimental comparisons between doing model selection/averaging in any of these models. And even toy ones are hard to come by. But even beyond empirical evidence, I often have a hard time even formulating a coherent qualitative argument.

Here are some points I've come up with, but maybe commentors can either debunk them or add...

  1. In some cases, there are lots of parts of the model for which we don't know the structure, so to do model selection/averaging would require trying a ridiculously large number of models. For instance, I might have two components in my model that are DP-ish, so now I have to try quadratically many models.
  2. I may not know a good upper/lower bound on the number of components (eg., in a DP). So I'm going to have to try a really large range. In fact, although it's well known that the expected number of clusters in a DP grows as o(log N), where N is the number of data points, it is actually unbounded (and there's a conjecture that it's w(log log N), which isn't terribly slow).
  3. Comparing marginal likelihoods across models with a different number of parameters is just plain hard. In fact, for most cases, I don't know how to do it, especially if you want to live in MCMC world. (In variational world you could compare the lower bound on the marginals, but it's always a bit nerve wracking to compare two lower bounds -- you'd rather compare lowers and uppers.) I'm aware of things like reversible jump MCMC and so on, but in most cases these aren't actually applicable to the models you want. Alternatively, if all you want to do is select (not average), you could always do something with held-out data.
The problem is that I can think of counter-arguments to most of these points. In the case of (1), you could argue that if the space is too big, then your sampler isn't going to hit everywhere anyway. In the case of (2), my guess is that for most of these models the marginal will be semi-convex, so you can just start small and keep increasing until things seem to get worse. For (3), this seems to be an argument for developing better MCMC techniques for comparing marginals, not necessarily an argument in favor of non-parametric methods.

But I can go back yet again. To counter the counter to (1), you can argue that the sampler is at least guaranteed after a long time to hit what you care about, whereas if you construct some arbitrary search policy, you may not be. For (2), well...I don't know...I'm pretty convinced by the counter-argument to (2) :P... For (3), you could just disagree and say: why should we develop better MCMC techniques for comparing marginals when we can get away from this whole business by doing non-parametric inference.

Overall, I think non-parametric inference is interesting, useful and fun. But I'd like better arguments against the nay-sayers (who, in my experience, are actually typically non-ML people).

(Note that I'm ignoring the case where the non-parametric model is actually known--or roughly known--to be the right model for your problem. Of course if it's the right model, then you should use it. I'm more referring to the question of using non-parametric methods to get around model selection issues.)

19 October 2007

Gender and text, gender and speech

For some crazy reason I decided a while ago that I wanted to learn Japanese. Essentially, I wanted to learn a language as unlike English as I could find. So I did some summer intensive thing before college (that amounted to a year of class) and then continued taking class for all three years of undergrad. At the end, I could get by passably for most conversation topics (business, politics, current events, etc.) other than research stuff (at some point I learned how to say NLP, but I don't remember anymore...I wonder if en-eru-pi would be understood...). During the whole time we were required to meet weekly with conversation partners so as to practice our speaking skills.

For the first "semester" during the summer, I had a male professor. For all remaining seven semesters, my profs were female. With the exception of one conversation partner (who was from Hokkaido and spoke quicky with a strong accent and who was quickly replaced by someone who I could understand a bit more), all of my conversation partners were female.

At the end of my four years, I was speaking to a frien (who was neither a conversation partner nor a prof) in Japanese and after about three turns of conversation, he says to me (roughly): "you talk like a girl."

Based on the set up of this post, you may have seen that coming. But the thing that is most interesting is that Japanese is not one of those languages where the speaker's gender is encoded in (eg.) verb morphology. In fact, as best I could tell at that point, the only thing that I did that was effeminate was to use too many sentence ending particles that were more commonly used by women (-ka-na, I think, was one, but it's been too long now to really remember). The guy who said this to me was a close enough friend that I tried to figure out what it was about my speech that made him assess that I talk like a girl. The sentence particle thing was part of it, but he said that there was also something else that he couldn't really figure out; he was hypothesizing it was something to do with emphasis patterns.

It's not at all surprising that given that the majority of native speakers that I talked to were female, that if there were some underlying bias that was sufficiently subtle that the profs weren't able to intentially avoid it, that I would have picked it up.

Now, getting back to en-eru-pi. There's been a reasonable amount of work in the past few years on identifying the gender of the authors of texts. I know both Moshe Koppel and Shlomo Argamon, to name two, have worked on this problem. I also remember seeing a web site a year or so ago where you could enter a few sentences that you wrote and it would guess your gender. I don't remember what it cued off of---i think distribution of types of verbs and adjectives, mostly, but I do remember that given a short paragraph, it's shockingly accurate.

What I don't know is (a) if anyone has done this for something other than English and (b) if someone has done it for speech. Of course, if you have speech, you have extra information (eg., pitch) which might be useful. But given my Japanese friend's reaction to my speech pattern (my voice is rather low), there has to be more going on. And I'm not convinced that what is going on will be the same between written text and (say) transcribed speech. If someone wanted to try such an experiment for non-English text, you could probably just mine non-English from some social networking site (like myspace or facebook), where people tend to list their genders. I'm not sure how to do it for speech. Maybe there's some speech transcription corpus out there that's annotated with gender, but I don't know what it is. Although I don't see a huge financial marked out there for an answer, I'm personally curious what it is about my English writing patterns that made the web site I refered to earlier strongly convinced that I'm male, and what it is about my Japanese speech patterns that make it clear that I'm not.