29 May 2008

Measures of correlation

As NLPers, we're often tasked with measuring similarity of things, where things are usually words or phrases or something like that. The most standard example is measuring collocations. Namely, for every pair of words, do they co-locate (appear next to each other) more than they "should." There are lots of statistics that are used to measure collocation, but I'm not aware of any in-depth comparison of these. If anyone actually still reads this blog over the summer and would care to comment, it would be appreciated!

The most straightforward measure, in my mind, is mutual information. If we have two words "Y" and "Y", then the mutual information is: MI(X;Y) = \sum_x \sum_y p(X=x,Y=y) \log [p(X=x,Y=y) / (p(X=x)p(Y=y))] (sorry, my LaTeX plugin seems to have died). In the case of collocation, the values x and y can take are usually just "does occur" and "does not occur." Here, we're basically asking ourselves: do X and Y take on the same values more often than chance. I.e., do they seem roughly statistically independent. The mutual information statistic gives, for every X/Y pair (every pair of words) a score. We can then sort all word pairs by this score and find the strongest collocations.

One issue with MI seems to be that it doesn't really care if pairs are common or not. Very infrequent (typically noisy) pairs seem to pop to the front. One way to fix this would be to add an additional "count" term to the front of the mutual information to weight high-frequency pairs more. This is quite similar to the RlogF measure that's locally quite popular.

The next set of methods seem to be based mostly on the idea of assuming a generative model for the data and doing hypothesis testing. I.e., you can have one model that assumes the two words are independent (the null hypothesis), and one that assumes they are not. Typically this is implemented as multinomials over words, and then a classical statistical test is applied to the estimated maximum likelihood parameters of the data. You can use Dice coefficient, t-score, chi-squared, or even something simpler like the log-likelihood ratio.

Although I'm not completely familiar with these techniques, they seem like they would also be sensitive to noise in the MLE, especially for low-frequency terms (of which we know there are a lot). It seems plausible to just directly do a Bayesian hypothesis test, rather than a classical one, but I don't know if anyone has done this.

In writing this, I just came across collocations.de, which among other things, has a nice survey of these results from an ESSLLI tutorial in 2003. They seem to conclude that for extracting PP/verb collocations, t-score and frequency seem to work best, and for extracting adjective/noun collocations, log-likelihood, t-score, Fisher, and p-values seem to work best. The intersection just contains t-score, so maybe that's the way to go. Of course, these things are hard to validate because so much depends on what you're trying to do.

22 May 2008

I occasionally--more than I would like--run in to the following problem. I am working on some task for which there is prior work (shocking!). I'm getting ready to decide what experiments to run in order to convince readers of a potential paper that what I'm doing is reasonable. Typically this involves comparing fairly directly against said prior work. Which, in turn, means that I should replicate what this prior work has done as closely as possible, letting the only difference in systems be what I am trying to propose. Easy example: I'm doing machine learning for some problem and there is an alternative machine learning solution; I need to use an identical feature set to them. Another easy example: I am working on some task; I want to use the same training/dev/test set as the prior work.

The problem is: what if they did it wrong (in my opinion). There are many ways for doing things wrong and it would be hard for me to talk about specifics without pointing fingers. But just for concreteness, let's take the following case relating to the final example in the above paragraph. Say that we're doing POS tagging and the only prior work POS tagging paper used as test data every tenth sentence in WSJ, rather than the last 10%. This is (fairly clearly) totally unfair. However, it leaves me with the following options:

1. I can repeat their bad idea and test on every tenth sentence.
2. I can point out (in the paper) why this is a bad idea, and evaluate on the last 10%.
3. I can not point this out in the paper and just ignore their work and still evaluate on the last 10%.
4. I can point out why this is a bad idea, evaluate on both the last 10% (for "real" numbers) and every tenth sentence (for "comparison" numbers).
5. I can point out why this is a bad idea, reimplement their system, and run both mine and theirs on the last 10%.
I think that if all else is equal, (5) is the best option. I think it's scientifically sound, truthful, and gives results that are actually comparable. Unfortunately, it's also the most work because it involves reimplementing a complete other system which in many cases may be verging on prohibitive. (I suppose another option would be to contact the authors and ask them to rerun in the correct way -- I imagine some people might respond positively to this, though probably not all.)

(4) is tempting, because it allows me to "break the bad habit" but also compare directly. The problem is that if I really disbelieve the prior methodology, then these comparison numbers are themselves dubious: what am I supposed to learn from results that compare in an unreasonable setting? If they're great (for me), does that actually tell me anything? And if they're terrible (for me), does that tell me anything? It seems to depend on the severity of the "bad idea", but it's certainly not cut and dry.

(3) just seems intellectually dishonest.

(2) at first glance seems inferior to (4), but I'm not entirely sure that I believe this. After all, I'm not sure that (4) really tells me anything that (2) doesn't. I suppose one advantage to (4) over (2) is that it gives me some sense of whether this "bad idea" really is all that bad. If things don't look markedly different in (2) and (4), then maybe this bad idea really isn't quite so bad. (Of course, measuring "markedly different" may be difficult for some tasks.)

(1) also seems intellectually dishonest.

At this point, to me, it seems like (4) and (5) are the winners, with (2) not hugely worse than (4). Thinking from the perspective of how I would feel reviewing such a paper, I would clearly prefer (5), but depending on how the rest of the paper went, I could probably tolerate (2) or (4) depending on how egregious the offense is. One minor issue is that as a writer, you have to figure that the author of this prior work is likely to be a reviewer, which means that you probably shouldn't come out too hard against this error. Which is difficult because in order to get other reviewers to buy (4), and especially (2), they have to buy that this is a problem.

I'm curious how other people feel about this. I think (5) is obviously best, but if (5) is impossible to do (or nearly so), what should be done instead.

18 May 2008

Domain adaptation is, roughly, the following problem: given labeled data drawn from one or more source domains, and either (a) a little labeled data drawn from a target domain or (b) a lot of unlabeled data drawn from a target domain; do the following. Produce a classifier (say) that has low expected loss on new data drawn from the target domain. (For clarity: we typically assume that it is the data distribution that changes between domains, not the task; that would be standard multi-task learning.)

Obviously I think this is an fun problem (I publish on it, and blog about it reasonably frequently). It's fun to me because it both seems important and is also relatively unsolved (though certainly lots of partial solutions exist).

One thing I've been wondering recently, and I realize as I write this that the concept is as yet a bit unformed in my head, is whether the precise formulation I wrote in the first paragraph is the best one to solve. In particular, I can imagine many ways to tweak it:

1. Perhaps we don't want just a classifier that will do well on the "target" domain, but will really do well on any of the source domains as well.
2. Continuing (1), perhaps when we get a test point, we don't know which of the domains we've trained on it comes from.
3. Continuing (2), perhaps it might not even come from one of the domains we've seen before!
4. Perhaps at training time, we just have a bunch of data that's not neatly packed into "domains." Maybe one data set really comprises five different domains (think: Brown, if it weren't labeled by the source) or maybe two data sets that claim to be different domains really aren't.
5. Continuing (4), perhaps the notion of "domain" is too ill-defined to be thought of as a discrete entity and should really be more continuous.
I am referring to union of these issues as adaptability, rather than adaptation (the latter implies that it's a do-it-once sort of thing; the former that it's more online). All of these points beg the question that I often try to ask myself when defining problems: who is the client.

Here's one possible client: Google (or Yahoo or MSN or whatever). Anyone who has a copy of the web lying around and wants to do some non-trivial NLP on it. In this case, the test data really comes from a wide variety of different domains (which may or may not be discrete) and they want something that "just works." It seems like for this client, we must be at (2) or (3) and not (1). There may be some hints as to the domain (eg., if the URL starts with "cnn.com" then maybe--though not necessarily--it will be news; it could also be a classified ad). The reason why I'm not sure of (2) vs (3) is that if you're in the "some labeled target data" setting of domain adaptation, you're almost certainly in (3); if you're in the "lots of unlabeled target data" setting, then you may still be in (2) because you could presumably use the web as your "lots of unlabeled target data." (This is a bit more like transduction that induction.) However, in this case, you are now also definitely in (4) because no one is going to sit around and label all of the web as to what "domain" it is in. However, if you're in the (3) setting, I've no clue what your classifier should do when it gets a test example from a domain that (it thinks) it hasn't seen before!

(4) and (5) are a bit more subtle, primarily because they tend to deal with what you believe about your data, rather than who your client really is. For instance, if you look at the WSJ section of the Treebank, it is tempting to think of this as a single domain. However, there are markedly different types of documents in there. Some are "news." Some are lists of stock prices and their recent changes. One thing I tried in the Frustratingly Easy paper but that didn't work is the following. Take the WSJ section of the Treebank and run document clustering on it (using bag of words). Treat each of these clusters as a separate domain and then do adaptation. This introduces the "when I get a test example I have to classify it into a domain" issue. At the end of the day, I couldn't get this to work. (Perhaps BOW is the wrong clustering representation? Perhaps a flat clustering is a bad idea?)

Which brings us to the final question (5): what really is a domain? One alternative way to ask this is: give data partitioned into two bags, are these two bags drawn from the same underlying distribution. Lots of people (especially in theory and databases, though also in stats/ML) have proposed solutions for answering this question. However, it's never going to be possible to give a real "yes/no" answer, so now you're left with a "degree of relatedness." Furthermore, if you're just handed a gigabyte of data, you're not going to want to try ever possible split into subdomains. One could try some tree-structured representation, which seems kind of reasonable to me (perhaps I'm just brainwashed because I've been doing too much tree stuff recently).

12 May 2008

Teaching machine translation

Last Fall (2007), I taught an Applications of NLP course to a 50/50 mix of grads and senior undergrads. It was modeled partially after a course that I took from Kevin Knight while a grad student. It was essentially 1/3 on finite state methods for things like NER and tagging, then 1/3 on machine translation, then 1/3 on question answering and summarization. Overall, the course went over fairly well.

I had a significant problem, however, teaching machine translation. Here's the problem.

Students knew all about FSTs because we used them to do all the named-entity stuff in the first third of class. This enabled us to talk about things like IBM model 1 and the HMM model. (There's a technical difficult here, namely dealing with incomplete data, so we talk about EM a little bit.) We discuss, but they don't actually make use of, higher order MT models.

Now, we all know that there's a lot more to MT than model 4 (even limiting oneself to statistical translation techniques). Namely, there are phrase-based models and syntactic models. We had a very brief (one lecture) overview of syntactic models at the end. My beef is with phrase-based models.

The problem is that we've gone though all this prettiness to develop these word-based models, and then I have to teach them grow-diag-final, phrase extraction and phrase scoring. I almost felt embarrassed doing so. The problem is that these things are obviously so heuristic that throwing them on top of this really pretty word-for-word translation model just kills me. And it's not just me: the students were visibly upset by the lack of real modeling behind these techniques.

One option would be just not to teach this stuff. I don't really think that it sheds much light on the translation process. The reason I don't like this solution is because it's nice to be able to say that they will have a handle on a not-too-difficult to understand/implement method for doing real-world MT. Instead, I could just spend that time on syntactic models. The situation there is better (you can talk about the hierarchy of tree transducers, etc.), but not perfect (eg., all the work that goes in to rule extraction is not too dissimilar from all the work that goes into phrase extraction).

I suppose that this is just the defacto problem with a relatively immature field: there hasn't been enough time for us to really tease apart what's actually going on in these models and try to come up with some coherent story. I'd love a story that doesn't involve first doing word alignment and, is, in some sense, integrated.