## 27 December 2007

### Those Darn Biologists...

I've recently been talking to a handful of biologists. The lab/bench kind, not the computational kind. As an experiment in species observation, they intrigue me greatly. I'm going to stereotype, and I'm sure it doesn't hold universally, but I think it's pretty true and several (non-biologist) friends have confirmed this (one of whom is actually married to a biologist). The reason they intrigue me is because they are wholly uninterested in techniques that allow you to do X better, once they already have a solution for X.

This is a bit of an exaggeration, but not a huge one (I feel).

It is perhaps best served by example. Take dimensionality reduction. This is a very very well studied problem in the machine learning community. There are a bajillion ways to do it. What do biologists use? PCA. Still. And it's not that people have tried other things and they've failed. Other things have worked. Better. But no one cares. Everyone still uses PCA.

Part of this is probably psychological. After using PCA for a decade, people become comfortable with it and perhaps able to anticipate a bit of what it will do. Changing techniques would erase this.

Part of this is probably cultural. If you're trying to publish a bio paper (i.e., a paper in a bio journal) and you're doing dimensionality reduction and you aren't using PCA, you're probably going to really have to convince the reviewers that there's a good reason not to use PCA and that PCA just plain wouldn't work.

Now, that's not to say that biologists are luddites. They seem perfectly happy to accept new solutions to new problems. For instance, still in the dimensionality reduction setting, non-negative matrix factorization techniques are starting to show their heads in microarray settings. But it's not because they're better at solving the same old problem. What was observed was that if you have not just one, but a handful of microarray data sets, taken under slightly different experimental conditions, then the variance captured by PCA is the variance across experiments, not the variance that you actually care about. It has been (empirically) observed that non-negative matrix factorization techniques don't suffer from this problem. Note that it's only fairly recently that we've had such a surplus of microarray data that this has even become an issue. So here we have an example of a new problem needing a new solution.

Let's contrast this with the *ACL modus operandi.

There were 132 papers in ACL 2007. How many of them do you think introduced a new problem?

Now, I know I'm being a bit unfair. I'm considering dimensionality reduction across different platforms (or experimental conditions) as a different problem from dimensionality reduction on a single platform. You could argue that these really are the same problem. I just kind of disagree.

Just as biologists tend to be wary of new techniques, so we tend to be wary of new problems. I think anyone who has worked on a new problem and tried to get it published in a *ACL has probably gone through an experience that left a few scars. I know I have. It's just plain hard to introduce a new problem.

The issue is that if you work on an age-old problem, then all you have to do is introduce a new technique and then show that it works better than some old technique. And convince the reviewers that the technique is interesting (though this can even be done away with if the "works better" part is replaced with "works a lot better").

On the other hand, if you work on a new problem, you have to do a lot. (1) You have to introduce the new problem and convince us that it is interesting. (2) You have to introduce a new evaluation metric and convince us that it is reasonable. (3) You have to introduce a new technique and show that it's better than whatever existed before.

The problem is that (1) is highly subjective, so a disinterested reviewer can easily kill such a paper with a "this problem isn't interesting" remark. (2) is almost impossible to do right, especially the first time around. Again, this is something that's easy to pick on as a reviewer and instantly kill a paper. One might argue that (3) is either irrelevant (being a new problem, there isn't anything that existed before) or no harder than in the "age old problem" case. I've actually found this not to be true. If you work on an age old problem, everyone knows what the right baseline is. If you work on a new problem, not only do you have to come up with your own technique, but you also have to come up with reasonable baselines. I've gotten (on multiple occasions) reviews for "new problems" papers along the lines of "what about a baseline that does XXX." The irony is that while XXX is often creative and very interesting, it's not really a baseline and would actually probably warrant it's own paper!

That said, I didn't write this post to complain about reviewing, but rather to point out a marked difference between how our community works and how another community (that is also successful) works. Personally, I don't think either is completely healthy. I feel like there is merit in figuring out how to do something better. But I also feel that there is merit in figuring out what new problems are and how to solve them. The key problem with the latter is that the common reviewer complaints I cited above (problem not interesting or metric not useful) often are real issues with "new problem" papers. And I'm not claiming that my cases are exceptions. And once you accept a "new problem" paper that is on the problem that really isn't interesting, you've opened Pandora's box and you can expect to see a handful of copy cat papers next year ( could, but won't, give a handful of examples of this happening in the past 5 years). And it's really hard to reject the second paper on a topic as being uninteresting, given that the precedent has been set.

## 18 December 2007

### Particle filtering versus beam search

I had a very interesting discussion at NIPS with Vikash Mansingka about beam search and particle filtering. Much of what is below is a result of this conversation and he deserves much credit.

In a very real sense, particle filtering can be seen (for many models) as a sort of stochastic beam search. Just to set the stage, let's talk about generic beam search:

1. Initialize a min-queue "beam" to contain a single initial hypothesis.2. While beam contains non-complete hypothesis:  A. Take h, the lowest scoring hypothesis from the beam  B. For each h' in the neighborhood of h     I. Add h' to the beam, with some score  C. If the beam exceeds a certain capacity, drop high-scoring elements3. Return the lowest scoring hypothesis from beam
The key variant in beam search algorithms is the score function that's used. The most naive is just path cost---how much did we have to pay to get to the current hypothesis? The "A*" way is to use path cost + admissible heuristic cost, where the admissible heuristic is an underestimate of the amount of cost remaining to complete.

There are (of course) other variants: sometimes we expand the whole beam in one step; sometimes we use multiple beams; etc. We'll come back to these shortly.

Particle filtering (using as similar terminology to the beam search as possible), on the other hand, looks something like:
1. Initialize a max-queue "beam" to contain a single initial hypothesis.2. While beam contains non-complete hypothesis:  A. Take h, the highest scoring hypothesis from the beam  B. Sample h' from the neighborhood of h     I. Add h' to the beam, with some score  C. If the beam exceeds a certain capacity, randomly drop elements  D. If the "effective sample size" of the beam is too low, resample elements3. Return the highest scoring hypothesis from beam
To map beam search to particle filtering, use negative log probabilities for scores. (This turns max search into min search.)

In a sense, the major thing that differs between PF and BS is that in PF everything happens randomly -- we're always sampling from a proposal distribution, rather than greedily taking the "best" hypotheses at each step.

There are, however, some minor issues: what proposal distribution is used in step 2B in PF and what does resampling do.

In general, the proposal distribution should be the true posterior probability, but this is pretty much always intractable (otherwise we wouldn't be using approximations at all). The most common thing to do is to use a transition probability, which corresponds exactly to using a path cost in beam search. However, just as we know that using a heuristic can help in beam search, this suggest that using a proposal distribution in PF that contains something like a future cost heuristic is going to be helpful. I'm sure this has been done, but I don't think it's nearly as common as it perhaps should be, given how helpful it is in BS.

What 2D in PF does is to measure whether the samples that we have in the beam have fallen out of the true posterior---in other words, are we "off the mark." There's a straightforward way to compute this (see the wikipedia page linked above). If we are off the mark, we resample all the elements in our beam, essentially trying to get back on the right path. My experience with PF is that this is pretty essential to getting things to work, but as far as I know there's no analogue in BS land.

It may be that these two things (resampling and heuristic) are two different ways of solving the same problem and that doing both is not so helpful. But I tend to doubt it.

One other thing that I've never seen done in PF is to maintain multiple beams. Admittedly this makes the resampling step harder (maybe...I haven't worked through it), but it may be helpful. One thing, for instance, that we found in the PF for the coalescent was that the PF tends to like to merge clusters early with no regard for how hard life is going to be later. We use resampling to try to fix this, but another way would be to use multiple beams. This is commonly done in, eg., machine translation, where a single beam doesn't do well, so we maintain one beam for each number of foreign (or English) words covered.

So what's the point? I think that by recognizing the similarity between BS and PF, we can see ways to potentially improve both. For instance, the following seem promising:
1. Use resampling techniques in beam search
2. Use multiple beams in particle filtering
3. Use A*-style heuristics in the proposal distribution for particle filtering
An alternative suggestion would be for NLPers to pay more attention to particle filtering. It's a nice, clean probabilistic analogue to a technique that we love (or at least are forced to use).

## 12 December 2007

### NIPS retrospective

I got back from NIPS on Sunday; sadly I got sick on the last day. Despite this, it was by far my most productive conference ever. Admittedly, I made a point to try to make it productive (being somewhat geographically isolated means that it's in my best interest), but even so, I had a fantastic time. There are several ideas that I plan on blogging on in the near future, but for now, I'll focus on the actual conference program. The standard disclaimer applies: I didn't see everything and just because something's not listed here doesn't mean I didn't enjoy it. I'll try to limit my commentary to papers that are relevant to the NLP audience, but I may stray. I'll conclude with a bit of my sense of what was in and out this year.

1. Kristina Toutanova and Mark Johnson had a topic-model style paper for doing semisupervised POS tagging. There are two new things here. First, they actually model context (yay!) which we know is important in language. Second, they introduce a notion of a confusion class (what possible POS tags can a word get) and actually model this. This latter is something that makes sense in retrospect, but is not obvious to think of (IMO). They actually get good results (which is non-trivial for topic models in this context).
2. Alex Bouchard-Cote and three more of the Berkeley gang had a paper on language change. If you saw their paper at ACL, they're attacking a pretty similar problem, by looking at phonological divergences between a handful of languages. I'd love to be able to reconcile their approach with what I've been working on---I use a huge database of typological knowledge; they use word forms...I'd love to use both, but I really like working with thousands of languages and it's hard to find corpora for all of these :).
3. John Blitzer had a good paper (with other Penn folks) on learning bounds for domain adaptation. If you care at all about domain adaptation, you should read this paper.
4. Alex Kulesza and Fernando Pereira had a great paper on what happens if you try to do structured learning when the underlying prediction (inference) algorithm is approximate. They show that things can go very wrong in widely different ways. This is a bit of a negative results paper, but it gives us (a) a strong sense that proceeding with a generic learning algorithm on top of an approximate inference algorithm is not okay and (b) a few ideas of what can actually go wrong.
5. Yee Whye Teh, Kenichi Kurihara and Max Welling continue their quest toward collapsed variational models for all problems in the world by solving the HDP. (For those who don't know, you can think of this as LDA with a potentially infinite number of topics, but of course the real case is more interesting.)
6. Ali Rahimi and Benjamin Recht presented a very cool analysis of kernel methods when your kernel is of the form K(x,z) = f(x-z). This is the case for, eg., the RBF kernel. This is something that I've wanted to try for a while, but to be honest my poor recollection of my analysis training left me a bit ill equipped. Essentially they apply a Fourier analysis to such kernels and then represent the kernel product K(x,z) as a dot product in a new input space, where kernelization is no longer required. The upshot is that you can now effectively do kernel learning in a linear model in primal space, which is very nice when the number of examples is large.
7. David Heckerman gave a really nice invited talk on graphical models for HIV vaccine design. What he was particularly interested in was analyzing whether standard methods of determining correlation between events (eg., "do I have HIV given that I have some genomic signal?") work well when the data points are not independent, but rather belong to, eg., some genetic population. This is effectively what Lyle Campbell and I were doing in our ACL paper last year on typological features. Curing HIV might be a bit more of an interesting application, though :). Essentially what David and his colleagues do is to build a hierarchy on top of the data points and then let this explain some of the interactions and then figure out what isn't explained by this hierarchy. In the linguistic sense, this is effectively separating historical from areal divergences. I'll have to read up more on what they do, precisely.
There were several themes that stuck out at NIPS this year, though they aren't represented in the above list. Probably the biggest one is recommender systems. Just search the titles for "matrix" and you'll come up with a handful of such systems. I have to believe that this is spurred, at least in part, by the NetFlix challenge. Another theme was deep belief networks, which even had a rogue workshop dedicated to them. I have a feeling we'll be seeing more and more of these things. A final theme was randomized algorithms, particularly as applied to large scale learning. I think this is a great direction, especially at NIPS, where, historically, things have been less focused on the large scale.

My only serious quibble with the program this year is that I saw a handful of papers (and at least one that got an oral and one a spotlight) that really fell down in terms of evaluation. That is, these papers all proposed a new method for solving task X and then proceeded to solve it. The experiments, however, contained only comparisons to algorithms that did not have access to all the information that X had. I'll give an example of something that would fall in this category but I have not seen appear anywhere: I build a structured prediction algorithm and compare only against algorithms that make independent classification decisions. I'm not a huge quibbler over comparing against the absolute state of the art, but it really really irks me when there are no comparisons to algorithms that use the same information, especially when standard (read: they are in any reasonable machine learning textbook) algorithms exist.

## 30 November 2007

### Domain adaptation vs. transfer learning

The standard classification setting is a input distribution p(X) and a label distribution p(Y|X). Roughly speaking, domain adaptation (DA) is the problem that occurs when p(X) changes between training and test. Transfer learning (TL) is the problem that occurs when p(Y|X) changes between training and test. In other words, in DA the input distribution changes but the labels remain the same; in TL, the input distributions stays the same, but the labels change. The two problems are clearly quite similar, and indeed you see similar (if not identical) techniques being applied to both problems. (Incidentally, you also see papers that are really solving one of the two problems but claim to be solving the other.)

As a brief aside, we actually need to be a bit more specific about the domain adaptation case. In particular, if p(X) changes then we can always encode any alternative labeling function by "hiding" some extra information in p(X). In other words, under the model that p(X) changes, the assumption that p(Y|X) doesn't change is actually vacuous. (Many people have pointed this out, I think I first heard it from Shai Ben-David a few years ago.) It is because of the assumption that theoretical work in domain adaptation has been required to make stronger assumptions. A reasonable one is the assumption that (within a particular concept class---i.e., space of possible classifiers), there exists one that doesn't do too bad on either the source or the target domain. This is a stronger assumption that the "p(Y|X) doesn't change", but actually enables us to do stuff. (Though, see (*) below for a bit more discussion on this assumption.)
Now, beyond the DA versus TL breakdown, there is a further breakdown: for which sides of the problem do we have labeled or unlabeled data. In DA, the two "sides" are the source domain and the target domain. In TL, the two sides are task 1 (the "source" task) and task 2 (the "target" task). In all cases, we want something that does well on the target. Let's enumerate the four possibilities:
1. Source labeled, target labeled (S+T+)
2. Source labeled, target only unlabeled (S+T-)
3. Source only unlabeled, target labeled (S-T+)
4. Source only unlabeled, target only unlabeled (S-T-)
We can immediately throw away S-T- because this is basically an unsupervised learning problem.

The typical assumption in TL is S+T+. That is, we have labeled data for both tasks. (Typically, it is actually assumed that we have one data set that is labeled for both problems, but I don't feel that this is a necessary assumption.)

In DA, there are two standard settings: S+T+ (this is essentially the side of DA that I have worked on) and S+T- (this is essentially the side of DA that John Blitzer has worked on).

Now, I think it's fair to say that any of the T- settings are impossible for TL. Since we're assuming that the label function changes and can change roughly arbitrarily, it seems like we just have to have some labeled target data. (I.e., unlike the case in DA where we assume a single good classifier exists, this assumption doesn't make sense in TL.)

This begs the question: in TL and DA, does the S-T+ setting make any sense?

For DA, the S-T+ setting is a bit hard to argue for from a practical perspective. Usually we want to do DA so that we don't have to label (much) target data. However, one could make a semi-supervised sort of argument here. Maybe it's just hard to come by target data, labeled or otherwise. In this case, we'd like to use a bunch of unlabeled source data to help out. (Though I feel that in this case, we're probably reasonably likely to already have some labeled source.) From a more theoretical perspective, I don't really see anything wrong with it. In fact, any DA algorithm that works in the S+T- setting would stand a reasonable chance here.

For TL, I actually think that this setting makes a lot of sense, despite the fact that I can't come up with a single TL paper that does this (of course, I don't follow TL as closely as I follow DA). Why do I think this makes sense? Essentially, the TL assumption basically says that the labeling function can change arbitrarily, but the underlying distribution can't. If this is true, and we have labeled target data, I see no reason why we would need labeled source data. That is, we're assuming that knowing the source label distribution tells us nothing about the target label distribution. Hence, the only information we should really be able to get out of the source side is information about the underlying distribution p(X), since this is the only thing that stays the same.

What this suggests is that if having labeled source data in TL is helpful, then maybe the problem is really more domain adaptation-ish. I've actually heard (eg., at AI-Stats this year) a little muttering about how the two tasks used in TL work are often really quite similar. There's certainly nothing wrong with this, but it seems like if this is indeed true, then we should be willing to make this an explicit assumption in our model. Perhaps not something so severe as in DA (there exists a good classifier on both sides), but something not so strong as independence of labeling distributions. Maybe some assumption on the bound of the KL divergence or some such thing.

How I feel at this point is basically that for DA the interesting cases are S+T+ and S+T- (which are the well studied cases) and for TL the only interesting one is S-T+. This is actually quite surprising, given that similar techniques have been used for both.
(*) I think one exception to this assumption occurs in microarray analysis in computational biology. One of the big problems faced in this arena is that it is very hard to combine data from microarrays taken using different platforms (the platform is essentially the manufacturer of the actual device) or in different experimental conditions. What is typically done in compbio is to do a fairly heavy handed normalization of the data, usually by some complex rank-ordering and binning process. A lot of information is lost in this transformation, but at least puts the different data sets on the same scale and renders them (hopefully) roughly comparable. One can think of not doing the normalization step and instead thinking of this as a DA problem. However, due to the different scales and variances of the gene expression levels, it's not clear that a "single good classifier" exists. (You also have a compounded problem that not all platforms measure exactly the same set of genes, so you get lots of missing data.)

## 23 November 2007

### NIPS 2007 Pre-prints Up (and WhatToSee-d)

NIPS 2007 preprints are online, with a big warning that they're not final versions. Be that as it may, I've indexed them on WhatToSee for your perusal. (More info on WhatToSee)

## 19 November 2007

### Translation out of English

If you look at MT papers published in the *ACL conferences and siblings, I imagine you'll find a cornucopia of results for translating into English (usually, from Chinese or Arabic, though sometimes from German or Spanish or French, if the corpus used is from EU or UN or de-news). The parenthetical options are usually just due to the availability of corpora for those languages. The former two are due to our friend DARPA's interest in translating from Chinese and Arabic into English. There are certainly groups out there who work on translation with other target languages; the ones that come most readily to mind are Microsoft (which wants to translate its product docs from English into a handful of "commercially viable" languages), and a group that works on translation into Hungarian, which seems to be quite a difficult proposition!

Maybe I'm stretching here, but I feel like we have a pretty good handle on translation into English at this point. Our beloved n-gram language models work beautifully on a languages with such a fixed word order and (to a first approximation) no morphology. Between the phrase-based models that have dominated for a few years, and their hierarchical cousins (both with and without WSJ as input), I think we're doing a pretty good job on this task.

I think the state of affairs in translation out of English is much worse off. In particular, I think the state of affairs for translation from a morphologically-poor language to a morphologically-rich language. (Yes, I will concede that, in comparison to English, Chinese is morphologically-poorer, but I think the difference is not particularly substantial.)

Why do I think this is an interesting problem? For one, I think it challenges a handful of preconceived notions about translation. For instance, my impression is that while language modeling is pretty darn good in English, it's pretty darn bad in languages with complex morphology. There was a JHU workshop in 2002 on speech recognition for Arabic, a large component of which was on language modeling. From their final report, "All these morphology-based language models yielded slight but consistent reductions in word error rate when combined with standard word-based language models." I don't want to belittle that work---it was fantastic. But one would hope for more than slight reduction, given how badly word based ngram models work in Arabic.

Second, one of my biggest pet-peeves about MT (well, at least, why I think MT is easier than most people usually think of it) is that interpretation doesn't seem to be a key component. That is, the goal of MT is just to map from one language to another. It is still up to the human reading the (translated) document to do interpretation. One place where this (partially) falls down is when there is less information in the source language than you need to produce grammatical sentences in the target language. This is not a morphological issue per se (for instance, in the degree to which context plays a role for interpretation in Japanese is significantly higher than in English---directly translated sentences would often not be interpretable in English), but really an issue of information-poor to information-rich translation. It just so happens that a lot of this information is often marked in morphology, which languages like English lack.

That said, there is at least one good reason why one might not work on translation out of English. For me, at least, I don't speak another language well enough to really figure out what's going on in translations (i.e., I would be bad at error analysis). The language other than English that I speak best is Japanese. But in Japanese I could probably only catch gross translation errors, nothing particularly subtle. Moreover, Japanese is not what I would call a morphologically rich language. I would imagine Japanese to English might actually be harder than the other way, due to the huge amount of dropping (pro-drop and then some) that goes on in Japanese.

If I spoke Arabic, I think English to Arabic translation would be quite interesting. Not only do we have huge amounts of data (just flip all our "Arabic to English" data :P), but Arabic has complex, but well-studied morphology (even in the NLP literature). As cited above, there's been some progress in language modeling for Arabic, but I think it's far from solved. Finally, one big advantage of going out of English is that, if we wanted, we have a ton of very good tools we could throw at the source language: parsers, POS taggers, NE recognition, coreference systems, etc. Such things might be important in generating, eg., gender and number morphemes. But alas, my Arabic is not quite up to par.

(p.s., I recognize that there's no reason English even has to be one of the languages; it's just that most of our parallel data includes English and it's a very widely spoken language and so it seems at least not unnatural to include it. Moreover, from the perspective of "information poor", it's pretty close to the top!)

## 15 November 2007

### NetFlix "solved" (the small version)

See the post on Hunch. Maybe that last 1.5% might benefit not from fancy ML, but from fancy (or even stupid) NLP. "But what data do we have to run the NLP on," my friends may ask. How about stuff like this, or if you're adventurous like this? (Had I enough time, I might give it a whirl, but alas...)

p.s., If any of the above links encourage copyright violations, then I'm not actually advocating their use.

## 12 November 2007

### Understanding Model Predictions

One significant (IMO) issue that we face when attempting to do some sort of feature engineering is trying to understand not only what sorts of errors our model is making, but why. This is an area in which pattern-based methods seem to have a leg up on classification-based methods. If an error occurs, we know exactly what pattern fired to yield that error, and we can look back at where the pattern came from and (perhaps) see what went wrong.

I've been trying to think for a while what the best way to do this in a classification style system is, especially when the output is structured. I think I have a way, but in retrospect it seems so obvious that I feel someone must have tried it before. Note that unlike some past blog posts, I haven't actually tried doing this yet, but I think maybe the idea is promising.

Suppose we learn a system to solve some task, and we have some held out dev data on which we want to do, say, feature engineering. What we'd like to do is (a) find common varieties of errors and (b) figure out why these errors are common. I'm going to assume that (a) is solved... for instance in tagging problems you could look at confusion matrices (though note that in something like MT or summarization, where the structure of the output space changes based on past decisions, this may be nontrivial).

Let's say we're doing POS tagging and we've observed some error class that we'd like to understand better, so we can add new features that will fix it. One way of thinking about this is that our classifier predicted X in some context where it should have produced Y. The context, of course, is described by a feature set. So what we want to do, essentially, is look back over the training data for similar contexts, but where the correct answer was X (what our current model predicted). In the case of linear classifiers, it seems something as simple as cosine difference over feature vectors may be a sufficiently good metric to use.

## 25 October 2007

### Non-parametric versus model selection/averaging

Non-parametric approaches (probably the most familiar of which is the Dirichlet process, but there are a whole host of them) are nice because they don't require us to pre-specify a bunch of things that in standard parametric inference would essentially be a model selection issue. For instance, in the DP, we needn't specify how many "clusters" gave rise to our data (in the context of a mixture model).

This brings up the immediate question, though: instead of doing inference in a non-parametric model, why don't you just do model selection (eg by comparing marginals) or model averaging. You can just vary whatever it is that is the "non-parametric" part of the model. For instance, in a DP, you run a bunch of inferences with different numbers of clusters and either choose the best (model selection) or average with respect to the marginals (model averaging). In something like an IBP, you can run with a different number of latent factors and select or average.

I've been asked this general question a few times by non-ML people and I rarely feel like I can give a compelling answer. In particular, I'm not aware of any non-toy experimental comparisons between doing model selection/averaging in any of these models. And even toy ones are hard to come by. But even beyond empirical evidence, I often have a hard time even formulating a coherent qualitative argument.

Here are some points I've come up with, but maybe commentors can either debunk them or add...

1. In some cases, there are lots of parts of the model for which we don't know the structure, so to do model selection/averaging would require trying a ridiculously large number of models. For instance, I might have two components in my model that are DP-ish, so now I have to try quadratically many models.
2. I may not know a good upper/lower bound on the number of components (eg., in a DP). So I'm going to have to try a really large range. In fact, although it's well known that the expected number of clusters in a DP grows as o(log N), where N is the number of data points, it is actually unbounded (and there's a conjecture that it's w(log log N), which isn't terribly slow).
3. Comparing marginal likelihoods across models with a different number of parameters is just plain hard. In fact, for most cases, I don't know how to do it, especially if you want to live in MCMC world. (In variational world you could compare the lower bound on the marginals, but it's always a bit nerve wracking to compare two lower bounds -- you'd rather compare lowers and uppers.) I'm aware of things like reversible jump MCMC and so on, but in most cases these aren't actually applicable to the models you want. Alternatively, if all you want to do is select (not average), you could always do something with held-out data.
The problem is that I can think of counter-arguments to most of these points. In the case of (1), you could argue that if the space is too big, then your sampler isn't going to hit everywhere anyway. In the case of (2), my guess is that for most of these models the marginal will be semi-convex, so you can just start small and keep increasing until things seem to get worse. For (3), this seems to be an argument for developing better MCMC techniques for comparing marginals, not necessarily an argument in favor of non-parametric methods.

But I can go back yet again. To counter the counter to (1), you can argue that the sampler is at least guaranteed after a long time to hit what you care about, whereas if you construct some arbitrary search policy, you may not be. For (2), well...I don't know...I'm pretty convinced by the counter-argument to (2) :P... For (3), you could just disagree and say: why should we develop better MCMC techniques for comparing marginals when we can get away from this whole business by doing non-parametric inference.

Overall, I think non-parametric inference is interesting, useful and fun. But I'd like better arguments against the nay-sayers (who, in my experience, are actually typically non-ML people).

(Note that I'm ignoring the case where the non-parametric model is actually known--or roughly known--to be the right model for your problem. Of course if it's the right model, then you should use it. I'm more referring to the question of using non-parametric methods to get around model selection issues.)

## 19 October 2007

### Gender and text, gender and speech

For some crazy reason I decided a while ago that I wanted to learn Japanese. Essentially, I wanted to learn a language as unlike English as I could find. So I did some summer intensive thing before college (that amounted to a year of class) and then continued taking class for all three years of undergrad. At the end, I could get by passably for most conversation topics (business, politics, current events, etc.) other than research stuff (at some point I learned how to say NLP, but I don't remember anymore...I wonder if en-eru-pi would be understood...). During the whole time we were required to meet weekly with conversation partners so as to practice our speaking skills.

For the first "semester" during the summer, I had a male professor. For all remaining seven semesters, my profs were female. With the exception of one conversation partner (who was from Hokkaido and spoke quicky with a strong accent and who was quickly replaced by someone who I could understand a bit more), all of my conversation partners were female.

At the end of my four years, I was speaking to a frien (who was neither a conversation partner nor a prof) in Japanese and after about three turns of conversation, he says to me (roughly): "you talk like a girl."

Based on the set up of this post, you may have seen that coming. But the thing that is most interesting is that Japanese is not one of those languages where the speaker's gender is encoded in (eg.) verb morphology. In fact, as best I could tell at that point, the only thing that I did that was effeminate was to use too many sentence ending particles that were more commonly used by women (-ka-na, I think, was one, but it's been too long now to really remember). The guy who said this to me was a close enough friend that I tried to figure out what it was about my speech that made him assess that I talk like a girl. The sentence particle thing was part of it, but he said that there was also something else that he couldn't really figure out; he was hypothesizing it was something to do with emphasis patterns.

It's not at all surprising that given that the majority of native speakers that I talked to were female, that if there were some underlying bias that was sufficiently subtle that the profs weren't able to intentially avoid it, that I would have picked it up.

Now, getting back to en-eru-pi. There's been a reasonable amount of work in the past few years on identifying the gender of the authors of texts. I know both Moshe Koppel and Shlomo Argamon, to name two, have worked on this problem. I also remember seeing a web site a year or so ago where you could enter a few sentences that you wrote and it would guess your gender. I don't remember what it cued off of---i think distribution of types of verbs and adjectives, mostly, but I do remember that given a short paragraph, it's shockingly accurate.

What I don't know is (a) if anyone has done this for something other than English and (b) if someone has done it for speech. Of course, if you have speech, you have extra information (eg., pitch) which might be useful. But given my Japanese friend's reaction to my speech pattern (my voice is rather low), there has to be more going on. And I'm not convinced that what is going on will be the same between written text and (say) transcribed speech. If someone wanted to try such an experiment for non-English text, you could probably just mine non-English from some social networking site (like myspace or facebook), where people tend to list their genders. I'm not sure how to do it for speech. Maybe there's some speech transcription corpus out there that's annotated with gender, but I don't know what it is. Although I don't see a huge financial marked out there for an answer, I'm personally curious what it is about my English writing patterns that made the web site I refered to earlier strongly convinced that I'm male, and what it is about my Japanese speech patterns that make it clear that I'm not.

## 04 October 2007

### F-measure versus Accuracy

I had a bit of a revelation a few years ago. In retrospect, it's obvious. And I'm hoping someone else out there hasn't realized this because otherwise I'll feel like an idiot. The realization was that F-measure (for a binary classification problem) is not invariant under label switching. That is, if you just change which class it is that you call "positive" and which it is that you call "negative", then your overall F-measure will change.

What this means is that you have to be careful, when using F-measure, about how you choose which class is the "positive" class.

On the other hand, the simple "accuracy" metric is (of course) invariant under label switching. So when using accuracy, you needn't worry about which class you consider "positive."

In the olden days, when people pretty much just used F-measure to analyze things like retrieval quality, this wasn't a problem. It was "clear" which class was positive (good documents) and which was negative. (Or maybe it wasn't...) But now, when people use F-measure to compare results on a large variety of tasks, it makes sense to ask: when is accuracy the appropriate measure and when is F the appropriate measure?

I think that, if you were to press anyone on an immediate answer to this question, they would say that they favor F when one of the classes is rare. That is, if one class occurs only in 1% of the instances, then a classifier that always reports "the other class" will get 99% accuracy, but terrible F.

I'm going to try to convince you that while rarity is a reasonable heuristic, there seems to be something deeper going on.

Suppose I had a bunch of images of people drinking soda (from a can) and your job was to classify if they were drinking Coke or Pepsi. I think it would be hard to argue that F is a better measure here than accuracy: how would I choose which one is "positive." Now, suppose the task were to distinguish between Coke and Diet Dr. Pepper. Coke is clearly going to be the majority class here (by a long shot), but I still feel that F is just the wrong measure. Accuracy still seems to make more sense. Now, suppose the task were to distinguish between Coke and "anything else." All of a sudden, F is much more appealing, even though Coke probably isn't much of a minority (maybe 30%).

What seems to be important here is the notion of X versus not-X, rather than X versus Y. In other words, the question seems to be: does the "not-X" space make sense?

Let's consider named entity recognition (NER). Despite the general suggestion that F is a bad metric for NER, I would argue that it makes more sense than accuracy. Why? Because it just doesn't make sense to try to specify what "not a name" is. For instance, consider the string "Bill Clinton used to be the president; now it's Bush." Clearly "Bill Clinton" is a person. But is "Clinton used"? Not really. What about "
Bill the"? Or "Bill Bush"? I think a substantial part of the problem here is not that names are rare, but that it's just not reasonable to develop an algorithm that finds all not-names. They're just not well defined.

This suggests that F is a more appropriate measure for NER than, at least, accuracy.

One might argue--and I did initially myself--that this is an artifact of the fact that names are often made up of multiple words, and so there's a segmentation issue. (The same goes for the computer vision problem of trying to draw bounding boxes around humans in images, for which, again, F seems to make more sense.)

But I think I've been convinced that this isn't actually the key issue. It seems, again, that what it boils down to is that it makes sense to ask one to find Entities, but it doesn't make sense to ask one to find non-Entities, in the same way it it doesn't make sense to ask one to find non-Cokes.

(Of course, in my heart of hearts I believe that you need to use a real--i.e., type 4--evaluation metric, but if you're stuck without this, then perhaps this yields a reasonable heuristic.)

## 27 September 2007

### Bootstrapping

There are many bootstrapping algorithms, but they all have (roughly) the same general form. Suppose we want to solve a binary classification problem. We do the following:

1. Build (by hand) a classifier that predicts positive with high precision (and low recall)
2. Build (by hand) a classifier that predicts negative with high precision (and low recall)
3. Apply (1) and (2) to some really huge data set leading to a smaller labeled set (that has high precision)
4. Train a new classifier on the output of (3)
5. Apply the new classifier to the original data set and go to (4)
Note that the hand-built classifier can alternatively just be a hand-labeling of a small number of "representative" examples.

I've never actually used such an algorithm before, but I hear that they work pretty well.

There's always been one thing that's surprised me, though. And that's the fact that they work pretty well.

There are two reasons why the fact that these algorithms work surprises me. First, there is often a danger that the classifier learned in (4) simply memorizes what the rule-based classifier does. Second, there is a domain adaptation problem.

To be more explicit, let me give an example. Suppose I want to classify sentences as subjective or objective. This is a well studied problem to which bootstrapping algorithms have been applied. The first two steps involve inventing rule-based classifiers that get high precision on the "predict-subjective" and "predict-objective" tasks. One way to do this is to create word lists. Get a list of highly subjective words and a list of highly objective words (okay this latter is hard and you have to be clever to do it, but let's suppose that we could actually do it). Now, to make our rule based classifier, we might say: if a sentence contains at least two positive words and no negative words it's positive; if a sentence contains at least two negative words and no positive words it's negative; else, punt.

Now, we apply this word-lookup-based classifier on a large data set and extract some subset of the sentences as labeled. In the very simplest case, we'll extract bag-of-words features from these sentences and train up a simple classifier.

This is where things go haywire for my understanding.

First, by using a bag-of-words model, there is always the possibility that the classifier we learn will simply mimic the rule-based classifiers on the training data and do random stuff on everything else. In other words, since the information used to create the rule-based classifier exists in the training data, there's no guarantee that the classifier will actually learn to generalize. Essentially what we hope is that there are other, stronger, and simpler signals in the data that the classifier can learn instead.

One might think that to get around this problem we would remove from the bag of words any of the words from the initial word list, essentially forcing the classifier to learn to generalize. But people don't seem to do this. In fact, in several cases that I've seen, people actually include extra features of the form "how many positive (resp. negative) words appear in this sentence." But now we have a single feature that will allow us to replicate the rule-based classifier. This scares me greatly.

Second, our goal from this process is to learn a classifier that has high accuracy on the distribution of "all English sentences (perhaps from some particular domain)." But the classifier we train in the first iteration is trained on the distribution of "all English sentences (perhaps...) such that the sentence contains >=2 positive and 0 negative, or >=2 negative and 0 positive words." The fact that this will generalize is totally not obvious.

And yet these things seem to work pretty well. I just don't quite understand why.

## 20 September 2007

### Mark-up Always the Wrong Tree?

Almost a year ago I responded to a very interesting article in CL. The substance of the article is that we have to be careful when we annotate data lest we draw incorrect conclusions. In this post I'm going to take a more extreme position. It's not necessarily one I agree with 100%, but I think it's worth more than just a brief consideration.

Proposition: mark-up is always a bad idea.

That is: we should never be marking up data in ways that it's not "naturally" marked up. For instance, part-of-speech tagged data does not exist naturally. Parallel French-English data does. The crux of the argument is that if something is not a task that anyone performs naturally, then it's not a task worth computationalizing.

Here's why I think this is a reasonable position to take. In some sense, we're striving for machines that can do things that humans do. We have little to no external evidence that when humans (for instance) perform translation, that they also perform part-of-speech tagging along the way. Moreover, as the CL article mentioned above nicely points out, it's very easy to confuse ourselves by using incorrect representations, or being lazy about annotating. We may be happy to speculate the humans build up some sort of syntactic representation of sentences inside their heads (and, yes, there is some psychological evidence for something that might correlate with this). But the fact is, simply, that all we can observe are the inputs and outputs of some processes (eg., translation) and that we should base all of our models on these observables.

Despite the fact that agreeing with this proposition makes much of my own work uninteresting (at least from the perspective of doing things with language), I find very few holes in the argument.

I think the first hole is just a "we're not there yet" issue. That is: in the ideal world, sure, I agree, but I don't think we yet have the technology to accomplish this.

The second hole, which is somewhat related, is that even if we had the technology, working on small problems based on perhaps-ill-conceived data will give us insight into important issues. For instance, many summarization people believe that coreference issues are a big problem. Sure, I can imagine an end-to-end summarization system that essentially treats coreference as a "latent variable" and never actually looks and hand-annotated coref data. On the other hand, I have no idea what this latent variable should look like, how it should be influenced, etc. The very process of working on these small problems (like "solving" coref on small annotated data sets) give us an opportunity to better understand what goes in to these problems.

The hole with the second hole :) is the following. If this is the only compelling reason to look at these sub-problems, then we should essentially stop working on them once we have a reasonable grasp. Not to be too hard on POS tagging, but I think we've pretty much established that we can do this task and we know more or less the major ins and outs. So we should stop doing it. (Similar arguments can be made for other tasks; eg., NE tagging in English.)

The final hole is that I believe that there exist tasks that humans don't do simply because they're too big. And these are tasks that computers can do. If we can force some humans to do these tasks, maybe it would be worthwhile. But, to be honest, I can't think of any such thing off the top of my head. Maybe I'm just too closed-minded.

## 11 September 2007

### Journals are on the Mind

Anyone who has ever read this blog before knows that I'm a huge supporter of moving our Computational Linguistics journal to open access. Those who talk to me, know I'm in favor of more drastic measures, but would be content with just this one small change. I'm not here to beat a horse (living or dead), but to say that this appears to be something on many people's minds. I just got an email for voting for new members of the ACL board. I think only members get to vote, but I don't see anything that says that everyone can't view the statements. The list of candidates with their statements is here.

What I find promising is how often open access CL is mentioned! Indeed, both candidates for VP-elect say something. Ido writes:

Among other possibilities, the proposal to make CL open access, with a shorter reviewing cycle, seems worth pursuing. Additionally, we can increase the annual capacity of journal papers and include shorter ones.
And Jan writes:
Third, ACL's role is to help the widest possible dissemination of the field's results, including novel ways such as electronic and open-access publications, training workshops and summer schools (to attract excellent "new blood").
The issue comes up again for one of the Exec members. Hwee Tou says "I believe some issues of current concern to the ACL membership include the role of open access journals..."

Obviously I'm not here to tell anyone who to vote for. Indeed, since both candidates for VP mention the issue, it's not really a dividing point! But I'm very very very pleased to see that this has become something of an important front.

## 05 September 2007

### Word order and what-not

I've recently been looking at some linguistic issues related to word order phenomena. This is somewhat inline with the implicational universals stuff that I am working on with Lyle Campbell, but also somewhat tangential.

Here's a little background.

Most linguists tend to believe in things like nouns, verbs, adjectives, subjects and objects, genitives, adpositions (prepositions and postpositions), etc. Indeed, these are some of the basic building blocks of typological studies of word order. A common example is that languages that are OV (i.e., the object precedes the verb) are also postpositional (think Hindi or Japanese). On the other hand, VO languages are also prepositional (think English).

The general set of orders that are considered important are: V/O, Adj/N, Gen/N, PrepP/PostP and a few others. However, one of these is underspecified. In particular, PrepP/PostP tells us nothing about the placement of the embedded clause with respect to its head.

For instance, in English, which is PrepP, the head precedes the PP ("The man *with* the axe", axe comes after man). Or, in Japanese, which is PostP, the head comes after the PP (glossed: "the axe *with* the man" -- meaning that the man has the axe). However, it is unclear if other orders are possible. For instance, are there PrepP languages for which "with the axe" ("with" has to come before "the axe" in a PrepP language) precedes "the man", something like "with the axe the man". Or, are there PostP languages for which "the axe-with" comes after the head ("the man the axe-with"). I certainly don't know any, but I know enough about maybe 4 or 5 languages out of around 7000 to tell. It seems like interpretation in such a language would be difficult, but of course that doesn't stop German from separating verbs and auxiliaries and indeed Germans don't seem to have a hard time understanding each other.

A second thing that is left unanswered is how these relate to each other. Consider Gen/N and Adj/N. If you are a GenN + AdjN language, which comes first? In English, the Gen has to ("The man's happy brother" not "The happy man's brother" -- the latter means that it's the man, not the brother, who is happy). Is this reversible for a given language, or are the any languages that allow both? Again, it seems like it would make interpretation difficult.

I've asked the few typologists that I know these two questions and they actually haven't known. I'm hoping that a query to the blogosphere (a word I hate) and a query to linguist-list will turn up something. I'll post anything I hear here.

## 28 August 2007

### The Hierarchical Bayes Compiler

I've been working for a while on a compiler for statistical models. (Those of you who knew me back in the day will know that I used to be a programming languages/compiler nut.) The basic idea is to have a language in which you can naturally express hierarchical Bayesian models and either simulate them directly or compile them into your language of choice (in this case, right now, the only language you can choose in C :P). The key differentiation between HBC (the Hierarchical Bayes Compiler) and tools like WinBugs is that you're actually supposed to be able to use HBC to do large-scale simulations. Also, although right now it only supports Gibbs sampling, message passing and stochastic EM variants are in the works. It also almost knows how to automatically collapse out variables (i.e., automatic Rao-Blackwellization), but this is still a bit buggy.

Anyway, you can see more information on the official website. I'd really appreciate people who are interested in sending me bug reports if you encounter any problems. It's a pretty complex bit of machinery and some of it is still hacked together rather than done properly (especially type checking), so I expect to find some bugs there.

So just to whet your appetite if you haven't yet clicked on the above link, here's a complete implementation of LDA including hyperparameter estimation in HBC:

  alpha     ~ Gam(0.1,1)  eta       ~ Gam(0.1,1)  beta_{k}  ~ DirSym(eta, V)           , k \in [1,K]  theta_{d} ~ DirSym(alpha, K)         , d \in [1,D]  z_{d,n}   ~ Mult(theta_{d})          , d \in [1,D] , n \in [1,N_{d}]  w_{d,n}   ~ Mult(beta_{z_{d,n}})     , d \in [1,D] , n \in [1,N_{d}]

## 21 August 2007

### Topic modeling: syntactic versus semantic

Topic modeling has turned into a bit of a cottage industry in the NLP/machine learning world. Most seems to stem from latent Dirichlet allocation, though this of course built on previous techniques; the most well-known of which is latent semantic analysis. At the end of the day, such "topic models" really look more like dimensionality reduction techniques (eg., the similarity to multinomial PCA); however, in practice, they're often used as (perhaps soft) clustering methods. Words are mapped to topics; topics are used as features; this is fed into some learning algorithm.

One thing that's interested me for a while is that when viewed as clustering algorithms, how these topic models compare with more standard word clustering algorithms from the NLP community. For instance, the Brown clustering technique (built into SRILM) that clusters words based on context. (Lots of other word clustering techniques exist, but they pretty much all cluster based on local context; where local is either positionally local or local in a syntactic tree.)

I think the general high level story is that "topic models" go for semantics while "clustering models" go for syntax. That is, clustering models will tend to cluster words together that appear in similar local context, while topic models will cluster words together that appear in a similar global context. I've even heard stories that when given a choice of using POS tags as features in a model versus Brown clusters, it really don't make a difference.

I think this sentiment is a bit unfair to clustering models. Saying that context-based clustering models only find syntactically similar words is just not true. Consider the example clusters from the original LDA paper (the top portion of Figure 8). If we look up "film" ("new" seems odd) in CBC, we get: movie, film, comedy, drama, musical, thriller, documentary, flick, etc. (I left out multiword entries). The LDA list contains: new, film, show, music, movie, play, musical, best, actor, etc. We never get things like "actor" or "york" (presumably this is why "new" appeared), "love" or "theater", but it's unclear if this is good or not. Perhaps with more topics, these things would have gone into separate topics.

If we look up "school", we get: hospital, school, clinic, center, laboratory, lab, library, institute, university, etc. Again, this is a different sort of list than the LDA list, which contains: school, students, schools, education, teachers, high, public, teacher, bennett, manigat, state, president, etc.

It seems like the syntactic/semantic distinction is not quite right. In some sense, with the first list, LDA is being more liberal in what it considers film-like, with CBC being more conservative. OTOH, with the "school" list, CBC seems to be more liberal.

I realize, of course, that this is comparing apples and oranges... the data sets are different, the models are different, the preprocessing is different, etc. But it's still pretty clear that both sort of models are getting at the same basic information. It would be cool to see some work that tried to get leverage from both local context and global context, but perhaps this wouldn't be especially beneficial since these approaches---at least looking at these two lists---don't seem to produce results that are strongly complementary. I've also seen questions abound regarding getting topics out of topic models that are "disjoint" in some sense...this is something CBC does automatically. Perhaps a disjoint-LDA could leverage these ideas.

## 08 August 2007

### Conferences: Costs and Benefits

I typically attend two or three conferences per year; usually NIPS (which has been in Vancouver since I started attending), and an ACL-related one; the third is typically a second ACL-related conference or ICML, depending on the year. Typically two of these are domestic, one is international. Domestic conferences cost me about $1500 and international ones vary, but Prague weighed in at around$4000. This means that my travel costs (just for myself!) are about $5500-$7000 per year. Moreover, this takes 2-3 weeks of my year (more than 5% of my non-vacation time). When I was a student, this question never entered my mind (I seemed to have a nearly endless supply of money); now, I find myself wondering: are conferences worth the time and money investment?

I'll focus on international conferences because these are the biggest sink in terms of both money and time. In particular, I'll consider Prague, which hosted both ACL and EMNLP. Here's what I feel like I gained from this trip:

1. I saw some interesting papers presented.
2. I saw some interesting invited talks (Tom Mitchell's stands out for me).
3. I had semi-deep hallway conversations with 3 or 4 people.
4. I had non-deep hallway conversations with probably ~20 people.
5. I gave two presentations. (The implication is that this may make me "more famous" and that this is a good thing for some reason :P.)
6. I saw an area of the world that I hadn't yet been to.
7. I spent a not insignificant amount of time socializing with ~20 friends who I pretty much only see at conferences.
So the question is, was this worth just over $4 grand and 10 days of my life that could have been spent doing research (or taking a vacation)? I have mixed feelings. 1. does not seem compelling -- for conferences close to me that I do not attend, I still read proceedings. Sure, sometimes presentations are helpful and there's a bit of a serendipity aspect, but overall, I'd say this is something I could do in a day in the park with a copy of the proceedings. 2. is important. Especially when the invited talks are good and aren't just a long version of some paper presentation---i.e., when you can get a good sense of the overall research direction and the important long term results---I feel like these things are worth something. 3. is important. Some people say that hallway conversations are the most important; maybe it's just me, but it's pretty rare for me to have hallway conversations that are sufficiently deep to be really meaningful in the long run, but I'd say around 3 per conferences is something that you can hope for. At least with these, they seem to have either led to collaboration or at least new ideas to try out in my own work. 4. provides good social networking... I don't feel like these really change how I think about problems (and I think the same is true for the people I had such conversations with). The only important thing here is if you find out about what new problems other people are working on, you can learn about new areas that may interest you. 5. is nebulous to me; I feel like the key purpose in conference talks is advertisement. It's a bit unclear what I'm advertising for---citations, perhaps?---but hopefully something I've done will save someone else some time, or will give them ideas of something to try or something along these lines. But this is highly correlated with (1), which suggests that it's actually not particularly useful. 6. shouldn't be underestimated, but if I compare taking a vacation with going to a conference, they're very different. In particular, even at a conference where I a priori intend to spend a bunch of time touristing, I never seem able to accomplish this as much as I would like. Of course,$4k out of grant money versus $4k out of personal money is very different. 7. also shouldn't be underestimated, but like (6) is maybe best accomplished in other ways. Based on this, I feel like overall the main benefits to going to a conference are: seeing invited talks, having deep hallway conversations, and a minor bit of socializing and serendipity. The thing that occurred to me recently is that it's actually possible to achieve these things without going to conferences. In particular, consider the following model. I invite one or two "famous types" to my university to give invited talks. Each of these would cost maybe$2000, but a lot (if not all) of this would be subsidized by the department. So I get invited talks for (nearly) free; for safety, even say it costs me $1k. I now have$3k left. With this $3k I tour around the country and spend a few days at different labs/universities and meet with people. If I know someone well enough at a lab, I can probably stay with them, which means my only real cost is airfare (assuming their university doesn't want to invite me and pay for it) and incidentals. For domestic flights, it's hard to imagine that I wouldn't be able to pull off each of this for around$750. And that eats up the \$4k.

What do I get out of this model? Well, I'd give talks at four universities. This is not quite as broad an audience as at a conference, but they're more focused and my talk can be longer. Instead of having semi-deep hallway conversations, at the very least I get 4 very deep office conversations, which is potentially much more useful. I get one or two invited talks per year, by people that I choose (modulo availability).

What do I lose? I lose seeing other papers presented, which I don't think is too serious. I lose socializing and touristing (in foreign countries). This is too bad, but is perhaps better served by a legitimate vacation. The only other big thing I lose is conversations with multiple people simultaneously (eg., in Prague, one of my "good" conversations was with Ryan McDonald and Joakim Nivre... this would not be possible under my proposed model). I also lose seeing questions and answers asked at talks, which are occasionally quite interesting, but also something that I'm willing to live with out.

Overall, I think the biggest thing I would lose is a sense of community, which I think is hard to quantify and yet still important. Though, I'm also not proposing that I would never go to a conference, but that maybe 2-3 per year is overkill for the benefits obtained (especially for expensive destinations). If I went to one domestic (I count Canada as domestic) conference per year and visited 2-3 other sites, I'm not sure that I'd be any worse off. (Of course, the fact that I'm in the States helps here... you probably couldn't get away with this model outside of US/Canada.)

## 01 August 2007

### Explanatory Models

I am frequently asked the question: why does your application for solving XXX make such and such an error? (You can easily replace "your" with any possessive noun and the statement remains valid.)

My standard answer is to shrug and say "who knows."

This is quite different from, for instance, work in pattern matching for information extraction (many other citations are possible). In this setting, when the system makes an error, one can ask the system "what pattern caused this error." You can then trace the pattern back to the source documents from which it came and obtain some understanding for what is going on.

This is frequently not the case for your generic sequence labeling algorithm. If, say, a CRF misses a person name, what can you do about it? Can you understand why it made the error. More generally, if a model of any variety errs, can it say anything about why this error came to be.

One way to approach this problem would be to try to inspect the weights of the learned algorithm, but there are so many trade-offs going on internally that I've never been able to do this successfully (by hand, at least --- perhaps a clever tool could help, but I'm not sure). An alternative that I've been thinking about recently (but probably won't work on because I haven't enough time right now) is instead to pose the question as: what is the minimal change to the input required so that I would have made the decision correctly.

I guess one way to think about this is consider the case that a POS system misses tagging "Fred" in the sentence "Fred is not happy" as an NNP and instead calls it a "VBD." Presumably we have a bunch of window features about Fred that give us its identify, prefixes and suffixes, etc. Perhaps if "Fred" had been "Harry" this wouldn't have happened because "Fred" has the spelling feature "-ed." (Ok, clearly this is a very crafted example, but you get the idea.)

The question is: how do you define minimal change in features. If we're in an HMM (where we're willing to assume feature independence), then I don't think this is a big problem. But in the case where "Fred" ends in "-ed", it also ends in "-d" and both of these make it look more like a VBD. Such an explanatory system would ideally like to know that if "-d" weren't there, then neither would be "-d" and use this for computing the minimal change. It would also have to know that certain features are easier to change than others. For instance, if it has only ever seen the word "Xavier" in the training data as an NNP, then it could also suggest that if the word were "Xavier" instead of "Fred" than this error would not have happened. But this is sort of silly, because it gives us no information. (I'm working under the assumption that we want to do this so that we can add/remove features to our model to help correct for errors [on development data :P].)

It seems like neither of these problems is insurmountable. Indeed, just looking at something like feature frequency across the entire training set would give you some sense of which features are easy to change, as well as which ones are highly correlated. (I guess you could even do this on unlabeled data.)

I feel like it's possible to create a methodology for doing this for a specific problem (eg., NE identification), but I'd really like to see some work on a more generic framework that can be applied to a whole host of problems (why did my MT system make that translation?). Perhaps something already exists and I just haven't seen it.

## 19 July 2007

### What's the Use of a Crummy Translation?

I'm currently visiting Microsoft Research Asia (in Beijing) for two weeks (thanks for having me, guys!). I speak basically no Chinese. I took one half of a semester about 6 years ago. I know much more Japanese; enough so that I can read signs that indicate direction, dates and times, but that's about it... the remainder is too divergent for me to make out at all (perhaps a native Japanese speaker would feel differently, but certainly not a gaijin like me).

My experience here has reminded me of a paper that Ken Church and Ed Hovy wrote almost 15 years ago now, Good Applications for Crummy Machine Translation. I'm not sure how many people have read it recently, but it essentially makes the following point: MT should enter the users world in small steps, only insofar as it is actually going to work. To say that MT quality has improved significantly in 15 years is probably already an understatement, but it is still certainly far from something that can even compare to translation quality of a human, even in the original training domain.

That said, I think that maybe we are a bit too modest as a community. MT output is actually relatively readable these days, especially for relatively short input sentences. The fact that "real world companies" such as Google and LanguageWeaver seem to anticipate making a profit off of MT shows that at least a few crazies out there believe that it is likely to work well enough to be useful.

At this point, rather than gleefully shouting the glories of MT, I would like to point out the difference between the title of this post and the title of the Church/Hovy paper. I want to know what to do with a crummy translation. They want to know what to do with crummy machine translation. This brings me back to the beginning of this post: my brief experience in Beijing. (Discourse parsers: I challenge you to get that dependency link!)

• This voucher can not be encashed and can be used for one sitting only.
• The management reserves the right of explanation.
• Office snack is forbidden to take away.
• Fizzwater bottles please recycle.
The first two are at my hotel, which is quite upscale; the second two are here on the fridge at Microsoft. There are so many more examples, in subway stations, on the bus, on tourism brochures, in trains, at the airport, I could go on collecting these forever. The interesting this is that although two of these use words that aren't even in my vocabulary (encashed and fizzwater), one is grammatical but semantically nonsensical (what are they explaining?) and one is missing an indirect object (but if it had one, it would be semantically meaningless), I still know what they all mean. Yes, they're sometimes amusing and worth a short chuckle, but overall the important points are gotten across: no cash value; you can be kicked out; don't steal snacks; recycle bottles.

The question I have to ask myself is: are these human translations really better than something a machine could produce? My guess is that machine translation outputs would be less entertaining, but I have a hard time imagine that they would be less comprehensible. I guess I want to know: if we're holding ourselves to the standard of a level of human translation, what level is this? Clearly it's not the average translation level that large tourism companies in China hold themselves to. Can we already beat these translations? If so, why don't we relish in this fact?

## 12 July 2007

### Multiclass learning as multitask learning

It's bugged me for a little while that when learning multiclass classifiers, the prior on weights is uniform (even when they're learned directly and not through some one-versus-rest or all-versus-all reduction).

Why does this bug me? Consider our favorite task: shallow parsing (aka syntactic chunking). We want to be able to identify base phrases such as NPs in running text. The standard way to do this is to do an encoding of phrase labels into word labels and apply some sequence labeling algorithm. The standard encoding is BIO. A sentence like "The man ate a sandwich ." would appear as "B-NP I-NP B-VP B-NP I-NP O" with "B-X" indicating the beginning of a phrase of type X, and "I-X" indicating being "inside" such a phrase ("O", assigned to "." is "outside" of a chunk).

If we train, eg., a CRF to recognize this, then (typically) it considers B-NP to be a completely independent of I-NP; just as independent as it is of "O". Clearly this is a priori a wrong assumption.

One way I have gotten around this problem is to actually explicitly parameterize my models with per-class features. That is, rather than having a feature like "current word is 'the'" and making K copies of this feature (one per output label); I would have explicitly conjoined features such as "current word is 'the' and label is 'B-NP'". This enables me to have features like "word=the and label is B-?" or "word=the and label is ?-NP", which would get shared across different labels. (Shameless plug: megam can do this effortlessly.)

But I would rather not have to do this. One thing I could do is to make 2^K versions of each feature (K is still the number of labels), where each encodes some subset of active features. But for large-K problems, this could get a bit unwieldy. Pairwise features would be tolerable, but then you couldn't get the "B-?" sort of features I want. There's also no obvious kernel solution here, because these are functions of the output label, not the input.

It seems like the right place for this to happen is in the prior (or the regularizer, if you're anti-probabilistic models). Let's say we have F features and K classes. In a linear model, we'll learn F*K weights (okay, really F*(K-1) for identifiability, but it's easier to think in terms of F*K). Let's say that a prior we know that classes j and k are related. Then we want the prior to favor w(:,j) to be similar to w(:,k). There are a variety of ways to accomplish this: I think that something along the lines of a recent NIPS paper on multitask feature learning is a reasonable way to approach this.

What this approach lacks in general is the notion that if classes j and k "share" some features (i.e., they have similar weights), then they're more likely to "share" other features. You could do something like task clustering to achieve this, but that seems unideal since I'd really like to see a single unified algorithm.

Unfortunately, all attempts (on my part) to come up with a convex regularizer that shares these properties has failed. I actually now think that it is probably impossible. The problem is essentially that there is bound to be some threshold controlling whether the model things classes j and k are similar and below this threshold the regularizer will prefer w(:,j) and w(:,k) independently close to zero; above this threshold, it will prefer w(:,j) and w(:,k) to be close to each other (and also close to zero). This is essentially the root of the non-convexity.

## 08 July 2007

### Collapsed Gibbs

(The contents of this post are largely due to a conversation with Percy Liang at ACL.)

I'm a big fan of Gibbs sampling for Bayesian problems, just because it's so darn easy. The standard setup for Gibbs sampling over a space of variables a,b,c (I'll assume there are no exploitable independences) is:

1. Draw a conditioned on b,c
2. Draw b conditioned on a,c
3. Draw c conditioned on a,b
This is quite a simple story that, in some cases, be "improved." For instance, it is often possible to jointly draw a and b, yielding:
1. Draw a,b conditioned on c
2. Draw c conditioned on a,b
This is the "blocked Gibbs sampler." Another variant, that is commonly used in our community, is when one of the variables (say, b) can be analytically integrated out, yielding:
1. Draw a conditioned on c
2. Draw b conditioned on a,c
3. Draw c conditioned on a,b
This is the "collapsed Gibbs sampler." In fact, we can often collapse b out entirely and, in cases where we don't actually care about it's value, we get:
1. Draw a conditioned on c
2. Draw c conditioned on a
To make this concrete, consider Mark Johnson's EMNLP paper on unsupervised part of speech tagging. Here, there are essentially two types of variables. One set are the tags assigned to each word. The second set are the parameters (eg., probability of word given tag). The standard Gibbs sampler would perform the following actions:
1. For each word token, draw a tag for that word conditioned on the word itself, the tag to the left, and the "probability of word given tag" parameters.
2. For each tag type (not token), draw a multinomial parameter vector for "probability of word given tag" conditioned on the current assignment of tags to words.
It turns out that if our prior on "p(word|tag)" is Dirichlet, we can collapse out the second step by integrating over these parameters. This yields a one-step collapsed Gibbs sampler:
1. For each word token, draw a tag for that word conditioned on the word itself, the tag to the left, and all other current assignments of tags to words.
My general M.O. has been: if you can collapse out a variable, you should. This seems intuitively reasonable because you're now sampling over a smaller space and so it should be easier.

The point of this post is that acknowledge that this may not always be the case. In fact, it's sort of obvious in retrospect. There are many models for which auxiliary variables are added just to make the sampling easier. This is, in effect, un-collapsing the sampler. If "always collapse" is a good rule to follow, then people would never add auxiliary variables.

While this is a convincing argument (for me, at least), it's not particularly intuitive. I think that the intuition comes from considering the mixing rate of the Markov chain specified by the standard Gibbs sampler and the collapsed Gibbs sampler. It seems that essentially what's happening by using a collapsed sampler is that the variance of the Markov chain is decreasing. In the tagging example, consider a frequent word. In the collapsed setting, the chance that the tag for a single token of this word will change in a Gibbs step is roughly inversely proportional to its term frequency. This means that the collapsed sampler is going to have a tendency to get stuck (and this is exactly what Mark's results seem to suggest). On the other hand, in the uncollapsed case, it is reasonably plausible that a large number of tags could change for a single word type "simultaneously" due to a slightly different draw of the "p(word|tag)" parameter vector.

(Interestingly, in the case of LDA, the collapsed sampler is the standard approach and my sense is that it is actually somehow not causing serious problems here. But I actually haven't seen experiments that bear on this.)

## 02 July 2007

### ACL and EMNLP 2007 Report

ACL/EMNLP just concluded. Overall, I thought both conferences were a success, though by now I am quite ready to return home. Prague was very nice. I especially enjoyed Tom Mitchell's invited talk on linking fMRI experiments to language. They actually use lexical semantic information to be able to identify what words people are thinking about when they scan their brains. Scary mind-reading stuff going on here. I think this is a very interesting avenue of research---probably not one I'll follow myself, but one that I'm really happy someone is persuing.

This is a really long post, but I hope people (both who attended and who didn't) find it useful. Here are some highlights, more or less by theme (as usual, there are lots of papers I'm not mentioning, the majority of which because I didn't see them):

Machine Translation:

The overall theme at the Stat-MT workshop was that it's hard to translate out of domain. I didn't see any conclusive evidence that we've figure out how to do this well. Since domain adaptation is a topic I'm interested in, I'd like to have seen something. Probably the most interesting paper I saw here was about using dependency order templates to improve translation, but I'm actually not convinced that this is actually helping much with the domain issues: the plots (eg., Fig 7.1) seem to indicate that the improvement is independent of domain when compared to the treelet system. They do do better than phrases though. There was a cute talk about trying character-based models for MT. Doesn't seem like this does much of interest, and it only really works for related languages, but it was nice to see someone try. This idea was echoed later in EMNLP but none of the talks there really stood out for me. The only other theme that I picked up at Stat-MT (I didn't stay all day) was that a lot of people are doing some form of syntactic MT now. Phrase-based seems to be on its way out (modulo the next paragraph).

There were also a lot of talks using Philipp Koehn's new Moses translation system, both at Stat-MT as well as at ACL and EMNLP. I won't link you to all of them because they all tried very similar things, but Philipp's own paper is probably a good reference. The idea is to do factored translation (ala factored language modeling) by splitting both the input words and the output words into factors (eg., lemma+morphology) and translating each independently. The plus is that most of your algorithms for phrase-based translation remain the same, and you can still use max-Bleu traning. These are also the cons. It seems to me (being more on the MT side) that what we need to do is rid ourselves of max-Bleu, and then just switch to a purely discriminative approach with tons of features, rather than a linear combination of simple generative models.

There were also a lot of word-alignment talks. The most conclusive, in my mind, was Alex Fraser's (though I should be upfront about bias: he was my officemate for 5 years). He actually introduced a new generative alignment model (i.e., one that does have "IBM" in the name) that accounts for phrases directly in the model (no more symmetrization, either). And it helps quite a bit. There was also a paper on alignments tuned for syntax by John DeNero and Dan Klein that I liked (I tried something similar previously in summarization, but I think their model makes more sense). (A second bias: I have known John since I was 4 years old.)

The google folks had a paper on training language models on tera-word corpora. The clever trick here is that if your goal is a LM in a linear model (see below), it doesn't matter if its normalized or not. This makes the estimation much easier. They also (not too surprisingly) find that when you have lots of words, backoff doesn't matter as much. Now, if only the google ngram corpus included low counts. (This paper, as well as many other google papers both within and without MT, makes a big deal of how to do all this computation in a map-reduce framework. Maybe it's just me, but I'd really appreciate not reading this anymore. As a functional programming languages guy, map-reduce is just map and fold. When I've written my code as a map-fold operation, I don't put it in my papers... should I?)

The talk that probably got the most attention at EMNLP was on WSD improving machine translation by Marine Carpuat and Dekai Wu. I think Marine and Dekai must have predicted a bit of difficulty from the audience, because the talk was put together a bit tongue in cheek, but overall came across very well. The key to getting WSD to help is: (a) integrate in the decoder, (b) do WSD on phrases not just words, and (c) redefine the WSD task :). Okay, (c) is not quite fair. What they do is essentially train a classifier to do phrase prediction, rather than just using a t-table or a phrase-table. (Actually, Berger and the Della Pietras did this back in the 90s, but for word translation.) Daniel Marcu nicely complimented that he would go back to LA and tell the group that he saw a very nice talk about training a classifier to predict phrases based on more global information, but that he may not mention that it was called WSD. They actually had a backup slide prepared for this exact question. Personally, I wouldn't have called it WSD if I had written the paper. But I don't think it's necessarily wrong to. I liken it to David Chiang's Hiero system: is it syntactic? If you say yes, I think you have to admit that Marine and Dekai's system uses WSD. Regardless of what you call it, I think this paper may have quite a bit of impact.

(p.s., MT-people, listen up. When you have a model of the form "choose translation by an argmax over a sum over features of a weight times a feature value", please stop refering to it as a log-linear model. It's just a linear model.)

Machine Learning:

Daisuke Okanohara and Jun'ichi Tsujii presented a paper on learning discriminative language models with pseudo-negative samples. Where do they get their negative samples? They're just "sentences" produced by a trigram language model! I find it hard to believe no one has done this before because it's so obvious in retrospect, but I liked it. However, I think they underplay the similarity to the whole-sentence maxent language models from 2001. Essentially, when one trains a WSMELM, one has to do sampling to compute the partition function. The samples are actually generated by a base language model, typically a trigram. If you're willing to interpret the partition funciton as a collection of negative examples, you end up with something quite similar.

There were two papers on an application of the matrix-tree theorem to dependency parsing, one by the MIT crowd, the other by the Smiths. A clever application (by both sets of authors) of the m-t theorem essentially allows you to efficiently (cubic time) compute marginals over dependency links in non-projective trees. I think both papers are good and if this is related to your area, it's worth reading both. My only nit pick is the too-general title of the MIT paper :).

John Blitzer, Mark Drezde and Fernando Pereira had a very nice paper on an application of SCL (a domain adaptation technique) to sentiment classification. Sentiment is definitely a hot topic (fad topic?) right now, but it's cool to see some fancy learning stuff going on there. If you know SCL, you know roughly what they did, but the paper is a good read.

One of my favorite papers at ACL was on dirichlet process models for coreference resolution by Aria Haghighi and Dan Klein. I'd say you should probably read this paper, even if it's not your area.

One of my favorite papers at EMNLP was on bootstrapping for dependency parsing by David Smith and Jason Eisner. They use a clever application of Renyi entropy to obtain a reasonable bootstrapping algorithm. I was not aware of this, but during the question period, it was raised that apparently these entropy-based measures can sometimes do funky things (make you too confident in wrong predictions). But I think this is at least somewhat true for pretty much all semi-supervised or bootstrapping models.

Random other stuff:

I learned about system combination from the BBN talk. The idea here is to get lots of outputs from lots of models and try to combine the outputs in a meaningful way. The high-level approach for translation is to align all the outputs using some alignment technique. Now, choose one as a pivot. For each aligned phrase in the pivot, try replacing it with the corresponding phrase from one of the other outputs. It's kinda crazy that this works, but it helps at least a few bleu points (which I'm told is a lot). On principle I don't like the idea. It seems like just a whole lot of engineering. But if you're in the "get good results" game, it seems like just as good a strategy as anything else. (I'm also curious: although currently quite ad-hoc, this seems a lot like doing an error-correcting output code. Does anyone know if it has been formalized as such? Do you get any gains out of this?)

My final blurb is to plug a paper by MSR on single-document summarization. Yes, that's right, single-document. And they beat the baseline. The cool thing about this paper is that they use the "highlights" put up on many CNN news articles as training. Not only are these not extracts, but they're also "out of order." My sense from talking to the authors is that most of the time a single highlight corresponds to one sentence, but is simplified. I actually downloaded a bunch of this data a month ago or so (it's annoying -- CNN says that you can only download 50 per day and you have to do it "manually" -- it's unclear that this is actually enforceable or if it would fall under fair use, but I can understand from Microsoft's perspective it's better safe than sorry). I was waiting to collect a few more months of this data and then release it for people to use, so check back later. (I couldn't quite tell if MSR was going to release their version or not... if so, we should probably talk and make sure that we don't waste effort.)

Wrapping Up:

Since we're ending on a summarization note, here's a challenge: create a document summarization system that will generate the above post from the data in the anthology. (Okay, to be fair, you can exclude the information that's obtained from conversations and questions. But if we start videotaping ACL, then that should be allowable too.)

I put pictures from Prague up on my web site; feel free to email me if you want a high-res version of any of them. Also, if I was talking to you at the conference about sequential Monte Carlo, email me -- I have new info for you, but I can't remember who you are :).

## 26 June 2007

### ACL Business Meeting Results

This afternoon here in Prague was the ACL business meeting. A few interesting points were brought up. As well all know, ACL will be in Columbus, OH next year. It will actually be joint with HLT, which means that (as I previously expected), there won't be a separate HLT next year. Combining with the fact that when ACL is in north america, there is no NAACL, it looks like there will only be one north american conference next year (unless EMNLP--which is now officially a conference--chooses not to co-locate with ACL/HLT). The paper submission deadline looks to be around 11 Jan -- calls will be out in September. EACL 2008 will be in Greece.

The new information: ACL 2009 will be in Singapore, which was one of my two guesses (the other being Beijing). This should be a really nice location, though I'm saddened since I've already been there.

A few changes have been proposed for ACL 2008 in terms of reviewing. None will necessarily happen, but for what it's worth I've added my opinion here. If you have strong feelings, you should contact the board, or perhaps Kathy McKoewn, who is the conference chair.

• Conditional accepts and/or author feedback. I'd be in favor of doing one of these, but not both (redundancy). I'd prefer author feedback.

• Increased poster presence with equal footing in the proceedings, ala NIPS. I would also be in favor of this because already we are at four tracks and too much is going on. Alternatively, we could reduce the number of accepted papers, which I actually don't think would be terrible, but going to posters seems like a safer solution. The strongest argument against this is a personality one: ACLers tend to ignore poster sessions. Something would have to be doing about this. Spotlights may help.

• Wildcards from senior members. The idea would be that "senior" (however defined?) members would be able to play a single wildcard to accept an otherwise controversial paper. I'm probably not in favor of this, partially because it seems to introduce annoying political issues "What? I'm not senior enough for you?" (I wouldn't say that, since I'm not, but...); partially because it seems that this is essentially already the job of area chairs. There may be a problem here, but it seems that there are better, more direct solutions.

• Something having to do with extra reviewing of borderline papers. I didn't quite get what was meant here; it didn't seem that the proposal was to have fewer than 3 reviews, but to ask for more in case of confusion. I would actually argue for being more extreme: have a single (maybe 2) reviewer to an initial round of rejects and then get three reviews only for those papers that have any chance at all of being accepted. I doubt this idea will fly, though, but it would be interesting to check in previous papers how many got in that had one reviewer give a really bad score.... how many got in that two reviewers gave a really bad score. If these numbers are really really low, then it should be safe. Anyone have access to this data???
Finally, we talked about the "grassroots" efforts. The proposals were: archive videos, augment the anthology to include link structure, augmenting the anthology with tech reports and journals (given permission from publishers), and ours to make CL open access. Speaking with respect to ours, the only big complains were with respect to typesetting information, but several people did voice support, both in the meeting and in person. I remain hopeful!

I'll post more about technical content after the conference.