natural language processing blog

08 August 2008

Parallel Sampling

I've been thinking a lot recently about how to do MCMC on massively parallel architectures, for instance in a (massively) multi-core setup (either with or without shared memory).

There are several ways to approach this problem.

The first is the "brain dead" approach. If you have N-many cores, just run N-many parallel (independent) samplers. Done. The problem here is that if N is really like (say 1000 or greater), then this is probably a complete waste of space/time.

The next approach works if you're doing something like (uncollapsed) Gibbs sampling. Here, the Markov blankets usually separate in a fairly reasonable way. So you can literally distribute the work precisely as specified by the Markov blankets. With a bit of engineering, you can probably do this is a pretty effective manner. The problem, of course, is if you have strongly overlapping Markov blankets. (I.e., if you don't have good separation in the network.) This can happen either due to model structure, or due to collapsing certain variables. In this case, this approach just doesn't work at all.

The third approach---and the only one that really seems plausible---would be to construct sampling schemes exclusively for a massively parallel architecture. For instance, if you can divide your variables in some reasonable way, you can probably run semi-independent samplers that communicate on an as-needed basis. The form of this communication might, for instance, look something like an MH-step, or perhaps something more complex.

At any rate, I've done a bit of a literature survey to find examples of systems that work like this, but have turned up surprisingly little. I can't imagine that there's that little work on this problem, though.

29 July 2008

What Makes a Fair Comparison

This issue seems to come up implicitly or explicitly in a large variety of circumstances. The general set up is this. Suppose that for some task, we have the "de facto" approach "A." I come up with a new approach "B" that I want to argue is better than "A." The catch is that for whatever reasons, I have to limit B in some way (perhaps in terms of the amount of data I train in on). So, is the fair comparison to a similarly limited version of A or to the full-blown A?

Let's instantiate this with two examples. (Yes, these are examples of papers that have been published; you can see if you can figure out which ones they are. I'm just not sufficiently creative to make something up right now.)

Let's say I'm doing syntactic language modeling. In this case, the baseline would be (say) a trigram language model. My model requires parse trees to train. So I train my language model on the Penn Treebank. Now, when I compare to an ngram model, is the fair comparison to an ngram model trained on the Treebank, or one trained on (say) all WSJ from ten years?
Now let's say I'm doing MT. My baseline is Moses and I build some fancy MT system that can only train on sentences of length 10 or less. Should I compare to Moses trained on sentences of length 10 or less, or all sentences?

The obvious response is "report both." But this is just deferring the problem. Instead of you deciding which is the appropriate comparison, you are leaving the decision up to someone else. Ultimately, someone has to decide.

The advantage to comparing to a gimped version of model A is that it tells you: if this is the only data available to me, this is how much better I do than A. Of course, in no real world situation (for many problems, including the two I listed above) will this really be the only data you have. Plus, if you compare to a non-gimped A, you'll almost always lose by a ridiculously large margin.

On the other hand, comparing against a non-gimped A is a bit unfair. Chances are that quite some time has gone in to optimizing (say, for speed) algorithms for solving A. Should I, as a developer of new models, also have to be a hard-core optimizer in order to make things work? I'm thinking back to the introduction of SVMs twenty years ago. Back then, SVM training would take thousands of times longer than naive Bayes. Today, (linear) SVM training really isn't that much slower (maybe a small constant times longer).

Yet from the perspective of a consumer, there's something fundamentally unpleasant about having to gimp an existing system.... you have to ask yourself: why should I care?

I suppose this is where human judgement comes in. If I can reasonably imagine that the new system B might possibly be scaled up and, if so, I think it would continue to do well, then I'm not unhappy with a gimped comparison. For example, I can probably buy the syntactic language modeling example above (and to a lesser degree the MT example). I have a harder time with grammar induction on 10 word sentences because my prior beliefs state that 10 word sentences are syntactically really different from real sentences.

(Incidentally, although this issue didn't come up in my recent reviewing, I suppose that if I were reviewing a paper that had this issue, it probably wouldn't hurt if the authors were to ease me through this imagination process. For instance, in grammar induction, maybe you can show statistics that say that the distribution of context free rules is not so dissimilar between all sentences and short sentences. This almost never happens, but I think it would be useful both during reviewing as well as just for posterity.)

21 July 2008

ICML/UAI/COLT 2008 Retrospective

I know it's a bit delayed, but better late than never, eh? I have to say: I thought ICML/UAI/COLT this year was fantastic. The organizers all did a fantastic job. I can't possibly list everything I liked, but here are some highlights:

The tutorials were incredible. I'm reaching that point where I often skip out on tutorials because there's nothing new enough. This time, there was an embarrassment of riches: I couldn't go to everything I wanted to. In the end, I went to the Smola+Gretton+Fukumizu tutorial on distributional embeddings and the Krause+Guestrin tutorial on submodularity. Both were fantastic. My only complaint is that while all the plenary talks at ICML/UAI/COLT were video taped, the tutorials were not! Anyway, I will blog separately about both of these tutorials in the near future; also check out their web pages: the slides are probably reasonably self-explanatory.

The first day began with an invited talk by John Winn, who does (at least used to do) probabilistic modeling for computer vision. I'm not sure if this was the intended take-away message, but the thing I liked best about the talk is that he listed about a dozen things that you would have to model, were you to model "vision." He then talked about problems and techniques in terms of which of these you model, which of them you fix, and which of them you treat as "noise." I think this is a great way to think about modeling any kind of complex real world phenomenon (hint hint!).

My favorite morning talk was Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani. This is a really clever use of slice sampling, which everyone should know about! The idea is to resample entire sequences of hidden states in one go. The trouble with this in an infinite model is that I don't know how to do forward-backward over an infinite state space. The trick is to slice the probability mass on the lattice and only sample over those state/observation pairs that stick out above the slice. This is effectively a fully Bayesian way to do beam-like methods, and is certainly no limited to infinite HMMs.

That afternoon, there was the joint award session, with a "ten year" award (great idea!) going to Blum and Mitchell's original co-training paper from COLT 1998. This is clearly a very influential paper, and one I will blog about separately. I particularly liked two of the best paper award recipients:

Percy Liang and Michael Jordan for An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators. The idea here is to look at what various estimators are like in the limit of infinite data. The conclusion (as I read it) is that if your model is correct (i.e., the true distribution is realized by a certain set of parameters in your model family), then you're asymptotically better using generative models. If the model is even a tiny tiny bit incorrect, then you're better using discriminative. The key weakness of this analysis seems to be that, since we're looking only at asymptotics, there's no notion of a "sortof-wrong" model -- the wrongness gets blown up in the limit.
Shai Shalev-Shwartz and Nati Srebo for SVM Optimization: Inverse Dependence on Training Set Size. The idea here is that having lots of data should be a good thing, not something that makes us say "ugh, now my SVM will take forever to run." The key to getting an inverse dependence on the size of the training set is to sort of change the problem. The idea is that if you have 1,000 training points, you may get 88% accuracy and it may take an hour. If you have 1,000,000 training points, we're not going to ask that we find something with 90% accuracy, but rather still aim for 88% accuracy but try to get there faster. The analysis is based on the idea that if we have lots more data, we can be a bit looser with finding the exact best model parameters, which is what usually takes so long. We can afford to have not exactly the best parameters, since we're still going for 90% accuracy, not 88%.

Although I didn't see the talk, I enjoyed the poster on Training Structural SVMs when Exact Inference is Intractable by Thomas Finley and Thorsten Joachims. They investigate the difference between approximate inference methods that are overgenerating versus undergenerating. The former is exemplified by a conversion from an IP to an LP. The latter by, say, search. I don't think there are huge cut and dry situations where one is better than the other, but I think this is a really good step to understanding these models better.

Tuesday morning, there was a great talk on Efficient Projections onto the L1-Ball for Learning in High Dimensions by John Duchi, Shai Shalev-Shwartz, Yoram Singer and Tushar Chandra. Here, there's trying to extend the Pegasos algorithm to do L1 regularized SVMs. The difficult step is, after taking a sub-gradient step, projecting yourself back on to the L1 ball. They have a very clever algorithm for doing this that makes use of red-black trees to maintain a notion of when a dimension was last updated (necessary in order to get complexity that depends on the sparseness of feature vectors). The only worry I have about this approach is that I'm very happy dealing with arrays for everything, but once someone tells me that my algorithm needs to use a tree structure, I start worrying. Not because of big-O issues, but just because trees mean pointers, pointers mean pointer chasing, and pointer chasing (especially out of cache) is a nightmare for actual efficiency.

All the papers in the Tuesday 10:40am online learning section I found interesting. Confidence-weighted linear classification (Mark Dredze, Koby Crammer and Fernando Pereira) presents a technique for keeping track of how confident you are about your weights for various features (if you squint it looks very PAC-Bayes) and they get very good online performance with only one or two passes over the data. Francesco Orabona, Joseph
Keshet and Barbara Caputo had a paper on memory-bounded online perceptrons called the Projectron. The idea is simple and effective; however, the big take-away I had from this paper has to do with the baseline they compare against. If you only allow your perceptron to maintain, say, 100 support vectors, just use the first 100 points that the perceptron updates on. This does pretty much as well as most of the past work in this problem! How did this not get noticed before? Finally, Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari talked about bandit problems; this is interesting work, but still seems far from practical. The idea of naively picking an unknown label with uniform probability over a gigantic set of labels is just not going to work in the real world. But I think it's a really good start.

The entire NLP section on the last afternoon was good. David Chen and Ray Mooney talked about how to generate sportscasts from RoboCup trials; Ronan Collobert and Jason Weston talked about using neural nets to solve every NLP problem in the world (more on this in another post); Jason Wolfe, Aria Haghighi and Dan Klein talked about how to parallelize the M step in addition to the E step; and then Percy Liang, me, and Dan Klein talked about whether its reasonable or not to get rid of features in structured prediction (more on this in another post).

The next day was workshop day. The two workshops I went to were the Prior Knowledge in Text workshop (that I co-organized with Marc Dymetman, Guillame Bouchard and Yee Whye Teh), and the Nonparametric Bayes workshop (where I talked a tiny bit about HBC). I'll relate the news of our own workshop at a later date; you can see some about the NPBayes workshop here.

Then came UAI and COLT. I must admit by this time I was a bit conferenced out, so I missed a bunch of sessions. However, there were still some papers that stood out for me.

In UAI, David Mimno and Andrew McCallum presented Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression, essentially a conditional variant of LDA wherein the distribution over hyperparameters is given by a generalized linear model and can use arbitrary features. One audience question that came up (David, if you're reading, maybe you can answer this!) had to do with putting the GLM on the "alpha" hyperparameter. The fact that there was such a big improvement over baseline suggests that LDA-like models are highly sensitive to the setting of their hyperparameters: this is a bit surprising. I (like the questioner) was a bit surprised that tweaking a hyperparameter could have such a big influence! Nevertheless, very cool.

The best paper went to David Sontag, Talya Meltzer, Amir Globerson, Tommi Jaakkola and Yair Weiss for Tightening LP Relaxations for MAP using Message Passing. The idea is to take your marginal polytope initially defined by simple single-node marginal constraints and iteratively add pairwise, 3-wise, etc., constraints. There are some implementation details that allow you to do warm restarts. They managed to scale really amazingly well. This furthers my confidence that LPs are a really great way to do MAP inference in graphical models.

Kuzman Ganchev, Joao Graca, John Blitzer and Ben Taskar talked about Multi-View Learning over Structured and Non-Identical Outputs. I see this as a bit of an extension to the "alignment by agreement" and the "agreement-based learning" work, but now the difference is that the output spaces don't have to "match up."

Marina Meila and Le Bao had a poster on Estimation and clustering with infinite rankings, where they give a nonparametric model over rankings, focusing on properties of the distribution. Kurt Miller, Tom Griffiths and Mike Jordan had an extension to the Indian Buffet Process that intentionally removes exchangeability in order to model cases where we know the data is not exchangeable. Chong Wang, Dave Blei and David Heckerman had a nice result on how to do topic modeling over continuous time (modeled as Brownian motion). The cool thing is that by going continuous, they're able to get a more efficient algorithm that for previous work that functioned in discrete time steps.

At COLT, there were also a good number of good papers. Shai Ben-David, Tyler Lu and David Pal ask: Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning. I'll ruin the suspense: No, it does not. (At least not without strong assumptions about the label distribution.) Nina Balcan, Avrim Blum and Nati Srebo
show that learning with similarity functions is even better than we thought.

One paper at COLT that I especially liked because I wasn't aware of much of the previous work they cited was by Liwei Wang, Masashi Sugiyama, Cheng Yang, Zhi-Hua Zhou and Jufu Feng On the Margin Explanation of Boosting Algorithms. The history here is roughly like this. Boosting came along and worked pretty well. People started trying to explain it from the perspective of margin-based analysis. Then, Breiman came along and described an algorithm arc-gv that provably generates a better margin than AdaBoost (and does so in practice as well), yet works worse! This paper attempts to re-analyze boosting from the perspective of an "Equilibrium Margin," which provides sharper bounds and results that actually agree with what is observed empirically.

08 July 2008

ICML Business Meeting

Here are some points, coming to you roughly live.

ICML 2009 will be in Montreal, June 15-19. Colocated with COLT (June 20-23) and Symposium on RL (June 19-21).
ICML 2010 will be in Haifa, Isreal (led by IBM folks). No official dates, yet.
The ICML board will be holding elections for members soon: probably around 10 new. Requests for nominations will be sent out in the next month or two from IMLS.
Sounds like tracks will be implemented in SPC vs. review -- just like *ACL. To me, clearly a good idea. Bidding on 600 papers sucks.
There's discussion about having a combination of a "AAAI-style nectar track" and an "applications track." The idea would be that if you've published a paper at an applications conference (eg, ACL, CVPR, ISMB, etc.), you could rewrite >50% of it, sell it to an ICML audience, and resubmit. The goal would be to get some applications blood into ICML, and not simultaneously prevent people from submitting their apps papers to the apps conferences.

02 July 2008

ICML, UAI and COLT 2008 all WhatToSee'd

See here

24 June 2008

ICML 2008 papers up

See here. Has also been WhatToSee-d.

Here are the top words (stems) from ICML this year:

learn (53) -- no kidding!
model (20)
kernel (17) -- can't seem to shake these guys
estim (11)
reinforc (10)
linear (10) -- as in both "linear time" and "linear model"
classif (10) -- no kidding
process (9) -- yay nonparametric Bayes!
analysi (9)
supervis (8)
structur (8) -- our paper is one of these
rank (8) -- always popular in the web days
effici (8) -- this is great!

23 June 2008

Help! Contribute the LaTeX for your ACL papers!

This is a request to the community. If you have published a paper in an ACL-related venue in the past ten years or so, please consider contributing the LaTeX source. Please also consider contributing talk slides! It's an relatively painless process: just point your browser here and upload! (Note that we're specifically requesting that associated style files be included, though figures are not necessary.)

21 June 2008

ACS: ACL 2008 Summarization Track

This is the first in what I hope will be a long line of "Area Chair Summaries" from NLP-related conferences. If you would like to contribute one, please contact me!

For ACL this year, I was lucky to be the Area Chair for the summarization track. I know I've said it before, but we really got a ton of great papers this year. In the end, seven were accepted for presentation (note there are also some summarization-related papers that officially fell under "Generation" for which I was not the area chair). I would like to say that there was some sort of unifying theme this year, but there's not one that I can actually come up with. The papers were:

P08-1035 [bib]: Tadashi Nomoto
A Generic Sentence Trimmer with CRFs

P08-1036 [bib]: Ivan Titov; Ryan McDonald
A Joint Model of Text and Aspect Ratings for Sentiment Summarization

P08-1092 [bib]: Fadi Biadsy; Julia Hirschberg; Elena Filatova
An Unsupervised Approach to Biography Production Using Wikipedia

P08-1093 [bib]: Qiaozhu Mei; ChengXiang Zhai
Generating Impact-Based Summaries for Scientific Literature

P08-1094 [bib]: Ani Nenkova; Annie Louis
Can You Summarize This? Identifying Correlates of Input Difficulty for Multi-Document Summarization

P08-1054 [bib]: Gerald Penn; Xiaodan Zhu
A Critical Reassessment of Evaluation Baselines for Speech Summarization

P08-1041 [bib]: Giuseppe Carenini; Raymond T. Ng; Xiaodong Zhou
Summarizing Emails with Conversational Cohesion and Subjectivity

Most of these you can guess the contents more-or-less by their titles, but here's a quick run-down. Nomoto's is probably the hardest to guess. I dare-say it actually sounds a bit boring from just the title; it leads me to think it's yet another sentence compression method. But if you start reading the paper, you find out all about compression under dependency structures, summarization of Japanese text, and an fairly thorough evaluation.

The Titov and McDonald paper attempts to model associations between fine-grained user reviews of restaurants (eg: how do you rate the food versus the ambiance?) and the actual text in the reviews. This enables them to produce summaries that are specific to particular aspects of the review.

Biadsi, Hircshberg and Filatova present a model for producing biographies, by trying to identify biography-like sentences from Wikipedia (a source that's gaining more an more attention these days). One aspect that I found most interesting here was that they attempt to do full-on reference resolution and referring expression generation. This has always been something I've been too scared to touch, but they actually present some results that show that it's worthwhile!

Mei and Zhai talk about a sentence-retrieval method for summarizing scientific documents, which they gathered from DBLP. They take advantage of citation sentences (called "citances" by Marti Hearst and also explored deeply by Simone Teufel) and make a "citation" language model. This language model is interpolated with the standard document language model to perform extraction. This allows them to extract sentences that readers care about, not what the author thinks you should care about. The key limitation, of course, is that it only works once the paper has been cited for a while. (It would be nice to see how many citations you need before it's worth doing this.)

The paper by Nenkova and Louis describes a model for predicting if a batch of documents is going to be difficult (for a machine) to summarize. This is akin to the notion of query-difficulty that you see in IR. The results are about what you'd expect, but it's nice to see them played out. In particular, you see that more cohesive sets of documents are easier to summarize. Read the paper for more.

Penn and Zhu look at how well Rouge works when trying to summarize speech. They look at both Switchboard data (telephone conversations) and lectures. They have many interesting findings that cast doubt not only on the role that Rouge plays in speech summarization, but also on what sort of baselines are reasonable for speech, and what role meta-data plays in speech summarization. If you care at all about the intersection of speech and summarization, this truly is a must-read.

Last but not least, Carenini, Ng and Zhou discuss the task of summarizing emails (following up on previously work by the same authors on roughly the same task). Since the most relevant past work appeared in WWW07, I'll actually comment on that too (maybe unknown to many readers). There is quite a bit of work here, that primarily aims at finding useful features for summarizing chains of emails. They look at things like cue words, semantic similarity, lexical similarity, "PageRank" measures and a handful of others. This is all done in a graph-based framework that is pieced together based on processing the email chain (from what I can tell, this is highly non-trivial).

After writing this, I think that maybe the connecting thread is that there's a lot of work that aims at doing summarization for things other than straight news stories. I think this is a fantastic step for the community. Keep up the good work, guys!

18 June 2008

CL is open access

Just officially announced. Minor details: as of Mar 2009 (first issue next year), there will be no print version (electronic only) and will be open access. Also switched to an online electronic management system.

Obviously I think this is fantastic. Many many thanks go to Robert Dale and the other ACL and CL board members for making this happen.

15 June 2008

ACL 2008 What-To-See'd

Sorry for the delay, but I had to wait until I got here and actually got the proceedings CD, since the proceedings don't yet seem to be online. Find some good talks to see!

Yes, I know that I promised another post in the interim, but this doesn't really count in my mind.

11 June 2008

Old school conference blogging

These days, the "conference blog" has come to be a fairly accepted part of an academic blogger's repertoire. I've actually received an email on the third day of a conference that people knew I was attending asking "why haven't you blogged yet!" Colleagues who blog have shared similar stories. How did the world manage without us?!

I occasionally like to read through old papers, like from the late 70s, early 80s, mostly in stats, but occasionally in NLP or ML. Mostly it's fun; often I learn things. The old NLP paper typically amuse me more than anything else, but it's a bit of a relief to see the set of things that were considered interesting or important 20-30 years ago. I was browsing through the ACL 1980 proceedings (I won't share the answer to "how old were you then?") on the anthology and came across a paper titled "Parsing." I thought that was quite impressive that someone would be audacious enough to title their paper "Parsing." I mean, can you imagine going to a talk at ACL 2008 and seeing the title "Machine Translation?" It's absurd.

So I read the paper.

Well, it turns out it's not really a research paper. It's the 1980 ACL's parsing area chair's impression of all the parsing papers that year, and how they related to the previous year.

"Fantastic!" I thought.

These are like pre-web-era conference blogs. But it's not just some random guy blogging about his favorite ACL papers across a variety of areas, but rather the one person who probably knew all the parsing papers in ACL that year better than anyone. Having area chaired myself, I know what all the summarization submissions were about, and I know quite well what all the accepted submissions were about. (Or, at least, I did right after we issues accept/reject notifications.) If I had sat down right then and written a page summary, relating the the past year, it probably would have taken about two hours. (An average blog post takes 30 mins, but I figure I would want to be more diligent, and also be sure to look up what happened the year before.)

I would actually love to see something like that reinstated. I think it captures an important area that's missing from you standard conference blogs -- namely, that you get coverage. As I age, I attend fewer and fewer talks, so my conference blog posts are necessarily really spotty. But if a chair from every area were to write a one page summary, I think that would be great. And I really don't think it's that much added effort -- I easily spent 10-20 times that going over reviews, looking for inconsistencies, reading papers, etc.

However, while I think it's useful, I also don't think that it's really something that needs to be in our proceedings. So I'm going to try to start a trend. Every time I am area chair for a conference, I will post to the blog a summary of the papers in my area. If you ever are an area chair, and would like to participate, please contact me. Perhaps I'll be the only one who ever does this, but I hope not. I'll start with ACL 2008 summarization as my next post.

10 June 2008

Evaluating topic models

I think it's fair to say that I am a fan of the Bayesian lifestyle. I have at least a handful of papers with "Bayesian" in the title, and no in the misleading "I used Bayes' rule on a noisy-channel model" sense.

It's probably also fair to say that I'm a fan of NLP.

So... let's think about it. A fan of Bayes, a fan of NLP. He must do topic modeling (ie LDA-style models), right? Well, no. Not really. Admittedly I have done something that looks like topic modeling (for query-focused summarization), but never really topic modeling for topic modeling's sake.

The main reason I have stayed away from the ruthless, yet endlessly malleable, world of topic modeling is because it's notoriously difficult to evaluate. (I got away with this in the summarization example because I could evaluate the summaries directly.) The purpose of this post is to discuss how one can try to evaluate topic models.

At the end of the day, most topic models are just probabilistic models over documents (although sometimes they are models over collections of documents). For a very simple example, let's take LDA. Here, we have . In the particular case of LDA, the first two "p"s are Dirichlet, and the last two "p"s are Multinomial, where z is an indicator selecting a mixture. For simplicity, though, let's just collapse all the hyperparameters into a single variable a and the true parameters into a single parameter z and just treat this as and let's assume p and q are not conjugate (if they were, life would be too easy).

Now, we want to "evaluate" these models. That is, we want to check to see if they really are good models of documents. (Or more precisely, are the better models of documents than whatever we are comparing against... perhaps I've proposed a new version of p and q that I claim is more "life-like.") Well, the natural thing to do would be to take some held-out data and evaluate according to the model. Whichever model assigns higher probability to the heldout data is probably better.

At this point, we need to take a moment to talk about inference. There's the whole Monte Carlo camp and there's the whole deterministic (variational, Laplace, EP, etc.) camp. Each gives you something totally different. In the Monte Carlo camp, we'll typically get a set of R-many (possibly weighted) samples from the joint distribution p(a,z,w). We can easily "throw out" some of the components to arrive at a conditional distribution on whatever parameters we want. In the deterministic camp, one of the standard things we might get is a type-II maximum likelihood estimate of a given the training data: i.e., a value of a that maximizes p(a|w). (This is the empirical Bayes route -- some deterministic approximations will allow you to be fully Bayes and integrate out a as well.)

Now, back to evaluation. The issue that comes up is that in order to evaluate -- that is, in order to compute , we have to do more inference. In particular, we have to marginalize over the zs for the heldout data. In the MC camp, this would mean taking our samples to describe a posterior distribution on a given w (marginalizing out z) and then using this posterior to evaluate the heldout likelihood. This would involve another run of a sampler to marginalize out the zs for the new data. In the deterministic camp, we may have an ML-II point estimate of the hyperparameters a, but we still need to marginalize out z, which usually means basically running inference again (eg., running EM on the test data).

All of this is quite unfortunate. In both cases, re-running a sampler or re-running EM, is going to be computationally expensive. Life is probably slightly better in the deterministic camp where you usually get a fairly reasonable approximation to the evidence. In the MC camp, life is pretty bad. We can run this sampler, but (a) it is usually going to have pretty high variance and, (b) (even worse!) it's just plain hard to evaluate evidence in a sampler. At least I don't know of any really good ways and I've looked reasonably extensively (though "corrections" to my naivete are certainly welcome!).

So, what recourse do we have?

One reasonable standard thing to do is to "hold out" data in a different way. For instance, instead of holding out 10% of my documents, I'll hold out 10% of my words in each document. The advantage here is that since the parameters z are typically document-specific, I will obtain them for every document in the process of normal inference. This means that (at least part of) the integration in computing p(w|a) disappears and is usually tractable. The problem with this approach is that in many cases, it's not really in line with what we want to evaluate. Typically we want to evaluate how well this model models totally new documents, not "parts" of previously seen documents. (There are other issues, too, though these are less irksome to me in a topic-modeling setting.)

Another standard thing to do is to throw the latent variables into some sort of classification problem. That is, take (eg) the 20newsgroups data set, training and test combined. Run your topic model and get document-level parameters. Use these as parameters to, say, logistic regression and see how well you do. This definitely gets around the "test on new data" problem, is not really cheating (in my mind), and does give you an estimate. The problem is that this estimate is cloaked behind classification. Maybe there's no natural classification task associated with your data, or maybe classification washes out precisely the distinctions your model is trying to capture.

The final method I want to talk about I learned from Wei Li and Andrew McCallum and is (briefly!) described in their Pachinko Allocation paper. (Though I recall Andrew telling me that the technique stems---like so many things---from stats; namely, it is the empirical likelihood estimate of Diggle and Gratton.

The key idea in empirical likelihood is to replace our statistically simple but computationally complex model p with a statistically complex but computationally simple model q. We then evaluate likelihood according to q instead of p. Here's how I think of it. Let's say that whatever inference I did on training data allowed me to obtain a method for sampling from the distribution p(w|a). In most cases, we'll have this. If we have an ML-II estimate of a, we just follow the topic models' generative story; if we have samples over a, we just use those samples. Easy enough.

So, what we're going to do is generate a ton of faux documents from the posterior. On each of these documents, we estimate some simpler model. For instance, we might simply estimate a Multinomial (bag of words) on each faux document. We can consider this to now be a giant mixture of multinomials (evenly weighted), where the number of mixture components is the number of faux documents we generated. The nice thing here is that evaluating likelihood of test data under a (mixture of) multinomials is really easy. We just take our test documents, compute their average likelihood under each of these faux multinomials, and voila -- we're done!

This method is, of course, not without it's own issues. For one, a multinomial might not be a good model to use. For instance, if my topic model says anything about word order, then I might want to estimate simple n-gram language models instead. The estimates might also have high variance -- how many faux documents do I need? Some sort of kernel smoothing can help here, but then that introduces additional bias. I haven't seen anyone do any evaluation of this for topic-model things, but it would be nice to see.

But overall, I find this method the least offensive (ringing praise!) and, in fact, it's what is implemented as part of HBC.

29 May 2008

Measures of correlation

As NLPers, we're often tasked with measuring similarity of things, where things are usually words or phrases or something like that. The most standard example is measuring collocations. Namely, for every pair of words, do they co-locate (appear next to each other) more than they "should." There are lots of statistics that are used to measure collocation, but I'm not aware of any in-depth comparison of these. If anyone actually still reads this blog over the summer and would care to comment, it would be appreciated!

The most straightforward measure, in my mind, is mutual information. If we have two words "Y" and "Y", then the mutual information is: MI(X;Y) = \sum_x \sum_y p(X=x,Y=y) \log [p(X=x,Y=y) / (p(X=x)p(Y=y))] (sorry, my LaTeX plugin seems to have died). In the case of collocation, the values x and y can take are usually just "does occur" and "does not occur." Here, we're basically asking ourselves: do X and Y take on the same values more often than chance. I.e., do they seem roughly statistically independent. The mutual information statistic gives, for every X/Y pair (every pair of words) a score. We can then sort all word pairs by this score and find the strongest collocations.

One issue with MI seems to be that it doesn't really care if pairs are common or not. Very infrequent (typically noisy) pairs seem to pop to the front. One way to fix this would be to add an additional "count" term to the front of the mutual information to weight high-frequency pairs more. This is quite similar to the RlogF measure that's locally quite popular.

The next set of methods seem to be based mostly on the idea of assuming a generative model for the data and doing hypothesis testing. I.e., you can have one model that assumes the two words are independent (the null hypothesis), and one that assumes they are not. Typically this is implemented as multinomials over words, and then a classical statistical test is applied to the estimated maximum likelihood parameters of the data. You can use Dice coefficient, t-score, chi-squared, or even something simpler like the log-likelihood ratio.

Although I'm not completely familiar with these techniques, they seem like they would also be sensitive to noise in the MLE, especially for low-frequency terms (of which we know there are a lot). It seems plausible to just directly do a Bayesian hypothesis test, rather than a classical one, but I don't know if anyone has done this.

In writing this, I just came across collocations.de, which among other things, has a nice survey of these results from an ESSLLI tutorial in 2003. They seem to conclude that for extracting PP/verb collocations, t-score and frequency seem to work best, and for extracting adjective/noun collocations, log-likelihood, t-score, Fisher, and p-values seem to work best. The intersection just contains t-score, so maybe that's the way to go. Of course, these things are hard to validate because so much depends on what you're trying to do.

22 May 2008

Continuing bad ideas

I occasionally--more than I would like--run in to the following problem. I am working on some task for which there is prior work (shocking!). I'm getting ready to decide what experiments to run in order to convince readers of a potential paper that what I'm doing is reasonable. Typically this involves comparing fairly directly against said prior work. Which, in turn, means that I should replicate what this prior work has done as closely as possible, letting the only difference in systems be what I am trying to propose. Easy example: I'm doing machine learning for some problem and there is an alternative machine learning solution; I need to use an identical feature set to them. Another easy example: I am working on some task; I want to use the same training/dev/test set as the prior work.

The problem is: what if they did it wrong (in my opinion). There are many ways for doing things wrong and it would be hard for me to talk about specifics without pointing fingers. But just for concreteness, let's take the following case relating to the final example in the above paragraph. Say that we're doing POS tagging and the only prior work POS tagging paper used as test data every tenth sentence in WSJ, rather than the last 10%. This is (fairly clearly) totally unfair. However, it leaves me with the following options:

I can repeat their bad idea and test on every tenth sentence.
I can point out (in the paper) why this is a bad idea, and evaluate on the last 10%.
I can not point this out in the paper and just ignore their work and still evaluate on the last 10%.
I can point out why this is a bad idea, evaluate on both the last 10% (for "real" numbers) and every tenth sentence (for "comparison" numbers).
I can point out why this is a bad idea, reimplement their system, and run both mine and theirs on the last 10%.

I think that if all else is equal, (5) is the best option. I think it's scientifically sound, truthful, and gives results that are actually comparable. Unfortunately, it's also the most work because it involves reimplementing a complete other system which in many cases may be verging on prohibitive. (I suppose another option would be to contact the authors and ask them to rerun in the correct way -- I imagine some people might respond positively to this, though probably not all.)

(4) is tempting, because it allows me to "break the bad habit" but also compare directly. The problem is that if I really disbelieve the prior methodology, then these comparison numbers are themselves dubious: what am I supposed to learn from results that compare in an unreasonable setting? If they're great (for me), does that actually tell me anything? And if they're terrible (for me), does that tell me anything? It seems to depend on the severity of the "bad idea", but it's certainly not cut and dry.

(3) just seems intellectually dishonest.

(2) at first glance seems inferior to (4), but I'm not entirely sure that I believe this. After all, I'm not sure that (4) really tells me anything that (2) doesn't. I suppose one advantage to (4) over (2) is that it gives me some sense of whether this "bad idea" really is all that bad. If things don't look markedly different in (2) and (4), then maybe this bad idea really isn't quite so bad. (Of course, measuring "markedly different" may be difficult for some tasks.)

(1) also seems intellectually dishonest.

At this point, to me, it seems like (4) and (5) are the winners, with (2) not hugely worse than (4). Thinking from the perspective of how I would feel reviewing such a paper, I would clearly prefer (5), but depending on how the rest of the paper went, I could probably tolerate (2) or (4) depending on how egregious the offense is. One minor issue is that as a writer, you have to figure that the author of this prior work is likely to be a reviewer, which means that you probably shouldn't come out too hard against this error. Which is difficult because in order to get other reviewers to buy (4), and especially (2), they have to buy that this is a problem.

I'm curious how other people feel about this. I think (5) is obviously best, but if (5) is impossible to do (or nearly so), what should be done instead.

18 May 2008

Adaptation versus adaptability

Domain adaptation is, roughly, the following problem: given labeled data drawn from one or more source domains, and either (a) a little labeled data drawn from a target domain or (b) a lot of unlabeled data drawn from a target domain; do the following. Produce a classifier (say) that has low expected loss on new data drawn from the target domain. (For clarity: we typically assume that it is the data distribution that changes between domains, not the task; that would be standard multi-task learning.)

Obviously I think this is an fun problem (I publish on it, and blog about it reasonably frequently). It's fun to me because it both seems important and is also relatively unsolved (though certainly lots of partial solutions exist).

One thing I've been wondering recently, and I realize as I write this that the concept is as yet a bit unformed in my head, is whether the precise formulation I wrote in the first paragraph is the best one to solve. In particular, I can imagine many ways to tweak it:

Perhaps we don't want just a classifier that will do well on the "target" domain, but will really do well on any of the source domains as well.
Continuing (1), perhaps when we get a test point, we don't know which of the domains we've trained on it comes from.
Continuing (2), perhaps it might not even come from one of the domains we've seen before!
Perhaps at training time, we just have a bunch of data that's not neatly packed into "domains." Maybe one data set really comprises five different domains (think: Brown, if it weren't labeled by the source) or maybe two data sets that claim to be different domains really aren't.
Continuing (4), perhaps the notion of "domain" is too ill-defined to be thought of as a discrete entity and should really be more continuous.

I am referring to union of these issues as adaptability, rather than adaptation (the latter implies that it's a do-it-once sort of thing; the former that it's more online). All of these points beg the question that I often try to ask myself when defining problems: who is the client.

Here's one possible client: Google (or Yahoo or MSN or whatever). Anyone who has a copy of the web lying around and wants to do some non-trivial NLP on it. In this case, the test data really comes from a wide variety of different domains (which may or may not be discrete) and they want something that "just works." It seems like for this client, we must be at (2) or (3) and not (1). There may be some hints as to the domain (eg., if the URL starts with "cnn.com" then maybe--though not necessarily--it will be news; it could also be a classified ad). The reason why I'm not sure of (2) vs (3) is that if you're in the "some labeled target data" setting of domain adaptation, you're almost certainly in (3); if you're in the "lots of unlabeled target data" setting, then you may still be in (2) because you could presumably use the web as your "lots of unlabeled target data." (This is a bit more like transduction that induction.) However, in this case, you are now also definitely in (4) because no one is going to sit around and label all of the web as to what "domain" it is in. However, if you're in the (3) setting, I've no clue what your classifier should do when it gets a test example from a domain that (it thinks) it hasn't seen before!

(4) and (5) are a bit more subtle, primarily because they tend to deal with what you believe about your data, rather than who your client really is. For instance, if you look at the WSJ section of the Treebank, it is tempting to think of this as a single domain. However, there are markedly different types of documents in there. Some are "news." Some are lists of stock prices and their recent changes. One thing I tried in the Frustratingly Easy paper but that didn't work is the following. Take the WSJ section of the Treebank and run document clustering on it (using bag of words). Treat each of these clusters as a separate domain and then do adaptation. This introduces the "when I get a test example I have to classify it into a domain" issue. At the end of the day, I couldn't get this to work. (Perhaps BOW is the wrong clustering representation? Perhaps a flat clustering is a bad idea?)

Which brings us to the final question (5): what really is a domain? One alternative way to ask this is: give data partitioned into two bags, are these two bags drawn from the same underlying distribution. Lots of people (especially in theory and databases, though also in stats/ML) have proposed solutions for answering this question. However, it's never going to be possible to give a real "yes/no" answer, so now you're left with a "degree of relatedness." Furthermore, if you're just handed a gigabyte of data, you're not going to want to try ever possible split into subdomains. One could try some tree-structured representation, which seems kind of reasonable to me (perhaps I'm just brainwashed because I've been doing too much tree stuff recently).

12 May 2008

Teaching machine translation

Last Fall (2007), I taught an Applications of NLP course to a 50/50 mix of grads and senior undergrads. It was modeled partially after a course that I took from Kevin Knight while a grad student. It was essentially 1/3 on finite state methods for things like NER and tagging, then 1/3 on machine translation, then 1/3 on question answering and summarization. Overall, the course went over fairly well.

I had a significant problem, however, teaching machine translation. Here's the problem.

Students knew all about FSTs because we used them to do all the named-entity stuff in the first third of class. This enabled us to talk about things like IBM model 1 and the HMM model. (There's a technical difficult here, namely dealing with incomplete data, so we talk about EM a little bit.) We discuss, but they don't actually make use of, higher order MT models.

Now, we all know that there's a lot more to MT than model 4 (even limiting oneself to statistical translation techniques). Namely, there are phrase-based models and syntactic models. We had a very brief (one lecture) overview of syntactic models at the end. My beef is with phrase-based models.

The problem is that we've gone though all this prettiness to develop these word-based models, and then I have to teach them grow-diag-final, phrase extraction and phrase scoring. I almost felt embarrassed doing so. The problem is that these things are obviously so heuristic that throwing them on top of this really pretty word-for-word translation model just kills me. And it's not just me: the students were visibly upset by the lack of real modeling behind these techniques.

One option would be just not to teach this stuff. I don't really think that it sheds much light on the translation process. The reason I don't like this solution is because it's nice to be able to say that they will have a handle on a not-too-difficult to understand/implement method for doing real-world MT. Instead, I could just spend that time on syntactic models. The situation there is better (you can talk about the hierarchy of tree transducers, etc.), but not perfect (eg., all the work that goes in to rule extraction is not too dissimilar from all the work that goes into phrase extraction).

I suppose that this is just the defacto problem with a relatively immature field: there hasn't been enough time for us to really tease apart what's actually going on in these models and try to come up with some coherent story. I'd love a story that doesn't involve first doing word alignment and, is, in some sense, integrated.

23 April 2008

A Bit of Levity

Some out there might find this amusing (if you know me, you'll know why).

Semester's over; back to regular blogging soon!

06 April 2008

ICWSM Report

(Guest Post by Kevin Duh -- Thanks, Kevin!!!)

I recently attended ICWSM (International Conference on Weblogs and Social Media), which consisted of an interesting mix of researchers from NLP, Data Mining, Pyschology, Sociology, and Information Sciences. Social media (which defined generally can include blogs, newsgroups, and online communities like facebook, flikr, youtube, del.icio.us) now accounts for the majority of content produced and consumed on the Web. As the area grows in importance, people are getting really interested in finding ways to better understand the phenomenon and to better build applications on top of it. This conference, the second in the series, has nearly 200 participants this year. I think this is a rewarding area for NLPers and MLers to test their wits on: there are many interesting applications and open problems.

In the following, I'll pick out some papers, just to give a flavor of the range of work in this area. For a full list of papers, see the conference program. Most papers are available online (do a search); some are linked from the conference blog.

Interesting new applications:

1) International sentiment analysis for News and Blogs -- M. Bautin, L. Vijayarenu, S. Skiena (StonyBrook) Suppose you want to monitor the sentiment of particular named entities (e.g. Bush, Putin) on news and blogs across different countries for comparison. This may be useful for, e.g., political scientists analyzing global reactions to the same event. There are two approaches: One is to apply sentiment analyzers trained in different languages; Another is to apply machine translation on foreign text, then apply an English sentiment analyzer. Their approach is the latter (using off-the-shelf MT engine). Their system generates very-fun-to-watch "heat maps" of named entities that are popular/unpopular across the globe. I think this paper opens up a host of interesting questions for NLPers: Is sentiment polarity something that can be translated across languages? How would one modify an MT system for this particular task? Is it more effective to apply MT, or to build multilingual sentiment analyzers?

2) Recovering Implicit Thread Structure in Newsgroup Style Conversations, by Y-C. Wang, M. Joshi, C. Rose, W. Cohen (CMU) Internet newsgroups can quite messy in terms of conversation structure. One long thread can actually represent different conversations among multiple parties. This work aims to use natural language cues to tease apart the conversations of a newsgroup thread. Their output is a conversation graph that shows the series of post-replies in a more coherent manner.

3) BLEWS: Using blogs to provide context for news articles -- M. Gamon, S. Basu, D. Belenko, D. Fisher, M. Hurst, C. Konig (Microsoft) Every news article has its bias (e.g. liberal vs. conservative). A reader who wishes to be well-educated on an issue should ideally peruse articles on all sides of the spectrum. This paper presents a system that aids the reader in quickly undertanding the political leaning (and emotional charge) of an article. It does so by basically looking at how many conservative vs. liberal blogs link to a news article. I think this paper is a good example of how one can creatively combine a few existing technologies (NLP, visualization, link analysis) to produce an application that has a lot of value-added.

Methods and algorithms adapted for social media data:

4) Document representation and query expansion models for blog recommendation -- J. Arguello, J. Elsas, J. Callan, J. Carbonel (CMU) This is an information retrieval paper, where the goal is to retrieve blogs relevant to an user query. This is arguably a harder problem than traditional webpage retrieval, since blogs are composed of many posts, and they can be on slightly different topics. The paper adopts a language modeling approach and asks the question: should we model blogs at the blog-level, or at the post-level? They also explored what kind of query expansion would work for blog retrieval. This paper is a nice example of how one can apply traditional methods to a new problem, and then discover a whole range of interesting and new research problems due to domain differences.

Understanding and analyzing social communities:

5) Wikipedian Self-governance in action: Motivating the policy-lens -- I. Beschastnikh, T. Kriplean, D. McDonald (UW) [Best paper award] Wikipedia is an example of self-governance, where participant editors discuss/argue about what should and can be edited. Over the years, a number of community-generated policies and guidelines have formed. These include policies such as "all sources need to be verified" and "no original research should be included in Wikipedia". Policies are themselves subject to modification, and they are often used as justification by different editors under different perspectives. How are these policies used in practice? Are they being used by knowledgeable Wikipedian "lawyers" or adminstrators at the expense of commonday editors? This paper analyzes the Talk pages of Wikipedia to see how policies are used and draws some very interesting observations about the evolution of Wikipedia.

6) Understanding the efficiency of social tagging systems using information theory -- E. Chi, T. Mytkowicz (PARC) Social communities such as del.icio.us allows users to tag webpages with arbitrary terms; how efficient is this evolving vocabulary of tags for categorizing the webpage of interest? Is there a way to measure whether a social community is "doing well"? This paper looks at this problem with the tools of information theory. For example, they compute the conditional entropy of documents given tags H(doc|tag) over time and observe that the efficiency is actually decreasing as popular tags are becoming overused.

Overall, I see three general directions of research for an NLPer in this field: The first approach focuses on building novel web applications that require NLP as a sub-component for the value-added. NLPers in industry or large research groups are well-suited to build these applications; this is where start-ups may spring up. The second approach is more technical: it focuses on how to adapt existing NLP techniques to new data such as blogs and social media.
This is a great area for individual researchers and grad student projects, since the task is challenging but clearly-defined: beat the baseline (old NLP technique) by introducing novel modifications, new features and models. Success in this space may be picked up by the groups that build the large applications.The third avenue of research, which is less examined (as far as I know), is to apply NLP to help analyze social phenomenon. The Web provides an incredible record of human artifacts. If we can study all that is said and written on the web, we can really understand a lot about social systems and human behavior.

I don't know when NLP technology will be ready, but I think it would be really cool to use NLP to study language for language's sake, and more importantly, to study language in its social context--perhaps we could call that "Social Computational Linguistics". I imagine this area of research will require collaboration with the social scientists; it is not yet clear what NLP technology is needed in this space, but papers (5) and (6) above may be a good place to start.

04 April 2008

More complaining about automatic evaluation

I remember a few years ago complaining about automatic evaluation at conference was the thing to do. (Ironically, so was writing papers about automatic evaluation!) Things are saner now on both sides. While what I'm writing here is interpretable as a gripe, it's really intended as a "did anyone else notice this" because it's somewhat subtle.

The evaluation metric I care about is Rouge, designed for summarization. The primary difference between Rouge and Bleu is that Rouge is recall-oriented while Bleu is precision-oriented. The way Rouge works is as follows. Pick an ngram size. Get a single system summary H and a single reference summary R (we'll get to multiple references shortly). Let |H| denote the size of bag the defined by H and let |H^R| denote the bag intersection. Namely, the number of times some n-gram is allowed to appear in H^R is the min of the number of times it appears in H and R. Take this number and divide by |R|. This is the ngram recall for our system on this one example.

To extend this to more than one summary, we simple average the Rouges at each individual summary.

Now, suppose we have multiple references, R_1, R_2, ..., R_K. In the original Rouge papers and implementation, we compute the score for a single sentence as the max over the references of the Rouge on that individual reference. In other words, our score is the score against a single reference, where that reference is chosen optimistically.

In later Rouge paper and implementation, this changed. In the single-reference case, our score was |H^R|/|R|. In the multiple reference setting, it is |H^(R_1 + R_2 + ... + R_K)|/|R_1 + R_2 + ... + R_K|, where + denotes bag union. Apparently this makes the evaluation more stable.

(As an aside, there is no notion of a "too long" penalty because all system output is capped at some fixed length, eg., 100 words.)

Enough about how Rouge works. Let's talk about how my DUC summarization system worked back in 2006. First, we run BayeSum to get a score for each sentence. Then, based on the score and about 10 other features, we perform sentence extraction, optimized against Rouge. Many of these features are simple patterns; the most interesting (for this post) is my "MMR-like" feature.

MMR (Maximal Marginal Relevance) is a now standard technique in summarization that aims to allow your sentence extractor to extract sentences that aren't wholly redundant. The way it works is as follows. We score each sentence. We pick as our first sentence the sentence with the highest score. We the rescore each sentence to a weighted linear combination of the original score and minus the similarity between the proposed second sentence and its similarity to the first. Essentially, we want to punish redundancy, weighted by some parameter a.

This parameter is something that I tune in max-Rouge training. What I found was that at the end of the day, the value of a that is found by the system is always negative, which means that instead of disfavoring redundancy, we're actually favoring it. I always took this as a notion that human summaries really aren't that diverse.

The take-home message is that if you can opportunistically pick one good sentence to go in your summary, the remaining sentences you choose should be as similar to that one was possible. It's sort of an exploitation (not exploration) issue.

The problem is that I don't think this is true. I think it's an artifact, and probably a pretty bad one, of the "new" version of Rouge with multiple references. In particular, suppose I opportunistically choose one good sentence. It will match a bunch of ngrams in, say, reference 1. Now, suppose as my second sentence I choose something that is actually diverse. Sure, maybe it matches something diverse in one of the references. But maybe not. Suppose instead that I pick (roughly) the same sentence that I chose for sentence 1. It won't re-match against ngrams from reference 1, but if it's really an important sentence, it will match the equivalent sentence in reference 2. And so on.

So this is all nice, but does it happen? It seems so. Below, I've taken all of the systems from DUC 2006 and plotted (on X) their human-graded Non-Redundancy scores (higher means less redundant) against (on Y) their Rouge-2 scores.

Here, we clearly see (though there aren't even many data points) that high non-redundacy means low Rouge-2. Below is Rouge-SU4, which is another version of the metric:

Again, we see the same trend. If you want high Rouge scores, you had better be redundant.

The point here is not to gripe about the metric, but to point out something that people may not be aware of. I certainly wasn't until I actually started looking at what my system was learning. Perhaps this is something that deserves some attention.

01 April 2008

Should NAACL and ICML co-locate?

It's been a while since we've had a survey, so here's a new one. The question is: should NAACL and ICML attempt to co-locate in the near future (probably no sooner than 2010, since planning for 2009 is already fairly well established). The reason the question is NAACL-specific and not, eg., ACL or EMNLP, is because I don't have any control over what ACL or EMNLP does. (For that matter, I don't have any control over what ICML does either.)

The obvious question is: why should this happen? Well, for one, it will save me travel time and money! What more convincing do you want?

More seriously, I think that the NLP community and the ML community have been bumping up against each other for a while now. More importantly, the model is no longer: NLP world supplies data to ML world supplies algorithm to NLP world. Our problems are often different from those that are commonly studied in ML, or have different prior biases (not necessarily in the Bayesian sense). But not only do we have interesting problems, I think we also have a different way of viewing the world (though not so different as to make communication impossible). I think the goal of such an endeavor would be marked most by having shared tutorials (one side can see what the other does) and shared workshops.

Should NLPers and MLers co-locate?
	I attend both and would like a co-location
	I attend both and would not like a co-location
	I usually only attend (NA)ACL, but would like a co-location
	I usually only attend (NA)ACL, and would not like a co-location
	I usually only attend ICML, but would like a co-location
	I usually only attend ICML, and would not like a co-location
	I attend neither and so my vote doesn't count