25 December 2005

Under the NLP Christmas Tree

Lance Fortnow eloquenty states that all he wants for Christmas is a proof that P != NP. This got me to thinking about what I, as an NLPer, would most like for Christmas. Unfortunately, things in AI aren't as clear-cut as things in theory because it's hard to tell if we've succeeded. That said, what one might call a Halbert's Program for NLP is a (set of) techniques for super-human aggregation of information from multiple sources, media and domains specific to a given user. There are several "enabling" technologies that (I believe) are necessary to achieve this goal (those that I actively work on are starred):

  • Human-level machine translation models (data in, system out)
  • Turn-key structured prediction techniques with lots of features, complex loss functions and hidden variables*
  • (Weak?) semantic entailment on raw text
  • Personalization that doesn't obviate privacy
  • Domain adaptation of arbitrary statistical models with little new annotated data*
  • Automatic speech recognition
  • Graph, equation and table "understanding"
  • Weakly supervised summarization, more complex than extraction*
  • Database querying from natural language queries
  • Social network analysis
  • Plan recognition, opinion and point-of-view analysis

There are doubtless other interesting things and, of course, each of these might have its own set of enabling technologies (and there are also some cross dependencies). One perhaps controversial aspect is that I consider MT to be an enabling technology; this is because I don't necessarily care about reading Chinese documents -- I care about the information in the Chinese documents. To me, everything is about getting information, regardless of the source.

20 December 2005


Summarization is one of the canonical NLP problems, but of all the big ones (MT, speech, IR, IE, etc.) it is in my opinion the most difficult (let the flames begin!). The reason I think it's so hard is because it's unclear what a summary is. When one cannot define a problem well, it is impossible to solve, and difficult ⊆ impossible. There has been an enormous amount of effort to define specific summarization problems that we can solve, a comparable amount of effort on figuring out how to measure how good a solution is, and a lot of effort on building models/systems that can solve these problems. We've tried many things; see the history of DUC for a small, biased, incomplete set.

That said, I think the field has much potential, but not necessarily by trying to mimick what a human would do when asked to produce a summary. I'm interested in doing summarization-like tasks that a human would never be able to do reliably. Here are some examples to give a flavor of what I'm talking about:

  • Let me go to a scientific engine like Rexa or CiteSeer and ask for a summary of "reinforcement learning." The system should know what papers I've read, what papers I've written, the relationship between all the papers in its database, an analysis of the goodness of authors and so on. What it produces for me must be markedly different from what it would produce for Satinder Singh.

  • Let me go to Amazon and ask about new digital cameras. Maybe I'm interested in spending $200-$300. It should know that I've never bought a digital camera before, or that I've bought 4. I want a summary of important specifications, user comments and so on.

  • Let me go to my own inbox and ask for a summary of what I've been discussing with John recently.

One can imagine many more similar tasks, but these are three obvious ones. The nice thing about these is that even partial solutions would be enormously useful (to me, at least...my mom might not care about the first). These are also things that people really can't do well. If someone asks me for something like the first one but, say, on structured prediction instead of reinforcement learning, I can give a summary, but it will be heavily biased. It is worse for the second, where I can basically only produce anecdotal evidence, and virtually impossible for the third.

The most important outstanding issue is how to measure sucess at such problems. I cannot imagine how to do this without doing user studies, but people probably felt the same way about MT a few years ago. How about now? But given the amount of personalization in these tasks, I feel that it would be harder to do automatic evaluation of them. Probably the most important things to measure in user studies are subjective satisfaction, how many times multiple searches had to be performed and so on. One could also take a TREC style approach for comparative pair-wise evaluations by marking system 2 down if it missed something system 1 found that a human thought was important.

There are also tons of subproblems that can be pulled from this tangle of tasks. Most notably, personalization methods, social network analysis methods, redundancy identification, coherence, information presentation (UI) techniques, generation of multimodal outputs (tables, graphs, etc.), dealing with imperfect input (googling for "reinforcement learning" also produces irrelevant documents), opinion identification, processing of ungrammatical input, anti-spam, and so on. I'm not a huge proponent of just solving subtasks that aren't proven to be necessary, but it is sometimes helpful to take this approach. I think we just have to keep the big picture in our minds.

19 December 2005


Several people have already posted what they thought were the coolest NIPS papers this year elsewhere. Instead of saying what I thought were necessarily the best, I'll say what I thought were most interesting from an NLP perspective.

Max Margin Semi-Supervised Learning for Structured Variables. (Yasemin Altun, David McAllester, Mikhail Belkin) This paper talks about the problem of using unlabeled data to learn structured prediction models, in the style of M3Ns. The idea is the same as manifold-based semi-supervised learning in standard classification, extended to structure: penalize disagreement of "nearby" labels (where nearby is defined in feature space, not structure space). They work in the very low data setting and find that (a) unlabeled data helps and (b) structure helps. Neither is surpising, but both are good. I wonder how to scale this to large data sets: all these manifold-based SS techniques require matrix operations on the gram matrix from the unlabeled data, which must be at least O(n^2) but is often worse (n = # of unlabeled points). In NLP, we have bajillions of unlabeled points. There's got to be a good way to do this that will scale, but I haven't see anything yet.

Group and Topic Discovery from Relations and Their Attributes. (Xuerui Wang, Natasha Mohanty, Andrew McCallum) This is a multi-clustering (cluster multiple things at once) paper, where they simultaneously identify clusters of groups and clusters of topics. E.g., they want to cluster senators by how they vote and bills by what they are about. If you do these two things simultaneously, you can get better statistics and thus do better overall.

Structured Prediction via the Extragradient Method. (Ben Taskar, Simon Lacoste-Julien, Michael Jordan) This is a clever solution to extending structured prediction outside the world of Markov fields to the world of quadratic cost min-flow problems (think matching). They get big speedups over previous optimization techniques and can handle a larger set of problems. I plan on writing a separate post on structured prediction, where I'll discuss this and related papers in more detail. But I thought it was cool enough to mention here anyway.

Interpolating between types and tokens by estimating power-law generators. (Tom Griffiths, Sharon Goldwater, Mark Johnson) Despite the scariness of the name, this paper is very interesting to NLP folk. Everything in language looks like a Zipf/power-law distribution, and this paper shows that the Pitman-Yor process models this very well. This is the same model that Yee Whye suggested for modeling Kneser-Ney smoothing, so it's clearly something that has some real weight in NLP.

Context as Filtering. (Daichi Mochihashi, Yuji Matsumoto) Discusses the issue of long-range dependencies in language modeling, breaking out of the bag of words-style histories paradigm. They use both an LDA-like and Dicihlet-mixture topic model that "shifts" through time and can perform inference by treating it as a particle filter problem. The results are on 11m words of data, which is reasonable, and seem promising, though I'd like to see more comparative results.

And that's it! Feel free to post other cool NIPS papers.

15 December 2005

Bayesian Methods for NLP (summary)

I recently co-organized a BayesNLP workshop with Yee Whye. Here's a brief summary of a subset of the talks and the discussions that took place.

Topic models and language modeling:
A lot of the discussion and papers were about either TMs, LMs or both. Much of the discussion of topic models was how to introduce Markov-style dependencies. There are essentially three ways to do this: (1) make word i dependent on topic i and word (i-1); (2) make topic i dependent on topic (i-1); (3) both. Basically this comes out in how you structure the "beta" language models (in LDA terminology). There is a trade-off between number of params (|vocab| * (# top)^2 versus |vocab|^2 * (# top)) and the ability to fit data. I think the general consensus is that if you have a lot of data (which you should!) then you should use the more expressive models.

The major problem with these models is that they often are evaluated by their perplexity on test data. These perplexities are significantly higher than those obtained by people in the speech community, which raises the "why should I care question" (see this other entry). There are several potential answers: (1) topics can be embeded in a task (say MT) and this leads to better performance; (2) topics are used to enable new tasks (browsing Science repositories); (3) topics can be compared with what humans do in a CogSci manner.

This topic lead into some incomplete discussion on what sorts of problems we might want to work on in the future. I don't think there was a solid decision made. In terms of what applications might be interesting, I think the agreement was that Bayesian techniques are most useful in problems for which there is insufficient data to fit all parameters well. Since "there's no data like more data" has become a mantra in NLP, this seems like it would include every problem! My opinion is that Bayesian methods will turn out to be most useful for largely unsupervised tasks, where my prior knowledge can be encoded as structure. I think there's lots of room to grow into new application domains (similar to some stuff Andrew McCallum has been working on in social network analysis). Introducing new tasks makes evaluation difficult which can make publication difficult (your eight pages have to go both to technique an evaluation), but I think it's the right way for the community to head.

I also really like Yee Whye's talk (which happened to propose basically the same model as a paper by Goldwater, Griffiths and Johnson at this same NIPS), where he basically gave an interpretation of KN smoothing as a nonparametric Bayesian model with a Poisson-Dirichlet prior. Unlike previous methods to explain why KN works, this actually give superior results to interpolated KN (though it loses to modified interpolated KN). Shaojun talked about integrating a whole bunch of stuff (Markov models, grammars and topics) into a language model using directed Markov fields as an "interface" language. This was really cute and they seem to be doing really well (going against the above comment that it's hard to get comparable perplexities). I believe there's an upcoming CL paper on this topic.

If anyone else took part in the BNLP workshop and would like to comment, you're more than welcome.

13 December 2005

Interesting topics in NLP

Papers for HLT/NAACL are due Friday, but since I'm probably not submitting anything, I'll take this opportunity to write down some thoughts about what NLP seems to care about and what I think it might care about more. Topic-wise, we have several big areas that typically overwhelm conference proceedings:

  • Machine Translation
  • Summarization
  • Language Modeling
  • Information Extraction
  • Parsing
There's obviously a lot more to NLP than these topics, but these by far recieve the highest publication rate at conferences like ACL, NAACL, HLT and COLING. I think it's fairly obvious why. Each has some strong combination of the "required" two aspects for publication at these conferences:
  • Importance (or strong community beliefs thereof)
  • Evaluatability (especially on common data sets with common metrics)
There is a third aspect under which contributions can be measured that doesn't seem to play a strong role in our field right now: Interestingness. The problem is that evaluating interestingness is difficult. As pointed out by Sam Roweis at a discussion at the BayesNLP workshop, this is something that SIGGRAPH is very successful at. Perhaps SIGGRAPH papers aren't on the most important or evaluatable topics, but they are certainly interesting. Putting too much weight on interestingness is probably a bad idea, especially if it completely forgoes the other two aspects. But maybe we'd be better off as a community if we had one or two tracks (or perhaps a workshop) that focused exclusively on interesting, new problems, even if they fumble a bit on evaluation and obvious short-term importance. Some topics I care about (i.e., think are interesting) that are not commonly seen at ACL-like conferences are:
  • User modeling or personalization in general
  • CogSci-oriented human-learning or -usage of language
  • Situated language experiments, in real or simulated worlds
  • Super-sentence models of coherence, cohesion, relation between documents
  • Tasks involving multiple media (text + {video, images, charts/tables, ...})
I don't expect these thing to actually start appearing soon, and I know there are conferences that focus more on each aspect individually, but I think it's time that our community as a whole starts paying more serious attention to what goes on in these areas.

12 December 2005


I enjoy reading and (occasionally) posting on other research blogs, most specifically John Langford's Machine Learning Theory. As much as machine learning is a part of my life, I've not seen a place to discuss issues in NLP/CL/etc. in an open forum. I've found in my life as an NLPer (a term I started using with Yee Whye Teh and hope others will adopt) that there are often lots of discussions across lunch tables and over coffee that should be made public for the benefit of others.

I welcome comments by people both within and without the NLP/CL community, and would love to have occasional "guest" posts (email me if you'd like to post something). I'll begin with the first real post shortly and see where things go from there.