## 28 July 2006

### Loss versus Conditional Probability

There was a talk in the session I chaired at ACL about directly optimizing CRFs to produce low F-scores for problems like NE tagging and chunking. The technique is fairly clever and is based on the observation that you can use very similar dynamic programming techniques to do max F-score as to do max log-probability in CRFs.

The details are not particularly important, but during the question phase, Chris Manning asked the following question: Given that F-score is not really motivated (i.e., is a type 4 loss), should we really be trying to optimize it? Conditional probability seems like a completely reasonable thing to want to maximize, given that we don't know how the tags will be used down the pipeline. (It seems Chris is also somewhat biased by a paper he subsequently had at EMNLP talking about sampling in pipelines.)

I think Chris' point is well taken. Absent any other information, conditional probably seems like a quite plausible thing to want to optimize, since given the true conditional probabilities, we can plug in any loss function at test time and do minimum Bayes risk (in theory, at least).

On the other hand, there is an interesting subtlety here. Conditional probability of what? The standard CRF optimization tries to maximize conditional probability of the entire sequence. The "alternative objective" of Kakade, Teh and Roweis optimizes the sub of the conditional probabilities of each label. These are two quite different criterea, and which one should we choose? In fact, neither really seems appropriate. Conditional probability of the sequence doesn't make sense because it would rather improve a bad label from probability 0.01 to 0.1 than improve a bad label from 0.4 to 0.6 and thus get it right. But summed conditional probability of labels doesn't make sense in NE tagging tasks because always assigning probability 0.9 to "not an entity" will do quite well. This is essentially the "accuracy versus f-score" problem, where, when few elements are actually "on," accuracy is a pretty terrible metric.

If we take Chris' advice and desire a conditional probability, it seems what we really want is direct conditional probability over the chunks! But how do we formulate this and how do we optimize it? My impression is that a direct modification of the paper Chris was asking about would actually enable us to do exactly that. So, while the authors of this paper were focusing on optimizing F-score, I think they've also given us a way to optimize conditional chunk probabilities (actually this should be easier than F-score because there are fewer forward/backward dependencies), similar to what Kakade et al. did for conditional label probabilities.

## 25 July 2006

### Preprocessing vs. Model

Back from Sydney now, and despite being a bit jetlagged, I thought I'd begin with an COLING/ACL-related post. There was a paper presented by Ben Wellington that discusses whether current syntax-driven models for translation can capture the sorts of reorderings/alignments observed in hand-aligned data. I'm not looking at the paper, but what I remember from the talk is that something like 5-10% of sentences are missed assuming an ITG-like model, without gapping.

At the end of the talk, Dekai Wu commented that they are aware of this for ITG and that an analysis of where it happens is essentially is "weird" extrapositions that occur in English, such as wh-movement, topical preposing, adverbial movements, etc. Apparently such things occur in roughly 5-10% of sentences (from whatever corpora were being used). This talk was near the end of the conference, so my buffer was near full, but my recollection is that the suggestion (implied or explicit) was that English could (should?) be "repaired" by unextraposing such examples, at which time a model like ITG could be directly applied to nearly all the data.

To me this seems like a strange suggestion. ITG is a very simple, very easy to understand model of translation; more natural, in many regards, than, say, the IBM models. (Okay, maybe in all regards.) One nice thing about the model is that it is largely un-hackish. Given this, the last thing I'd really want to do is to fix English by applying some preprocessing step to it, so that the output could then be fed into this nice pretty model.

There's an obvious generalization to this question, hinted at in the title of this post. It seems that in many problems (translation with ITG being but one example), we can either choose a theoretically clean but maybe not completely appropriate model plus some preprocessing, or a somewhat messier but maybe 99.9% recall model with no preprocessing. I suppose one could argue that this is the long sought after integration of rule-based and statistically-based NLP, but this seems uncompelling, because it's not really an integration. I suppose that I'd almost always prefer the latter sort of solution, but that's not to say the first isn't also useful. For instance, in the ITG case, after identifying cases where the ITG assumption fails, we could try to minimally enhance the ITG model to be able to capture these corner cases.

## 15 July 2006

### Where are Interesting Learning Problems in NLP?

I just spent a few days visiting Yee Whye and NUS (photos to prove it!). YW and I talked about many things, but one that stood out as a ripper is attempting to answer the question: as a machine learning person (i.e., YW), what problems are there in NLP that are likely to be amenable to interesting machine learning techniques (i.e., Bayesian methods, or non-parametric methods, etc.). This was a question we tried to answer at the workshop last year, but I don't think we reached a conclusion.

At first, we talked about some potential areas, mostly focusing around problems for which one really needs to perform some sort of inference at a global scale, rather than just locally.  I think that this answer is something of a pearler, but not the one I want to dwell on.

Another potential partial answer arose, which I think bears consideration: it will not be on any problem that is popular right now.  Why?  Well, what we as NLPers usually do these days is use simple (often hueristic) techniques to solve problems.  And we're doing a sick job at it, for the well studied tasks (translation, factoid QA, ad hoc search, parsing, tagging, etc.).  The hunch is that one of the reasons such problems are so popular these days is because such techniques work so bloody well.  Given this, you'd have to be a flamin' galah to try to apply something really fancy to solve one of these tasks.

This answer isn't incompatible with the original answer (globalness).  After all, most current techniques only use local information.  There is a push toward "joint inference" problems and reducing our use of pipelining, but this tends to be at a fairly weak level.

This is not to say that Bayesian techniques (or, fancy machine learning more generally) is not applicable to problems with only local information, but there seems to be little need to move toward integrating in large amounts of global uncertainty.  Of course, you may disagree and if you do, no wuckers.

p.s., I'm in Australia for ACL, so I'm trying to practice my Aussie English.

## 13 July 2006

### Conference Formats

There are two conference formats I'm familiar with. One is typified by ACL and ICML; the other by NIPS. The ACL format is a roughly 3 day schedule, with three parallel sessions, which gives time for somewhere around 80 presentations, each 20 mins + 5 for questions. ACL has smallish poster sessions that are typically associated with "short papers." My impression is that many people do not attend the poster sessions.

In contrast, NIPS typically accepts many more papers (120ish per year), but the vast majority are presented as posters. There is only one track at the conference, with several invited talks, a few paper presentations (maybe 15 total) that last 20 minutes. They also have "spotlight presentations", which are 2 slide, 2 minute talks designed to draw attention to particularly interesting posters. The poster sessions run from (roughly) 6 until 10 every night and are attended by everyone.

My feeling is that both formats have advantages and disadvantages.  I personally like the NIPS format, essentially because I feel that at the end of the conference, I have seen more of what is there (because I spend so much time at posters).  This also means that I'm completely exhausted at the end of NIPS.  The major disadvantage to the NIPS format seems to be that I'm used to using post-conference time as dinner socialization time, and this seems to happen somewhat less at NIPS (this is confirmed by a friend who is more in the "in crowd" at NIPS than I am).  I think that it might be a little intimidating for junior researchers to go up to a "famous" poster, but on the other hand, this forces you to immediately get to know everyone.  The "lots of posters" format also means that the decision about the number of papers to accept at NIPS is essentially limited by physical space, rather than length of the conference.  Ken Church seems to think we should be accepting more papers, anyway.

Are there other major disadvantages to having a single-track conference, with the majority of papers presented as posters?  I don't expect to see a shift in how our conferences are held, but I'd like to understand all the pros and cons.

### Visa Problems entering USA?

Some people are trying to do something about it (contacting congressmen, etc.). If you've had problems entering the US for conferences, you should contact them.

## 07 July 2006

### What is Natural Language Understanding?

I've read lots of papers and seen lots of talks that justify a task as being useful to doing "natural language understanding." The problem I have here is that I really have no idea what NLU actually is. I think the standard line is that NLU is the process of transforming natural language text into "some representation" that is "easy" for a computer to "manipulate." Of course, the quoted words are so vacuous in meaning that I can easily imagine reasonable definitions for them that make this task either trivial or essentially impossible.

What I find unfortunate here is that, without trying hard to pin down definitions, this seems like a completely reasonable goal for NLP technology; in fact, my impression is that up until 10 or 20 years ago, this actually was one of the major goals of the field. According to the anthology, roughly 25% of the papers in 1979 contained the terms "NLU" or "language understanding", compared to 20% in the 1980s, 10% in the 90s and 7% in the 2000s. One possible explanation of the dwindling of this topic is that publication has become increasingly driven by experimental results, and if one cannot pin down a definition of a task, one cannot reliable compare results across multiple papers.

The recent push on textual entailment bears some semblance to NLU, but is simultaneously better defined and more restricted (though I admit to having heard some grumbling that some more work needs to be done to get textual entailment to be a very well defined task). There also continues to be a modicum of work on database filling and natural langauge database queries, though certainly not at the same rate as before.

## 03 July 2006

### My Thesis Path

(Minor note: the latest version of the EDT section didn't get folded in properly to the version of the thesis that went up yesterday; this is fixed now.)

Many people have asked me how I settled on a thesis topic, I think largely because they are trying to find their own paths. My story is (at least to me) a bit odd and circuitous, but I'll tell it anyway.

When I came to USC, I knew I wanted to do NLP and, more specifically, I wanted to do summarization. Working with Daniel was a natural choice. I was particularly interested in coming up with good models for getting around the sentence extraction paradigm that dominates the field. My first work was on extending Kevin and Daniel's sentence compression work to the document level by using discourse information. This worked reasonably well. My next milestone was to try to leverage alignment information of document/abstract pairs in order to learn complex transformations. This led to the segment HMM model for doc/abs alignment, that met with reasonable success (considering how darned hard this problem is).

At that point, it became clear that trying to do a full abstractive system just wasn't going to work. So I started looking at interesting subproblems, for instance sentence fusion. Unfortunately, humans cannot reliably do this task, so I threw that out, along with a few other ideas. Around the same time, I began noticing that to have any sort of reasonably interesting model that did more than sentence extraction, I really was going to need to run a coreference resolution system. So I went to the web (this was back in 2003) and found one to download.

Except I didn't because there weren't any publicly available.

So I went to build my own. I wanted to do is "right" in the sense that I wanted a principled, joint system that could be trained and run efficiently. I read a lot of literature and didn't find anything. So then I read a lot of machine learning literature (I was beginning to get into ML fairly hardcore at this point) to find a framework that I could apply. I couldn't find anything.

So I decided to build my own thing and came up with LaSO, which is essentially a formalization and tweak on Mike Collins and Brian Roark's incremental perceptron. My thesis proposal used LaSO to solve the EDT problem, and I presented LaSO at ICML 2005. Also at ICML 2005 I met John Langford, having previously noticed that what I was doing with LaSO looked something like some recent work he had done on reinforcement learning. We had a long dinner and conversation and, after a few visits to Chicago, many emails, and lots of phone calls, came up with Searn. With both LaSO and Searn, I always had in the back of my mind that I didn't want to make any assumptions that would render it inapplicable to MT, since everyone else at ISI only does MT :).

At about the same time, I started thinking about how to do a better job of sentence compression and came up with the vine-growth model that eventually made it into my thesis. This was really the first point that I started thinking about summarization again, and, in the evolution of LaSO to Searn, it now became possible to do learning in this model.

So, in a sense, I had come full circle. I started with summarization, lost hope and interest, began to get more interested in EDT and machine learning, and then finally returned to summarization, this time with a new hammer.

## 02 July 2006

### Thesis is Done!

I'm happy to announce that my thesis, Practical Structured Learning Techniques for Natural Language Processing, is now done (which is good, because it needs to be handed in by 11:59pm on July 5th). Many thanks to everyone who supported me through this (long) process; I really appreciate it.