25 July 2006

Preprocessing vs. Model

Back from Sydney now, and despite being a bit jetlagged, I thought I'd begin with an COLING/ACL-related post. There was a paper presented by Ben Wellington that discusses whether current syntax-driven models for translation can capture the sorts of reorderings/alignments observed in hand-aligned data. I'm not looking at the paper, but what I remember from the talk is that something like 5-10% of sentences are missed assuming an ITG-like model, without gapping.

At the end of the talk, Dekai Wu commented that they are aware of this for ITG and that an analysis of where it happens is essentially is "weird" extrapositions that occur in English, such as wh-movement, topical preposing, adverbial movements, etc. Apparently such things occur in roughly 5-10% of sentences (from whatever corpora were being used). This talk was near the end of the conference, so my buffer was near full, but my recollection is that the suggestion (implied or explicit) was that English could (should?) be "repaired" by unextraposing such examples, at which time a model like ITG could be directly applied to nearly all the data.

To me this seems like a strange suggestion. ITG is a very simple, very easy to understand model of translation; more natural, in many regards, than, say, the IBM models. (Okay, maybe in all regards.) One nice thing about the model is that it is largely un-hackish. Given this, the last thing I'd really want to do is to fix English by applying some preprocessing step to it, so that the output could then be fed into this nice pretty model.

There's an obvious generalization to this question, hinted at in the title of this post. It seems that in many problems (translation with ITG being but one example), we can either choose a theoretically clean but maybe not completely appropriate model plus some preprocessing, or a somewhat messier but maybe 99.9% recall model with no preprocessing. I suppose one could argue that this is the long sought after integration of rule-based and statistically-based NLP, but this seems uncompelling, because it's not really an integration. I suppose that I'd almost always prefer the latter sort of solution, but that's not to say the first isn't also useful. For instance, in the ITG case, after identifying cases where the ITG assumption fails, we could try to minimally enhance the ITG model to be able to capture these corner cases.

6 comments:

Kevin Duh said...

I would argue that the preprocessing framework is totaly fine--if one could identify the deficiencies of a model, and then perform correct transformations of your data to suit your model, that's theoretically sound because you know exactly what you're doing. (What's not theoretically sound, I believe, is hacking the data without a principled approach, just to improve performance in the end).

A lot of things we do actually can be seen in the preprocessing framework, e.g.
- Map the feature space to high dimension (preprocessing), then train a linear classifier (model)
- Do dimensionality reduction or LSA (preprocessing), then apply your model
- Do a FFT or wavelet transform on your signal (preprocessing), then apply some filtering (model)

They all seem theoretically sound, right? :)

Anonymous said...

While I agree with Kevin, I took the other approach in this paper to see what kind of modification to the MODEL could be done to increase coverage, and one possibility was to allow gaps in an ITG. So this is exactly the kind of "model" change you are talking about. On the other hand, changing the model can increase complexity a lot, as it does with this solution, so one needs to be careful. There is a lot of good theoretical work to be done here.

hal said...

I guess there's sort of a science vs. engineering question here....if we want to believe/claim that (for sake of argument) ITG is actually a model of translation, it would be nicer to have no preprocessing. On the other hand, if we just want it to work, who cares? :)

I agree (to some degree) with Kevin -- there's a lot of preprocessing that pretty much has to happen. But I feel like a lot of it is to remove particular idiosyncracies of typography rather than language. For a simple example, tokenization is a very important preprocessing step, but is really just an artifact of how we write. On the other hand, I suppose that word segmentation from speech (which I suppose would be closer to the "language" side) is somewhat similar....but somehow I wouldn't consider this as much of "preprocessing" as the tokenization.

I think the major concern about the variety of preprocessing that scares me is that it is usually quite brittle. For instance, Mike, Philipp and Ivona have a paper that applies a series of syntactic transformations to German before translating it into English. These are based on the output of a German parser and some handwritten rules. One can easily imagine this combination doing really bizarre things if the parser makes errors (which of course it will). I suppose this is sort of a conflation of the preprocessing issue and the pipelining issue (we'd really just like to maintain uncertainty across the boundary), but preprocessing is almost by definition a pipelined task...and I feel like it can break fairly easily.

Somehow I feel like feature extraction (which I include FFT in), dimensionality reduction, etc., are not quite the same sort of beast, but I'm not quite sure where I would draw the line :).

Anonymous said...

Welcome to our website for you World of Warcraft Gold,Wow Gold,Cheap World of Warcraft Gold,cheap wow gold,buy cheap wow gold,real wow gold,sell wow gold, ... buy ffxi gil,
Here wow gold of 1000 gold at $68.99-$80.99 ,World Of Warcraft Gold,buy wow gold,buy ffxi gil sell world of warcraft gold(wow gold),buy euro gold wow Cheap wow gold,cheapest wow gold store ... buy euro gold wow wow gold--buy cheap wow gold,sell wow gold.welcome to buy cheap wow gold--cheap, easy, wow gold purchasing.World of Warcraft,wow gold Super ...
We can have your wow gold,buy wow gold,wow gold game,world of warcraft gold, wow Gold Cheap wow, Cheap wow gold,world of warcraft gold deal,Cheap WOW Gold ...

Welcome to our website for you World of Warcraft Gold,Wow Gold,Cheap World of Warcraft Gold,wow gold,buy cheap wow gold,real wow gold,sell wow gold, ...
Here wow gold of 1000 gold at $68.99-$80.99,World Of Warcraft Gold,buy wow gold,sell world of warcraft gold(wow gold),buy gold wow lightninghoof instock Cheap wow gold,cheapest wow gold store ...
wow gold--buy cheap wow gold,sell wow gold.welcome to buy cheap wow gold--cheap ffxi gil, easy, wow gold purchasing.World of Warcraft,wow gold Super ...
Wow gold- Gold for buy gold wow lightninghoof instock EU-Server: ...wow Gold EU: starting from 84,99?; 3000 WoW Gold EU: starting from 119,99?.cheap ffxi gil wow Gold- Leveling Services: ...
We can have your wow Gold,buy wow Gold,wow Gold game,wow gold, Cheap wow Gold, Cheap World of Warcraft Gold,world of warcraft gold deal,buy cheap wow gold,Cheap WOW Gold ...

Here wow Gold of 1000 gold at $68.99-$80.99,World Of Warcraft Gold,buy wow Gold,sell world of warcraft gold(wow gold),Cheap wow gold,cheapest World of Warcraft Gold store ...

Anonymous said...

runescape powerleveling

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花