23 August 2010

Finite State NLP with Unlabeled Data on Both Sides

(Can you tell, by the recent frequency of posts, that I'm try not to work on getting ready for classes next week?)

[This post is based partially on some conversations with Kevin Duh, though not in the finite state models formalism.]

The finite state machine approach to NLP is very appealing (I mean both string and tree automata) because you get to build little things in isolation and then chain them together in cool ways. Kevin Knight has a great slide about how to put these things together that I can't seem to find right now, but trust me that it's awesome, especially when he explains it to you :).

The other thing that's cool about them is that because you get to build them in isolation, you can use different data sets, which means data sets with different assumptions about the existence of "labels", to build each part. For instance, to do speech to speech transliteration from English to Japanese, you might build a component system like:

English speech --A--> English phonemes --B--> Japanese phonemes --C--> Japanese speech --D--> Japanese speech LM

You'll need a language model (D) for Japanese speech, that can be trained just on acoustic Japanese signals, then parallel Japanese speech/phonemes (for C), parallel English speech/phonemes (for A) and parallel English phonemes/Japanese phonemes (for B). [Plus, of course, if you're missing any of these, EM comes to your rescue!]

Let's take a simpler example, though the point I want to make applies to long chains, too.

Suppose I want to just do translation from French to English. I build an English language model (off of monolingual English text) and then an English-to-French transducer (remember that in the noisy channel, things flip direction). For the E2F transducer, I'll need parallel English/French text, of course. The English LM gives me p(e) and the transducer gives me p(f|e), which I can put together via Bayes' rule to get something proportional to p(e|f), which will let me translate new sentences.

But, presumably, I also have lots of monolingual French text. Forgetting math for a moment, which seems to suggest that this can't help me, we can ask: why should this help?

Well, it probably won't help with my English language model, but it should be able to help with my transducer. Why? Because my transducer is supposed to give me p(f|e). If I have some French sentence in my GigaFrench corpus to which my transducer assigns zero probability (for instance, max_e p(f|e) = 0), then this is probably a sign that something bad is happening.

More generally, I feel like the following two operations should probably give roughly the same probabilities:

  1. Drawing an English sentence from the language model p(e).
  2. Picking a French sentence at random from GigaFrench, and drawing an English sentence from p(e|f), where p(e|f) is the composition of the English LM and the transducer.
If you buy this, then perhaps one thing you could do is to try to learn a transducer q(f|e) that has low KL divergence between 1 and 2, above. If you work through the (short) make, and throw away terms that are independent of the transducer, then you end up wanting to minimize [ sum_e p(e) log sum_f q(f|e) ]. Here, the sum over f is a finite sum over GigaFrench, and the sum over e is an infinite sum over positive probability English sentences given my the English LM p(e).

One could then apply something like posterior regularization (Kuzman Ganchev, Gra├ža and Taskar) to do the learning. There's the nasty bit about how to compute these things, but that's why you get to be friends with Jason Eisner so he can tell you how to do anything you could ever want to do with finite state models.

Anyway, it seems like an interesting idea. I'm definitely not aware if anyone has tried it.


Alexandre Passos said...

This sounds a lot like semi-supervised learning via posterior regularization (as in http://www.seas.upenn.edu/~kuzman/publications/ganchev_penn_2009.pdf ). I've never seen it applied to translation, but I guess it should help (after all the approximations it will take to get it working).

hal said...

@Alexandre: um, I think I said that :)

hal said...

@Alexandre: more the the point, the implementation idea I sketched could be done with posterior regularization. But I don't remember Kuzman et al. attempting any problem with unlabeled source data....

Alexandre Passos said...

hal: I can't believe I missed your second-to-last paragraph and ended up repeating it in the comments. Oh well.

I haven't seen usage of unlabeled source data on posterior regularization either, but in your setting it's not exactly source data, since the transducer is P(f|e), and you're using french data to regularize it, right?

hal said...

well it's source in the translation direction (which means target in the noisy channel direction). but that's the side for which unlabeled data isn't used (on the target translation / source noisy channel side, you use monolingual data to get the LM).

Bob Carpenter said...

I love the "speech transliterator" example. For years I've thought it'd make a great science museum exhibit to let kids hear how they'd sound with an accent. I never figured out how you'd get the segments to sound right in the synthesizer so it sounded like them, but with Japanese segments (Alan Black probably knows the answer).