[This post is based partially on some conversations with Kevin Duh, though not in the finite state models formalism.]
The finite state machine approach to NLP is very appealing (I mean both string and tree automata) because you get to build little things in isolation and then chain them together in cool ways. Kevin Knight has a great slide about how to put these things together that I can't seem to find right now, but trust me that it's awesome, especially when he explains it to you :).
The other thing that's cool about them is that because you get to build them in isolation, you can use different data sets, which means data sets with different assumptions about the existence of "labels", to build each part. For instance, to do speech to speech transliteration from English to Japanese, you might build a component system like:
English speech --A--> English phonemes --B--> Japanese phonemes --C--> Japanese speech --D--> Japanese speech LM
You'll need a language model (D) for Japanese speech, that can be trained just on acoustic Japanese signals, then parallel Japanese speech/phonemes (for C), parallel English speech/phonemes (for A) and parallel English phonemes/Japanese phonemes (for B). [Plus, of course, if you're missing any of these, EM comes to your rescue!]
Let's take a simpler example, though the point I want to make applies to long chains, too.
Suppose I want to just do translation from French to English. I build an English language model (off of monolingual English text) and then an English-to-French transducer (remember that in the noisy channel, things flip direction). For the E2F transducer, I'll need parallel English/French text, of course. The English LM gives me p(e) and the transducer gives me p(f|e), which I can put together via Bayes' rule to get something proportional to p(e|f), which will let me translate new sentences.
But, presumably, I also have lots of monolingual French text. Forgetting math for a moment, which seems to suggest that this can't help me, we can ask: why should this help?
Well, it probably won't help with my English language model, but it should be able to help with my transducer. Why? Because my transducer is supposed to give me p(f|e). If I have some French sentence in my GigaFrench corpus to which my transducer assigns zero probability (for instance, max_e p(f|e) = 0), then this is probably a sign that something bad is happening.
More generally, I feel like the following two operations should probably give roughly the same probabilities:
- Drawing an English sentence from the language model p(e).
- Picking a French sentence at random from GigaFrench, and drawing an English sentence from p(e|f), where p(e|f) is the composition of the English LM and the transducer.
One could then apply something like posterior regularization (Kuzman Ganchev, Graça and Taskar) to do the learning. There's the nasty bit about how to compute these things, but that's why you get to be friends with Jason Eisner so he can tell you how to do anything you could ever want to do with finite state models.
Anyway, it seems like an interesting idea. I'm definitely not aware if anyone has tried it.