29 April 2007

Problems with Sometimes-hidden Variables

Following in the footsteps of Andrew, I may start posting NLP-related questions I receive by email to the blog (if the answer is sufficiently interesting). At least its a consistent source of fodder. Here's one from today:

I am looking for a task, preferably in nlp and info retriv, in which we have labeled data where some are completely observed (x,h) and some are partially observed x. (h is a latent variable and is observed for some data points and not observed for others) Briefly, the data should be like {(x_i,h_i,y_i),(x_j,y_j)} where y is the class label.
I think there are a lot of such examples. The first one to spring to mind is alignments for MT, but please don't use the Aachen alignment data; see Alex's ACL06 paper for one way to do semi-supervised learning when the majority of the work is coming from the unsupervised part, not from the supervised part. You could try to do something like the so-called end-to-end system, but with semi-supervised training on the alignments. Of course, building an MT system is a can of worms we might not all want to open.

Depending on exactly what you're looking for, there may be (many) other options. One might be interested in the case where the ys are named entity tags or entities with coref annotated and the hs are parse trees (ala the BBN corpus). By this reasoning, pretty much any pipelined system can be considered of this sort, provided that there is data annotated for both tasks simultaneously (actually, I think a very interesting research question is what to do when this is not the case). Unfortunately, typically if the amount of annotated data differs between the two tasks, it is almost certainly going to be the "easier" task for which more data is available, and typically you would want this easier task to be the hidden task.

Other than that, I can't think of any obvious examples. You could of course fake it (do parsing, with POS tags as the hidden, and pretend that you don't have fully annotated POS data, but I might reject such a paper on the basis that this is a totally unrealistic setting -- then again, I might not if the results were sufficiently convincing). You might also look at multitask learning literature; I know that in such settings, it is often useful to treat the "hidden" variable as just another output variable that's sometimes present and sometimes not. This is easiest to do in the case of neural networks, but can probably be generalized.


Benoit Essiambre said...

"I am looking for a task,(h is a latent variable and is observed for some data points and not observed for others) Briefly, the data should be like {(x_i,h_i,y_i),(x_j,y_j)} where y is the class label."

Something that uses the EM-algorithm maybe?

Anonymous said...

Hiii ,, i found your blog by accident when i was searching on NLP(natural language processing ),, i've this project about WHAT's NEW IN NLP ? ,, and i didn't found what i want in google search engine ,, so i hope u could help me in my project ,, this is my final year and i'll graduate ,, i wish u can help me by giving me some resources on the internet that can help me in my project ,, and i'll be so thankfull ,, thanks alot ,, Regards ,, Asma

Anonymous said...

BTW ,, this is my e-mail to contact me ..

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花