Unsupervised learning is very popular for NLP problems, including MT, summarization, IE, etc. I've done work in this area, as have many other people. Beginning about 6 months ago, I have been asking myself: should we ever really do unsupervised learning?
This question may seem silly, but there is a fairly compelling argument. The argument is that, regarless of whether we are in academia or industry, we will have to convince someone else that our system is doing a good job. In order to do this, we need evaluation data. Which means we need to annotate. But once we've annotated, at the very least we should do semi-supervised learning, not unsupervised learning.
The problem with this argument is that it presupposes the existence of an automatic evaluation metric.
My newly formed perspective is the following. We should only do unsupervised learning if we do not have a trustworthy automatic evaluation metric (i.e., a Type III metric). I cannot currently think of a compelling argument against this.
Half-Exponential No More
14 hours ago
17 comments:
Suppose you run out of money after annotating the test data. Then you can only do unsupervised learning on the training data. :)
More seriously, I think if unsupervised learning is to be used, it should demonstrate that it outperforms supervised learning with a practical amount of labeled data. "Practical amount" is a fuzzy term and depends on your resources. If you're a grad student, annotating 50 sentences with POS tags is practical; annotating the entire Wall Street Journal with syntax trees is not practical.
Cross-validation?
I agree about "practical amount" though.
I don't think the unsupervised learning part should be automatically considered with an automatic evaluation procedure. The evaluation part could still be done with supervision, i.e. by human.
The problem with the lack of automatic evaluations is that we can't prove that various unsupervised methods work, or some of them work better. This shouldn't prevent us from trying out new ideas, but ... :-)
I guess this runs in to the difficulty that you need data to develop features on. Doing this requires held-out data, which means cross validation really wouldn't work (at least if you don't want to cheat).
So I guess it also depends if you're developing a new system or just trying to port an existing one.
For NLP, at least we can test the performance an unsupervised learning in one language. If good, we can apply to the other languages to reduce the anotation labor.
This is based on the assumption that, anyway, supervised learning is "better"... unsupervised learning being an alternative for those who cannot afford an annotated corpus.
Well, I'm not sure of that. Sup and unsup learning are very different. IMO, unsupervised learning really IS learning, while sup learning is more like a mapping process between two structures that must be quite similar.
To me, sup and unsup learning are not opposed, they complete eachother. Unsup learning builds concepts from raw data (this step is often replaced by the use of 'features'), THEN, sup learning builds an 'interface' between this set of concept, and another one (human, machine..).
I think I can agree with anonymous: I was focusing on cases where the problem really looks supervised, but we just don't have data so we build an unsupervised model for it.
However, playing devil's advocate: can someone give me an NLP problem that really is unsupervised by nature?
all nlp problems are unsupervised, in my opinion. we are just trying to come up with supervised ways of modeling the evidence shown from each task.
Hal's statement:
We should only do unsupervised learning if we do not have a trustworthy automatic evaluation metric (i.e., a Type III metric). I cannot currently think of a compelling argument against this.
I respectfully disagree with this statement. This is only true for writing papers.
A problem in general (supervised or unsupervised) is a "good problem" to work on if and only if two (or n number of) humans can tell 2 systems apart from each other in terms of quality with a very high kappa.
Let us see what the above statement really means:
If you have a summarization system A and a summarization system B. If A > B in terms of performance with a high inter-human agreement then it is a good problem to work on.
I am going to go a step further here. I believe that unsupervised technique is very hard to evaluate using a white box technique. However, if they are part of a larger system then it is easy to do a black box evaluation.
Now, since Hal made the statment this is my counter statement:
"Why would anybody evaluate their co-reference system, named-entity recognition system, parsers or any other such thing in a white-box evaluation when we are ultimately trying to help some end application?"
(Please, don't give me the reason => to get papers accepted)
That is why I liked Alexander Fraser's thesis work on word-alignment where he argues without BLEU improvments all the (alignment) techniques are useless for Statistical Machine Translation system.
Deepak --
I don't think you and I disagree. (There's a small caveat that I don't really know how you distinguish a system from a problem. I can imagine having a system B and then I want to decide if I should try to improve on it? I don't know how I can decide that this is an interesting problem to work on without actually trying it.)
I'm actually not sure what you're disagreeing with. I'm essentially saying that so long as we can evaluate automatically, we should do a supervised approach. Only if we cannot evaluate automatically (eg., for word alignment), should we do something unsupervised. Why? For exactly Alex's reason: if we don't know that the metric we're optimizing is good, then we might as well optimize anything and then show via human evaluations (or in Alex's case, Bleu evaluations) that we've done something useful.
I think the answer to your counter statement is just that someone has gone through the effort of showing that improvements in the white-box often (or are expected to) lead to improvements in the black-box. I completely sympathize with this. Unfortunately, this doesn't happen all that much :).
Hal --
There's a small caveat that I don't really know how you distinguish a system from a problem.
My definition of a system is the one that we build to solve a problem. System A and B could be simple baselines too (but you still got to try them). Or better it could be 2 humans. E.g., you can ask two humans to write summaries and ask some human judges to evaluate them.
Isn't this too permissive? I can almost always build two baselines with noticibly different performance, or a baseline that's noticibly worse than a human. And it seems odd to want a noticible difference between two humans: usually we're fighting to get our humans to agree!
Hal --
Not Really!
Actually this experiment has to be carried over large number of sample systems and a large number of humans.
I believe, if you are fighting for many human expert humans to agree then I think it is not a good problem to work on.
This seems to put the cart before the horse: you're saying that a problem is worth working on if, after I develop lots of solutions, I can see a difference. Isn't the point of deciding if a problem is worth working on or not so that you needn't waste time building all these systems? I also agree (to a certain extent -- this is a good other post) that introducing artifical constraints to make humans agree is bad.
I'm still not sure what in my original statement you disagree with, though. As far as I can tell, we're in agreement about when one should use unsupervised learning: essentially, whenever we are forced to do human evaluations.
Hal --
Yes it is a iterative process. There is no way around unfortunately! :)
I think there is nothing wrong with such an iterative definition.
酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花
This is such a great resource. You are very smart about this. I just love your insight on this.
palm beach porcelain veneers
Post a Comment