19 August 2010

Multi-task learning: should our hypothesis classes be the same?

It is almost an unspoken assumption in multitask learning (and domain adaptation) that you use the same type of classifier (or, more formally, the same hypothesis class) for all tasks. In NLP-land, this usually means that everything is a linear classifier, and the feature sets are the same for all tasks; in ML-land, this usually means that the same kernel is used for every task. In neural-networks land (ala Rich Caruana), this is enforced by the symmetric structure of the networks used.

I probably would have gone on not even considering this unspoken assumption, until a few years ago I saw a couple papers that challenged it, albeit indirectly. One was Factorizing Complex Models: A Case Study in Mention Detection by Radu (Hans) Florian, Hongyan Jing, Nanda Kambhatla and Imed Zitouni, all from IBM. They're actually considering solving tasks separately rather than jointly, but joint learning and multi-task learning are very closely related. What they see is that different features are useful for spotting entity spans, and for labeling entity types.

That year, or the next, I saw another paper (can't remember who or what -- if someone knows what I'm talking about, please comment!) that basically showed a similar thing, where a linear kernel was doing best for spotting entity spans, and a polynomial kernel was doing best for labeling the entity types (with the same feature sets, if I recall correctly).

Now, to some degree this is not surprising. If I put on my feature engineering hat, then I probably would design slightly different features for these two tasks. On the other hand, coming from a multitask learning perspective, this is surprising: if I believe that these tasks are related, shouldn't I also believe that I can do well solving them in the same hypothesis space?

This raises an important (IMO) question: if I want to allow my hypothesis classes to be different, what can I do?

One way is to punt: you can just concatenate your feature vectors and cross your fingers. Or, more nuanced, you can have some set of shared features and some set of features unique to each task. This is similar (the nuanced version, not the punting version) to what Jenny Finkel and Chris Manning did in their ACL paper this year, Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data.

An alternative approach is to let the two classifiers "talk" via unlabeled data. Although motivated differently, this was something of the idea behind my EMNLP 2008 paper on Cross-Task Knowledge-Constrained Self Training, where we run two models on unlabeled data and look for where they "agree."

A final idea that comes to mind, though I don't know if anyone has tried anything like this, would be to try to do some feature extraction over the two data sets. That is, basically think of it as a combination of multi-view learning (since we have two different hypothesis classes) and multi-task learning. Under the assumption that we have access to examples labeled for both tasks simultaneously (i.e., not the settings for either Jenny's paper or my paper), then one could do a 4-way kernel CCA, where data points are represented in terms of their task-1 kernel, task-2 kernel, task-1 label and task-2 label. This would be sort of a blending of CCA-for-multiview-learning and CCA-for-multi-task learning.

I'm not sure what the right way to go about this is, but I think it's something important to consider, especially since it's an assumption that usually goes unstated, even though empirical evidence seems to suggest it's not (always) the right assumption.

5 comments:

cissa said...

http://portal.acm.org/citation.cfm?id=1657504.1657510
can you read this paper,do you have some idea about incremental multitask learning?

hal said...

@cissa: i hadn't seen that, but it's very related to a previous post: http://nlpers.blogspot.com/2008/05/adaptation-versus-adaptability.html

btw, you can read the paper here: http://www.lib.kobe-u.ac.jp/repository/90001004.pdf

cissa said...

thanks
my work is to use domain adaptation into multitask learning and it is need to be incremental learning,so i am very appreciate if you have some idea about that.
thank you again

Kevin Duh said...

Yeah, it sounds very reasonable to want to have different feature spaces for different tasks. Is there any theory in ML that depends on them being in the same hypothesis class?

I think there is some work on adaptation with different feature spaces (via feature extraction). For example: Heterogenous cross-domain ranking [Wang09] frames the learning problem as (simplified here):

min_{Ws,Wt,U} L(Ws,U*Xs) + L(Wt,U*Xt) where Xs and Xt are source and target samples, Ws and Wt are the source and target weights to be learned, L is the loss function, and U is a transformation mapping Xs and Xt (which could be different spaces, into the same latent subspace.

For practical reasons, they still need to concat that source and target feature vectors, or else matrix dimensions of U won't match. The optimization is done using a convex trick similar to [Argyriou06].

I think your CCA idea is very cool and possibly more effective, though!

Alexandre Passos said...

Maybe something like an HDP is right here? Not in the usual setting, but something that assumes the feature,weight pairs are drawn from a DP for each domain with an HDP prior, and then you can sample/optimize which are the specific features for each domain and which share weights.

Inference would be tricky, but I think the idea may be sound.