It is almost an unspoken assumption in multitask learning (and domain adaptation) that you use the same type of classifier (or, more formally, the same hypothesis class) for all tasks. In NLP-land, this usually means that everything is a linear classifier, and the feature sets are the same for all tasks; in ML-land, this usually means that the same kernel is used for every task. In neural-networks land (ala Rich Caruana), this is enforced by the symmetric structure of the networks used.
I probably would have gone on not even considering this unspoken assumption, until a few years ago I saw a couple papers that challenged it, albeit indirectly. One was Factorizing Complex Models: A Case Study in Mention Detection by Radu (Hans) Florian, Hongyan Jing, Nanda Kambhatla and Imed Zitouni, all from IBM. They're actually considering solving tasks separately rather than jointly, but joint learning and multi-task learning are very closely related. What they see is that different features are useful for spotting entity spans, and for labeling entity types.
That year, or the next, I saw another paper (can't remember who or what -- if someone knows what I'm talking about, please comment!) that basically showed a similar thing, where a linear kernel was doing best for spotting entity spans, and a polynomial kernel was doing best for labeling the entity types (with the same feature sets, if I recall correctly).
Now, to some degree this is not surprising. If I put on my feature engineering hat, then I probably would design slightly different features for these two tasks. On the other hand, coming from a multitask learning perspective, this is surprising: if I believe that these tasks are related, shouldn't I also believe that I can do well solving them in the same hypothesis space?
This raises an important (IMO) question: if I want to allow my hypothesis classes to be different, what can I do?
One way is to punt: you can just concatenate your feature vectors and cross your fingers. Or, more nuanced, you can have some set of shared features and some set of features unique to each task. This is similar (the nuanced version, not the punting version) to what Jenny Finkel and Chris Manning did in their ACL paper this year, Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data.
An alternative approach is to let the two classifiers "talk" via unlabeled data. Although motivated differently, this was something of the idea behind my EMNLP 2008 paper on Cross-Task Knowledge-Constrained Self Training, where we run two models on unlabeled data and look for where they "agree."
A final idea that comes to mind, though I don't know if anyone has tried anything like this, would be to try to do some feature extraction over the two data sets. That is, basically think of it as a combination of multi-view learning (since we have two different hypothesis classes) and multi-task learning. Under the assumption that we have access to examples labeled for both tasks simultaneously (i.e., not the settings for either Jenny's paper or my paper), then one could do a 4-way kernel CCA, where data points are represented in terms of their task-1 kernel, task-2 kernel, task-1 label and task-2 label. This would be sort of a blending of CCA-for-multiview-learning and CCA-for-multi-task learning.
I'm not sure what the right way to go about this is, but I think it's something important to consider, especially since it's an assumption that usually goes unstated, even though empirical evidence seems to suggest it's not (always) the right assumption.