01 August 2007

Explanatory Models

I am frequently asked the question: why does your application for solving XXX make such and such an error? (You can easily replace "your" with any possessive noun and the statement remains valid.)

My standard answer is to shrug and say "who knows."

This is quite different from, for instance, work in pattern matching for information extraction (many other citations are possible). In this setting, when the system makes an error, one can ask the system "what pattern caused this error." You can then trace the pattern back to the source documents from which it came and obtain some understanding for what is going on.

This is frequently not the case for your generic sequence labeling algorithm. If, say, a CRF misses a person name, what can you do about it? Can you understand why it made the error. More generally, if a model of any variety errs, can it say anything about why this error came to be.

One way to approach this problem would be to try to inspect the weights of the learned algorithm, but there are so many trade-offs going on internally that I've never been able to do this successfully (by hand, at least --- perhaps a clever tool could help, but I'm not sure). An alternative that I've been thinking about recently (but probably won't work on because I haven't enough time right now) is instead to pose the question as: what is the minimal change to the input required so that I would have made the decision correctly.

I guess one way to think about this is consider the case that a POS system misses tagging "Fred" in the sentence "Fred is not happy" as an NNP and instead calls it a "VBD." Presumably we have a bunch of window features about Fred that give us its identify, prefixes and suffixes, etc. Perhaps if "Fred" had been "Harry" this wouldn't have happened because "Fred" has the spelling feature "-ed." (Ok, clearly this is a very crafted example, but you get the idea.)

The question is: how do you define minimal change in features. If we're in an HMM (where we're willing to assume feature independence), then I don't think this is a big problem. But in the case where "Fred" ends in "-ed", it also ends in "-d" and both of these make it look more like a VBD. Such an explanatory system would ideally like to know that if "-d" weren't there, then neither would be "-d" and use this for computing the minimal change. It would also have to know that certain features are easier to change than others. For instance, if it has only ever seen the word "Xavier" in the training data as an NNP, then it could also suggest that if the word were "Xavier" instead of "Fred" than this error would not have happened. But this is sort of silly, because it gives us no information. (I'm working under the assumption that we want to do this so that we can add/remove features to our model to help correct for errors [on development data :P].)

It seems like neither of these problems is insurmountable. Indeed, just looking at something like feature frequency across the entire training set would give you some sense of which features are easy to change, as well as which ones are highly correlated. (I guess you could even do this on unlabeled data.)

I feel like it's possible to create a methodology for doing this for a specific problem (eg., NE identification), but I'd really like to see some work on a more generic framework that can be applied to a whole host of problems (why did my MT system make that translation?). Perhaps something already exists and I just haven't seen it.

6 comments:

Mark Dredze said...

I've heard a few conversations on this topics, especially when learning is discussed in the context of UIs. One of the reasons that UI people sometimes prefer pattern based or rule based methods is because it is much more intuitive to the user when the system makes a mistake (and how it can be fixed). Users don't think in terms of probabilities and features, so its hard to image a way of exposing that information to a user in a useful way. Even the example you gave, the "ed" suffix, is not necessarily intuitive to the average user. They'd have to realize that this is a common verb suffix, but I think that realization would only come from working directly with these problems and thinking about features in the way we do.

Another possible solution is to use a nearest neighbor paradigm to convey this information to the user. Instead of showing a feature, the system could show a training example that most influenced this mistaken decision (the closest instance to this instance with the same guessed label). That would be equivalent to saying, "I thought this was true because I had seen this example." Statistical learning is based on the principle of learning from lots of examples, so its only natural to show the user which one of those examples caused the problem. However, either approach seems quite tricky and its likely that sometimes it would still be uninformative to the user.

While this isn't something we often consider in the learning community, I have heard it a lot from the HCI crowd. I think it is a major reason why statistical learning methods may appear in backend systems but often don't show up in user applications.

Unknown said...

SVM Model Tampering and Anchored Learning: A Case Study in Hebrew NP Chunking (Yoav Goldberg; Michael Elhadad) covers some of this from a backend point of view.

Yoav said...

Thanks for the mention, Peter.

Personally, I feel that (from the model designer, not the end user, perspective) the question of interest is not why the model made a specific mistake, but
why it got the (difficult) right things right.

For example, in the "Fred as a verb" case, knowing that the model erred because of the -ed feature is a nice anecdote,
but there is not much you can do about it -- you probably wouldn't go and remove the suffix feature from your model (because you believe it is a good one), but
go and look for other features which might help solving that specific case. But you could mostly do that just as well without understanding
that the error was casued by the -ed feature.

On the other hand, knowing why your model got several hard-to-get cases right will give you an insight on
(a) which of your features are really usefull -- it might be that the '-ed' feature never actually helps anything, in which
case you can remove it from your models and (b) how well is the model expected to behave on slightly different data.
I am not talking about the majority of correct decisions which are easy to do and to explain, but on those "top 1%" or so correct decisions
that the model got right only after some serious twicking and feature engineering -- kowing if these result from an overfitting of the data or are a real improvement is very interesting.
As an example, the Charniak and Johnson rereanking parser does surprisingly well on coordination -- does it really learn coordination? Or is it overfitting some economic jargon or topical conventions? Knowing which reranker features really take place
in resolving the coordination cases is (in my opinion) an extremely interesting question.

Anonymous said...

Yoav,

In designing the Charniak-Johnson parser, I did several ablation studies, where I removed one feature class (a group of related features) to measure their effect on overall performance (summaries of these are presented here.)

But I agree with the general comments; these ablation studies don't really let us understand why the model works (or doesn't).

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

gamefan12 said...

These errors are so common with me. I run into this all the time. I would love to see a fix for these.
ponte vedra beach cosmetic dentist