12 November 2007

Understanding Model Predictions

One significant (IMO) issue that we face when attempting to do some sort of feature engineering is trying to understand not only what sorts of errors our model is making, but why. This is an area in which pattern-based methods seem to have a leg up on classification-based methods. If an error occurs, we know exactly what pattern fired to yield that error, and we can look back at where the pattern came from and (perhaps) see what went wrong.

I've been trying to think for a while what the best way to do this in a classification style system is, especially when the output is structured. I think I have a way, but in retrospect it seems so obvious that I feel someone must have tried it before. Note that unlike some past blog posts, I haven't actually tried doing this yet, but I think maybe the idea is promising.

Suppose we learn a system to solve some task, and we have some held out dev data on which we want to do, say, feature engineering. What we'd like to do is (a) find common varieties of errors and (b) figure out why these errors are common. I'm going to assume that (a) is solved... for instance in tagging problems you could look at confusion matrices (though note that in something like MT or summarization, where the structure of the output space changes based on past decisions, this may be nontrivial).

Let's say we're doing POS tagging and we've observed some error class that we'd like to understand better, so we can add new features that will fix it. One way of thinking about this is that our classifier predicted X in some context where it should have produced Y. The context, of course, is described by a feature set. So what we want to do, essentially, is look back over the training data for similar contexts, but where the correct answer was X (what our current model predicted). In the case of linear classifiers, it seems something as simple as cosine difference over feature vectors may be a sufficiently good metric to use.

18 comments:

Peter Turney said...

If you use a decision tree induction algorithm, you may gain some insight into the errors by looking at the error rate for each branch or leaf of the tree. If one leaf or branch is particularly prone to errors, you can look at the path from the root of the tree to the given branch or leaf, to see what features were used.

Fernando Pereira said...

You raise an important and under-studied problem, but we need to distinguish several possible sources of error: 1) overfitting: you didn't see enough appropriate examples in training; 2) bias: you don't have the right features to represent the concept; and 3) drift: the test distribution is not the same as the training distribution. Your proposal seems to focus on 2), but in practice I see 1) and 3) as the bigger problems. In some bio applications, like protein function prediction or disease prediction from microarray data, 1) and 3) are extreme: training data is very scarce, and it is often collected for purposes that make its distribution very different from the test distribution. People like Don Geman have argued that the only solution then is to use very low-capacity hypothesis classes that give simple interpretable results, even much simpler than decision trees. See D. Geman, C. d'Avignon, D. Naiman and R. Winslow, "Classifying gene expression profiles from pairwise mRNA comparisons," Statist. Appl. in Genetics and Molecular Biology, 3, 2004.

Yoav said...

Supporting Fernando's claim, from my experience it seem that in language tasks (NER, chunking, pos-tagging) where a heavily lexicalized classification model is used, many of the mistakes are due to overfitting on the specific lexical context. Ofcourse, having a better feature set might overcome this, but I think the first step is to get rid of the offending features.

Perhaps a two stage process will work: first identify the "offending" features, say by re-classification of the error points with different features subsets , and then (a) look for commonalities in the pruned features and (b) do your proposed search, but with the reduced feature set.

And, of course, the best solution in my view would be to try and use as less direct lexicalization as possible.

Oren Tsur said...

Obviously, any classifier needs a good feature-set and we spend time and effort to find the features that provide good results. But isn't putting the major effort in finding the best feature-set makes the whole thing similar to rule-based (a decision tree as Peter suggested) model in which we craft rules and then encode them into vectors?

Now, it's clear that some effort should be dedicated for the refinement of features, but I wonder when is the time to start and call this type of work a rule-based alg.
I'll take it to the extreme in a naive way - I would like to use classifiers only in those cases that a good-enough set of features is easy to come with and the classifiers can do the work of filtering out the noisy features.

This is just a random thought, meanwhile I'm still spending my time trying to understand my wrong predictions and change the features...

Bob Carpenter said...

I believe Oren hit the nail on the head. I like this quote from Finkel, Dingare, Nguyen, Nissim, Manning and Sinclair (2004):

"Using the set of features designed for that task in CoNLL 2003 [24], our system achieves an f-score of 0.76 on the BioCreative development data, a dramatic ten points lower than its f-score of 0.86 on the CoNLL newswire data. Despite the massive size of the final feature set (almost twice as many features as used for CoNLL), its final performance of 0.83 is still below its performance on the CoNLL data."

It turns out that like Fernando pointed out, there's a huge dependency on lexical features here. These wind up getting tuned over many grad-student-weeks into what is effectively a rule-based system. Thus, we still need crews of NLP wonks to achieve state of the art performance.

The real killer, in Fernando's terms, is an overfit model applied to data that's drifted. Language data's simply not stationary in the statistical sense. Cross-validation on homogeneous (especially in time and topic) data is going to grossly overestimate performance on the same source a year later. Even cross-validation within a fairly limited time span of fairly limited article sources (say a month's worth of NY Times articles from their business section) shows the huge degree of variance.

Ryan McDonald said...

I agree with both Oren and Bob that intensive feature construction likens stat-NLP to rule-based NLP. In fact, feature-based stat-NLP is really just a generalization of rule-based NLP where rules have weights.** I have often observed people talk about the two as though they were fundamentally different. Binary features of the form:

"Token is Capitalized && label=Person"
"Suffix=ing && label=Verb"

are just rules, since these features always take the form of relating some input property to a prediction, i.e., "If the word ends in ing then label it a verb". The advantage of stat-NLP is that we can define lots of them (even very weak or bad rules) and allow computers to sort out the details optimally (under some definition of optimal).

I would disagree with Oren's proposition that classifiers are only really needed to filter rules (or features). They improve performance by jointly considering all rules (plus their combinations) to optimize the weights. A human just cannot consider the interactions between all features at any reasonable scale. But I have observed teams of engineers get close by refining the weights of rules over the course of months and years.

** I would say this also holds for kernels, since they can theoretically be re-written in the feature space (though this space may be infinite).

Yoav said...

Bob: I think the big difference Finkel et. al. observed at the CoNLL vs. the Bio data was, mostly, due to the lack of capitalization information which is a very good predictor that I guess was missing from the Bio data. [The impact of lexical overfitting plays less of a role, because they trained and tested on a very similar vocabulary, and had relatively few bi-lexical features. Ofcourse, I suspect running the same model on a slightly different test set would yield much lower results.. (which they actually confirmed, in a sense: an HMM based POS tagger (TnT) had better transfer performance than the heavily lexicalized MaxEnt tagger for biomedical pos tagging.]

Having just read that paper, the quote that struck me was:

"
A number of the features listed in Table 1, as well as the features used to incorporate external resources, are relatively unintuitive conjunctions of other features that were chosen by lengthy trial and error processes.
[...] All told, we spent about 25 person-weeks extending the system of [2] for this evaluation exercise, much of it in designing and testing variant feature sets.
"

In this respect, at least, things start to resemble rule-based systems: it takes an enormous amount of time for uninspiring fine-tuning.

Which brings me to: Oren:

Every classifier is in essence a rule-based system in disguise with some very unintuitive rules. But I don't think spending a lot of time feature-engineering a classifier amount to rule-writing in a rule based system: your final product is a list of "rule templates", which the learning algorithm then "fills in" to produce the actual rules.

The problem I have with the massive feature engineering is that it's NOT CLOSE ENOUGH to rule writing. It is just very unimaginative: you throw in all the words in the context, all the POS in the context, all substring of all words in the context, any additional data you have, and then start to experiment with various conjunctions of these features. Then you come up with templates which are "relatively unintuitive conjunctions of other features". These are probably overfitted to the specific dataset, and are not expected to generalize well for other datasets.

Having spent the same amount of time writing rules (which you actually take time to consider WHY they should work and how they interact) might turn out to be much more worthwhile.

If, instead, one could come up with features that are actually new and hopefully linguistically motivated (e.g. something of the form "the previous verb expects an animate agent and non has been encountered so far and the current word is X"), then we would be going somewhere.

Yoav said...

Ryan, re kernels -- from my point of view, as long as we keep using binary features**, kernels are just a way for "spacing out" the rules, allowing for a better (or worse) interaction between them. Different kernels amount for different weighting schemes.

[**this might be true also for non-binary features, I just don't have any good intuitions about these.]

Oren Tsur said...

Ryan,
"I would disagree with Oren's proposition that classifiers are only really needed to filter rules (or features). They improve performance by jointly considering all rules (plus their combinations) to optimize the weights. A human just cannot consider the interactions between all features at any reasonable scale..."

That's exactly my point. My "proposition" wasn't practical but just a way to sharpen the differences. Practically - you are totally right.

Yoav,
In essence, the different between rule-based approaches and classifiers is that rule-writing requires domain professionals (a super human- a combination of a linguists and a biologist) while classifiers can be built by simple ppl like us. This is clearly changing since it's not that we just try all combinations of features till we find the best one - we actually learn the domain deeply in the course of our bad-prediction analysis.

Yoav said...

Oren,
I partly agree.

The rule-writing super-human you describe should actually be a combination of 3: a linguists, a biologist, and a computer-scientist/programmer.

The statistical approaches still require some form of linguist+biologist for doing the annotation (arguably this person's job is now much easier, so he need not be much of an expert in either field). [But an expert IS needed to define the annotation guidelines, which is often overlooked...]
The machine-learning person can then just use the data and "do its magic" without being either a domain-expert or a linguist. But the magic doesn't work so good -- you need to tweak the ingredients each time.

You argue that this tweaking process force the machine-learning person to become a domain expert and a linguist. I wish that was indeed the case -- from my observations it doesn't. It just makes them better tinkerers (and, *maybe* experts in the specific corpus peculiarities).

hal said...

bob:

I'm not really convinced by the Finken et al. quote... That quote presupposes that we should be able to get the same performance on two different tasks (okay, same task, different domain, but from a statistical perspective this is basically the same) using the same system. I really don't see why we would expect that to be the case. It's nice evidence, of course, that work in domain adaptation is useful, and it also supports your other points, with which I agree, but I just don't see why we should have expected the same exact numbers. Performance across different data sets is just not really comparable. It's not even clear that 0.83 on BC is actually worse than 0.86 on CoNLL... (Admittedly, I haven't yet read the paper and maybe I should, but based on that quote alone, it seems odd.)

Oren/Yoav: From having spent time being the "linguist" on a coref system, I will claim that, occassionally, you learn something interesting. For instance, for coref, the feature "anaphor and antecedent both appear on the same gazeteer but one is not a substring of the other" is a very useful negative feature. It's maybe not a priori super obvious, but after hearing it, it makes complete sense.

I think it's a reasonable goal to try to get rid of the domain experts. (Or, as Yoav very corrently points out, relegate them to just annotating data.) I don't think this should necessarily be everyone's goal, but it is enticing. In a sense, I feel the question is just: where does the domain knowledge come in. It can either come in in the form of an expert crafting features, or in the form of an expert annotating data. I think this is an underappreciated trade-off. However, if you're interested in the "expert only annotates data" setting, then you really need to have a system that can learn its own features (or, at the very least, come up with interesting combinations of very very simple features, along the lines of this discussion. On the other hand, if you would like the expert to also help with feature engineering, then we need to support our expert in this task.

Fernando: If there's no domain adaptation problem, then it seems that 1) is at least reasonably solved with tuning hyperparameters on dev data (or cross-validating, modulo Bob's later comment about stationarity, which is definitely true!). For 3), if you're willing to annotate some data in the target domain (it's understandable if you're not!) then I don't see what's stopping you from applying my proposal (or something close) to the labeled target data. If you aren't willing to annotate target data, then maybe what you could do is train two models (or more): one with all your features and one with a super trimmed down feature set... then effectively do query-by-committee and run both on the target data, find where they disagree, and then apply my proposal to those examples.

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

Adi said...

Oes Tsetnoc one of the ways in which we can learn seo besides Mengembalikan Jati Diri Bangsa. By participating in the Oes Tsetnoc or Mengembalikan Jati Diri Bangsa we can improve our seo skills. To find more information about Oest Tsetnoc please visit my Oes Tsetnoc pages. And to find more information about Mengembalikan Jati Diri Bangsa please visit my Mengembalikan Jati Diri Bangsa pages. Thank you So much.

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

cilemsin42 said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
hersey burada
sesliparti
seslisohbet odalari
Sesli adresi
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex