20 November 2006

Feature engineering (with a coreference focus)

I gave a talk today at our small group meeting about feature engineering for coreference resolution problems. I think it is oft underappreciated by people (myself occasionally included) who spend a lot of time working on fancy machine learning algorithms that the difference between success and failure often hinges more on clever features than on fancy learning. For those who want to look at the slides, you can take an OpenOffice or PDF version for your perusal. I forewarn that it may be a bit hard to follow without the soundtrack, but it's sort of an "untold story" version of the coref paper we had at HLT/EMNLP 2005.

So what did we do that's clever (or, at least, as far clever as I knew at the time and know now)? Well, a few things. First, I tried extending the notion of Hobbs' distance to discourse trees. This actually works surprisingly well. The true referent is the first element according to syntactic Hobbs' distance in about 70-80% of the cases and we (almost) see the same thing at the discourse level. The only oddity is that you have to flip the nuclearity of the attribution relation. Then you get about 80% holding, assuming perfect discourse trees. Of course, we never have perfect discourse trees and the noise introduced by poor parsing is enough to make this feature more or less useless.

The second thing we tried was to use this name/instance data that Mike Fleischman gathered at ISI before going off to MIT. This data was gathered in the context of Q/A to answer questions like "Who is the president of the U.S.?" by using lookup-tables. But it's great data for coref! It includes about 2 million examples of names paired with their occupations. Maybe you can find this data interesting too. Mike describes it more in his ACL 2003 paper.
These guys make a huge difference in coref performance.

The next thing that helped a lot was a clever, non-standard use of gazetteers. What we do is take the ID of the gazetteer to which the anaphor belongs and pair it with the lexical item of the antecedent. Why? Well, suppose we have a gazetteer that contains words like "California" and "Utah" and "New York." By pairing this list id (eg., "list-5") with antecedents, we get features like "list-5 *AND* state" or "list-5 and country" etc. We'll probably learn that the former is a good feature and the latter is not. We also have a feature that checks to see if the two referents are on the same gazetteer but are not the same word. Why? This is a negative feature. "California" and "Utah" appear on the same list (states) but are not identical (string-wise). So they're probably not coreferent.

We tried pretty hard to get WordNet to work, but it just kept not helping. WordNet distance is a terrible measure (eg., "Japan" and "Russia" have distance 2 -- they are both under "country"). On the other hand, like in the gazetteer case, "nearby in WordNet but not the same word" is a reasonably good feature, though it turned out not to help much. Hypernym and hyponym were the most promising, but we never actually got any benefit from them.

The rest of the stuff in the talk made it fairly close to the front of the paper. The most surprising result was that count-based features help a lot! These features look at things like the entity to mention ratio, the number of entities detected so far, etc. These can't be included in a simple coref model that makes pairwise decisions, but for us it's easy.

The end result is that using just string-based features (standard "substring" and edit distance stuff), lexical features (word pairs), count-based features and Mike's data, we can get an ACE score of 87.6 using LaSO for doing the learning (this goes up about 0.8 if you use Searn in most cases, but I don't have all the numbers). Adding in some inference to predict things like number and gender gets you up to 88.3. Gazetteers get you up to 88.7. After that, the rest of the features (regular expression patterns, word clusters, WordNet, discourse, and syntax) don't help...all together, they get you up to 89.1, but that's not much bang for the buck.

My conclusion is basically that while learning makes a difference (Searn is a bit better than LaSO), the feature engineering makes a bigger one[*]. Many of the features we found useful were a priori not really that obvious. I spent a month or two of my life trying to come up with ways of being clever to help improve performance (playing on dev data so as to be fair). It's definitely a non-trivial enterprise, but it often pays off.

[*] This is assuming that your learning is robust enough to handle the features you want: for instance, the standard pairwise models could not handle the count-based features that helped us a lot.

11 comments:

Bob Carpenter said...

Coref's a variety of things.

The biggest contrast is whether you're doing resolution against a database. It seems to be a whole different field of research, known as record linkage or database deduplication, depending on whether the merging is inter- or intra-DB.

For instance, WestLaw resolves mentions of expert witnesses and lawyers in case law against their professional registration/licensing.

We do this using context to map mentions of genes in biomedical research articles to their citations in EntrezGene. A typical request from customers is to link product mentions to their respective pages (e.g. ShopWiki or Froogle).

What Hal's talking about is the more knowledge free clustering problem (I love the plug and play architecture in his linked ppt). The issue of feature selection is rather different within documents and across documents. Within documents, syntactic features are key and the one-sense-per-document approximation is usually a reasonable heuristic.

Across documents, context is hugely helpful. In fact, it's the only way to handle token-by-token identical mentions, as in the umpteen John Smith's on the web.

For name matching, you should be able to get a big improvement using more appropriate string matching metrics. Check out William Cohen et al.'s paper A comparison of string distance metrics for name matching tasks.

Joseph Turian said...

"My conclusion is basically that while learning makes a difference ... the feature engineering makes a bigger one[*]. Many of the features we found useful were a priori not really that obvious. I spent a month or two of my life trying to come up with ways of being clever to help improve performance (playing on dev data so as to be fair). It's definitely a non-trivial enterprise, but it often pays off."

Agreed.

The interesting thing for me, as an ML researcher, is that we have a pretty good understanding of learning methods, inference algorithms, &tc. However, automatic methods for feature engineering are little studied.

That's a shame because the potential payoff of good automatic feature engineering is huge. It means you can tackle new problems with relatively little time investment and without a human expert in the loop. I expect this research direction would have a big impact on the field.

For the past two years, I've been thinking for a while about methods for automatically inducing features over structured prediction problems, and it's become one of my favorite "if I only had the time" questions. I've spent enough idle time wrestling with this problem that I've refined my ideas down to a very simple approach that would be simple to implement for any sequence labelling or tree prediction tasks, but nonetheless powerful enough to capture all commonly used features I know about. If I only had the time...

hal said...

I think the automatic feature engineering thing is really really interesting, but I have to admit I'm a bit skeptical. I'm fully on board with Andrew McCallum's CRF-style feature induction, where you start out with a bunch of base features and then start adding conjuctions (or disjunctions, or implications, or ...). This is a hard search problem (but we could combine this with learning to search...), but doable.

But I see a big difference between this and sort of a true automatic feature engineering technique. I feel that the set of "base" features is pretty much the most important thing, and that conjunctions and such on top of that is a bit of icing. But coming up with good features is really hard and I really don't see how it could be done automatically. It seems to be fair, one would have to do everything at the character level, but then you just lose so much information...

Example: in the coref stuff, there are a lot of words that are syntactically singular but semantically plural (eg., "group", "company", etc.). If it weren't for these, a "number" feature is very useful, but with them, the number feature often hurts and doesn't help. So I grepped a ton of data for the string "members of the ____" and pulled out all syntactically singular nouns that appeared in "____" and then adjusted the number feature to consider these as plural. Voila! Performance went up.

It's hard to imagine an automatic system coming up with things like this, unless you put in as one of its abilities "come up with regular expressions to grep through a corpus with" but then how do you come up with them and how do you use that resulting information.

Joseph Turian said...

I'm not sure McCallum's feature induction technique is powerful enough. He can find features of the form "there exists an item I1 over which predicate P1 holds and there exists an item I2 over which predicate P2 holds", but not features of the form "there exists an item I over which predicates P1 and P2 hold". If I'm not mistaken, his technique cannot induce complex item predicates, and all item predicates must be specified in advance.

I feel that the set of "base" features is pretty much the most important thing, and that conjunctions and such on top of that is a bit of icing.

I'm not sure I agree. The "base" features (what I call atomic item predicates) are important, but the relations between the items may be important too.

For example, "there is item I1 with label L1 and item I2 with label L2 and I1 is the head child of I2". The head-childness relation may be more important than whether we examine the label or the headtag of the individual items.

Sorry if my terminology is confusing, I haven't really refined it yet. When I say "predicate", I mean a boolean function over items with arity one, i.e. P: I -> {+1, -1}. When I say "relation", I mean a boolean function over items with arity two, i.e. R: I X I -> {+1, -1}. I assume relations are non-decomposable into predicates. I also believe that, for problems like sequence labelling and tree prediction, arity one and two boolean item functions are sufficiently powerful to represent most common features.

I've been thinking primarily about induction of binary item relations. In sequence labelling, we can give an item as a discrete label and a location in the sequence. The label contains all predicate information, e.g. headtag, headlabel, etc., i.e. every bit of information about the item that is independent of any other item in the structure. The location is the information of the item that is only useful wrt to other items. We can then begin with two atomic binary item relations "i1 is left of i2" or "i1 is at the same location as i2". From these atomic relations, we can---in principle---induce more complex relations like "i1 is one to the left of i2" ("i1 is left of i2 and there does not exist an item i3 such that (i1 is left of i3 and i3 is left of i2)"). Similarly, for tree prediction problems, an item can be given as the discrete label and the left and right location of the span. Using the atomic relations above (for either the left location or the right location) we have sufficient power to induce more complex relations like "i1 is the leftmost child of i2".

What I've been focusing on is automatic methods for inducing relations between items, because this is a closed domain. I simplify the item predicate problem by reducing all items to a small label set (e.g. the constituent label) and have designed some techniques for inducing useful relations. These relations could then be used with more complex predicates.

You are right that the item predicates may be very important too. I haven't given much thought to this problem, but it seems like we could make some progress just by examining the item predicate problem in isolation (i.e. "what predicate is needed to discriminate successfully on this task?"). Predicate creation may still require significant human feature-engineering effort.

Bill_Lang said...

Maybe the link of

http://pub.hal3.name/06-11-coref.odp

could be updated to

http://hal3.name/docs/06-11-coref.odp

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

Electric Cylinder said...

I believe construction of such projects requires knowledge of engineering and management principles and business procedures, economics, and human behavior.

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex