05 April 2007

Discourse is darn hard

Discourse is an area I've been peripherally interested in for quite a while; probably unduly influence by my advisor's (ex-?)interest in the problem. The general sense I get is that discourse relations are relatively easy to identify when cue-phrases exist. This is essentially the approach taken by the Penn Discourse Treebank and Daniel's original parser. Unfortunately, cue phrases exist only for a small subset of sentences, so it's tempting to look at lexical techniques, since they work so well in other problems. The intuition is that if one sentence says "love" and the next says "hate," then maybe they are in some sort of contrastive relation. Unfortunately, the world is rarely this nice. The amount of inference necessary to reliably identify seemingly simple, coarse-grained relationships, seems astonishing.

Here are some examples of contrast relations:

  1. The interest-only securities will be sold separately by BT Securities.
    The principal-only securities will be repackaged by BT Securities into a Freddie Mac Remic, Series 103, that will have six classes.
  2. James Earl Jones is cast to play the Rev. Mr. Johns.
    The network deals a lot with unknowns, including Scott Wentworth, who portrayed Mr. Anderson, and Bill Alton as Father Jenco.
  3. At the same time, second-tier firms will continue to lose ground.
    In personal computers, Apple, Compaq and IBM are expected to tighten their hold on their business.
In (1), we need to know that interest-only and principal-only securities are essentially themselves contrastive. We then need to combine this information with the coreference of BT Securities in both cases (should be easy). This seems like maybe we could possibly get it with enough data. For (2), we essentially need to know that James Earl Jones is famous, and that "deals with unknowns" is the same as "casts." For (3), we need to know that Apple, Compaq and IBM are not second-tier firms. These latter two seem quite difficult.

Next, let's consider background. Here are some short examples of background relations:
  1. The March 24 oil spill soiled hundreds of miles of shoreline along Alaska's southern coast and wreaked havoc with wildlife and the fishing industry.
    Exxon Corp. is resigning from the National Wildlife Federation 's corporate advisory panel, saying the conservation group has been unfairly critical of the Exxon Valdez oil spill along the Alaskan coast.
  2. Insurers typically retain a small percentage of the risks they underwrite and pass on the rest of the losses.
    After Hugo hit, many insurers exhausted their reinsurance coverage and had to tap reinsurers to replace that coverage in case there were any other major disasters before the end of the year.
The first example here might be identifiable by noting that the tense in the two sentences is different: one is present, one is past. Beyond this, however, we'd essentially have to just recognize that the oil spill referred to in both sentences is, in fact, the same spill. We'd have to infer this from the fact that otherwise the first sentence is irrelevant. In (2), we'd have to recognize that the "passed on" losses go to reinsurers (I didn't even know such things existed).

As a final example, let's look at some evidence relations:
  1. A month ago, the firm started serving dinner at about 7:30 each night; about 50 to 60 of the 350 people in the investment banking operation have consistently been around that late.
    Still, without many actual deals to show off, Kidder is left to stress that it finally has "a team" in place, and that everyone works harder.
  2. Yesterday, Mobil said domestic exploration and production operations had a $16 million loss in the third quarter, while comparable foreign operations earned $234 million.
    Individuals familiar with Mobil's strategy say that Mobil is reducing its U.S. work force because of declining U.S. output.
These examples also seem quite difficult. In (1), the fact that the "team" is supposed to refer to the "50 to 60 ... people" and that having dinner, within the context of a company, is a "team-like" activity. (2) is somewhat easier -- knowing that a "loss" is related to a "decline" helps a lot, though it is entirely possible that the type of relation would differ. You also pretty much have to know that "domestic" means "U.S."

I feel like, in many ways, this discourse relation problem subsumes the now-popular "textual entailment" problem. In fact, some of the RST relations look a lot like textual entailment: elaboration-general-specific, eleboration-set-member, Contrast (as a negative entailment), consequence, etc.

I don't have a strong conclusion here other than: this is a hard problem. There've been some attempts at learning lexical things for this task, but I feel like it needs more than that.


Anonymous said...

... we'd have to recognize that the "passed on" losses go to reinsurers (I didn't even know such things existed).

My father works in insurance. He has told me that this can even loop around, i.e., the insurers get reinsured, the reinsurers get reinsured, etc., until the original insurer becomes a reinsurer of a small part of whatever it was that they were insuring in the first place!! This may sound ridiculous, but it happens because policies are often reinsured in packages in which parts may link back to the companies original policy.

Not really relevant for NLPERs, but I found this amusing when I first heard it.

Kevin Duh said...

Hal--do you have some good references for learning about discourse for someone who knows nothing about it? (Thanks in advance!)

Ryan--Is this loopy reinsurance business pretty common? It sounds like it won't converge... :)

hal said...

kevin -- daniel ran a discourse structure reading group my first semester at ISI; it has a good list of papers. i'd say mann+thompson, grosz+sidner and hobbs90 are the three "must reads." for more "practical" matters, i'd suggest strube+hahn and hearst97 (all refs are on the link). for discourse parsing stuff, one of daniel's early papers (96-97ish) would probably be a good place to start.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花