30 April 2006

Google Beta SMT System

Is online, with a description here..

I can only judge fluency/reasonableness, not faithfullness to the input, but of the three AJ pages I tried, one came out reasonably (I could get the gist, sort of) and one came out quite poorly. Better, I'd say, on average than Babelfish's implementation of Asian anything to English, but probably worse than Babelfish's French to English (which has been cooking for ages).

It's unfortunate that there aren't Bleu scores posted for the standard test sets, though it wouldn't be too hard to run these, if someone has the test data. (The results Franz linked to are probably not the same system, since the system they submitted to the eval last year took fourty machine hours to translate a sentence. Google engineers are good, but...) If someone would run the test set, I'd be eternally grateful (and very curious). Given that this is free, NIST may run it as a baseline for the NIST MT eval and the GALE eval, which could have substantial impact on the community, if it is any good.

25 April 2006

Please Pass the Pragmatics

I've spent the last few weeks (lots of plane trips) reading Pragmatics by Levinson et al. Pragmatics is a difficult to define field (like many :P), but from a linguistic perspective, it is roughly "the study of the ability of language users to pair sentences with contexts in which they would be appropriate." (pg 24)

The book talks about five aspects of pragmatics:

  1. Deixis: methods of directly encoding context into language; "Meet me here a week from now with a stick about this big." (pg 55) Without knowing the context, we cannot know the meanings of me, now (and hence a week from now), this big. Diectic pronouns are pronouns that reference speaker- and hearer-known context.
  2. Implicature: Grice's theory says that a speaker should: be cooperative, be truthful (quality), be informative (quantity), be relevant, and be perspicuous (manner: avoid obscurity and ambiguity, be brief and orderly). (pgs 101-102) This theory is insufficient, but useful. Implicature also includes notions of quantification (perhaps, some, many, etc.) and metaphors.
  3. Presupposition: roughly describes that which is immediately inferrable but not the new information in an utterance. Eg., "Sue cried before she finished her thesis" presupposes that Sue finished her thesis, but this is not the new information (which is that she cried). (pg 187) There are both semantic and pragmatic presuppositions; the latter involve shared knowledge between speaker and hearer. (For fun, compare "Sue died before she finished her thesis.")
  4. Speech Acts: a speech act is a statement that does something rather than just one that says something (eg., "I declare war on Zanzibar." (pg 228)). The basic question of speech acts seems to be that of identifying them and understanding how they differ from normal statements.
  5. Conversational Structure: analysis of the sequential (and anti-sequential) nature of conversations, interruptions, etc.
I find implicature and presupposition fascinating, because this is essentially where hidden meaning comes out in text. (See, here I've presupposed that hidden meaning is fascinating!) I've really seen very few papers that look at computational aspects of these problems (speech acts and conversational structure may be exceptions). The textual entailment work, as far as I can tell, focuses on semantic implication rather than pragmatic implication, though of course this is related (but much easier, I would imagine!).

One immediate question is whether there is an appliation that demands pragmatic understanding (or, say, understanding of pragmatic implicature and presupposition). This, I'm not sure. I'm curious how divergent pragmatic issues are crosslingually. My hunch is "not much" which implies that this is not necessary for translation purposes (though Czech morphologically encodes for the new/old distinction). IR and IE also seem impervious to the pragmatic hammer. I can come up with artificial situations that suggest QA might benefit, but I feel these are too artificial (i.e., "Did Sue finish her thesis?"). Even summarization seems fairly robust to a lack of pragmatic understanding, although the "new/old" issue is important here. Perhaps what is old in the real discourse is not new for the reader of the summary. But if it just comes through as presupposition, it's unclear that anything is lost. I'm at something of a loss here, because it seems like human conversation has pragmatic influences so deeply embedded that it is surprising that I feel we can do without them for NLP problems.

24 April 2006

Unsupervised Learning: Why?

Unsupervised learning is very popular for NLP problems, including MT, summarization, IE, etc. I've done work in this area, as have many other people. Beginning about 6 months ago, I have been asking myself: should we ever really do unsupervised learning?

This question may seem silly, but there is a fairly compelling argument. The argument is that, regarless of whether we are in academia or industry, we will have to convince someone else that our system is doing a good job. In order to do this, we need evaluation data. Which means we need to annotate. But once we've annotated, at the very least we should do semi-supervised learning, not unsupervised learning.

The problem with this argument is that it presupposes the existence of an automatic evaluation metric.

My newly formed perspective is the following. We should only do unsupervised learning if we do not have a trustworthy automatic evaluation metric (i.e., a Type III metric). I cannot currently think of a compelling argument against this.

17 April 2006

Rexa is Live

http://rexa.info

(More info here, including a post by Andrew McCallum, the man behind the machine!)

16 April 2006

SIGIR Program Up

The SIGIR program has been posted. I'm looking forward to:

  • Learning User Interaction Models (Agichtein, Brill, Dumais, Rango)
  • Improving the Estimation of Relevance Models (Diaz, Metzler)
  • LDA-based Document Models (Wei, Croft)
  • Tackling Concept Drift by Temporal Inductive Transfer (Forman)
  • Large Scale Semi-supervised Linear SVMs (Sindhwani, Keerthi)
One thing that surprises me is the distribution of industry labs' papers. There are 12 papers with at least one Microsoft author, four for IBM, but only one each for Yahoo and Google (and the Google one was work done while the relevant author was at IBM, I believe). Does MSR just encourage publication more? Or is it more closely tied to SIGIR (SIGIR is in Seattle this year)? Or did Yahoo and Google just get unlucky with reviews?

08 April 2006

Unlabeled Structured Data

I'll skip discussion of multitask learning for now and go directly for the unlabeled data question.

It's instructive to compare what NLPers do with unlabeled data to what MLers do with unlabeled data. In machine learning, there are a few "standard" approaches to boosting supervised learning with unlabeled data:

  1. Use the unlabeled data to construct a low dimensional manifold; use this manifold to "preprocess" the training data.
  2. Use the unlabeled data to construct a kernel.
  3. Use the unlabeled data during training time to implement the "nearby points get similar labels" intuition.
There are basically two common uses of unlabeled data in NLP (that I can think of):
  1. Use the unlabeled data to cluster words; use these word clusters as features for training.
  2. Use the unlabeled data to bootstrap new training instances based on highly precise patterns (regular expressions, typically).
Why the discrepancy? I think it partially that the ML techniques are relatively new and developed without thinking about language problems. For instance, most manifold learning techniques work in mid-dimensional, continuous space, not uber-high-dimensional sparse discrete space. Moreover, most of the ML-style techniques scale as O(N^3), where N=# of unlabeled points. This is clearly far far too expensive for any reasonable data set. For the converse, NLP semi-sup techniques are very tied to their domain, and don't generalize well.

The paper that's been getting a lot of attention recently -- on both sides -- is the work by Ando and Zhang. As I see it, this is one way of formalizing the common practice in NLP to a form digestible by ML people. This is great, because maybe it means the two camps will be brought closer together. The basic idea is to take "A" style NLP learning, but instead of clustering as we normally think of clustering (words based on contexts), they try to learn a classifier that predicts known "free" aspects of the unlabeled data (is this word capitalized?). Probably the biggest (acknowledged) shortcoming of this technique is that a human has to come up with these secondary classification problems. Can we try to do that automatically?

But beyond standard notions of supervised -> semi-supervised -> unsupervised, I think that working in the structured domain offers us a much more interesting continuum. Maybe instead of having some unlabeled data, we have some partially labeled data. Or data that isn't labeled quite how we want (William Cohen recently told me of a data set where they have raw bio texts and lists of proteins that are important in these texts, but they want to actually find the names in unlabeled [i.e., no lists] texts.) Or maybe we have some labeled data but then want to deploy a system and get user feedback (good/bad translation/summary). This is a form of weak feedback that we'd ideally like to use to improve our system. I think that investigating these avenues is also a very promising direction.

03 April 2006

What is Structured Prediction?

It is surprisingly difficult to concretely define structured prediction. The root of the question is: what does it mean for a problem to be structured. A very reasonable condition seems to be the following:

Condition 1: Output elements decompose into variable length vectors over a finite set.

This notion of decomposition is fairly ubiquitous (it subsumes things like sequence labeling, parsing, MT, protein folding, etc.). Unfortunately, it does not seem sufficient. Requiring only C1 means that problems like binary classification, multitask learning and other clearly non-structured problems fall under the heading of SP. It also doesn't stress the fact that there's some relation between the elements of these vectors.

There are, it seems, essentially two places to introduce dependence between elements in the output vectors: the features or the loss. This leads to two new conditions:

Condition 2-Feat: The feature function does not decompose over any vector representation of the output.

Condition 2-Loss: The loss function does not decompose over any vector representation of the output.

(For both of these, by "does not decompose" I mean that it is not invariant under permutations of the output vector. For example, in sequence labeling world, Hamming loss does decompose.)

In some sense, C2-Loss subsumes C2-Feat. This is because it is reasonable to believe that if the loss function doesn't decompose, then we will need/want to introduce features that span at least as much of the "nondecomposition" of the output vector as the loss. For instance, in sequence labeling under a second-order Hamming loss (accuracy on adjacent labels, which does not decompose), one would presumably want to use bigram-style features.

By mixing and matching these conditions, we obtain several different notions of structured prediction.

Condition 1 alone leads to fully decomposable problems and for these we can use independent classifiers ala Dan Roth and others.

Condition 1 and Condition 2-Feat is the standard interpretation of structure prediction and leads to, among other things, max-margin Markov nets.

Condition 1 and Condition 2-Loss is my preferred interpretation and leads to, as far as I know, no published work (though we have a paper at Snowbird on it, and something conditionally accepted at ICML).

Given that 1 and 2-Feat is standard, I have some desire to defend my interpretation. To me, the 1+2-Feat interpretation is specifying SP as a solution, not a problem. That is, the world doesn't force us to use structure: we simply do so because we believe structured features will be useful. On the other hand, when the loss function doesn't decompose, the world is essentially telling us that we had really better pay attention to the structure. We can't reasonably assume to do well otherwise. I like to define problems in terms of what the world gives us, not what we create for ourselves.

My interpretation successfully excludes problems like multiclass classification and multitask learning from the realm of structured prediction. Interestingly, it also excludes sequence labeling under Hamming loss and collective classification (ala statistical relational learning). I don't see this as necessarily a bad thing. Especially since this definition now includes problems I care about like MT, summarization, ASR, etc. under any reasonable loss function. (Any reasonable loss function will not decompose.)