my biased thoughts on the fields of natural language processing (NLP), computational linguistics (CL) and related topics (machine learning, math, funding, etc.)

## 15 December 2008

### Interesting NIPS papers, take 1

"Large Margin Taxonomy Embedding for Document Categorization" - Kilian Weinberger, Olivier Chapelle (Yahoo)

- Suppose you have a multi-class classification problem where you need to assign documents to different nodes in the topic hierarchy. While there are hierarchical classifiers for solving this problem, the authors instead proposes to embed the taxonomy in a continuous space and use regression. The idea is as follows: from a taxonomy, we can compute distances between nodes that characterize the loss of classifying one node when the true class is the other. This pairwise distance matrix is used by multidimensional scaling to create a set of prototype vectors in Euclidean space, one for each class. Then, we train multiple regression that maps training samples to these prototype vectors. However, the problem with this two-stage approach is that the prototypes are computed without regard to the training data, so the solution may be suboptimal for classification. The paper then introduces an objective function that combines both steps: essentially, we constrain the mapping of the training samples and the prototypes to have a large margin.

"Learning taxonomies by Dependence Maximization" - Matthew Blaschko, Arthur Gretton (MPI)

- Our goal is to cluster a dataset and provide a taxonomy that shows the relationship between clusters. The standard solutions are agglomerative/divisive hierarchical clustering. This paper proposes an alternative solution which allows us to use kernels (and is thus related to spectral clustering). The idea is based on a kernel measure of dependence: roughly speaking, if K is the kernel matrix of the original data and L is the kernel matrix of the resulting clustering, the objective max_{L} trace(K*L) is an measures the dependence between samples and clusters and is thus a viable clustering objective. The method gets a taxonomy by formulating L=PYP' where P is a partition matrix (maps cluster to samples) and Y is a positive semi-definite matrix that encodes relationships between clusters.

Fast Prediction on a Tree" - Mark Herbster, Massimiliano Pontil, Sergio Rojas (UCL)

- Graph-based semi-supervised learning needs to scale well with the number of unlabeled samples in order to be truly useful in large data scenarios. This paper presents a method to improve the computational scalability of Laplacian-based methods: First, convert the data graph to a tree (using, e.g. a maximum spanning tree algorithm). Second, they show a fast way to compute the pseudo-inverse of the graph/tree Laplacian in O(m2 + mS), where m is the number of labeled samples and S is the tree diameter. This Laplacian pseudo-inverse corresponds to a kernel, and so one can use, say, a kernel perceptron. to predict on test points. Experiments show that tree approximations to graph did not deteriorate accuracy, while drastically increasing speed.

"Unlabeled data: Now it helps, now it doesn't" - Aarti Singh, Rob Nowak, Jerry Zhu (Wisconsin)

- This is an interesting theoretical paper that analyzes when unlabeled data helps under the cluster assumption. First, the authors argue that asymptotic analysis is unsuitable for analyzing the difference between supervised learning and SSL, and instead uses finite-sample analysis and minimax bounds. Let n be the number of labeled samples, m the number of unlabeled samples, d the feature dimension, and g the margin between two classes (can be positive or negative). The proof is of the form: suppose a clairvoyant supervised learner will full knowledge of the underlying density p(x) has error less than e2(n), and a supervised learner has error greater than e1(n). Then, the error of SSL is no more than e2(n) + O(some function of m). Thus, if O(some function of m) is negligible (and this depends on the exact values of m,d,g,n), then SSL will improve over supervised learning; otherwise, no. In words, the cases where SSL helps is as follows: if the margin g is relatively large compared to the average spacing between labeled points (n^{-1/d}), then supervised learning can discover p(x) accurately and works just as well as SSL. However, if g is small relative to the spacing between labeled points, but large relative to the spacing between unlabeled points (m^{-1/d}), then SSL will beat any supervised learner. In the case that the margin is negative, if -g is larger than (m^{-1/d}), then SSL also wins.

"DiscLDA: Discriminative learning for dimensionality reduction and classification" - Simon Lacoste-Julien, Fei Sha, Michael Jordan (Berkeley/USC)

- Unsupervised topic models have become popular methods for finding latent structures in text documents. These are usually trained by max likelihood, but this may be suboptimal if our final goal is classification. This paper considers the problem of introducing labeled data (e.g. topic labels) into topic models. Recall in Latent Dirichlet Allocation (LDA), foreach each document, we first draw a (k-dimensional) topic mixture from a Dirichlet prior. Then we draw a words according to p(word|topic)p(topic|topic-mixture). We can view each document as a topic simplex. The idea here is to introduce a transformation T on the topic simplex, so that documents with the same label will be mapped close together.

"Modeling the effects of memory on human online sentence processing with particle filters" - Roger Levy (UCSD), Florencia Realia, Tom Griffiths (Berkeley)

- Humans comprehend sentences in an online manner: it is believed that we do incremental parsing as we hear words one at a time. Thus, garden-path sentences are able to catch us off-guard. Moreover, the longer the sentence is before a disambiguation point is reached, the harder it is for humans to recover (digging-in effect). This is a psycholinguistics paper that seeks to explain garden-path and digging-in by a novel particle-filter based PCFG parser: essentially, whenever a word is received, a partial parse is sampled. The number of "incorrect" particles increase with sentence length (modeling digging-in), and the number of particles used correlates with the memory constraints of the brain.

"Tighter bounds for structured estimation" - Olivier Chapelle, et. al. (Yahoo/Stanford/NICTA)

- A common approach in optimizing difficult loss functions is to minimize a convex upper bound instead (e.g. hinge loss in SVM's). However, these losses are often loose. In particular, outliers often suffer large loss, so the general classifier accuracy may be sacrificed since the optimizer focuses on these extremely difficult points. The idea here is to use a non-convex, but tighter upper bound. They adopt a ramp-loss for the structured prediction problem and use the convex-concave procedure to solve it.

## 07 December 2008

### Two Reviewer Comments that Stuck with Me

- "Other approaches don't have to be bad in order for your approach to be good."
- "If I were working in this area, I would want to know about these results."

I feel like we too often see research in a competitive light. There are two things that can cause this. The first is funding. Short-term funding (in the US) is essentially a zero-sum game, which means that I can win only if you lose. (There are models where this is less true, but usually that has other -- not necessarily desirable -- outcomes.)

The second is the scooping effect: many times when we have (what we think is) a cool idea, we want to "beat" everyone else to the punch. I recall a comment John Langford made on his blog (which I cannot seem to find now because I can't figure out what search terms to use) quite some time ago along these lines... saying that for many problems he doesn't care who finds a solution so long as the solution is found. I usually have mixed feelings. For some topics, I definitely feel this way. Indeed, for some topics I actually don't want to be the person who figures it out, either because I don't feel I have the necessary skills or because I don't have the necessary time. For some topics, I do get a feeling of ownership and really want to be the person who does it. My thesis work was like that, as was my document/abstract alignment work. This is definitely highly personal: I know plenty of people who care a lot more about ownership than I do, and many who care a lot less.

What I took away from this comment is essentially the realization that we are all working toward some vague future goal, which has to do with computationalizing language processing (or some other topic, for the non-NLP audience). Progress is good. If I've done work that has something interesting and novel to say about this goal, then it's not bad -- and is often good -- that this builds on and improves on your work.

## 01 December 2008

### Where did my (linear) model go wrong?

Let's say we're learning a predictor f and it doesn't get 100% test accuracy. We can ask ourselves: why not. Typically, the reason I ask myself this question is because I want to get closer to 100% test accuracy and want to know how I should go about it. Should I add more labeled data (what should it look like) or should I add more features (what should they look like) or am I just hosed?

In order to try to get a handle on this, we can ask ourselves: where can error come from:

- Noise in the training data (which we have fit and is now screwing us up at test time)
- Noise in the test data (which is causing our otherwise correct predictions to look wrong)
- Insufficient representation (we just don't have the right/enough features)
- Insufficient examples (our training data wasn't sampled densely enough in some region)

I'm going to use (binary) text categorization as a working example because it's simple yet exhibits NLP-like properties (ever growing feature sets, sparse feature representations, etc.). So let's say we train a model (perhaps linear) on a bag of words feature set and apply it on test data. Say we get 90% accuracy on dev data. Now what can we do?

Let's take a single example that we misclassified. Perhaps take the "worst" (i.e., most misclassified in a margin sense), though I don't know that it matters. Which of the four reasons above tells us why this example was misclassified?

By looking at the example (yes, by hand!) we can probably tell if it's a problem due to issue (2: Noise in the test data). We might like to try to automate this ascertainment, but honestly I'm at a loss for how to do this. Let's say we decide that the problem isn't due to test set noise. Now what?

Let's consider the following approach. We are going to retrain our classifier several times (okay, this is expensive, but we can worry about this later). What we're going to do is add this misclassified dev point into the training data with a varying cost. The currently trained model we will call f(0), since it is based on giving this dev point a cost of zero. We can then train f(c) for a large range of costs c and let C be some reasonable upper bound (i.e., we guarantee that C is big enough that f(C) correctly classifies this point -- for any reasonable classifier, there should be such a finite C). Go ahead and replace "correctly classifies" with "classifies with a sufficiently large margin" if you'd prefer; I doubt it matters.

Now, we're going to look at two of these fs. We'll look at f(0) and f(c'), where c' is the minimal value of c such that this dev example becomes correctly classified. Let's say we now run these two fs on the training data. We know that f(0) will misclassify our "selected" test point and that f(c') will not. The big question is what do the fs do on the other points.

- Suppose that f(c') doesn't make any (many?) more mistakes than f(0). That is, they're basically the same, just now classifying our selected point correctly. This suggests that the problem is (3) or (4) above (features or examples).
- Suppose that f(c') makes many more mistakes than f(0). Now, we see that in order to get this selected test point correct, we have to pay by getting other things wrong (that we didn't before). This suggests that the problem is (1) or (3) above (noise or features). Importantly, it's not (4: examples).

Now, let's go back and do the same process of all of the dev points that were misclassified. What can happen?

- Almost all of the f(c')s make no more errors on other training points. Unless all of these erroneous dev points are markedly different from the training data (in which case we really have a domain adaptation problem), then this is almost certainly a feature issue (3).
- Almost all of the f(c')s make many more errors on the other training points, and the set of training points on which they make these errors is usually the same. Then this is almost certainly noisy training data (or noisy test data).
- Almost all of the f(c')s make many more errors on the other training points, but the set of training points on which they err is substantially different each time. Then this is almost certainly a feature issue (3).
- Mixed results: some f(c')s make more errors on training, some don't. This is harder to say, but my gut tells me that this is probably a (4: example) issue.

(If you're using a margin based classifier, you might not have to exactly retrain each time. Koby Crammer's passive aggressive algorithm will essentially give you a closed form solution for "closest (in L2) weight vector to the current weight vector such that a given point is correctly classified," which could save some computational effort.)

Note that I haven't actually tried this. I'd definitely be interested to, but wanted to think about it a bit before I shot off to implement something.

## 24 November 2008

### Supplanting vs Augmenting Human Language Capabilities

What about NLP?

I would say that most NLP research aims to supplant humans. Machine translation puts translators out of work. Summarization puts summarizers out of work (though there aren't as many of these). Information extraction puts (one form of) information analysts out of work. Parsing puts, well... hrm...

There seems actually to be quite little in the way of trying to augment human capabilities. Web search might be one such area, though this is only tenuously an "NLP" endeavor. Certainly there is a reasonable amount of work in translation assistance: essentially fancy auto-completion for translators. Some forms of IE might look like this: find all the names, coreferenceify them and then present "interesting" results to a real human analyst who just doesn't have time to look through all the documents... though this looks to me like a fancy version of some IR task that happens to

*use*IE at the bottom.

Where else might NLP technology be used to augment, rather than supplant?

- A student here recently suggested the following. When learning a new language, you are often reading an encounter unknown words. These words could be looked up in a dictionary and described in a little pop-up window. Of course, if the definition (hopefully sense-disambiguated) itself contains unknown words, you'd need to recurse. He then suggested a similar model for reading Wikipedia pages: tell Wikipedia everything you know and then have it regenerate the "variational EM" page explaining things that you don't know about (again, recursively). This could either be interactive or not. Wikipedia is nice here because you can probably look up most things that a reader might not know via internal Wikipedia links.
- Although really a field of IR, there's the whole interactive track at TREC that essentially aims for an interactive search experience, complete with suggestions, refinements, etc.
- I can imagine electronic tutorials that automatically follow your progress in some task (eg., learning to use photoshop) and auto-generate text explaining parts where you seem to be stuck, rather than just providing you with random, consistent advice. (Okay, this starts to sound a bit like our mutual enemy clippy... but I suspect it could actually be done well, especially if it were really in the context of learning.)
- Speaking of learning, someone (I don't even remember anymore! Sorry!) suggested the following talk to me a while ago. When trying to learn a foreign language, there could be some proxy server you go through that monitors when you are reading pages in this language that you want to learn. It can keep track of what you know and offer mouseover suggestions for words you don't know. This is a bit like the first suggestion above.

One big "problem" with working on such problems is that you then cannot avoid actually doing user studies, and we all know how much we love doing this in NLP these days.

## 04 November 2008

### Using machine learning to answer scientific questions

A less broad type of question that one can ask might be categorized as: is some type of feature relevant for some task. For instance, are lexical semantics features derived via a hand-crafted resource (such as WordNet) useful for the task of coreference resolution? What about lexical semantics features derived automatically from the web? This type of question is the sort of thing that I looked at in an EMNLP 2005 paper. The answers seemed to be "no" and "yes" respectively.

There are several issues with trying to answer such questions. One issue is that typically what you're actually looking at is the question: can I figure out a way to turn WordNet into a feature vector that is useful for the task of coreference resolution? This is probably a partial explanation for why our community seems not to like negative results to such questions: maybe you just weren't sufficiently clever in encoding WordNet as features. Or maybe WordNet features are useful but there is some other set of features that's more useful that's just swamping the benefits of WordNet.

My first question is whether we're even going about this the right way. The usual approach is to take some baseline system, add WordNet features, and see if predictive performance goes up (i.e., performance on test data). This seems like a bit of a round-about way to attack the problem. After all, this problem of "does some feature have an influence on the target concept" is a classical and very well studied area of statistics: most people have probably seen ANOVA (analysis of variance), but there are many many more ways to try to address this question. And, importantly, they don't hinge on the notions of predictive performance. (Which almost immediately ties us in to "my system is better than yours.")

Now, I'm not claiming that ANOVA (or some variant thereof) is the right thing to do. Our problems are often quite different from those encountered in classical statistics. We have millions of covariates (features) and often complex structured outputs. I just wonder if there's some hint of a good idea in those decades of statistical research.

The other way to go about looking at this question, which at the time I wrote the EMNLP paper I thought was pretty good, is the following. The question is: why do I

*care*if WordNet features are useful for prediction. Very rarely will they actually hurt performance, so who cares if they don't help? (Besides those people who actually want to know something about how language works.... though in this case perhaps it tells us more about how WordNet works.) The perspective we took was that someone out there is going to implement a coreference system from scratch (perhaps for a new language, perhaps for a new domain) and they want to know what features they should spend time writing and which ones they can ignore. This now

*is*a question about predictive performance, and we did a form of backwards feature selection to try to answer it. If you look at Figure 2 in the paper, basically what you see is that you'd better implement string-match features, lexical features, count-based features, and knowledge-based features. On top of this, you can get another half point with inference features and another half again using gazetteers. But beyond that, you're probably wasting your time. (At the time of the EMNLP paper, one of this big "sells" was the count-based features, which are impossible in the standard pairwise-decision models that most people apply to this problem. I still think this is an interesting result.)

At any rate, I still think this ("what should I spend time implementing") is a reasonable way to look at the problem. But if you're moving to a new domain or new language, most likely all of your time/money is going to be spent annotating data, not implementing features. So maybe you should just go and implement everything you can think of anyway. And besides, if you want to make an argument about moving to different languages or different domains or even different amounts of training data, you should make those arguments specifically and do the experiments to back them up. For instance, how does Figure 2 in the EMNLP paper change when you have half the data, or a tenth the data? Maybe the lexical features don't matter as much and perhaps here WordNet might kick in?

## 29 September 2008

### Statistical Machine Translation Papers at COLING

(guest post) Not always is a major conference a short train ride away, so I went down to Manchester even though I had no real business being at COLING this year. Liang Huang gave an interesting tutorial where he tied together a number of natural language problems and types of dynamic programming solutions for it. A nice treatment of this area, with examples from tree-based statistical machine translation, of course. There were also a lot of very strong papers, so let's take a look at them.

**Syntax-Based SMT is alive:** Currently one of the most exciting areas of statistical machine translation is the development of tree-based models, with all the open question: Real syntax in the source, in the target? Which grammar formalism? How do we parameterize the models? And all this efficient, please! When using real syntax on both sides, many of the rules in standard phrase-based models do not match anymore the syntactic annotations, so they have to be dropped. So, to accommodate more rules, **Zhang et al.** propose to extend the synchronous grammar formalism to allow multi-headed rules. **He et al.** build maximum entropy models for rule selection in hierarchical phrase models, similar to recent work by Carpuat and Wu in phrase-based models. Additional source-side context may of course be a syntactic parse tree. **Xiong et al.** use features about syntactic spans in their translation model and maximum entropy reordering model in a BTG framework. **Zhang et al.** present an algorithm to extract rules from a word-aligned corpus in linear time. The trick is to build up a tree structure that compactly encodes all possible rules. Storing all these rules is a real problem, too, it may take up tera-bytes, but **Lopez** presents a handy solution: suffix arrays. He also shows that rules that cover longer spans do help in hierarchical models, even if long phrases do not matter much in phrase-based models. Finally, **Zollmann et al.** present detailed results from their time at Google where they compared their syntax-augmented model against a hierarchical and standard phrase model -- so, syntax has arrived at Google at last.

**Back to the Basics:** Statistical machine translation systems have evolved into a fairly long pipeline of steps, and many of them deserve more attention. For instance the phrase segmentation of the input. Does it matter? The standard approach uses uniform segmentation, and maybe a phrase count feature. **Blackwood et al.** suggest to build an explicit phrasal segmentation model, which takes the form of a smoothed bigram phrase model. **Moore and Quirk** examine everybody's favorite: minimum error rate training (MERT), the step used to fine-tune weights given to various model components. This is typically done with a random starting point, and then hillclimbing to a better weight setting to optimize BLEU. Since this method easily gets stuck in local minmima, this is repeated several times. Moore and Quirk promise better merting (yes, that is a verb!) with fewer random restarts and better selection of the starting points through a random walk. But also the n-best list of candidate translations used during merting or re-ranking may be improved. **Chen et al.** suggest to add additional entries using a confusion network or by concatenating n-grams found in candidate translations. Speaking of building of confusion networks, they are also used in combining different system outputs (which is the current key method to win competitions, even though bad-mouthed as *intellectually bankrupt* at the recent NIST evaluation meeting). **Ayan et al.** suggest that better confusion networks and better system combination results may be achieved, when considering synonyms as found in Wordnet for matching up words when aligning the different outputs. **Zwarts and Dras** show that system combination could also be done by building a classifier that chooses one of the outputs for each sentence.

**Reordering:** Two papers on reordering, one of the big unsolved problems in statistical machine translation. **Elming** shows some advances in the first-reorder-into-lattice-then-decode approach, specifically how this lattice should be scored: he argues for scoring the final output to check which rules were implicitly followed (either by applying the rule or using a phrase translation that has it internalized). **Zhang et al.** propose that different reordering models should be built for different types of sentences, such as statements vs. questions.

**Neat and surprising:** We typically train our system on corpora that we find on web sites of multilingual institutions, such as the UN or the EU. When using such data, does it matter what the original source language was? I doubt it, but then **van Halteren** shows that he can detect the original source language in English Europarl documents with over 96 percent accuracy.

**Connecting with our rule-based friends:** An effective but simple method to combine a rule-based system like Systran with a statistical model is but first translating with Systran and then learn a statistical model to translate Systran English into real English. Based on your biases, you can call this rule-based preprocessing, serial combination, or statistical post-editing. **Ueffing et al.** show that some information from the rule-based system may help the statistical component, such as annotation of names or other reliable markup.

**Aligning and building up dictionaries:** What can you do with a large monolingual source language corpus, or a target language corpus, or a conventional bilingual dictionary? **Wu et al.** present various methods and compare them. **Tsunakawa et al.** use a method using a pivot language (English) to build a Japanese-Chinese dictionary. Given a syntactically parsed parallel corpus, **Zhechev and Way** compare different methods how to extract subtrees. **Macken et al.** extract technical term translations from a domain-specific parallel corpus. **Lardilleux and Lepage** propose an iterative phrase alignment method that first matches short sentences and phrases, and then subtracts the known alignments from longer phrases to extract the remainder, basically pigeon-holing.

**Also:** A paper on a Bayesian method for Chinese word segmentation by **Xu et al.**; a paper of transliteration, by **Malik et al.**; a paper on evaluation, especially if quality of reference translations matters (it does not), by **Hamon and Mostefa**; and a new grammar formalism for translation by **Søgaard**.

**What's missing?** No paper on word alignment or large-scale discriminative training, but there is always **Hawaii**.

## 21 September 2008

### Co-training, 10 years later

For those unfamiliar with the idea of co-training, you should really read the paper. There's also a wikipedia entry that describes it as:

Co-training is a semi-supervised learning technique that requires two views of the data. It was introduced by Avrim Blum and Tom Mitchell. It assumes that each example is described using two different feature sets that provide different, complementary information about the instance. Ideally, the two views are conditionally independent (i.e., the two feature sets of each instance are conditionally independent given the class) and each view is sufficient (i.e., the class of an instance can be accurately predicted from each view alone). Co-training first learns a separate classifier for each view using any labeled examples. The most confident predictions of each classifier on the unlabeled data are then used to iteratively construct additional labeled training data.This is a good summary of the algorithm, but what is left off is that---as far as I know---co-training was one of the first (if not the first) method for which theoretical analysis showed that semi-supervised learning might help. My history is a bit rough, so anyone should feel free to correct me if I'm wrong.

Another aspect of co-training that is cool for readers of this blog is that to a reasonable degree, it has its roots in a 1995 ACL paper by David Yarowsky: Unsupervised Word Sense Disambiguation Rivaling Supervised Methods, which, as far as I know, was really the first paper to introduce the notion of having two views of data (although I don't think David described it as such).

All in all, the co-training paper is great. In fact, if you don't believe me that I think it's great, check out my EMNLP 2008 paper. My analysis (and, to some degree, algorithm) are based heavily on the co-training analysis.

Which brings me to what I really want to discuss. That is, I have a strong feeling that if the co-training paper were reviewed today, it would be blasted for the theoretical analysis. (Indeed, I had the same fear for my EMNLP paper; though since it was EMNLP and not, say, COLT, I don't think the problem was as severe.) The problem with the co-training paper is that the theoretical result is for an algorithm that is only superficially related to the actual algorithm they implement. In particular, the actual algorithm they implement uses notions of confidence, and steadily increasing training set size, and incremental additions and so on. It's

*vastly*more complicated that the algorithm they analyze. My recent experience as both an author and reviewer at places like NIPS and ICML is that this is pretty much a non-starter these day.

In fact, the algorithm is so different that it took three years for an analysis of something even remotely close to come out. In NIPS 2001, Sanjoy Dasgupta, Michael Littman and David McAllester published a paper that actually tries to analyze something closer to the "real" co-training algorithm. They get pretty close. And this analysis paper is a

*full*NIPS paper that basically just proves one (main) theorem.

(A similar set of events happened with David Yarowsky's paper. He didn't include any theoretical analysis, but there has been subsequent work, for instance by Steve Abney to try to understand the Yarowsky algorithm theoretically. And again we see that an analysis of the exact original algorithm is a bit out of grasp.)

I'm sure other people will disagree--which is fine--but my feeling about this matter is that there's nothing wrong with proving something interesting about an algorithm that is not quite exactly what you implement. The danger, of course, is if you get an incorrect intuition. For instance, in the case of co-training, maybe it really was all these "additions" that made the algorithm work, and the whole notion of having two views was useless. This seems to have turned out not to be the case, but it would be hard to tell. For instance, the co-training paper doesn't report results on the

*actual*algorithm analyzed: presumably it doesn't work very well or there would be no need for the more complex variant (I've never tried to implement it). On the other hand, if it had taken Avrim and Tom three extra years to prove something stronger before publishing, then the world would have had to wait three extra years for this great paper.

The approach I took in my EMNLP paper, which, at least as of now, I think is reasonable, is to just flat out acknowledge that the theory doesn't really apply to the algorithm that was implemented. (Actually, in the case of the EMNLP paper, I

*did*implement both the simple and the complex and the simple wasn't

*too much*worse, but the difference was enough to make it worth--IMO--using the more complex one.)

## 08 August 2008

### Parallel Sampling

There are several ways to approach this problem.

The first is the "brain dead" approach. If you have N-many cores, just run N-many parallel (independent) samplers. Done. The problem here is that if N is really like (say 1000 or greater), then this is probably a complete waste of space/time.

The next approach works if you're doing something like (uncollapsed) Gibbs sampling. Here, the Markov blankets usually separate in a fairly reasonable way. So you can literally distribute the work precisely as specified by the Markov blankets. With a bit of engineering, you can probably do this is a pretty effective manner. The problem, of course, is if you have strongly overlapping Markov blankets. (I.e., if you don't have good separation in the network.) This can happen either due to model structure, or due to collapsing certain variables. In this case, this approach just doesn't work at all.

The third approach---and the only one that really seems plausible---would be to construct sampling schemes exclusively for a massively parallel architecture. For instance, if you can divide your variables in some reasonable way, you can probably run semi-independent samplers that communicate on an as-needed basis. The form of this communication might, for instance, look something like an MH-step, or perhaps something more complex.

At any rate, I've done a bit of a literature survey to find examples of systems that work like this, but have turned up surprisingly little. I can't imagine that there's that little work on this problem, though.

## 29 July 2008

### What Makes a Fair Comparison

Let's instantiate this with two examples. (Yes, these are examples of papers that have been published; you can see if you can figure out which ones they are. I'm just not sufficiently creative to make something up right now.)

- Let's say I'm doing syntactic language modeling. In this case, the baseline would be (say) a trigram language model. My model requires parse trees to train. So I train my language model on the Penn Treebank. Now, when I compare to an ngram model, is the fair comparison to an ngram model trained on the Treebank, or one trained on (say) all WSJ from ten years?
- Now let's say I'm doing MT. My baseline is Moses and I build some fancy MT system that can only train on sentences of length 10 or less. Should I compare to Moses trained on sentences of length 10 or less, or all sentences?

The advantage to comparing to a gimped version of model A is that it tells you: if this is the only data available to me, this is how much better I do than A. Of course, in no real world situation (for many problems, including the two I listed above) will this really be the only data you have. Plus, if you compare to a non-gimped A, you'll almost always lose by a ridiculously large margin.

On the other hand, comparing against a non-gimped A is a bit unfair. Chances are that quite some time has gone in to optimizing (say, for speed) algorithms for solving A. Should I, as a developer of new models, also have to be a hard-core optimizer in order to make things work? I'm thinking back to the introduction of SVMs twenty years ago. Back then, SVM training would take thousands of times longer than naive Bayes. Today, (linear) SVM training really isn't that much slower (maybe a small constant times longer).

Yet from the perspective of a consumer, there's something fundamentally unpleasant about having to gimp an existing system.... you have to ask yourself: why should I care?

I suppose this is where human judgement comes in. If I can reasonably imagine that the new system B might possibly be scaled up and, if so, I think it would continue to do well, then I'm not unhappy with a gimped comparison. For example, I can probably buy the syntactic language modeling example above (and to a lesser degree the MT example). I have a harder time with grammar induction on 10 word sentences because my prior beliefs state that 10 word sentences are syntactically really different from real sentences.

(Incidentally, although this issue didn't come up in my recent reviewing, I suppose that if I were reviewing a paper that had this issue, it probably wouldn't hurt if the authors were to ease me through this imagination process. For instance, in grammar induction, maybe you can show statistics that say that the distribution of context free rules is not so dissimilar between all sentences and short sentences. This almost never happens, but I think it would be useful both during reviewing as well as just for posterity.)

## 21 July 2008

### ICML/UAI/COLT 2008 Retrospective

The tutorials were incredible. I'm reaching that point where I often skip out on tutorials because there's nothing new enough. This time, there was an embarrassment of riches: I couldn't go to everything I wanted to. In the end, I went to the Smola+Gretton+Fukumizu tutorial on distributional embeddings and the Krause+Guestrin tutorial on submodularity. Both were fantastic. My only complaint is that while all the plenary talks at ICML/UAI/COLT were video taped, the tutorials were not! Anyway, I will blog separately about both of these tutorials in the near future; also check out their web pages: the slides are probably reasonably self-explanatory.

The first day began with an invited talk by John Winn, who does (at least used to do) probabilistic modeling for computer vision. I'm not sure if this was the intended take-away message, but the thing I liked best about the talk is that he listed about a dozen things that you would have to model, were you to model "vision." He then talked about problems and techniques in terms of which of these you model, which of them you fix, and which of them you treat as "noise." I think this is a great way to think about modeling any kind of complex real world phenomenon (hint hint!).

My favorite morning talk was Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani. This is a really clever use of slice sampling, which everyone should know about! The idea is to resample entire sequences of hidden states in one go. The trouble with this in an infinite model is that I don't know how to do forward-backward over an infinite state space. The trick is to slice the probability mass on the lattice and only sample over those state/observation pairs that stick out above the slice. This is effectively a fully Bayesian way to do beam-like methods, and is certainly no limited to infinite HMMs.

That afternoon, there was the joint award session, with a "ten year" award (great idea!) going to Blum and Mitchell's original co-training paper from COLT 1998. This is clearly a very influential paper, and one I will blog about separately. I particularly liked two of the best paper award recipients:

- Percy Liang and Michael Jordan for An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators. The idea here is to look at what various estimators are like in the limit of infinite data. The conclusion (as I read it) is that if your model is correct (i.e., the true distribution is realized by a certain set of parameters in your model family), then you're asymptotically better using generative models. If the model is even a tiny tiny bit incorrect, then you're better using discriminative. The key weakness of this analysis seems to be that, since we're looking only at asymptotics, there's no notion of a "sortof-wrong" model -- the wrongness gets blown up in the limit.
- Shai Shalev-Shwartz and Nati Srebo for SVM Optimization: Inverse Dependence on Training Set Size. The idea here is that having lots of data should be a good thing, not something that makes us say "ugh, now my SVM will take forever to run." The key to getting an inverse dependence on the size of the training set is to sort of change the problem. The idea is that if you have 1,000 training points, you may get 88% accuracy and it may take an hour. If you have 1,000,000 training points, we're not going to ask that we find something with 90% accuracy, but rather still aim for 88% accuracy but try to get there faster. The analysis is based on the idea that if we have lots more data, we can be a bit looser with finding the exact best model parameters, which is what usually takes so long. We can afford to have not exactly the best parameters, since we're still going for 90% accuracy, not 88%.

Tuesday morning, there was a great talk on Efficient Projections onto the L1-Ball for Learning in High Dimensions by John Duchi, Shai Shalev-Shwartz, Yoram Singer and Tushar Chandra. Here, there's trying to extend the Pegasos algorithm to do L1 regularized SVMs. The difficult step is, after taking a sub-gradient step, projecting yourself back on to the L1 ball. They have a very clever algorithm for doing this that makes use of red-black trees to maintain a notion of when a dimension was last updated (necessary in order to get complexity that depends on the sparseness of feature vectors). The only worry I have about this approach is that I'm very happy dealing with arrays for everything, but once someone tells me that my algorithm needs to use a tree structure, I start worrying. Not because of big-O issues, but just because trees mean pointers, pointers mean pointer chasing, and pointer chasing (especially out of cache) is a nightmare for actual efficiency.

All the papers in the Tuesday 10:40am online learning section I found interesting. Confidence-weighted linear classification (Mark Dredze, Koby Crammer and Fernando Pereira) presents a technique for keeping track of how confident you are about your weights for various features (if you squint it looks very PAC-Bayes) and they get very good online performance with only one or two passes over the data. Francesco Orabona, Joseph

Keshet and Barbara Caputo had a paper on memory-bounded online perceptrons called the Projectron. The idea is simple and effective; however, the big take-away I had from this paper has to do with the baseline they compare against. If you only allow your perceptron to maintain, say, 100 support vectors, just use the first 100 points that the perceptron updates on. This does pretty much as well as most of the past work in this problem! How did this not get noticed before? Finally, Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari talked about bandit problems; this is interesting work, but still seems far from practical. The idea of naively picking an unknown label with uniform probability over a gigantic set of labels is just not going to work in the real world. But I think it's a really good start.

The entire NLP section on the last afternoon was good. David Chen and Ray Mooney talked about how to generate sportscasts from RoboCup trials; Ronan Collobert and Jason Weston talked about using neural nets to solve every NLP problem in the world (more on this in another post); Jason Wolfe, Aria Haghighi and Dan Klein talked about how to parallelize the M step in addition to the E step; and then Percy Liang, me, and Dan Klein talked about whether its reasonable or not to get rid of features in structured prediction (more on this in another post).

The next day was workshop day. The two workshops I went to were the Prior Knowledge in Text workshop (that I co-organized with Marc Dymetman, Guillame Bouchard and Yee Whye Teh), and the Nonparametric Bayes workshop (where I talked a tiny bit about HBC). I'll relate the news of our own workshop at a later date; you can see some about the NPBayes workshop here.

Then came UAI and COLT. I must admit by this time I was a bit conferenced out, so I missed a bunch of sessions. However, there were still some papers that stood out for me.

In UAI, David Mimno and Andrew McCallum presented Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression, essentially a conditional variant of LDA wherein the distribution over hyperparameters is given by a generalized linear model and can use arbitrary features. One audience question that came up (David, if you're reading, maybe you can answer this!) had to do with putting the GLM on the "alpha" hyperparameter. The fact that there was such a big improvement over baseline suggests that LDA-like models are highly sensitive to the setting of their hyperparameters: this is a bit surprising. I (like the questioner) was a bit surprised that tweaking a hyperparameter could have such a big influence! Nevertheless, very cool.

The best paper went to David Sontag, Talya Meltzer, Amir Globerson, Tommi Jaakkola and Yair Weiss for Tightening LP Relaxations for MAP using Message Passing. The idea is to take your marginal polytope initially defined by simple single-node marginal constraints and iteratively add pairwise, 3-wise, etc., constraints. There are some implementation details that allow you to do warm restarts. They managed to scale really amazingly well. This furthers my confidence that LPs are a really great way to do MAP inference in graphical models.

Kuzman Ganchev, Joao Graca, John Blitzer and Ben Taskar talked about Multi-View Learning over Structured and Non-Identical Outputs. I see this as a bit of an extension to the "alignment by agreement" and the "agreement-based learning" work, but now the difference is that the output spaces don't have to "match up."

Marina Meila and Le Bao had a poster on Estimation and clustering with infinite rankings, where they give a nonparametric model over rankings, focusing on properties of the distribution. Kurt Miller, Tom Griffiths and Mike Jordan had an extension to the Indian Buffet Process that intentionally removes exchangeability in order to model cases where we know the data is not exchangeable. Chong Wang, Dave Blei and David Heckerman had a nice result on how to do topic modeling over continuous time (modeled as Brownian motion). The cool thing is that by going continuous, they're able to get a more efficient algorithm that for previous work that functioned in discrete time steps.

At COLT, there were also a good number of good papers. Shai Ben-David, Tyler Lu and David Pal ask: Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning. I'll ruin the suspense: No, it does not. (At least not without strong assumptions about the label distribution.) Nina Balcan, Avrim Blum and Nati Srebo

show that learning with similarity functions is even better than we thought.

One paper at COLT that I especially liked because I wasn't aware of much of the previous work they cited was by Liwei Wang, Masashi Sugiyama, Cheng Yang, Zhi-Hua Zhou and Jufu Feng On the Margin Explanation of Boosting Algorithms. The history here is roughly like this. Boosting came along and worked pretty well. People started trying to explain it from the perspective of margin-based analysis. Then, Breiman came along and described an algorithm arc-gv that provably generates a better margin than AdaBoost (and does so in practice as well), yet works worse! This paper attempts to re-analyze boosting from the perspective of an "Equilibrium Margin," which provides sharper bounds and results that actually agree with what is observed empirically.

## 08 July 2008

### ICML Business Meeting

- ICML 2009 will be in Montreal, June 15-19. Colocated with COLT (June 20-23) and Symposium on RL (June 19-21).
- ICML 2010 will be in Haifa, Isreal (led by IBM folks). No official dates, yet.
- The ICML board will be holding elections for members soon: probably around 10 new. Requests for nominations will be sent out in the next month or two from IMLS.
- Sounds like tracks will be implemented in SPC vs. review -- just like *ACL. To me, clearly a good idea. Bidding on 600 papers sucks.
- There's discussion about having a combination of a "AAAI-style nectar track" and an "applications track." The idea would be that if you've published a paper at an applications conference (eg, ACL, CVPR, ISMB, etc.), you could rewrite >50% of it, sell it to an ICML audience, and resubmit. The goal would be to get some applications blood into ICML, and not simultaneously prevent people from submitting their apps papers to the apps conferences.

## 02 July 2008

## 24 June 2008

### ICML 2008 papers up

Here are the top words (stems) from ICML this year:

- learn (53) -- no kidding!
- model (20)
- kernel (17) -- can't seem to shake these guys
- estim (11)
- reinforc (10)
- linear (10) -- as in both "linear time" and "linear model"
- classif (10) -- no kidding
- process (9) -- yay nonparametric Bayes!
- analysi (9)
- supervis (8)
- structur (8) -- our paper is one of these
- rank (8) -- always popular in the web days
- effici (8) -- this is great!

## 23 June 2008

### Help! Contribute the LaTeX for your ACL papers!

## 21 June 2008

### ACS: ACL 2008 Summarization Track

For ACL this year, I was lucky to be the Area Chair for the summarization track. I know I've said it before, but we really got a ton of great papers this year. In the end, seven were accepted for presentation (note there are also some summarization-related papers that officially fell under "Generation" for which I was not the area chair). I would like to say that there was some sort of unifying theme this year, but there's not one that I can actually come up with. The papers were:

P08-1035 [bib]: **Tadashi Nomoto***A Generic Sentence Trimmer with CRFs*

P08-1036 [bib]: **Ivan Titov; Ryan McDonald***A Joint Model of Text and Aspect Ratings for Sentiment Summarization*

P08-1092 [bib]: **Fadi Biadsy; Julia Hirschberg; Elena Filatova***An Unsupervised Approach to Biography Production Using Wikipedia*

P08-1093 [bib]: **Qiaozhu Mei; ChengXiang Zhai***Generating Impact-Based Summaries for Scientific Literature*

P08-1094 [bib]: **Ani Nenkova; Annie Louis***Can You Summarize This? Identifying Correlates of Input Difficulty for Multi-Document Summarization*

P08-1054 [bib]: **Gerald Penn; Xiaodan Zhu***A Critical Reassessment of Evaluation Baselines for Speech Summarization*

P08-1041 [bib]: **Giuseppe Carenini; Raymond T. Ng; Xiaodong Zhou***Summarizing Emails with Conversational Cohesion and Subjectivity*

Most of these you can guess the contents more-or-less by their titles, but here's a quick run-down. Nomoto's is probably the hardest to guess. I dare-say it actually sounds a bit boring from just the title; it leads me to think it's yet another sentence compression method. But if you start reading the paper, you find out all about compression under dependency structures, summarization of Japanese text, and an fairly thorough evaluation.

The Titov and McDonald paper attempts to model associations between fine-grained user reviews of restaurants (eg: how do you rate the food versus the ambiance?) and the actual text in the reviews. This enables them to produce summaries that are specific to particular aspects of the review.

Biadsi, Hircshberg and Filatova present a model for producing biographies, by trying to identify biography-like sentences from Wikipedia (a source that's gaining more an more attention these days). One aspect that I found most interesting here was that they attempt to do full-on reference resolution and referring expression generation. This has always been something I've been too scared to touch, but they actually present some results that show that it's worthwhile!

Mei and Zhai talk about a sentence-retrieval method for summarizing scientific documents, which they gathered from DBLP. They take advantage of citation sentences (called "citances" by Marti Hearst and also explored deeply by Simone Teufel) and make a "citation" language model. This language model is interpolated with the standard document language model to perform extraction. This allows them to extract sentences that readers care about, not what the author thinks you should care about. The key limitation, of course, is that it only works once the paper has been cited for a while. (It would be nice to see how many citations you need before it's worth doing this.)

The paper by Nenkova and Louis describes a model for predicting if a batch of documents is going to be difficult (for a machine) to summarize. This is akin to the notion of query-difficulty that you see in IR. The results are about what you'd expect, but it's nice to see them played out. In particular, you see that more cohesive sets of documents are easier to summarize. Read the paper for more.Penn and Zhu look at how well Rouge works when trying to summarize speech. They look at both Switchboard data (telephone conversations) and lectures. They have many interesting findings that cast doubt not only on the role that Rouge plays in speech summarization, but also on what sort of baselines are reasonable for speech, and what role meta-data plays in speech summarization. If you care at all about the intersection of speech and summarization, this truly is a must-read.

Last but not least, Carenini, Ng and Zhou discuss the task of summarizing emails (following up on previously work by the same authors on roughly the same task). Since the most relevant past work appeared in WWW07, I'll actually comment on that too (maybe unknown to many readers). There is quite a bit of work here, that primarily aims at finding useful features for summarizing chains of emails. They look at things like cue words, semantic similarity, lexical similarity, "PageRank" measures and a handful of others. This is all done in a graph-based framework that is pieced together based on processing the email chain (from what I can tell, this is highly non-trivial).

After writing this, I think that maybe the connecting thread is that there's a lot of work that aims at doing summarization for things other than straight news stories. I think this is a fantastic step for the community. Keep up the good work, guys!

## 18 June 2008

### CL is open access

Obviously I think this is fantastic. Many many thanks go to Robert Dale and the other ACL and CL board members for making this happen.

## 15 June 2008

### ACL 2008 What-To-See'd

Yes, I know that I promised another post in the interim, but this doesn't really count in my mind.

## 11 June 2008

### Old school conference blogging

I occasionally like to read through old papers, like from the late 70s, early 80s, mostly in stats, but occasionally in NLP or ML. Mostly it's fun; often I learn things. The old NLP paper typically amuse me more than anything else, but it's a bit of a relief to see the set of things that were considered interesting or important 20-30 years ago. I was browsing through the ACL 1980 proceedings (I won't share the answer to "how old were you then?") on the anthology and came across a paper titled "Parsing." I thought that was quite impressive that someone would be audacious enough to title their paper "Parsing." I mean, can you imagine going to a talk at ACL 2008 and seeing the title "Machine Translation?" It's absurd.

So I read the paper.

Well, it turns out it's not really a research paper. It's the 1980 ACL's parsing area chair's

*impression*of all the parsing papers that year, and how they related to the previous year.

"Fantastic!" I thought.

These are like pre-web-era conference blogs. But it's not just some random guy blogging about his favorite ACL papers across a variety of areas, but rather the one person who probably knew all the parsing papers in ACL that year better than anyone. Having area chaired myself, I know what all the summarization submissions were about, and I know

*quite well*what all the

*accepted*submissions were about. (Or, at least, I did right after we issues accept/reject notifications.) If I had sat down right then and written a page summary, relating the the past year, it probably would have taken about two hours. (An average blog post takes 30 mins, but I figure I would want to be more diligent, and also be sure to look up what happened the year before.)

I would actually love to see something like that reinstated. I think it captures an important area that's missing from you standard conference blogs -- namely, that you get coverage. As I age, I attend fewer and fewer talks, so my conference blog posts are necessarily really spotty. But if a chair from every area were to write a one page summary, I think that would be great. And I really don't think it's that much added effort -- I easily spent 10-20 times that going over reviews, looking for inconsistencies, reading papers, etc.

However, while I think it's useful, I also don't think that it's really something that needs to be in our proceedings. So I'm going to try to start a trend. Every time I am area chair for a conference, I will post to the blog a summary of the papers in my area. If you ever are an area chair, and would like to participate, please contact me. Perhaps I'll be the only one who ever does this, but I hope not. I'll start with ACL 2008 summarization as my next post.

## 10 June 2008

### Evaluating topic models

It's probably also fair to say that I'm a fan of NLP.

So... let's think about it. A fan of Bayes, a fan of NLP. He must do topic modeling (ie LDA-style models), right? Well, no. Not really. Admittedly I have done something that looks like topic modeling (for query-focused summarization), but never really topic modeling for topic modeling's sake.

The main reason I have stayed away from the ruthless, yet endlessly malleable, world of topic modeling is because it's notoriously difficult to evaluate. (I got away with this in the summarization example because I could evaluate the summaries directly.) The purpose of this post is to discuss how one can try to evaluate topic models.

At the end of the day, most topic models are just probabilistic models over documents (although sometimes they are models over collections of documents). For a very simple example, let's take LDA. Here, we have . In the particular case of LDA, the first two "p"s are Dirichlet, and the last two "p"s are Multinomial, where z is an indicator selecting a mixture. For simplicity, though, let's just collapse all the hyperparameters into a single variable a and the true parameters into a single parameter z and just treat this as and let's assume p and q are not conjugate (if they were, life would be too easy).

Now, we want to "evaluate" these models. That is, we want to check to see if they really are good models of documents. (Or more precisely, are the better models of documents than whatever we are comparing against... perhaps I've proposed a new version of p and q that I claim is more "life-like.") Well, the natural thing to do would be to take some held-out data and evaluate according to the model. Whichever model assigns higher probability to the heldout data is probably better.

At this point, we need to take a moment to talk about inference. There's the whole Monte Carlo camp and there's the whole deterministic (variational, Laplace, EP, etc.) camp. Each gives you something totally different. In the Monte Carlo camp, we'll typically get a set of R-many (possibly weighted) samples from the joint distribution p(a,z,w). We can easily "throw out" some of the components to arrive at a conditional distribution on whatever parameters we want. In the deterministic camp, one of the standard things we might get is a type-II maximum likelihood estimate of a given the training data: i.e., a value of a that maximizes p(a|w). (This is the empirical Bayes route -- some deterministic approximations will allow you to be fully Bayes and integrate out a as well.)

Now, back to evaluation. The issue that comes up is that in order to evaluate -- that is, in order to compute , we have to do more inference. In particular, we have to marginalize over the zs for the heldout data. In the MC camp, this would mean taking our samples to describe a posterior distribution on a given w (marginalizing out z) and then using this posterior to evaluate the heldout likelihood. This would involve another run of a sampler to marginalize out the zs for the new data. In the deterministic camp, we may have an ML-II point estimate of the hyperparameters a, but we still need to marginalize out z, which usually means basically running inference again (eg., running EM on the test data).

All of this is quite unfortunate. In both cases, re-running a sampler or re-running EM, is going to be computationally expensive. Life is probably slightly better in the deterministic camp where you usually get a fairly reasonable approximation to the evidence. In the MC camp, life is pretty bad. We can run this sampler, but (a) it is usually going to have pretty high variance and, (b) (even worse!) it's just plain hard to evaluate evidence in a sampler. At least I don't know of any really good ways and I've looked reasonably extensively (though "corrections" to my naivete are certainly welcome!).

So, what recourse do we have?

One reasonable standard thing to do is to "hold out" data in a different way. For instance, instead of holding out 10% of my documents, I'll hold out 10% of my words in each document. The advantage here is that since the parameters z are typically document-specific, I will obtain them for every document in the process of normal inference. This means that (at least part of) the integration in computing p(w|a) disappears and is usually tractable. The problem with this approach is that in many cases, it's not really in line with what we want to evaluate. Typically we want to evaluate how well this model models totally new documents, not "parts" of previously seen documents. (There are other issues, too, though these are less irksome to me in a topic-modeling setting.)

Another standard thing to do is to throw the latent variables into some sort of classification problem. That is, take (eg) the 20newsgroups data set, training and test combined. Run your topic model and get document-level parameters. Use these as parameters to, say, logistic regression and see how well you do. This definitely gets around the "test on new data" problem, is not really cheating (in my mind), and does give you an estimate. The problem is that this estimate is cloaked behind classification. Maybe there's no natural classification task associated with your data, or maybe classification washes out precisely the distinctions your model is trying to capture.

The final method I want to talk about I learned from Wei Li and Andrew McCallum and is (briefly!) described in their Pachinko Allocation paper. (Though I recall Andrew telling me that the technique stems---like so many things---from stats; namely, it is the empirical likelihood estimate of Diggle and Gratton.

The key idea in empirical likelihood is to replace our statistically simple but computationally complex model p with a statistically complex but computationally simple model q. We then evaluate likelihood according to q instead of p. Here's how I think of it. Let's say that whatever inference I did on training data allowed me to obtain a method for sampling from the distribution p(w|a). In most cases, we'll have this. If we have an ML-II estimate of a, we just follow the topic models' generative story; if we have samples over a, we just use those samples. Easy enough.

So, what we're going to do is generate a ton of faux documents from the posterior. On each of these documents, we estimate some simpler model. For instance, we might simply estimate a Multinomial (bag of words) on each faux document. We can consider this to now be a giant mixture of multinomials (evenly weighted), where the number of mixture components is the number of faux documents we generated. The nice thing here is that evaluating likelihood of test data under a (mixture of) multinomials is really easy. We just take our test documents, compute their average likelihood under each of these faux multinomials, and voila -- we're done!

This method is, of course, not without it's own issues. For one, a multinomial might not be a good model to use. For instance, if my topic model says anything about word order, then I might want to estimate simple n-gram language models instead. The estimates might also have high variance -- how many faux documents do I need? Some sort of kernel smoothing can help here, but then that introduces additional bias. I haven't seen anyone do any evaluation of this for topic-model things, but it would be nice to see.

But overall, I find this method the least offensive (ringing praise!) and, in fact, it's what is implemented as part of HBC.

## 29 May 2008

### Measures of correlation

The most straightforward measure, in my mind, is mutual information. If we have two words "Y" and "Y", then the mutual information is: MI(X;Y) = \sum_x \sum_y p(X=x,Y=y) \log [p(X=x,Y=y) / (p(X=x)p(Y=y))] (sorry, my LaTeX plugin seems to have died). In the case of collocation, the values x and y can take are usually just "does occur" and "does not occur." Here, we're basically asking ourselves: do X and Y take on the same values more often than chance. I.e., do they seem roughly statistically independent. The mutual information statistic gives, for every X/Y pair (every pair of words) a score. We can then sort all word pairs by this score and find the strongest collocations.

One issue with MI seems to be that it doesn't really care if pairs are common or not. Very infrequent (typically noisy) pairs seem to pop to the front. One way to fix this would be to add an additional "count" term to the front of the mutual information to weight high-frequency pairs more. This is quite similar to the RlogF measure that's locally quite popular.

The next set of methods seem to be based mostly on the idea of assuming a generative model for the data and doing hypothesis testing. I.e., you can have one model that assumes the two words are independent (the null hypothesis), and one that assumes they are not. Typically this is implemented as multinomials over words, and then a classical statistical test is applied to the estimated maximum likelihood parameters of the data. You can use Dice coefficient, t-score, chi-squared, or even something simpler like the log-likelihood ratio.

Although I'm not completely familiar with these techniques, they seem like they would also be sensitive to noise in the MLE, especially for low-frequency terms (of which we know there are a lot). It seems plausible to just directly do a Bayesian hypothesis test, rather than a classical one, but I don't know if anyone has done this.

In writing this, I just came across collocations.de, which among other things, has a nice survey of these results from an ESSLLI tutorial in 2003. They seem to conclude that for extracting PP/verb collocations, t-score and frequency seem to work best, and for extracting adjective/noun collocations, log-likelihood, t-score, Fisher, and p-values seem to work best. The intersection just contains t-score, so maybe that's the way to go. Of course, these things are hard to validate because so much depends on what you're trying to do.

## 22 May 2008

### Continuing bad ideas

The problem is: what if they did it wrong (in my opinion). There are many ways for doing things wrong and it would be hard for me to talk about specifics without pointing fingers. But just for concreteness, let's take the following case relating to the final example in the above paragraph. Say that we're doing POS tagging and the only prior work POS tagging paper used as test data every tenth sentence in WSJ, rather than the last 10%. This is (fairly clearly) totally unfair. However, it leaves me with the following options:

- I can repeat their bad idea and test on every tenth sentence.
- I can point out (in the paper) why this is a bad idea, and evaluate on the last 10%.
- I can not point this out in the paper and just ignore their work and still evaluate on the last 10%.
- I can point out why this is a bad idea, evaluate on both the last 10% (for "real" numbers) and every tenth sentence (for "comparison" numbers).
- I can point out why this is a bad idea, reimplement their system, and run both mine and theirs on the last 10%.

(4) is tempting, because it allows me to "break the bad habit" but also compare directly. The problem is that if I really disbelieve the prior methodology, then these comparison numbers are themselves dubious: what am I supposed to learn from results that compare in an unreasonable setting? If they're great (for me), does that actually tell me anything? And if they're terrible (for me), does that tell me anything? It seems to depend on the severity of the "bad idea", but it's certainly not cut and dry.

(3) just seems intellectually dishonest.

(2) at first glance seems inferior to (4), but I'm not entirely sure that I believe this. After all, I'm not sure that (4) really tells me anything that (2) doesn't. I suppose one advantage to (4) over (2) is that it gives me some sense of whether this "bad idea" really is all that bad. If things don't look markedly different in (2) and (4), then maybe this bad idea really isn't quite so bad. (Of course, measuring "markedly different" may be difficult for some tasks.)

(1) also seems intellectually dishonest.

At this point, to me, it seems like (4) and (5) are the winners, with (2) not hugely worse than (4). Thinking from the perspective of how I would feel reviewing such a paper, I would clearly prefer (5), but depending on how the rest of the paper went, I could probably tolerate (2) or (4) depending on how egregious the offense is. One minor issue is that as a writer, you have to figure that the author of this prior work is likely to be a reviewer, which means that you probably shouldn't come out too hard against this error. Which is difficult because in order to get other reviewers to buy (4), and especially (2), they have to buy that this is a problem.

I'm curious how other people feel about this. I think (5) is obviously best, but if (5) is impossible to do (or nearly so), what should be done instead.

## 18 May 2008

### Adaptation versus adaptability

**not**the task; that would be standard multi-task learning.)

Obviously I think this is an fun problem (I publish on it, and blog about it reasonably frequently). It's fun to me because it both seems important and is also relatively unsolved (though certainly lots of partial solutions exist).

One thing I've been wondering recently, and I realize as I write this that the concept is as yet a bit unformed in my head, is whether the precise formulation I wrote in the first paragraph is the best one to solve. In particular, I can imagine many ways to tweak it:

- Perhaps we don't want just a classifier that will do well on the "target" domain, but will really do well on any of the source domains as well.
- Continuing (1), perhaps when we get a test point, we don't know which of the domains we've trained on it comes from.
- Continuing (2), perhaps it might not even come from one of the domains we've seen before!
- Perhaps at training time, we just have a bunch of data that's not neatly packed into "domains." Maybe one data set really comprises five different domains (think: Brown, if it weren't labeled by the source) or maybe two data sets that claim to be different domains really aren't.
- Continuing (4), perhaps the notion of "domain" is too ill-defined to be thought of as a discrete entity and should really be more continuous.

Here's one possible client: Google (or Yahoo or MSN or whatever). Anyone who has a copy of the web lying around and wants to do some non-trivial NLP on it. In this case, the test data really comes from a wide variety of different domains (which may or may not be discrete) and they want something that "just works." It seems like for this client, we must be at (2) or (3) and not (1). There may be some hints as to the domain (eg., if the URL starts with "cnn.com" then maybe--though not necessarily--it will be news; it could also be a classified ad). The reason why I'm not sure of (2) vs (3) is that if you're in the "some labeled target data" setting of domain adaptation, you're almost certainly in (3); if you're in the "lots of unlabeled target data" setting, then you may still be in (2) because you could presumably use the web as your "lots of unlabeled target data." (This is a bit more like transduction that induction.) However, in this case, you are now also definitely in (4) because no one is going to sit around and label all of the web as to what "domain" it is in. However, if you're in the (3) setting, I've no clue what your classifier should do when it gets a test example from a domain that (it thinks) it hasn't seen before!

(4) and (5) are a bit more subtle, primarily because they tend to deal with what you believe about your data, rather than who your client really is. For instance, if you look at the WSJ section of the Treebank, it is tempting to think of this as a single domain. However, there are markedly different types of documents in there. Some are "news." Some are lists of stock prices and their recent changes. One thing I tried in the Frustratingly Easy paper but that didn't work is the following. Take the WSJ section of the Treebank and run document clustering on it (using bag of words). Treat each of these clusters as a separate domain and then do adaptation. This introduces the "when I get a test example I have to classify it into a domain" issue. At the end of the day, I couldn't get this to work. (Perhaps BOW is the wrong clustering representation? Perhaps a flat clustering is a bad idea?)

Which brings us to the final question (5): what really is a domain? One alternative way to ask this is: give data partitioned into two bags, are these two bags drawn from the same underlying distribution. Lots of people (especially in theory and databases, though also in stats/ML) have proposed solutions for answering this question. However, it's never going to be possible to give a real "yes/no" answer, so now you're left with a "degree of relatedness." Furthermore, if you're just handed a gigabyte of data, you're not going to want to try ever possible split into subdomains. One could try some tree-structured representation, which seems kind of reasonable to me (perhaps I'm just brainwashed because I've been doing too much tree stuff recently).

## 12 May 2008

### Teaching machine translation

I had a significant problem, however, teaching machine translation. Here's the problem.

Students knew all about FSTs because we used them to do all the named-entity stuff in the first third of class. This enabled us to talk about things like IBM model 1 and the HMM model. (There's a technical difficult here, namely dealing with incomplete data, so we talk about EM a little bit.) We discuss, but they don't actually make use of, higher order MT models.

Now, we all know that there's a lot more to MT than model 4 (even limiting oneself to statistical translation techniques). Namely, there are phrase-based models and syntactic models. We had a very brief (one lecture) overview of syntactic models at the end. My beef is with phrase-based models.

The problem is that we've gone though all this prettiness to develop these word-based models, and then I have to teach them grow-diag-final, phrase extraction and phrase scoring. I almost felt embarrassed doing so. The problem is that these things are obviously so heuristic that throwing them on top of this really pretty word-for-word translation model just kills me. And it's not just me: the students were visibly upset by the lack of real modeling behind these techniques.

One option would be just not to teach this stuff. I don't really think that it sheds much light on the translation process. The reason I don't like this solution is because it's nice to be able to say that they will have a handle on a not-too-difficult to understand/implement method for doing real-world MT. Instead, I could just spend that time on syntactic models. The situation there is better (you can talk about the hierarchy of tree transducers, etc.), but not perfect (eg., all the work that goes in to rule extraction is not too dissimilar from all the work that goes into phrase extraction).

I suppose that this is just the defacto problem with a relatively immature field: there hasn't been enough time for us to really tease apart what's actually going on in these models and try to come up with some coherent story. I'd love a story that doesn't involve first doing word alignment and, is, in some sense, integrated.

## 23 April 2008

### A Bit of Levity

Semester's over; back to regular blogging soon!