15 December 2008

Interesting NIPS papers, take 1

I just got back from NIPS. Kevin Duh was nice enough to forward his "top N" list of NIPS papers; I'll post my own shortly. Thanks Kevin!

"Large Margin Taxonomy Embedding for Document Categorization" - Kilian Weinberger, Olivier Chapelle (Yahoo)
- Suppose you have a multi-class classification problem where you need to assign documents to different nodes in the topic hierarchy. While there are hierarchical classifiers for solving this problem, the authors instead proposes to embed the taxonomy in a continuous space and use regression. The idea is as follows: from a taxonomy, we can compute distances between nodes that characterize the loss of classifying one node when the true class is the other. This pairwise distance matrix is used by multidimensional scaling to create a set of prototype vectors in Euclidean space, one for each class. Then, we train multiple regression that maps training samples to these prototype vectors. However, the problem with this two-stage approach is that the prototypes are computed without regard to the training data, so the solution may be suboptimal for classification. The paper then introduces an objective function that combines both steps: essentially, we constrain the mapping of the training samples and the prototypes to have a large margin.

"Learning taxonomies by Dependence Maximization" - Matthew Blaschko, Arthur Gretton (MPI)
- Our goal is to cluster a dataset and provide a taxonomy that shows the relationship between clusters. The standard solutions are agglomerative/divisive hierarchical clustering. This paper proposes an alternative solution which allows us to use kernels (and is thus related to spectral clustering). The idea is based on a kernel measure of dependence: roughly speaking, if K is the kernel matrix of the original data and L is the kernel matrix of the resulting clustering, the objective max_{L} trace(K*L) is an measures the dependence between samples and clusters and is thus a viable clustering objective. The method gets a taxonomy by formulating L=PYP' where P is a partition matrix (maps cluster to samples) and Y is a positive semi-definite matrix that encodes relationships between clusters.

Fast Prediction on a Tree" - Mark Herbster, Massimiliano Pontil, Sergio Rojas (UCL)
- Graph-based semi-supervised learning needs to scale well with the number of unlabeled samples in order to be truly useful in large data scenarios. This paper presents a method to improve the computational scalability of Laplacian-based methods: First, convert the data graph to a tree (using, e.g. a maximum spanning tree algorithm). Second, they show a fast way to compute the pseudo-inverse of the graph/tree Laplacian in O(m2 + mS), where m is the number of labeled samples and S is the tree diameter. This Laplacian pseudo-inverse corresponds to a kernel, and so one can use, say, a kernel perceptron. to predict on test points. Experiments show that tree approximations to graph did not deteriorate accuracy, while drastically increasing speed.

"Unlabeled data: Now it helps, now it doesn't" - Aarti Singh, Rob Nowak, Jerry Zhu (Wisconsin)
- This is an interesting theoretical paper that analyzes when unlabeled data helps under the cluster assumption. First, the authors argue that asymptotic analysis is unsuitable for analyzing the difference between supervised learning and SSL, and instead uses finite-sample analysis and minimax bounds. Let n be the number of labeled samples, m the number of unlabeled samples, d the feature dimension, and g the margin between two classes (can be positive or negative). The proof is of the form: suppose a clairvoyant supervised learner will full knowledge of the underlying density p(x) has error less than e2(n), and a supervised learner has error greater than e1(n). Then, the error of SSL is no more than e2(n) + O(some function of m). Thus, if O(some function of m) is negligible (and this depends on the exact values of m,d,g,n), then SSL will improve over supervised learning; otherwise, no. In words, the cases where SSL helps is as follows: if the margin g is relatively large compared to the average spacing between labeled points (n^{-1/d}), then supervised learning can discover p(x) accurately and works just as well as SSL. However, if g is small relative to the spacing between labeled points, but large relative to the spacing between unlabeled points (m^{-1/d}), then SSL will beat any supervised learner. In the case that the margin is negative, if -g is larger than (m^{-1/d}), then SSL also wins.

"DiscLDA: Discriminative learning for dimensionality reduction and classification" - Simon Lacoste-Julien, Fei Sha, Michael Jordan (Berkeley/USC)
- Unsupervised topic models have become popular methods for finding latent structures in text documents. These are usually trained by max likelihood, but this may be suboptimal if our final goal is classification. This paper considers the problem of introducing labeled data (e.g. topic labels) into topic models. Recall in Latent Dirichlet Allocation (LDA), foreach each document, we first draw a (k-dimensional) topic mixture from a Dirichlet prior. Then we draw a words according to p(word|topic)p(topic|topic-mixture). We can view each document as a topic simplex. The idea here is to introduce a transformation T on the topic simplex, so that documents with the same label will be mapped close together.

"Modeling the effects of memory on human online sentence processing with particle filters" - Roger Levy (UCSD), Florencia Realia, Tom Griffiths (Berkeley)
- Humans comprehend sentences in an online manner: it is believed that we do incremental parsing as we hear words one at a time. Thus, garden-path sentences are able to catch us off-guard. Moreover, the longer the sentence is before a disambiguation point is reached, the harder it is for humans to recover (digging-in effect). This is a psycholinguistics paper that seeks to explain garden-path and digging-in by a novel particle-filter based PCFG parser: essentially, whenever a word is received, a partial parse is sampled. The number of "incorrect" particles increase with sentence length (modeling digging-in), and the number of particles used correlates with the memory constraints of the brain.

"Tighter bounds for structured estimation" - Olivier Chapelle, et. al. (Yahoo/Stanford/NICTA)
- A common approach in optimizing difficult loss functions is to minimize a convex upper bound instead (e.g. hinge loss in SVM's). However, these losses are often loose. In particular, outliers often suffer large loss, so the general classifier accuracy may be sacrificed since the optimizer focuses on these extremely difficult points. The idea here is to use a non-convex, but tighter upper bound. They adopt a ramp-loss for the structured prediction problem and use the convex-concave procedure to solve it.

07 December 2008

Two Reviewer Comments that Stuck with Me

Over the years I've gotten a number of positive reviews and negative reviews. I usually remember few of the details of any more than a few months after the good or bad news. Two reviewer quotes, however, have really stuck with me. They were both in overall positive reviews, though one of them is more negative sounding. Obviously I don't know who wrote them, but they've actually had a strong impact on most of my research and my own reviewing since:

  1. "Other approaches don't have to be bad in order for your approach to be good."
  2. "If I were working in this area, I would want to know about these results."
You can guess what I might have written that had caused this reviewer to make the first comment. And actually I've come to think of these two comments as basically saying the same thing.

I feel like we too often see research in a competitive light. There are two things that can cause this. The first is funding. Short-term funding (in the US) is essentially a zero-sum game, which means that I can win only if you lose. (There are models where this is less true, but usually that has other -- not necessarily desirable -- outcomes.)
Link
The second is the scooping effect: many times when we have (what we think is) a cool idea, we want to "beat" everyone else to the punch. I recall a comment John Langford made on his blog (which I cannot seem to find now because I can't figure out what search terms to use) quite some time ago along these lines... saying that for many problems he doesn't care who finds a solution so long as the solution is found. I usually have mixed feelings. For some topics, I definitely feel this way. Indeed, for some topics I actually don't want to be the person who figures it out, either because I don't feel I have the necessary skills or because I don't have the necessary time. For some topics, I do get a feeling of ownership and really want to be the person who does it. My thesis work was like that, as was my document/abstract alignment work. This is definitely highly personal: I know plenty of people who care a lot more about ownership than I do, and many who care a lot less.

What I took away from this comment is essentially the realization that we are all working toward some vague future goal, which has to do with computationalizing language processing (or some other topic, for the non-NLP audience). Progress is good. If I've done work that has something interesting and novel to say about this goal, then it's not bad -- and is often good -- that this builds on and improves on your work.

01 December 2008

Where did my (linear) model go wrong?

Back to research-related posts for a while :).

Let's say we're learning a predictor f and it doesn't get 100% test accuracy. We can ask ourselves: why not. Typically, the reason I ask myself this question is because I want to get closer to 100% test accuracy and want to know how I should go about it. Should I add more labeled data (what should it look like) or should I add more features (what should they look like) or am I just hosed?

In order to try to get a handle on this, we can ask ourselves: where can error come from:

  1. Noise in the training data (which we have fit and is now screwing us up at test time)
  2. Noise in the test data (which is causing our otherwise correct predictions to look wrong)
  3. Insufficient representation (we just don't have the right/enough features)
  4. Insufficient examples (our training data wasn't sampled densely enough in some region)
These are not mutually exclusive. For instance, maybe be reducing/changing the feature set (3), it would turn out that we are sampled densely enough in a region of interest (4).

I'm going to use (binary) text categorization as a working example because it's simple yet exhibits NLP-like properties (ever growing feature sets, sparse feature representations, etc.). So let's say we train a model (perhaps linear) on a bag of words feature set and apply it on test data. Say we get 90% accuracy on dev data. Now what can we do?

Let's take a single example that we misclassified. Perhaps take the "worst" (i.e., most misclassified in a margin sense), though I don't know that it matters. Which of the four reasons above tells us why this example was misclassified?

By looking at the example (yes, by hand!) we can probably tell if it's a problem due to issue (2: Noise in the test data). We might like to try to automate this ascertainment, but honestly I'm at a loss for how to do this. Let's say we decide that the problem isn't due to test set noise. Now what?

Let's consider the following approach. We are going to retrain our classifier several times (okay, this is expensive, but we can worry about this later). What we're going to do is add this misclassified dev point into the training data with a varying cost. The currently trained model we will call f(0), since it is based on giving this dev point a cost of zero. We can then train f(c) for a large range of costs c and let C be some reasonable upper bound (i.e., we guarantee that C is big enough that f(C) correctly classifies this point -- for any reasonable classifier, there should be such a finite C). Go ahead and replace "correctly classifies" with "classifies with a sufficiently large margin" if you'd prefer; I doubt it matters.

Now, we're going to look at two of these fs. We'll look at f(0) and f(c'), where c' is the minimal value of c such that this dev example becomes correctly classified. Let's say we now run these two fs on the training data. We know that f(0) will misclassify our "selected" test point and that f(c') will not. The big question is what do the fs do on the other points.
  1. Suppose that f(c') doesn't make any (many?) more mistakes than f(0). That is, they're basically the same, just now classifying our selected point correctly. This suggests that the problem is (3) or (4) above (features or examples).
  2. Suppose that f(c') makes many more mistakes than f(0). Now, we see that in order to get this selected test point correct, we have to pay by getting other things wrong (that we didn't before). This suggests that the problem is (1) or (3) above (noise or features). Importantly, it's not (4: examples).
At this point, we have separated causes (3 or 4) from (1 or 4), but haven't separated them completely.

Now, let's go back and do the same process of all of the dev points that were misclassified. What can happen?
  1. Almost all of the f(c')s make no more errors on other training points. Unless all of these erroneous dev points are markedly different from the training data (in which case we really have a domain adaptation problem), then this is almost certainly a feature issue (3).
  2. Almost all of the f(c')s make many more errors on the other training points, and the set of training points on which they make these errors is usually the same. Then this is almost certainly noisy training data (or noisy test data).
  3. Almost all of the f(c')s make many more errors on the other training points, but the set of training points on which they err is substantially different each time. Then this is almost certainly a feature issue (3).
  4. Mixed results: some f(c')s make more errors on training, some don't. This is harder to say, but my gut tells me that this is probably a (4: example) issue.
There's a lot of hedging going on here: "almost", "many", "different", etc. Maybe you could formalize these things, but I'd rather get the intuition right first.

(If you're using a margin based classifier, you might not have to exactly retrain each time. Koby Crammer's passive aggressive algorithm will essentially give you a closed form solution for "closest (in L2) weight vector to the current weight vector such that a given point is correctly classified," which could save some computational effort.)

Note that I haven't actually tried this. I'd definitely be interested to, but wanted to think about it a bit before I shot off to implement something.