natural language processing blog: Interesting NIPS papers, take 1

I just got back from NIPS. Kevin Duh was nice enough to forward his "top N" list of NIPS papers; I'll post my own shortly. Thanks Kevin!

"Large Margin Taxonomy Embedding for Document Categorization" - Kilian Weinberger, Olivier Chapelle (Yahoo)
- Suppose you have a multi-class classification problem where you need to assign documents to different nodes in the topic hierarchy. While there are hierarchical classifiers for solving this problem, the authors instead proposes to embed the taxonomy in a continuous space and use regression. The idea is as follows: from a taxonomy, we can compute distances between nodes that characterize the loss of classifying one node when the true class is the other. This pairwise distance matrix is used by multidimensional scaling to create a set of prototype vectors in Euclidean space, one for each class. Then, we train multiple regression that maps training samples to these prototype vectors. However, the problem with this two-stage approach is that the prototypes are computed without regard to the training data, so the solution may be suboptimal for classification. The paper then introduces an objective function that combines both steps: essentially, we constrain the mapping of the training samples and the prototypes to have a large margin.

"Learning taxonomies by Dependence Maximization" - Matthew Blaschko, Arthur Gretton (MPI)
- Our goal is to cluster a dataset and provide a taxonomy that shows the relationship between clusters. The standard solutions are agglomerative/divisive hierarchical clustering. This paper proposes an alternative solution which allows us to use kernels (and is thus related to spectral clustering). The idea is based on a kernel measure of dependence: roughly speaking, if K is the kernel matrix of the original data and L is the kernel matrix of the resulting clustering, the objective max_{L} trace(K*L) is an measures the dependence between samples and clusters and is thus a viable clustering objective. The method gets a taxonomy by formulating L=PYP' where P is a partition matrix (maps cluster to samples) and Y is a positive semi-definite matrix that encodes relationships between clusters.

Fast Prediction on a Tree" - Mark Herbster, Massimiliano Pontil, Sergio Rojas (UCL)
- Graph-based semi-supervised learning needs to scale well with the number of unlabeled samples in order to be truly useful in large data scenarios. This paper presents a method to improve the computational scalability of Laplacian-based methods: First, convert the data graph to a tree (using, e.g. a maximum spanning tree algorithm). Second, they show a fast way to compute the pseudo-inverse of the graph/tree Laplacian in O(m2 + mS), where m is the number of labeled samples and S is the tree diameter. This Laplacian pseudo-inverse corresponds to a kernel, and so one can use, say, a kernel perceptron. to predict on test points. Experiments show that tree approximations to graph did not deteriorate accuracy, while drastically increasing speed.

"Unlabeled data: Now it helps, now it doesn't" - Aarti Singh, Rob Nowak, Jerry Zhu (Wisconsin)
- This is an interesting theoretical paper that analyzes when unlabeled data helps under the cluster assumption. First, the authors argue that asymptotic analysis is unsuitable for analyzing the difference between supervised learning and SSL, and instead uses finite-sample analysis and minimax bounds. Let n be the number of labeled samples, m the number of unlabeled samples, d the feature dimension, and g the margin between two classes (can be positive or negative). The proof is of the form: suppose a clairvoyant supervised learner will full knowledge of the underlying density p(x) has error less than e2(n), and a supervised learner has error greater than e1(n). Then, the error of SSL is no more than e2(n) + O(some function of m). Thus, if O(some function of m) is negligible (and this depends on the exact values of m,d,g,n), then SSL will improve over supervised learning; otherwise, no. In words, the cases where SSL helps is as follows: if the margin g is relatively large compared to the average spacing between labeled points (n^{-1/d}), then supervised learning can discover p(x) accurately and works just as well as SSL. However, if g is small relative to the spacing between labeled points, but large relative to the spacing between unlabeled points (m^{-1/d}), then SSL will beat any supervised learner. In the case that the margin is negative, if -g is larger than (m^{-1/d}), then SSL also wins.

"DiscLDA: Discriminative learning for dimensionality reduction and classification" - Simon Lacoste-Julien, Fei Sha, Michael Jordan (Berkeley/USC)
- Unsupervised topic models have become popular methods for finding latent structures in text documents. These are usually trained by max likelihood, but this may be suboptimal if our final goal is classification. This paper considers the problem of introducing labeled data (e.g. topic labels) into topic models. Recall in Latent Dirichlet Allocation (LDA), foreach each document, we first draw a (k-dimensional) topic mixture from a Dirichlet prior. Then we draw a words according to p(word|topic)p(topic|topic-mixture). We can view each document as a topic simplex. The idea here is to introduce a transformation T on the topic simplex, so that documents with the same label will be mapped close together.

"Modeling the effects of memory on human online sentence processing with particle filters" - Roger Levy (UCSD), Florencia Realia, Tom Griffiths (Berkeley)
- Humans comprehend sentences in an online manner: it is believed that we do incremental parsing as we hear words one at a time. Thus, garden-path sentences are able to catch us off-guard. Moreover, the longer the sentence is before a disambiguation point is reached, the harder it is for humans to recover (digging-in effect). This is a psycholinguistics paper that seeks to explain garden-path and digging-in by a novel particle-filter based PCFG parser: essentially, whenever a word is received, a partial parse is sampled. The number of "incorrect" particles increase with sentence length (modeling digging-in), and the number of particles used correlates with the memory constraints of the brain.

"Tighter bounds for structured estimation" - Olivier Chapelle, et. al. (Yahoo/Stanford/NICTA)
- A common approach in optimizing difficult loss functions is to minimize a convex upper bound instead (e.g. hinge loss in SVM's). However, these losses are often loose. In particular, outliers often suffer large loss, so the general classifier accuracy may be sacrificed since the optimizer focuses on these extremely difficult points. The idea here is to use a non-convex, but tighter upper bound. They adopt a ramp-loss for the structured prediction problem and use the convex-concave procedure to solve it.

natural language processing blog

15 December 2008

Interesting NIPS papers, take 1

13 comments: