30 December 2009

Some random NIPS thoughts...

I missed the first two days of NIPS due to teaching. Which is sad -- I heard there were great things on the first day. I did end up seeing a lot that was nice. But since I missed stuff, I'll instead post some paper suggests from one of my students, Piyush Rai, who was there. You can tell his biases from his selections, but that's life :). More of my thoughts after his notes...

Says Piyush:
There was an interesting tutorial by Gunnar Martinsson on using randomization to speed-up matrix factorization (SVD, PCA etc) of really really large matrices (by "large", I mean something like 106 x 106). People typically use Krylov subspace methods (e.g., the Lanczos algo) but these require multiple passes over the data. It turns out that with the randomized approach, you can do it in a single pass or a small number of passes (so it can be useful in a streaming setting). The idea is quite simple. Let's assume you want the top K evals/evecs of a large matrix A. The randomized method draws K *random* vectors from a Gaussian and uses them in some way (details here) to get a "smaller version" of A on which doing SVD can be very cheap. Having got the evals/evecs of B, a simple transformation will give you the same for the original matrix A.
The success of many matrix factorization methods (e.g., the Lanczos) also depends on how quickly the spectrum decays (eigenvalues) and they also suggest ways of dealing with cases where the spectrum doesn't quite decay that rapidly.

Some papers from the main conference that I found interesting:

Distribution Matching for Transduction (Alex Smola and 2 other guys): They use maximum mean discrepancy (MMD) to do predictions in a transduction setting (i.e., when you also have the test data at training time). The idea is to use the fact that we expect the output functions f(X) and f(X') to be the same or close to each other (X are training and X' are test inputs). So instead of using the standard regularized objective used in the inductive setting, they use the distribution discrepancy (measured by say D) of f(X) and f(X') as a regularizer. D actually decomposes over pairs of training and test examples so one can use a stochastic approximation of D (D_i for the i-th pair of training and test inputs) and do something like an SGD.

Semi-supervised Learning using Sparse Eigenfunction Bases (Sinha and Belkin from Ohio): This paper uses the cluster assumption of semi-supervised learning. They use unlabeled data to construct a set of basis functions and then use labeled data in the LASSO framework to select a sparse combination of basis functions to learn the final classifier.

Streaming k-means approximation (Nir Ailon et al.): This paper does an online optimization of the k-means objective function. The algo is based on the previously proposed kmeans++ algorithm.

The Wisdom of Crowds in the Recollection of Order Information. It's about aggregating rank information from various individuals to reconstruct the global ordering.

Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora (by some folks at gatech): The problem setting is interesting here. Here the "multi-instance" is a bit of a misnomer. It means that each example in turn can consists of several sub-examples (which they call instances). E.g., a document consists of several paragraphs, or a webpage consists of text, images, videos.

Construction of Nonparametric Bayesian Models from Parametric Bayes Equations (Peter Orbanz): If you care about Bayesian nonparametrics. :) It basically builds on the Kolmogorov consistency theorem to formalize and sort of gives a recipe for the construction of nonparametric Bayesian models from their parametric counterparts. Seemed to be a good step in the right direction.

Indian Buffet Processes with Power-law Behavior (YWT and Dilan Gorur): This paper actually does the exact opposite of what I had thought of doing for IBP. The IBP (akin to the sense of the Dirichlet process) encourages the "rich-gets-richer" phenomena in the sense that a dish that has been already selected by a lot of customers is highly likely to be selected by future customers as well. This leads to the expected number of dishes (and thus the latent-features) to be something like O(alpha* log n). This paper tries to be even more aggressive and makes the relationship have a power-law behavior. What I wanted to do was a reverse behavior -- maybe more like a "socialist IBP" :) where the customers in IBP are sort of evenly distributed across the dishes.
The rest of this post are random thoughts that occurred to me at NIPS. Maybe some of them will get other people's wheels turning? This was originally an email I sent to my students, but I figured I might as well post it for the world. But forgive the lack of capitalization :):

persi diaconis' invited talk about reinforcing random walks... that is, you take a random walk, but every time you cross an edge, you increase the probability that you re-cross that edge (see coppersmith + diaconis, rolles + diaconis).... this relates to a post i had a while ago: nlpers.blogspot.com/2007/04/multinomial-on-graph.html ... i'm thinking that you could set up a reinforcing random walk on a graph to achieve this. the key problem is how to compute things -- basically want you want is to know for two nodes i,j in a graph and some n >= 0, whether there exists a walk from i to j that takes exactly n steps. seems like you could craft a clever data structure to answer this question, then set up a graph multinomial based on this, with reinforcement (the reinforcement basically looks like the additive counts you get from normal multinomials)... if you force n=1 and have a fully connected graph, you should recover a multinomial/dirichlet pair.

also from persi's talk, persi and some guy sergei (sergey?) have a paper on variable length markov chains that might be interesting to look at, perhaps related to frank wood's sequence memoizer paper from icml last year.

finally, also from persi's talk, steve mc_something from ohio has a paper on using common gamma distributions in different rows to set dependencies among markov chains... this is related to something i was thinking about a while ago where you want to set up transition matrices with stick-breaking processes, and to have a common, global, set of sticks that you draw from... looks like this steve mc_something guy has already done this (or something like it).

not sure what made me think of this, but related to a talk we had here a few weeks ago about unit tests in scheme, where they basically randomly sample programs to "hope" to find bugs... what about setting this up as an RL problem where your reward is high if you're able to find a bug with a "simple" program... something like 0 if you don't find a bug, or 1/|P| if you find a bug with program P. (i think this came up when i was talking to percy -- liang, the other one -- about some semantics stuff he's been looking at.) afaik, no one in PL land has tried ANYTHING remotely like this... it's a little tricky because of the infinite but discrete state space (of programs), but something like an NN-backed Q-learning might do something reasonable :P.

i also saw a very cool "survey of vision" talk by bill freeman... one of the big problems they talked about was that no one has a good p(image) prior model. the example given was that you usually have de-noising models like p(image)*p(noisy image|image) and you can weight p(image) by ^alpha... as alpha goes to zero, you should just get a copy of your noisy image... as alpha goes to infinity, you should end up getting a good image, maybe not the one you *want*, but an image nonetheless. this doesn't happen.

one way you can see that this doesn't happen is in the following task. take two images and overlay them. now try to separate the two. you *clearly* need a good prior p(image) to do this, since you've lost half your information.

i was thinking about what this would look like in language land. one option would be to take two sentences and randomly interleave their words, and try to separate them out. i actually think that we could solve this tasks pretty well. you could probably formulate it as a FST problem, backed by a big n-gram language model. alternatively, you could take two DOCUMENTS and randomly interleave their sentences, and try to separate them out. i think we would fail MISERABLY on this task, since it requires actually knowing what discourse structure looks like. a sentence n-gram model wouldn't work, i don't think. (although maybe it would? who knows.) anyway, i thought it was an interesting thought experiment. i'm trying to think if this is actually a real world problem... it reminds me a bit of a paper a year or so ago where they try to do something similar on IRC logs, where you try to track who is speaking when... you could also do something similar on movie transcripts.

hierarchical topic models with latent hierarchies drawn from the coalescent, kind of like hdp, but not quite. (yeah yeah i know i'm like a parrot with the coalescent, but it's pretty freaking awesome :P.)


That's it! Hope you all had a great holiday season, and enjoy your New Years (I know I'm going skiing. A lot. So there, Fernando! :)).

19 comments:

  1. I've always looked at the image problem as an argument for posterior predictive checks rather than straight draws from the prior. It's possible that your original prior may be pretty diffuse but still puts probability enough mass on real images. Given data, the posterior should then be able to generate new images similar to the data, which is the standard textbook argument for posterior predictive checks (e.g. Gelman et al, 2003). Clearly all the current models fail. But posterior predictive checks should give hints about how to improve image models.

    The same applies to language models. The prior doesn't need to generate sensible documents, but posterior predictive simulations should, given enough training data. Otherwise your model isn't rich enough and your prior doesn't put enough weight on true images.

    Coming from a stats background, I've actually been surprised at how little iteration there is between posterior predictive checks and model building in computer science literature. This is a huge theme by statisticians doing applied Bayesian work in other fields. The payoff seems particularly big in CS applications because the models are so bad/hard.

    ReplyDelete
  2. @Anonymous Computer scientists only tend to care about the predictive accuracy of their models.

    My main beef is that they only tend to consider first-best predictions (e.g. 0/1 loss for classification) and not care about the probability assigned (e.g. log loss). This makes it hard to trade off recall for precision (or sensitivity for specificity) in an application, and most applications require either high precision or high recall.

    Computer scientists don't usually evaluate individual parameters for significance or give them causal interpretations. That's because they're not interested in assessing the effect of education on income, but are rather interested in a single prediction such as "should I give this person a credit card?".

    For language models, lack of predictive checks isn't so surprising when you consider that no one has ever built a language model (and I'm not talking just n-gram models here) that generates anything like sensible documents from the posterior predictive distribution.

    You do see just this kind of posterior predictive checking in section 3 of Shannon's 1948 Mathematical Theory of Communication (yes, that's 61+ years ago) paper that introduced n-gram language models! What you see right away is the Markovian nature of n-gram models not representing long-term topical or syntactic consistency (as in Stephen Merrit's song title "Doris Day the Earth Stood Still").

    On the other hand, you can do posterior predictive checks on smaller units than full docs. For instance, you could scatterplot expectated versus empirical counts of the next word given the previous word(s). You also see this in comparing prior coefficient distributions to posteriors (e.g. Goodman's paper on the Laplace [double exponential] prior).

    ReplyDelete
  3. Anonymous: By "prior" I meant it in the Bayes' rule sense, not in the Bayesian sense... i.e., it is something like p(true image) which then gets corrupted into p(observation | true image). the "prior" then is, actually, a posterior given data, and it's that that doesn't generate anything remotely like images.

    Analogously in NLP, as Bob says, a language model doesn't generate anything like sentences (see previous post of small changes begetting negative examples).

    I actually think people do do a fair amount of something roughly analogous to posterior predictive simulations, but in the one-best sense that Bob doesn't like. That is, people run their models, see what they do, and make adjustments as appropriate. This is probably one of the major ways in which progress is made.

    But Bob is totally right: I don't care at all if feature 18329 has x% effect on predicting whether a word is a determiner or not!

    Back when I was a student, I took a class from Roni Rosenfeld where we had to build a system to disambiguate between true English sentences and sentences generated by a trigram language model. It's actually quite hard, until you start looking at using parsers and things like that. Nowadays I'd replace that with a fivegram and I bet it would be even more difficult. Of course, people do it with no effort at all (the bad ones "hurt" to read).

    ReplyDelete
  4. Well, I've been skiing so hard the last two days, exploiting the bounty of yesterday's Tahoe storm, that I'm too tired to produce much in the way of technical comment. I'll just note that "I don't care at all if feature 18329 has x% effect on predicting whether a word is a determiner or not!" sounds a bit like sour grapes ;) If you had that information, it could help you debug your model when it goes badly wrong because of a change in the data distribution. Most academic ML work is not forced to deal with that critical issue because it is based on fixed datasets.

    ReplyDelete
  5. @Bob

    That Shannon link is very nice. In statistics, as far as I can tell, Box (1980) and Rubin (1984) are viewed by many as the first clear statements of posterior predictive checks from a calibrated Bayes perspective, but the Shannon example is great; I'll definitely cite it from now on.

    ReplyDelete
  6. Very nice information. Thanks for this. Please come visit my site Colorado CO Phone Directory when you got time.

    ReplyDelete
  7. Very nice information. Thanks for this. Please come visit my site Aurora Phone Book when you got time.

    ReplyDelete
  8. spelling quibble: Diaconis's first name is "Persi".

    ReplyDelete
  9. A fantastic read….very literate and informative. Many thanks….what theme is this you are using and also, where is your RSS button ?
    A fantastic read….very literate and informative. Many thanks….what theme is this you are using and also, where is your RSS button ?

    ReplyDelete
  10. I like your article, really interesting! My point is also very good, I hope you'll like:chi flat iron are a very popular choice of hair straightener.New Balance,new Blance shoes,new Blance Outlet are some of the most comfortable and stylish shoes on the market today. The designer has a whole range of shoes for all types of athletes. five finger shoes,vibram five fingers,Five fingers shoes give women the feeling of walking barefoot while still keeping the feet protected.

    ReplyDelete