30 December 2009

Some random NIPS thoughts...

I missed the first two days of NIPS due to teaching. Which is sad -- I heard there were great things on the first day. I did end up seeing a lot that was nice. But since I missed stuff, I'll instead post some paper suggests from one of my students, Piyush Rai, who was there. You can tell his biases from his selections, but that's life :). More of my thoughts after his notes...

Says Piyush:

There was an interesting tutorial by Gunnar Martinsson on using randomization to speed-up matrix factorization (SVD, PCA etc) of really really large matrices (by "large", I mean something like 106 x 106). People typically use Krylov subspace methods (e.g., the Lanczos algo) but these require multiple passes over the data. It turns out that with the randomized approach, you can do it in a single pass or a small number of passes (so it can be useful in a streaming setting). The idea is quite simple. Let's assume you want the top K evals/evecs of a large matrix A. The randomized method draws K *random* vectors from a Gaussian and uses them in some way (details here) to get a "smaller version" of A on which doing SVD can be very cheap. Having got the evals/evecs of B, a simple transformation will give you the same for the original matrix A.
The success of many matrix factorization methods (e.g., the Lanczos) also depends on how quickly the spectrum decays (eigenvalues) and they also suggest ways of dealing with cases where the spectrum doesn't quite decay that rapidly.

Some papers from the main conference that I found interesting:

Distribution Matching for Transduction (Alex Smola and 2 other guys): They use maximum mean discrepancy (MMD) to do predictions in a transduction setting (i.e., when you also have the test data at training time). The idea is to use the fact that we expect the output functions f(X) and f(X') to be the same or close to each other (X are training and X' are test inputs). So instead of using the standard regularized objective used in the inductive setting, they use the distribution discrepancy (measured by say D) of f(X) and f(X') as a regularizer. D actually decomposes over pairs of training and test examples so one can use a stochastic approximation of D (D_i for the i-th pair of training and test inputs) and do something like an SGD.

Semi-supervised Learning using Sparse Eigenfunction Bases (Sinha and Belkin from Ohio): This paper uses the cluster assumption of semi-supervised learning. They use unlabeled data to construct a set of basis functions and then use labeled data in the LASSO framework to select a sparse combination of basis functions to learn the final classifier.

Streaming k-means approximation (Nir Ailon et al.): This paper does an online optimization of the k-means objective function. The algo is based on the previously proposed kmeans++ algorithm.

The Wisdom of Crowds in the Recollection of Order Information. It's about aggregating rank information from various individuals to reconstruct the global ordering.

Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora (by some folks at gatech): The problem setting is interesting here. Here the "multi-instance" is a bit of a misnomer. It means that each example in turn can consists of several sub-examples (which they call instances). E.g., a document consists of several paragraphs, or a webpage consists of text, images, videos.

Construction of Nonparametric Bayesian Models from Parametric Bayes Equations (Peter Orbanz): If you care about Bayesian nonparametrics. :) It basically builds on the Kolmogorov consistency theorem to formalize and sort of gives a recipe for the construction of nonparametric Bayesian models from their parametric counterparts. Seemed to be a good step in the right direction.

Indian Buffet Processes with Power-law Behavior (YWT and Dilan Gorur): This paper actually does the exact opposite of what I had thought of doing for IBP. The IBP (akin to the sense of the Dirichlet process) encourages the "rich-gets-richer" phenomena in the sense that a dish that has been already selected by a lot of customers is highly likely to be selected by future customers as well. This leads to the expected number of dishes (and thus the latent-features) to be something like O(alpha* log n). This paper tries to be even more aggressive and makes the relationship have a power-law behavior. What I wanted to do was a reverse behavior -- maybe more like a "socialist IBP" :) where the customers in IBP are sort of evenly distributed across the dishes.
The rest of this post are random thoughts that occurred to me at NIPS. Maybe some of them will get other people's wheels turning? This was originally an email I sent to my students, but I figured I might as well post it for the world. But forgive the lack of capitalization :):

persi diaconis' invited talk about reinforcing random walks... that is, you take a random walk, but every time you cross an edge, you increase the probability that you re-cross that edge (see coppersmith + diaconis, rolles + diaconis).... this relates to a post i had a while ago: nlpers.blogspot.com/2007/04/multinomial-on-graph.html ... i'm thinking that you could set up a reinforcing random walk on a graph to achieve this. the key problem is how to compute things -- basically want you want is to know for two nodes i,j in a graph and some n >= 0, whether there exists a walk from i to j that takes exactly n steps. seems like you could craft a clever data structure to answer this question, then set up a graph multinomial based on this, with reinforcement (the reinforcement basically looks like the additive counts you get from normal multinomials)... if you force n=1 and have a fully connected graph, you should recover a multinomial/dirichlet pair.

also from persi's talk, persi and some guy sergei (sergey?) have a paper on variable length markov chains that might be interesting to look at, perhaps related to frank wood's sequence memoizer paper from icml last year.

finally, also from persi's talk, steve mc_something from ohio has a paper on using common gamma distributions in different rows to set dependencies among markov chains... this is related to something i was thinking about a while ago where you want to set up transition matrices with stick-breaking processes, and to have a common, global, set of sticks that you draw from... looks like this steve mc_something guy has already done this (or something like it).

not sure what made me think of this, but related to a talk we had here a few weeks ago about unit tests in scheme, where they basically randomly sample programs to "hope" to find bugs... what about setting this up as an RL problem where your reward is high if you're able to find a bug with a "simple" program... something like 0 if you don't find a bug, or 1/|P| if you find a bug with program P. (i think this came up when i was talking to percy -- liang, the other one -- about some semantics stuff he's been looking at.) afaik, no one in PL land has tried ANYTHING remotely like this... it's a little tricky because of the infinite but discrete state space (of programs), but something like an NN-backed Q-learning might do something reasonable :P.

i also saw a very cool "survey of vision" talk by bill freeman... one of the big problems they talked about was that no one has a good p(image) prior model. the example given was that you usually have de-noising models like p(image)*p(noisy image|image) and you can weight p(image) by ^alpha... as alpha goes to zero, you should just get a copy of your noisy image... as alpha goes to infinity, you should end up getting a good image, maybe not the one you *want*, but an image nonetheless. this doesn't happen.

one way you can see that this doesn't happen is in the following task. take two images and overlay them. now try to separate the two. you *clearly* need a good prior p(image) to do this, since you've lost half your information.

i was thinking about what this would look like in language land. one option would be to take two sentences and randomly interleave their words, and try to separate them out. i actually think that we could solve this tasks pretty well. you could probably formulate it as a FST problem, backed by a big n-gram language model. alternatively, you could take two DOCUMENTS and randomly interleave their sentences, and try to separate them out. i think we would fail MISERABLY on this task, since it requires actually knowing what discourse structure looks like. a sentence n-gram model wouldn't work, i don't think. (although maybe it would? who knows.) anyway, i thought it was an interesting thought experiment. i'm trying to think if this is actually a real world problem... it reminds me a bit of a paper a year or so ago where they try to do something similar on IRC logs, where you try to track who is speaking when... you could also do something similar on movie transcripts.

hierarchical topic models with latent hierarchies drawn from the coalescent, kind of like hdp, but not quite. (yeah yeah i know i'm like a parrot with the coalescent, but it's pretty freaking awesome :P.)


That's it! Hope you all had a great holiday season, and enjoy your New Years (I know I'm going skiing. A lot. So there, Fernando! :)).

18 comments:

Anonymous said...

I've always looked at the image problem as an argument for posterior predictive checks rather than straight draws from the prior. It's possible that your original prior may be pretty diffuse but still puts probability enough mass on real images. Given data, the posterior should then be able to generate new images similar to the data, which is the standard textbook argument for posterior predictive checks (e.g. Gelman et al, 2003). Clearly all the current models fail. But posterior predictive checks should give hints about how to improve image models.

The same applies to language models. The prior doesn't need to generate sensible documents, but posterior predictive simulations should, given enough training data. Otherwise your model isn't rich enough and your prior doesn't put enough weight on true images.

Coming from a stats background, I've actually been surprised at how little iteration there is between posterior predictive checks and model building in computer science literature. This is a huge theme by statisticians doing applied Bayesian work in other fields. The payoff seems particularly big in CS applications because the models are so bad/hard.

Bob Carpenter said...

@Anonymous Computer scientists only tend to care about the predictive accuracy of their models.

My main beef is that they only tend to consider first-best predictions (e.g. 0/1 loss for classification) and not care about the probability assigned (e.g. log loss). This makes it hard to trade off recall for precision (or sensitivity for specificity) in an application, and most applications require either high precision or high recall.

Computer scientists don't usually evaluate individual parameters for significance or give them causal interpretations. That's because they're not interested in assessing the effect of education on income, but are rather interested in a single prediction such as "should I give this person a credit card?".

For language models, lack of predictive checks isn't so surprising when you consider that no one has ever built a language model (and I'm not talking just n-gram models here) that generates anything like sensible documents from the posterior predictive distribution.

You do see just this kind of posterior predictive checking in section 3 of Shannon's 1948 Mathematical Theory of Communication (yes, that's 61+ years ago) paper that introduced n-gram language models! What you see right away is the Markovian nature of n-gram models not representing long-term topical or syntactic consistency (as in Stephen Merrit's song title "Doris Day the Earth Stood Still").

On the other hand, you can do posterior predictive checks on smaller units than full docs. For instance, you could scatterplot expectated versus empirical counts of the next word given the previous word(s). You also see this in comparing prior coefficient distributions to posteriors (e.g. Goodman's paper on the Laplace [double exponential] prior).

hal said...

Anonymous: By "prior" I meant it in the Bayes' rule sense, not in the Bayesian sense... i.e., it is something like p(true image) which then gets corrupted into p(observation | true image). the "prior" then is, actually, a posterior given data, and it's that that doesn't generate anything remotely like images.

Analogously in NLP, as Bob says, a language model doesn't generate anything like sentences (see previous post of small changes begetting negative examples).

I actually think people do do a fair amount of something roughly analogous to posterior predictive simulations, but in the one-best sense that Bob doesn't like. That is, people run their models, see what they do, and make adjustments as appropriate. This is probably one of the major ways in which progress is made.

But Bob is totally right: I don't care at all if feature 18329 has x% effect on predicting whether a word is a determiner or not!

Back when I was a student, I took a class from Roni Rosenfeld where we had to build a system to disambiguate between true English sentences and sentences generated by a trigram language model. It's actually quite hard, until you start looking at using parsers and things like that. Nowadays I'd replace that with a fivegram and I bet it would be even more difficult. Of course, people do it with no effort at all (the bad ones "hurt" to read).

Fernando Pereira said...

Well, I've been skiing so hard the last two days, exploiting the bounty of yesterday's Tahoe storm, that I'm too tired to produce much in the way of technical comment. I'll just note that "I don't care at all if feature 18329 has x% effect on predicting whether a word is a determiner or not!" sounds a bit like sour grapes ;) If you had that information, it could help you debug your model when it goes badly wrong because of a change in the data distribution. Most academic ML work is not forced to deal with that critical issue because it is based on fixed datasets.

Anonymous said...

@Bob

That Shannon link is very nice. In statistics, as far as I can tell, Box (1980) and Rubin (1984) are viewed by many as the first clear statements of posterior predictive checks from a calibrated Bayes perspective, but the Shannon example is great; I'll definitely cite it from now on.

rr8004 said...

Very nice information. Thanks for this. Please come visit my site Colorado CO Phone Directory when you got time.

rr8004 said...

Very nice information. Thanks for this. Please come visit my site Aurora Phone Book when you got time.

Gustavo Lacerda said...

spelling quibble: Diaconis's first name is "Persi".

hal said...

fixed: thanks gustavo.

Montana Attorneys Legal Services said...

Valuable information and excellent design you got here! I would like to thank you for sharing your thoughts and time into the stuff you post!! Thumbs up



Montana Attorneys, Montana
Lawyers
, Montana Law Firms,
Montana Law Offices, Montana
Legal Services
, Attorneys
In Montana
, Montana Lawyer
Directory
, Montana Attorney
Directory
, Montana Accident Attorneys, Montana Administrative & Governmental Law Attorneys, Montana Adoption Attorneys, Montana Agricultural Law Attorneys, Montana Appeals Attorneys, Montana Arbitration & Mediation Services, Montana Arbitration & Mediation Services Attorneys, Montana Asbestos Diseases Attorneys, Montana Asset Protection Attorneys, Montana Attorneys, Montana Attorneys&#; Information & Referral Services, Montana Attorneys&#; Support Services, Montana Banking & Investment Law Attorneys, Montana Bankruptcy Attorneys, Montana Business Services, Montana Child Abuse Law Attorneys

Nebraska Lawyer Directory said...

Hello, i am glad to read the whole content of this blog and am very excited and happy to say that the webmaster has done a very good job here to put all the information content and information at one place, i will must refer this information with reference on my website ...

Nebraska Attorneys, Nebraska
Lawyers
, Nebraska Law Firms,
Nebraska Law Offices, Nebraska
Legal Services
, Attorneys
In Nebraska
, Nebraska Lawyer
Directory
, Nebraska Attorney
Directory
, Nebraska Accident Attorneys, Nebraska Administrative & Governmental Law Attorneys, Nebraska Adoption Attorneys, Nebraska Agricultural Law Attorneys, Nebraska Appeals Attorneys, Nebraska Arbitration & Mediation Services, Nebraska Arbitration & Mediation Services Attorneys, Nebraska Asbestos Diseases Attorneys, Nebraska Asset Protection Attorneys, Nebraska Attorneys, Nebraska Attorneys&#; Information & Referral Services, Nebraska Attorneys&#; Support Services, Nebraska Banking & Investment Law Attorneys, Nebraska Bankruptcy Attorneys, Nebraska Business Services, Nebraska Child Abuse Law Attorneys

Unknown said...

If you had some way of rating posts I would for sure give you a high rating my friend!
New Jersey Discrimination & Civil Rights Attorneys, New Jersey Divorce & Mediation Services, New Jersey Divorce Attorneys, New Jersey Election Law Attorneys, New Jersey Eminent Domain & Condemnation Attorneys, New Jersey Employment & Labor Law Attorneys, New Jersey Entertainment & Sports Law Attorneys, New Jersey Environmental & Natural Resources Attorneys, New Jersey Estate Planning & Administration Attorneys, New Jersey Expert Testimony Services, New Jersey Family Law Attorneys

Unknown said...

Your summaries are always top-notch. Thanks for keeping us apprised. I’m reading every word here.
New Jersey Firearm & Gun Law Attorneys, New Jersey Franchise & Licensing Law Attorneys, New Jersey General Practice Attorneys, New Jersey Government Contracts & Claims Attorneys, New Jersey Guardianship & Conservatorship Attorneys, New Jersey Health Care Law Attorneys, New Jersey Immigration Law Attorneys, New Jersey Insurance Law Attorneys, New Jersey Intellectual Property Attorneys, New Jersey International Law Attorneys, New Jersey Juvenile Law Attorneys

Anonymous said...

What a great post, I actually found it very thought provoking, you just never know sometimes when a golden nugget of information is going to land at your feet, thanks
New Hampshire title insurance, New Jersey title insurance, New Mexico title insurance, New York title insurance, North Carolina title insurance, North Dakota title insurance, Ohio title insurance, Oklahoma title insurance, Oregon title insurance, Pennsylvania title insurance, Rhode Island title insurance

Anonymous said...

A fantastic read….very literate and informative. Many thanks….what theme is this you are using and also, where is your RSS button ?
A fantastic read….very literate and informative. Many thanks….what theme is this you are using and also, where is your RSS button ?

combattery84 said...

Laptop battery
ACER Laptop Battery
ASUS Laptop Battery
COMPAQ Laptop Battery
Dell Laptop Battery
HP Laptop Battery
IBM Laptop Battery
SONY Laptop Battery
TOSHIBA Laptop Battery
APPLE M8403 battery
APPLE A1078 Battery
APPLE A1079 battery
APPLE A1175 battery
APPLE a1185 battery 1
APPLE A1189 battery
Acer aspire 5920 battery
Acer btp-arj1 battery
Acer LC.BTP01.013 battery

Acer ASPIRE 1300 battery
Acer ASPIRE 1310 battery
Acer Aspire 1410 battery
Acer ASPIRE 1680 battery
ACER BTP-63D1 battery
ACER BTP-43D1 battery
Acer lc.btp05.001 battery
Acer aspire 3000 battery
Acer Travelmate 4000 battery
ACER aspire 5560 battery
ACER BATBL50L6 battery
ACER TravelMate 240 Battery
ACER BT.00803.004 Battery
ACER Travelmate 4002lmi battery
Acer travelmate 800 battery

combattery84 said...

ACER Travelmate 4002lmi battery
Acer travelmate 800 battery
Acer aspire 3613wlmi battery
Travelmate 2414wlmi battery
Acer batcl50l battery
Acer Travelmate 2300 battery
ACER aspire 3610 battery
ACER travelmate 4600 battery
Dell Latitude D800 battery
Dell Inspiron 600m battery
Dell Inspiron 8100 Battery
Dell Y9943 battery
Dell Inspiron 1521 battery
Dell Inspiron 510m battery
Dell Latitude D500 battery
Dell Latitude D520 battery
Dell GD761 battery
Dell NF343 battery
Dell D5318 battery
Dell G5260 battery
Dell Inspiron 9200 battery
Dell Latitude C500 battery
Dell HD438 Battery
Dell GK479 battery
Dell PC764 battery
Dell KD476 Battery
Dell Inspiron 1150 battery

ylinling001 said...

I like your article, really interesting! My point is also very good, I hope you'll like:chi flat iron are a very popular choice of hair straightener.New Balance,new Blance shoes,new Blance Outlet are some of the most comfortable and stylish shoes on the market today. The designer has a whole range of shoes for all types of athletes. five finger shoes,vibram five fingers,Five fingers shoes give women the feeling of walking barefoot while still keeping the feet protected.