When I think of structured input models, I typically think of things like kernels over discrete input spaces. For instance, the famous all-substrings kernel for which K(d1,d2) effectively counts the number of common substrings in two documents, without spending exponential time enumerating them all. Of course there are many more ways of thinking about structured inputs: tree-to-string machine translation has a tree structured input. RNNs (on the input side) are essentially structured input models for sequence structures.
When I think of structured output models, I typically think of things like CRFs, structured SVMs/M3Ns, multilabel predictors (those are borderline), various transition-based methods (eg., shift/reduce parsers), etc. Here, my internal model for the structure is essentially at prediction time: find a high scoring structure from this complicated discrete output space.
Perhaps this has been obvious to everyone-but-me for a decade, but I only recently came to the recognition that these are essentially the same, at least if you restrict the sort of models you're willing to consider. (In particular, if you ignore things like imitation learning/learning to search for a moment.)
In a pure structured input setting, you have some simple label space Y (let's assume it's the real numbers) and some complex input space X. Typically you want to learn a function f : X ➝ Y, which has low loss. In particular you want to minimize the expectation of loss(y, f(x)) over random draws of x,y. And the "interesting" thing is that x isn't just a vector, so you have to be clever.
In the pure structure output setting, in, for instance, the structured SVM/CRF setup, you have some input space X (which may or may not be structured) and some complex output space Y. As before, you want to learn a function f : X ➝ Y, which has low loss. However, in the most common setups, the way you accomplish this is that instead of directly learning f, you instead learn a scoring function s that scores x,y pairs based on how "good" that y is for the corresponding x. For a fixed scoring function s, you derive f according to the argmax rule: fs(x) := argmaxy s(x,y). In this way, you have effectively separated the learning problem (get a good s) from the structured problem (solve the argmax). [Whether this is good or not is up to debate; I'm personally on the "nay" side.] You then want to minimize something like the expectation of loss(y, argmaxy' s(x,y')) over random draws x,y.
The observation is that these two problems are essentially the same thing. That is, if you know how to do the structured input problem, then the structured output problem is essentially the same thing, as far as the learning problem goes. That is, if you can put structure in f(x) for structured input, you can just as well put structure in s(x,y) for structured output. Or, by example, if you can predict the fluency of an English sentence x as a structured input problem, you can predict the translation quality of a French/English sentence pair x,y in a structured output problem. This doesn't solve the argmax problem -- you have to do that separately -- but the underlying learning problem is essentially identical.
You see similar ideas being reborn these days with papers like David Belanger's ICML paper this year on energy networks. With this framework of think-of-structured-input-and-structured-output-as-the-same, basically what they're doing is building a structured score function that uses both the input and output simultaneously, and throwing these through a deep network. (Ok it's a bit more than that, but that's the cartoon.)
At any rate, maybe obvious to everyone but me, but I thought I'd write it down anyway :).
my biased thoughts on the fields of natural language processing (NLP), computational linguistics (CL) and related topics (machine learning, math, funding, etc.)
29 July 2016
26 July 2016
Decoding (neural?) representations
I remember back in grad school days some subset of the field was thinking about the following question. I train an unsupervised HMM on some language data to get something-like-part-of-speech tags out. And naturally the question arises: these tags that come out... what are they actually encoding?
At the time, there were essentially three ways of approaching this question that I knew about:
In fact, of the above approaches, the only one that requires any modification is #1 because there's not an obvious way to do the matching. The alternative is to let a classifier do the matching, rather than an offline process. In particular, you take your embeddings, and then try to train a classifier that predicts POS tags from the embeddings directly. (Note: I claim this generalizes #1 because if you did this with discrete tags, the classifier would simply learn to do the matching that we used to compute "by hand" offline.) If your classifier can do a good job, then you're happy.
This approach naturally has flaws (all do), but I think it's worth thinking about this seriously. To do so, we have to take a step back and ask ourselves: what are we trying to do? Typically, it seems we want to make an argument that a system that was not (obviously) designed to encode some phenomenon (like POS tags) and was not trained (specifically) to predict that phenomenon has nonetheless managed to infer that structure. (We then typically go on to say something like "who needs no POS tags" even though we just demonstrated our belief that they're meaningful by evaluating them... but okay.)
As a first observation, there is an entire field of study dedicated to answering questions like this: (psycho)linguists. Admittedly they only answer questions like this in humans and not in machines, but if you've ever posed to yourself the question "do humans encode/represented phrase structures in their brains" and don't know the answer (or if you've never thought about this question!) then you should go talk to some linguists. More classical linguists would answer these questions with tests like, for instance, constituency tests or scoping tests. I like Colin Phillips' encyclopedia article on syntax for a gentle introduction (and is what I start with for syntax in intro NLP).
So, as a starting point for "has my system learned X" we might ask our linguist friends how they determine if a human has learned X. Some techniques are difficult to replicate in machine (e.g., eye movement experiments, though of course models that have something akin to alignment---or "attention" if you must---could be thought of as having something like eye movements, though I would be hesitant to take this analogy too far). But many are not, for instance behavioral experiments, analyzing errors, and, I hesitate somewhat to say it, grammaticality judgements.
My second comment has to do with the notion of "can these encodings be used to predict POS tags." Suppose the answer is "yes." What does that mean? Suppose the answer is "no."
In order to interpret the answer to these questions, we have to get a bit more formal. We're going to train a classifier to do something like "predict POS given embedding." Okay, so what hypothesis space does that classifier have access to? Perhaps you say it gets a linear hypothesis space, in which case I ask: if it fails, why is that useful? It just means that POS cannot be decoded linearly from this encoding. Perhaps you make the hypothesis space outrageously complicated, in which case I ask: if it succeeds, what does that tell us?
The reason I ask these questions is because I think it's useful to think about two extreme cases.
EDIT 26 Jul 2016, 8:24p Eastern: It's unclear to a few people so clarification. I'm mostly not talking about type-level word embeddings above, but embeddings in context. At a type-level, you could imagine evaluating (1) on out of vocabular terms, which would be totally reasonable. I'm think more something like: the state of your biLSTM in a neural MT system. The issue is that if, for instance, this biLSTM can repredict the input (as in an autoencoder), then it could be that the POS tagger is doing all the work. See this conversation thread with Yoav Goldberg for some discussion.
At the time, there were essentially three ways of approaching this question that I knew about:
- Do a head-to-head comparison, in which you build an offline matching between induced tags and "true" tags, and then evaluate the accuracy of that matching. This was the standard evaluation strategy for unsupervised POS tagging, but is really just trying to get at the question of: how correlated are the induced tags with what we hope comes out.
- Take a system that expects true POS tags and give it induced POS tags instead (at both training and test time). See how much it suffers (if at all). Joshua Goodman told me a few times (though I can't find his paper on this) that word clusters were just as good as POS tags if your task was NER.
- Do something like #2, but also give the system both POS tags and induced tags, and see if the POS tags give you anything above and beyond the induced tags.
In fact, of the above approaches, the only one that requires any modification is #1 because there's not an obvious way to do the matching. The alternative is to let a classifier do the matching, rather than an offline process. In particular, you take your embeddings, and then try to train a classifier that predicts POS tags from the embeddings directly. (Note: I claim this generalizes #1 because if you did this with discrete tags, the classifier would simply learn to do the matching that we used to compute "by hand" offline.) If your classifier can do a good job, then you're happy.
This approach naturally has flaws (all do), but I think it's worth thinking about this seriously. To do so, we have to take a step back and ask ourselves: what are we trying to do? Typically, it seems we want to make an argument that a system that was not (obviously) designed to encode some phenomenon (like POS tags) and was not trained (specifically) to predict that phenomenon has nonetheless managed to infer that structure. (We then typically go on to say something like "who needs no POS tags" even though we just demonstrated our belief that they're meaningful by evaluating them... but okay.)
As a first observation, there is an entire field of study dedicated to answering questions like this: (psycho)linguists. Admittedly they only answer questions like this in humans and not in machines, but if you've ever posed to yourself the question "do humans encode/represented phrase structures in their brains" and don't know the answer (or if you've never thought about this question!) then you should go talk to some linguists. More classical linguists would answer these questions with tests like, for instance, constituency tests or scoping tests. I like Colin Phillips' encyclopedia article on syntax for a gentle introduction (and is what I start with for syntax in intro NLP).
So, as a starting point for "has my system learned X" we might ask our linguist friends how they determine if a human has learned X. Some techniques are difficult to replicate in machine (e.g., eye movement experiments, though of course models that have something akin to alignment---or "attention" if you must---could be thought of as having something like eye movements, though I would be hesitant to take this analogy too far). But many are not, for instance behavioral experiments, analyzing errors, and, I hesitate somewhat to say it, grammaticality judgements.
My second comment has to do with the notion of "can these encodings be used to predict POS tags." Suppose the answer is "yes." What does that mean? Suppose the answer is "no."
In order to interpret the answer to these questions, we have to get a bit more formal. We're going to train a classifier to do something like "predict POS given embedding." Okay, so what hypothesis space does that classifier have access to? Perhaps you say it gets a linear hypothesis space, in which case I ask: if it fails, why is that useful? It just means that POS cannot be decoded linearly from this encoding. Perhaps you make the hypothesis space outrageously complicated, in which case I ask: if it succeeds, what does that tell us?
The reason I ask these questions is because I think it's useful to think about two extreme cases.
- We know that we can embed 200k words in about 300 dimensions with nearly orthogonal vectors. This means that for all intents and purposes, if we wanted, we could consider ourselves to be working with a one-hot word representation. We know that, to some degree, POS tags are predictable from words, especially if we allow for complex hypothesis spaces. But this is uninteresting because by any reasonable account, this representation has not encoded anything interesting: it's just the output classifier that's doing something interesting. That is to say: if your test can do well on the raw words as input, then it's dubious as a test.
- We also know that some things are just unpredictable. Suppose I had a representation that perfectly encoded everything I could possibly want. But then in the "last layer" it got run through some encryption protocol. All of the information is still there, so the representation in some sense "contains" the POS tags, but no classifier is going to be able to extract it. That is to say, just because the encoded isn't on the "surface" doesn't mean it's not there. Now, one could reasonably argue something like "well if the information is there in an impossible-to-decode format then it might as well not be there" but this slope gets slippery very quickly.
EDIT 26 Jul 2016, 8:24p Eastern: It's unclear to a few people so clarification. I'm mostly not talking about type-level word embeddings above, but embeddings in context. At a type-level, you could imagine evaluating (1) on out of vocabular terms, which would be totally reasonable. I'm think more something like: the state of your biLSTM in a neural MT system. The issue is that if, for instance, this biLSTM can repredict the input (as in an autoencoder), then it could be that the POS tagger is doing all the work. See this conversation thread with Yoav Goldberg for some discussion.
12 July 2016
Some picks from NAACL 2016
Usual caveats: didn't see all talks, didn't read all papers, there's lot of good stuff at NAACL that isn't listed here! That said, here are some papers I particularly liked at NAACL, with some comments. Please add comments with papers you liked!
Anyone who has taught has suffered the following dilemma. You ask students for feedback throughout the course, and you have to provide free text because if you could anticipate their problems, you'd have addressed them already. But now you have hundreds of responses that you can't quickly read through. This paper provides a dataset of such responses, together with summaries that you actually have time to read through. The approach is a reasonably standard ILP formulation, with some additional machinery to deal with the fact that the data is very sparse. The idea is to essentially induce a "bigram similarity" as part of the summarization problem. I like this paper because the problem is great (I think NLP should really push in the direction of helping learners learn and teachers teach!), the dataset is nice (if a bit small) and the approach makes sense. And they actually do a human evaluation, even for a short paper! Hooray!
The problem considered here is the following: if I want to go to ACL, I need to register, book a flight, book a hotel, find some attractions to go to while skipping sessions, look up all the good vegan restaurants in Berlin (hah!), etc. My overall task is going to ACL, but there are a number of highly related but different subtasks. The challenge is to infer these subtasks from search logs, so that you can provide better search support. The model is a Chinese Restaurant Process with word embeddings used to measuring similarities. And look, another short paper with a human evaluation!
This paper introduces the task of story cloze: given a story prefix, predict which of two endings is the "correct" (natural) one. Eg: "Jim got his first credit card in college. He didn't have a job so he bought everything on his card. After he graduated he amounted a $10,000 debt. Jim realized that he was foolish to spend so much money." Then you have to decide on the right ending: "Jim decided to device a plan for repayment." (correct) versus "Jim decided to open another credit card." (incorrect). The data set was quite well constructed (lots of checks), and a large collection of baseline models are run for comparison. The data is available. I like this paper for the task and the very very careful data set collection. Since this was establishing baselines, I would really liked to have seen error bars on the results so we can have more reasonable future comparisons. I'm really tempted to try to annotate some of these with plot units, but that was really really painful the first time around; but I feel like that's a good explanatory theory for a lot of the examples shown in the paper.
Automatic Summarization of Student Course Feedback by Luo, Liu, Liu and Litman.
Anyone who has taught has suffered the following dilemma. You ask students for feedback throughout the course, and you have to provide free text because if you could anticipate their problems, you'd have addressed them already. But now you have hundreds of responses that you can't quickly read through. This paper provides a dataset of such responses, together with summaries that you actually have time to read through. The approach is a reasonably standard ILP formulation, with some additional machinery to deal with the fact that the data is very sparse. The idea is to essentially induce a "bigram similarity" as part of the summarization problem. I like this paper because the problem is great (I think NLP should really push in the direction of helping learners learn and teachers teach!), the dataset is nice (if a bit small) and the approach makes sense. And they actually do a human evaluation, even for a short paper! Hooray!
A Latent Variable Recurrent Neural Network for Discourse Relation Language Models by Ji, Haffari and Eisenstein.
This paper combines the nice rich hypothesis classes you get with the "anonymous" latent variables in neural networks with the modeling power that one gets from the ability to marginalize "structured" latent variables from classic graphical models land. This is applied to document language modeling, in which the latent variables are discourse relations (in the PDTB sense). The model works well both for language modeling and for predicting implicit discourse relations and dialogue acts. (And, as one should expect from a paper with Eisenstein as an author, there are statistical significance tests!)Structured Prediction with Output Embeddings for Semantic Image Annotation by Quattoni, Ramisa, Madhyastha, Simo-Serra and Moreno-Noguer.
If you want to label images with complex structured outputs, like play(dog,grass), you quickly run in to sparsity problems on the output. The proposal here is to decompose the outputs into embeddings (kind of like Vivek's work and Jags' work) and learning a bilinear model of inputs/outputs in that space. In general, I think there's a lot to be done in interesting modeling of structured output spaces, and this paper gives a new set of techniques in this space.Deconstructing Complex Search Tasks: A Bayesian Nonparametric Approach for Extracting Sub-tasks by Mehrotra, Bhattacharya and Yilmaz.
The problem considered here is the following: if I want to go to ACL, I need to register, book a flight, book a hotel, find some attractions to go to while skipping sessions, look up all the good vegan restaurants in Berlin (hah!), etc. My overall task is going to ACL, but there are a number of highly related but different subtasks. The challenge is to infer these subtasks from search logs, so that you can provide better search support. The model is a Chinese Restaurant Process with word embeddings used to measuring similarities. And look, another short paper with a human evaluation!
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories by Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli and Allen.
This paper introduces the task of story cloze: given a story prefix, predict which of two endings is the "correct" (natural) one. Eg: "Jim got his first credit card in college. He didn't have a job so he bought everything on his card. After he graduated he amounted a $10,000 debt. Jim realized that he was foolish to spend so much money." Then you have to decide on the right ending: "Jim decided to device a plan for repayment." (correct) versus "Jim decided to open another credit card." (incorrect). The data set was quite well constructed (lots of checks), and a large collection of baseline models are run for comparison. The data is available. I like this paper for the task and the very very careful data set collection. Since this was establishing baselines, I would really liked to have seen error bars on the results so we can have more reasonable future comparisons. I'm really tempted to try to annotate some of these with plot units, but that was really really painful the first time around; but I feel like that's a good explanatory theory for a lot of the examples shown in the paper.
Learning Global Features for Coreference Resolution by Wiseman, Rush and Shieber.
The basic idea here is to take a "local" coreference model, that makes decisions about assigning mentions to entities, and augment it with a "global" model that models the entire cluster of mentions. In past work, cluster-level features have been hard to define (e.g., what are the right features to extract over sets of arbitrary size?). What's the solution? Essentially, embed the clusters. I like this paper because I wanted to try to do something like this and Wiseman did it better than I would have. I think there's also an interesting interpretation of this work in terms of things like neural Turing machines/memory networks, in which we now have technology for learning to update memory (where "memory" here is the "clusters"). My only gripe is that, like most papers in the new wave of coreference, the older ML approaches have been somewhat forgotten (except for Vincent Ng who is unforgettable); I'm thinking in particular about the work on ACE, like that of Luo, Ittycheriah, Jing, Kambhatla and Roukos on Bell-tree coreference, which I don't think gets enough notice these days for introducing a lot of the ideas that make up modern coref.Visual Storytelling by Huang, Ferraro, Mostafazadeh, Misra, Agrawal, Devlin, Girshick, He, Kohli, Batra, Zitnick, Parikh, Vanderwende, Galley and Mitchell.
The main observation in this paper is that how you caption images in a sequence is different from how you caption them in isolation. For instance, if you have a temporal sequence of images from, say, a wedding, the captions for them should tell a story. This paper primarily introduces a dataset and baseline approaches for addressing the problem in this dataset. I like this paper, even if you remove the image stuff, because it emphasizes the fact that stories have a different structure than just sequences of sentences. The image stuff gives a grounding that's interesting beyond that. One problem pointed out in the paper is that automatic metrics aren't great here, which is problematic, but not particularly surprising.05 July 2016
Rating the quality of reviews, after the fact
Groan groan groan reviewers are horrible people. Not you and me. Those other reviewers over there!
tldr: In general we actually don't think our reviews are that bad, though of course it's easy to remember the bad ones. Author perception of review quality is colored by, but not determined by, the overall accept/reject decision and/or the overall score that review gave to the paper.
NIPS did an experiment a bunch of years ago (can't dig it up any more, now it's an urban legend) where they asked reviewers at, I think, the time of author feedback, to rate the reviews. The anecdotal story was that there was almost perfect correlation between "this is a good review" and "this review gave my paper a high score." Of course this is not super surprising, even if you get rid of "emotions," because presumably I like my paper and so any review that doesn't like it is flawed.
For NAACL 2013, we did a similar experiment, but we asked authors for their responses several months after the fact (actually, even after the conference had taken place), at which point hopefully emotions had cooled a bit and they could look back at their reviews with a sort of fond recollection. We presented the contact author of each paper with the original review text for each of their reviews, but did not show them the original scores. We asked them on a standard Likert scale how helpful this review was, and how informative this review was.
Because this was long after the fact, response rate was of course not 100%, and it was also biased toward authors of papers that were accepted. We got responses from 128 authors on a total of 138 papers (some papers had same contact authors), covering a total of about 397 reviews (roughly one per paper, but some short papers only had two, and some papers had four).
All the plots below are restricted to this set of 138 papers, not to the full set of about 500.
First, let's get a sense of the data. Here are the overall results for this entire set of 138 papers:
(Note that the numbers add up to 397, not 138, because this is counting per-review not per-paper.) The first row shows the accept/reject ratio. Since NAACL 2013 had an acceptance rate between 25% and 30%, obviously survey results are biased toward accepted papers, but we still have a healthy response rate from rejected papers.
Overall, the vast majority (~80%) of reviews were considered both informative and helpful (score 4 or 5) according to the authors. So yes, we need to do something about the 20% of reviews that got a 1, 2 or 3 on the Likert scale, but we're actually not doing that horribly. (Modulo sample selection bias.) The papers themselves were considered overwhelmingly appropriate and clear. The overall score distribution matches (roughly) the overall score distribution for the entire conference.
Let's look at what happens if we look only at accepted or rejected paper:
Comparing these, we definitely see a bit of the NIPS effect. For accepted papers, the reviews were considered overwhelmingly informative and helpful (scores of 4 or 5 in 85% or more cases). However, for rejected papers, the reviews were still considered largely informative and helpful (~73% of cases were 4s and 5s). Not surprisingly, accepted papers fare quite well on the individual score metrics, in particular overall score (duh!).
We can alternatively condition the analysis on the overall paper score rather than the final accept/reject decision. Here's how that looks:
That's not substantially different.
So what makes the difference between good (informativeness 4 or 5) and bad (informativeness 1 or 2) reviews?
On average, the "good" reviews were about 15% longer than the bad reviews (on average 320 characters versus 280 characters).
Somewhat surprisingly, a linear classifier on bag of words data and distinguish with 90% accuracy "good" from "bad" reviews, but the features it gives high weight to are basically features that look like positive versus negative reviews, rather essentially exploiting of the correlation between informativeness and acceptance, rather than informativeness on its own.
tldr: In general we actually don't think our reviews are that bad, though of course it's easy to remember the bad ones. Author perception of review quality is colored by, but not determined by, the overall accept/reject decision and/or the overall score that review gave to the paper.
NIPS did an experiment a bunch of years ago (can't dig it up any more, now it's an urban legend) where they asked reviewers at, I think, the time of author feedback, to rate the reviews. The anecdotal story was that there was almost perfect correlation between "this is a good review" and "this review gave my paper a high score." Of course this is not super surprising, even if you get rid of "emotions," because presumably I like my paper and so any review that doesn't like it is flawed.
For NAACL 2013, we did a similar experiment, but we asked authors for their responses several months after the fact (actually, even after the conference had taken place), at which point hopefully emotions had cooled a bit and they could look back at their reviews with a sort of fond recollection. We presented the contact author of each paper with the original review text for each of their reviews, but did not show them the original scores. We asked them on a standard Likert scale how helpful this review was, and how informative this review was.
Because this was long after the fact, response rate was of course not 100%, and it was also biased toward authors of papers that were accepted. We got responses from 128 authors on a total of 138 papers (some papers had same contact authors), covering a total of about 397 reviews (roughly one per paper, but some short papers only had two, and some papers had four).
All the plots below are restricted to this set of 138 papers, not to the full set of about 500.
First, let's get a sense of the data. Here are the overall results for this entire set of 138 papers:
(Note that the numbers add up to 397, not 138, because this is counting per-review not per-paper.) The first row shows the accept/reject ratio. Since NAACL 2013 had an acceptance rate between 25% and 30%, obviously survey results are biased toward accepted papers, but we still have a healthy response rate from rejected papers.
Overall, the vast majority (~80%) of reviews were considered both informative and helpful (score 4 or 5) according to the authors. So yes, we need to do something about the 20% of reviews that got a 1, 2 or 3 on the Likert scale, but we're actually not doing that horribly. (Modulo sample selection bias.) The papers themselves were considered overwhelmingly appropriate and clear. The overall score distribution matches (roughly) the overall score distribution for the entire conference.
Let's look at what happens if we look only at accepted or rejected paper:
Comparing these, we definitely see a bit of the NIPS effect. For accepted papers, the reviews were considered overwhelmingly informative and helpful (scores of 4 or 5 in 85% or more cases). However, for rejected papers, the reviews were still considered largely informative and helpful (~73% of cases were 4s and 5s). Not surprisingly, accepted papers fare quite well on the individual score metrics, in particular overall score (duh!).
We can alternatively condition the analysis on the overall paper score rather than the final accept/reject decision. Here's how that looks:
That's not substantially different.
So what makes the difference between good (informativeness 4 or 5) and bad (informativeness 1 or 2) reviews?
On average, the "good" reviews were about 15% longer than the bad reviews (on average 320 characters versus 280 characters).
Somewhat surprisingly, a linear classifier on bag of words data and distinguish with 90% accuracy "good" from "bad" reviews, but the features it gives high weight to are basically features that look like positive versus negative reviews, rather essentially exploiting of the correlation between informativeness and acceptance, rather than informativeness on its own.