I remember back in grad school days some subset of the field was thinking about the following question. I train an unsupervised HMM on some language data to get something-like-part-of-speech tags out. And naturally the question arises: these tags that come out... what are they actually encoding?
At the time, there were essentially three ways of approaching this question that I knew about:
- Do a head-to-head comparison, in which you build an offline matching between induced tags and "true" tags, and then evaluate the accuracy of that matching. This was the standard evaluation strategy for unsupervised POS tagging, but is really just trying to get at the question of: how correlated are the induced tags with what we hope comes out.
- Take a system that expects true POS tags and give it induced POS tags instead (at both training and test time). See how much it suffers (if at all). Joshua Goodman told me a few times (though I can't find his paper on this) that word clusters were just as good as POS tags if your task was NER.
- Do something like #2, but also give the system both POS tags and induced tags, and see if the POS tags give you anything above and beyond the induced tags.
In fact, of the above approaches, the only one that requires any modification is #1 because there's not an obvious way to do the matching. The alternative is to let a classifier do the matching, rather than an offline process. In particular, you take your embeddings, and then try to train a classifier that predicts POS tags from the embeddings directly. (Note: I claim this generalizes #1 because if you did this with discrete tags, the classifier would simply learn to do the matching that we used to compute "by hand" offline.) If your classifier can do a good job, then you're happy.
This approach naturally has flaws (all do), but I think it's worth thinking about this seriously. To do so, we have to take a step back and ask ourselves: what are we trying to do? Typically, it seems we want to make an argument that a system that was not (obviously) designed to encode some phenomenon (like POS tags) and was not trained (specifically) to predict that phenomenon has nonetheless managed to infer that structure. (We then typically go on to say something like "who needs no POS tags" even though we just demonstrated our belief that they're meaningful by evaluating them... but okay.)
As a first observation, there is an entire field of study dedicated to answering questions like this: (psycho)linguists. Admittedly they only answer questions like this in humans and not in machines, but if you've ever posed to yourself the question "do humans encode/represented phrase structures in their brains" and don't know the answer (or if you've never thought about this question!) then you should go talk to some linguists. More classical linguists would answer these questions with tests like, for instance, constituency tests or scoping tests. I like Colin Phillips' encyclopedia article on syntax for a gentle introduction (and is what I start with for syntax in intro NLP).
So, as a starting point for "has my system learned X" we might ask our linguist friends how they determine if a human has learned X. Some techniques are difficult to replicate in machine (e.g., eye movement experiments, though of course models that have something akin to alignment---or "attention" if you must---could be thought of as having something like eye movements, though I would be hesitant to take this analogy too far). But many are not, for instance behavioral experiments, analyzing errors, and, I hesitate somewhat to say it, grammaticality judgements.
My second comment has to do with the notion of "can these encodings be used to predict POS tags." Suppose the answer is "yes." What does that mean? Suppose the answer is "no."
In order to interpret the answer to these questions, we have to get a bit more formal. We're going to train a classifier to do something like "predict POS given embedding." Okay, so what hypothesis space does that classifier have access to? Perhaps you say it gets a linear hypothesis space, in which case I ask: if it fails, why is that useful? It just means that POS cannot be decoded linearly from this encoding. Perhaps you make the hypothesis space outrageously complicated, in which case I ask: if it succeeds, what does that tell us?
The reason I ask these questions is because I think it's useful to think about two extreme cases.
- We know that we can embed 200k words in about 300 dimensions with nearly orthogonal vectors. This means that for all intents and purposes, if we wanted, we could consider ourselves to be working with a one-hot word representation. We know that, to some degree, POS tags are predictable from words, especially if we allow for complex hypothesis spaces. But this is uninteresting because by any reasonable account, this representation has not encoded anything interesting: it's just the output classifier that's doing something interesting. That is to say: if your test can do well on the raw words as input, then it's dubious as a test.
- We also know that some things are just unpredictable. Suppose I had a representation that perfectly encoded everything I could possibly want. But then in the "last layer" it got run through some encryption protocol. All of the information is still there, so the representation in some sense "contains" the POS tags, but no classifier is going to be able to extract it. That is to say, just because the encoded isn't on the "surface" doesn't mean it's not there. Now, one could reasonably argue something like "well if the information is there in an impossible-to-decode format then it might as well not be there" but this slope gets slippery very quickly.
EDIT 26 Jul 2016, 8:24p Eastern: It's unclear to a few people so clarification. I'm mostly not talking about type-level word embeddings above, but embeddings in context. At a type-level, you could imagine evaluating (1) on out of vocabular terms, which would be totally reasonable. I'm think more something like: the state of your biLSTM in a neural MT system. The issue is that if, for instance, this biLSTM can repredict the input (as in an autoencoder), then it could be that the POS tagger is doing all the work. See this conversation thread with Yoav Goldberg for some discussion.