27 December 2012

Teaching (intro, grad) NLP

I had a post a while back on teaching ML that I still basically stand by.  I've taught intro grad NLP here at UMD twice now, and a sort-of-similar-course back at Utah once.  I find these courses really hard to teach.  And not for the usually bemoaned reason of the CS/linguistics mix -- I think it's possible to deal with that, and certainly it's an issue that's been talked about a lot.

What I find difficult is that NLP (and CL) is a collection of problems, techniques, ideas, frameworks, etc. that really are not tied together in any reasonable way other than the fact that they have to do with NLP.  Even if you manage to answer questions about "what sort of topics are most interesting?" you're still faced with this problem that every time you switch topics, the entire context in which you're discussing them changes.  This is exacerbated by the problem that things like tagging and parsing are hopelessly boring (in comparison to all the cool interesting stuff in NLP these days), but yet so many modern ideas are based on understanding basic dynamic programming for tree structures and things like that.

To make things a bit more concrete, a standard intro NLP class might start with morphology.  Okay, so you have to explain what morphemes are and why they're important.  Now, you probably will take a finite state approach, so you have to explain transducers.  If you want these things to work, you have to explain weighted transducers.  Do you do probabilities, in which case there's the whole local vs global normalization stuff that takes more time?  So now you want to do POS tagging or something.  Fine, you can do that with finite state models too.  But no one actually does this any more (except lingpipe :P).  So you have to explain POS stuff, perhaps how this works in non-English, and then you can leave them with HMMs (maybe talking about Viterbi algorithm) or do lots of ML so you can get to CRFs or structured perceptron or something.  And we're still at POS tagging.  Now you switch to parsing.  Back to square one.  And then you want to do compositional semantics, now there's lots more structure, lots more features and so on.  But even now things are at least somewhat connected.  But then you talk about lexical semantics: be it distributed representations or WSD or whatever, but the problem is new, the techniques are new (do you teach Yarowsky?), the evaluation is now and so on.

I think it's worth contrasting this with ML.  I find ML remarkably easy to teach (so I'm flipping the classroom this coming Spring for the UG version to make it more exciting) despite the fact that the material is (in many ways) much harder for CS types.  The thing that is nice about ML is that the problem is basically always the same (or at least changes only once, when you switch from supervised to unsupervised).  In that sense, ML tends to be a course about techniques for a relatively fixed problem (or at least fixed problem type).  This makes for significantly less context switching, which makes learning easier (and thereby makes teaching easier).

So the question I wanted to ask is: can we do something similar in NLP.  The crazy idea that I'm sure everyone will say is insane is the following: teach NLP as a course about what you can do with log-linear models.  Here's how I envision it.  You spend the first day talking about NLP and why data is important, ambiguity, etc, just like normal.  You spend the next two days explaining enough about log linear models that you can treat them as given for the rest of the semester.  Maybe you tell how to optimize them by gradient descent or something, but basically enough that anyone who is simultaneously taking ML will get more out of it, but those that are not are fine with LL models as a black box.

Now, when you teach different topics, the framework in which you discuss them is the same.  You have a structured problem (which forces you to talk about algorithms like Viterbi or CKY) with interesting ambiguities (which forces you to talk about features).  Then, the class essentially becomes a sequence of problems, associated algorithms and relevant features.  The rest is left as a black box, which can be provided off the shelf for programming projects, and they can focus on the interesting and more NLP-ish problems of algorithms and features.  You could even start with something like sentiment classification (at a document level) to make the beginning gentle.

I realize there are some things you couldn't do this way, or would be very awkward to do this way.  Anything generative or unsupervised, which often go together.  For instance, word alignment via the IBM models won't fit.  Topic models won't fit (though I don't usually do them anyway -- maybe I should).  Probably there are some other things too.

Anyway, I'd be curious to hear what people think of this idea.  I know it's biased by my own view of the world, but hey -- that's why I'm a professor (or at least why I assist professors...).  Or if anyone has tried it.

19 comments:

Chris said...

Ed Hovy has been asking the question "what is the theory of NLP?" a lot lately. Maybe we need a symposium on this topic? :P

As for building a course around log-linear models, I'd say: interesting idea, but what about all the problems where we actually want to model language and not just low-dimensional analyses of language conditioned on observed language? Using log-linear models for the former is really hard since you end up having to compute partition functions of massive spaces. Even if you normalize everything locally, summing over real-world vocabularies is really expensive.

So I guess the question is: do we really want to model language (and not just conditional analyses) in the first semester? I don't know. It maybe sounds a bit old fashioned, but I think I'd hate to see it go. First, if we do want to have a "theory of statistical NLP", I think it should be something about using statistical models to, you know, model language. Second, the only way to make headway on unsupervised learning is when you're modeling observed language, so you can't hide from the problem without getting rid of such classics as unsupervised POS induction or word alignment.

So, if you buy this argument, what's the solution? I see a few options 1) you can argue that computers are fast and I shouldn't worry so much about computing partition functions, 2) confine yourself to small datasets, 3) teach first semester NLP students about approximate inference, or 4) consider going back to the good old days of directly parameterized multinomials. This last one may sound horribly gauche, but maybe it's not terrible for a first semester course? Start with directly parameterized multinomials for everything and do a bunch of different models (including language modeling, supervised POS tagging and parsing, unsupervised POS tagging, IBM Model 1, maybe topic modeling). The downside is that most of these won't be state-of-the-art, but you will have an excuse to cover a lot of ground, and basically the only "black box" they'll need is "count and normalize".

Finally, if you do decide to start with log-linear models, who needs to teach the IBM models??

hal said...

@Chris: not sure I follow... what's the low-dimensional analysis bit? Can you give an example? Are you just saying p(x) versus p(y|x)? I'm not sure why the latter is such a bad thing, aside from the loss of unsupervised models (which you seem to ascribe more weight to than I do... perhaps partially because again I find this stuff really hard to teach at the right level).

Really the log-linear thing was an attempt to get away from stupid multinomials, precisely because of the problem you state: that you end up spending a lot of time learning about stuff that no one uses. And it doesn't get away from the complexity issue: it's *much* easier to talk about writing down interesting *features* in a LLM than to start talking about different generative stories and different parameterizations and all that.

Yoav said...

As a compromise between you and Chris, I'd say start with basic probability stuff based on MLEs. So you can do simple language modeling, generate some sentences, etc. Then move to HMM for tagging. Once you get viterbi going, it is natural to introduce the idea of parameterization, and that the MLE is just one way to set parameter values. From here it's a relatively easy shift to the structured perceptron as a way of setting the weights. Now you can talk about feature etc.

At this point you can probably just go on with your original plan, but using the perceptron instead of log-linear models (what's the appeal of log-linear models?)

Suresh Venkatasubramanian said...

How do you plan to flip the ML class ?

rrenaud said...

The NLP class I took started with n-gram language modelling. I found it interesting, and I certainly liked the lots of data and simple models slant (Norvig style AI). It also gives a very concrete introduction to Bayesian reasoning, smoothing, and the importance of priors. And I personally love the kind of "sideways thinking" that the Good Turing estimate gives. Of course, language models don't really fit in well with other parts of the course, with an exception to machine translation.

Alexandre Passos said...

What if you start with the log-linear model perspective and then say that a simple way of training those is just counting and normalizing (but you don't get calibrated probabilities if you do that, nor can you handle overlapping features)? In the end these generative thngs are just a log linear model for P(x|y) instead of P(y|x), if you squint really hard.

This perspective will let you go over HMMs for supervised things, though you might need to do one class over EM (again, as a way of unsupervisedly traiing log linear models), which will let you do the IBM models and other generative things.

Kevin Duh said...

Yes, NLP is a collection of problems/techniques that aren't really tied together besides the fact that they all relate to language. Which makes me wonder if NLP will actually fragment into separate fields in the future. The love of language is the only thing that keeps us together.

I think your class idea is very nice. The step up from log-linear models to structured prediction, focusing on decoding algorithms and feature engineering, will elucidate how NLPers think about many problems.

For the other topics that don't fit in this framework, you can just give a brief survey at the end of the course. I think there's no need for a grand unifying theory, but some partially-unifying framework is helpful pedagogically.

Jacob said...

Hal, I spend the first 1/2 of my class doing pretty much what you describe. The first two lectures introduce ways of training log-linear models (including setting the weights as the MLE of "stupid multinomials"). Then we use linear models for structured prediction in sequences and trees. I guess I see what you mean about this stuff being boring compared to cutting-edge NLP, but the dynamic programs for tagging and parsing were the favorite topic for a lot of my students.

The second half of the class is more of a grab bag: MT, information extraction, discourse, semantics, etc. Even there, I think it's possible to emphasize a few running themes like EM and distributional similarity. I find discourse and semantics to be the most difficult to tie in with everything else.

My major concern after one semester of teaching this is that a lot of students got through the class without learning much of anything about language. This semester I'm going to try to fix that by giving 10 short annotation assignments, and making them self-grade their annotations in class. I'll report back on how this goes.

Chris said...

Yes, p(x,y) vs p(y|x) is what I was getting at.

I guess my concern is that giving students only LLMs in the first semester will make it hard for them to accept the limitations of stupid multinomials when they eventually get to problems like language/translation modeling. Stupid multinomials may not be good for all problems, but my argument is that they are indispensable for some problems that I think deserve coverage in a first semester course.

hal said...

@Yoav: that seems reasonable. I guess Chris really thinks multinomials are important (poor misguided MT souls... :P). I only say LLMs because if I say perceptron people will jump on me. you can always use perceptron as a (I think inconsistent) training procedure for LLMs to get around partition function issues.

@Jacob: yeah, that sounds pretty much in line with my experience. Linguists tend to have a hard time with dynamic programming, but we see to agree as a community that this is a really important topic.

@Chris: the p(x,y) vs p(y|x) doesn't seem at all important to me, but probably that's just my bias. I think of it as modeling language and interpretation versus modeling interpretation. I don't have a problem with the latter.

I guess the conclusion is that you can't get rid of count-and-normalize as an estimation procedure. And in thinking more, I wouldn't want to get rid of LMs at the beginning. So maybe you start with that up to some point (maybe HMMs, maybe earlier) and then swap it out so that students "get" why multinomials are stupid.

hal said...

Oh and @Chris: he's been saying that since at least 2001 :).

Philip said...

I guess my immediate reaction leans toward Chris's. Trying to wedge everything into log-linear models, and chucking whatever doesn't fit there, may be convenient for looking at problems in a more or less uniform way, but I think it dilutes the connection with the /language /part of NLP. And I would add that it probably would make it difficult for students to have any kind of accurate historical perspective on the field.

In the past we've generally started with non-statistical methods (e.g. the non-probabilistic version of CKY) and then moved from there to statistical methods. That's due for change, but I do think there's something essentially correct about communicating that, both historically and conceptually, NLP comes out of a fundamentally algebraic/structural view of the problem space. To use an analogy, if you were expressing the world in terms of graphical models, HMMs would be just one way of drawing circles and arrows among many others -- but if you don't emphasize the finite-state underpinnings, then you lose the opportunity to connect to, to pick just three, regular expressions, Gold's paradigm, and the center-embedding argument for the competence/performance distinction.

It seems to me that an alternative framework to consider would be one built around the different ways to "build in" linguistic knowledge or inductive bias in modern NLP-- namely (a) the structure of the model, (b) the choice of features, and (c) priors. Thinking out loud here, so to speak, it seems to me that with those three, the remaining missing ingredient is search (optimization, decoding). Combine (a-c) with search in the broad sense, and you've got a conceptual framework that makes it possible to get from the field's knowledge-based roots to the most current machine learning methods, be they discriminative or generative. Plus, by thinking about the world in terms of inductive bias and search, we're in line with the most fundamental issues in machine learning, and we're also in line with the way that linguists characterize the language learning problem (even if they don't describe what they're doing in this fashion).

hal said...

@Philip: on the historical note, yes, you'd lose some of that (though all you'd lose is the crazy ways people set parameters in models 20 years ago, which i don't think is a huge loss). starting with non-probabilistic versions of algorithms like CKY makes total sense pedagogically and i would definitely continue to do that. (it doesn't really make sense for HMMs.) and yeah, you'd lose finite state. i always have mixed feelings about this. that stuff is indeed super important. but it's also such a completely different paradigm than anything else. unless you want to teach that parsing is really finite state in disguise, but this is (IMO) getting way too complex for a first semester class where, among other things, i want to *excite* them about language, not about formal language.

i really like a lot of things about your alternative paradigm.... in fact, a large part of my goal of unifying parameter estimation is precisely so you can spend *more* time talking about a-c and search (i seem to give more weight to algorithms than you do). i want to get away from spending so much time talking about stupid parameter estimation and more time talking about important language stuff!

Rajhans said...

Very interesting and relevant discussion here. I'd like to suggest something half-baked and different from the suggestions already made -- not so much because I completely believe in this idea but I'm hoping that it might add to the diversity of suggestion.

When I am explaining NLP to a layman, I always have trouble explaining POS-tagging, Parsing, etc. To me, it always seems easy to explain an application-centric view. In other words, it helps to to first provide an easy to understand application (e.g. textual entailment, relation extraction, coreference, machine translation).

I think this top-down view can be useful in providing a more coherent structure to NLP from an application point-of-view. So one idea is to follow roughly the following steps:

1) First introduce a problem that you'd like to solve, say, textual entailment or relation extraction. Then convince the students that this is a hard problem.

2) Then talk about how you plan to "solve" (in as much as they can be solved :P) this problem by breaking it down into smaller steps. These steps can be syntactic (e.g. POS tagging) and semantic (e.g. semantic word clusters, semantic role labeling) One would then need to convince the students that these steps indeed yield useful information/features one can use towards solving the final task.

2.5) It may be useful to introduce the notion that almost all of these problems are "clustering" problems (e.g. POS-tagging can be thought of as clustering based on local context with k=46 or something). This can also help introduce the idea of other clustering like distributional clustering.

3) Then, assuming some ML background, delve into the details of solving each individual task. Introduce a machine learning tool that you'd like to use. I think using log-linear models is a great idea because IMO it is easier to think of probabilities than "discriminative scores".

4) May be consider both supervised and unsupervised approaches towards estimating things.

Not sure what else...

Bottomline being: while unifying from a linguistic theory perspective might be closer to ideal, unification from an application point-of-view is easier and arguably more accessible.

Robert Munro said...

Pretty much every commercial NLP project that I've worked on consisted of a selection of features and algorithms wrapped around black-box log-linear models, so I think it's great idea as prep for industry.

For the academic side of things, I think there's more than enough richness in exploring exactly how the features and algorithms allow us to model different aspect of natural language (for a grad intro course, at least). Although it would be a shame to leave out Machine Translation.

Unknown said...

While I sympathize with the issue that an NLP course can tend to be a disparate collection of techniques, I think Hal's proposed solution just has to be wrong, since it denies the fundamental nature of the field. There are technique-oriented fields, like ML, which can be taught in a cumulative technique-centered way, kind of like a math class. For such fields, this does give a very nice and satisfying organization. But most of human inquiry consists of domain-centered fields, whether it's biology, political science, astronomy or computational linguistics. It makes no sense to try to turn them into a technique centered field and course.

So what have I been doing recently? Kind of what @Rajhans is suggesting. But it's not yet a common organization, so I'll advocate and explain it a little here. I start off teaching machine translation, and do that for the first 6 classes. Here's the organization I used in 2011 (this past year I actually did things a bit differently for various reasons, but this is a clear view of the proposal): http://www.stanford.edu/class/cs224n/syllabus.html

MT gets you straight into good and interesting language examples and a domain problem that everyone can understand the importance of. It's also the most historically accurate way to convey the history of the field. This is where it all began: http://www.aclweb.org/archive/misc/History.html . But I think it has two other good attributes. One is that it's a distinctive NLP thing. This depends on what other courses you have at your school, but certainly at Stanford I felt the problem that neighboring courses like machine learning, probabilistic graphical models, and even bioinformatics had "stolen" a lot of other beginning points for NLP. That is, people do a naive bayes spam classifier in the ML course. People do HMMs in the probabilistic graphical models course. It's not exactly that they've seen all of the NLP issues, but if you start at points like these, half the class starts off from a perspective of "this is stuff we've already done before". No one else has done MT. Secondly, I also find it a very good place to introduce several important topics. There are others (like spelling correction) but it is a very good place to introduce language models and their importance to applications. The classic IBM alignment algorithms are also a great place to introduce and for the students to do experiments with EM. Previously I'd done EM in contexts like HMMs and PCFGs, but that makes it much harder for students because they're doing EM and dynamic programming simultaneously. For IBM model 1, they can focus on just the EM part. It's also a good first place to discuss evaluation and its difficulties, and moving into modern phrase-based systems, it's also a good place to first introduce feature-based log-linear models :).

Now, one could just keep going all course with MT, since after all, there's a lot of work now on parsing in MT models and using word senses and semantics in various ways. But I don't do that. One reason is that parsing in MT is clearly more difficult and complex (synchronous grammars, etc.) than just simply (P)CFG parsing. It would lead to systems that are two large for convenient student assignments. And, thirdly, even though MT is a cool application, it's not what the majority of students are likely to actually do. It's more useful to teach them about things like NER, relation extraction, coreference. And so I move to such topics - and parsing :). However, this thread did make me think that maybe I should try harder to turn the second part of the course also into a coherent unit with an overarching application (probably relation extraction or event extraction) and then to introduce NER, parsing, coreference in the context of solving it.

Gabriel said...

Along the lines of Rajhan and Christopher, an application-centric top-down approach could be based on dialogue systems, or spoken dialogue systems. You can fit almost everything under that very wide umbrella (syntactic parsing, computational semantics, speech recognition, speech synthesis, NLG, MT for multi-lingual dialogue, etc.).

Anonymous said...

Hal --- LingPipe has binary linear-chain CRFs with pluggable Gaussian, Laplace, Cauchy, or elastic net priors and also integration with our feature extractor framework. There is full marginal decoding, n-best sequence decoding, and an integration with the standard BIO-type encodings of entities to allow n-best entity extraction, too. It also implements our tagging and chunking frameworks, so it plugs into all the evaluation and error analysis code, too. Training is via stochastic gradient descent.

I still stand by first-order HMMs with character language model emissions as a go-to out-of-the-box tagging model.

Anonymous said...

I agree with Chris (Manning) that you want to focus on language.

To me, it seems your real problem is that the students haven't had the necessary pre-reqs, which nowadays include linguistics, algorithm analysis, and stats/ML. So one solution is to make them take the pre-reqs. Of course, linguistics and stats/ML are so all over the place that might not help much --- they might only know Chomskyan generative grammar and SVMs at that point.

I sat in on Chris's first (stat) NLP class at CMU in 1994 or 1995 (this was before his book with Schuetze, though a natural precursor to it). I liked how he started it with a real understandable application (language ID) along with a no-holds-barred assignment to implement it however you wanted. (This problem's too easy nowadays, but unconstrained full-text MT is too hard.) It got the students involved AND motivated language modeling.

Before we hired Chris, I'd been teaching old-school NLP focusing on finite-state transducers, parsing, and unification-based grammars. Still not a lot of language content per se.

So I very much like the idea of doing data annotation. It helps people understand how language works and just how subtle the problems are, even with simple tasks like word sense or part-of-speech tagging. (I don't like entailment or the whole RTE game because like Strawson, I believe entailment is a property of sentence usages in contexts, not sentences themselves --- everyone should also do some philosophy of language, which I also used to teach :-))