natural language processing blog: 2011/09

Originally in the context of Braque and now in the context of FUSE, I've thought a bit about understanding the role of techniques and tasks in scientific papers (admittedly, mostly NLP and ML, which I realize are odd and biased). I worked with Sandeep Pokkunuri, a MS student at Utah, looking at the following problem: given a paper (title, abstract, fulltext), determine what task is being solved and what technique is being used to solve it. For instance, a paper like "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data" the task would be "segmenting and labeling sequence data" and the technique would be "conditional random fields."

You can actually go a long way just looking for simple patterns in paper titles, like "TECH for TASK" or "TASK by TECH" and a few things like that (after doing some NP chunking and clean-up). From there you can get a good list of seed tasks and techniques, and could conceivably bootstrap your way from there. We never got a solid result out of these, and sadly I moved and Sandeep graduated and it never went anywhere. What we wanted to do was automatically generate tables of "for this TASK, here are all the TECHs that have been applied (and maybe here are some results) oh and by the way maybe applying these other TECHs would make sense." Or visa-verse: this TECH has been applied to blah blah blah tasks. You might even be able to tell what TECHs are better for what types of tasks, but that's quite a bit more challenging.

At any rate, a sort of "obvious in retrospect" thing that we noticed was that what I might consider a technique, you might consider a task. And you can construct a chain, typically all the way back to math. For instance, I might consider movie recommendations a task. To solve recommendations, I apply the technique of sparse matrix factorization. But then to you, sparse matrix factorization is a task and to solve it, you apply the technique of compressive sensing. But to Scott Tanner, compressive sensing is a task, and he applies the technique of smoothed analysis (okay this is now false, but you get the idea). But to Daniel Spielman, smoothed analysis is the task, and he applies the technique of some other sort of crazy math. And then eventually you get to set theory (or some might claim you get to category theory, but they're weirdos :P).

(Note: I suspect the same thing happens in other fields, like bio, chem, physics, etc., but I cannot offer such an example because I don't know those areas. Although not so obvious, I do think it holds in math: I use the proof technique of Shelah35 to prove blah -- there, both theorems and proof techniques are objects.)

At first, this was an annoying observation. It meant that our ontology of the world into tasks and techniques was broken. But it did imply something of a richer structure than this simple ontology. For instance, one might posit as a theory of science and technologies studies (STS, a subfield of social science concerned with related things) that the most basic thing that matters is that you have objects (things of study) and an appliedTo relationship. So recommender systems, matrix factorization, compressive sensing, smoothed analysis, set theory, etc., are all objects, and they are linked by appliedTos.

You can then start thinking about what sort of properties appliedTo might have. It's certainly not a function (many things can be applied to any X, and any Y can be applied to many things). I'm pretty sure it should be antireflexive (you cannot apply X to solve X). It should probably also be antisymmetric (if X is applied to Y, probably Y cannot be applied to X). Transitivity is not so obvious, but I think you could argue that it might hold: if I apply gradient descent to an optimization problem, and my particular implementation of gradient descent uses line search, then I kind of am applying line search to my problem, though perhaps not directly. (I'd certainly be interested to hear of counter-examples if any come to mind!)

If this is true, then what we're really talking about is something like a directed acyclic graph, which at least at a first cut seems like a reasonable model for this world. Probably you can find exceptions to almost everything I've said, but that's why you need statistical models or other things that can deal with "noise" (aka model misspecification).

Actually something more like a directed acyclic hypergraph might make sense, since often you simultaneously apply several techniques in tandem to solve a problem. For instance, I apply subgradient descent and L1 regularization to my binary classification problem -- the fact that these two are being applied together rather than separately seems important somehow.

Not that we've gone anywhere with modeling the world like this, but I definitely thing there are some interesting questions buried in this problem.

As you've noticed, I haven't posted in a while. I've also not been reading blogs. My unread number of posts is now 462. Clearly I'm not going to go back and read all 462 posts that I missed. I will claim that this was an experiment to see what a (nearly) blog-free world is like.

I actually found that I missed both the reading and the writing, so now (especially that I've switch over to public transportation and so have about an hour to kill in transportation time) I'm going to go back to reading while being transported and blogging when I have time.

I figured I'd return to blogging by saying a bit about a recent experience. Less than a month ago I had the honor of serving on Jurgen Van Gael's Ph.D. examination committee. Jurgen did an excellent job and, as perhaps expected, passed. But what I want to talk about is how the UK model (or at least the Cambridge model) is different from the US model.

In the UK, the examination is done by two faculty members, one internal (this was Stephen Clark) and one external (that was me). It does not involve the advisor/supervisor, though this person can sit in the room without speaking :). There is no public presentation and the process we followed was basically to go through the dissertation chapter-by-chapter, ask clarification questions, perhaps some things to get Jurgen to think on his toes, and so on. This took about two hours.

Contrast this to the (prototypical) US model, where a committee consists of 5 people, perhaps one external (either external to CS or to the university, depending on how your institution sets it up), and includes the advisor. The defense is typically a 45 minute public presentation followed by questions from the committee in a closed-room environment with the student.

Having been involved, now, in both types, I have to say they each have their pros and cons. I think the lack of a public presentation in the UK model is a bit of a shame, though of course students could decide to do these anyway. But it's nice to have something official for parents or spouses to come to if they'd like. However, in the US, the public presentation, plus the larger committee, probably leads to situation that students often joke about that not even their committee reads their dissertation. You can always fall back on the presentation, much like students skip class reading when they know that the lecture will cover it all. When it was just me, Stephen and Jurgen, there's really no hiding in the background :).

I also like how in the UK model, you can skip over the easy stuff and really spend time talking with the student about the deep material. I found myself much more impressed with how well Jurgen knows his stuff after the examination than before, and this is not a feeling I usually get with US students because their defense it typically quite high-level. And after 45 minutes of a presentation, plus 15 minutes of audience questions, the last thing anyone wants to do is sit around for another two hours examining the details of the defense chapter-by-chapter.

Regarding the issue of having the advisor there or not, I don't have a strong preference. The one thing I will say is that by having the advisor missing removes the potential for weird politics. For instance, I have seen one or two defenses in which an advisor tends to answer questions for the student, without the student first attempting an answer. If I were on these committees, with a relatively senior advisor, it might be politically awkward to ask them not to do this. Luckily this issue hasn't come up for me, but I could imagine it happening.

Obviously I don't really expect anyone's policies to change, and I'm not even sure that they should, but I like thinking about things that I've grown used to taking for granted. Plus, after having gone through the UK model, I think I will grill students a bit more during the Q/A time. And if this means that fewer students ask me to be on their committees, then there's more time to blog :).

natural language processing blog

29 September 2011

A technique for me is a task for you

26 September 2011

Four months without blogs