14 April 2014

Waaaah! EMNLP six months late :)

Okay, so I've had this file called emnlp.txt sitting in my home directory since Oct 24 (last modification), and since I want to delete it, I figured I'd post it here first. I know this is super belated, but oh well, if anyone actually reads this blog any more, you're the first to know how I felt 6 months ago. I wonder if I would make the same calls today... :)

If you remember anything about EMNLP anymore and have your own opinions, please feel free to comment. It will also let me know if anyone reads here anymore :).

Happy Spring from DC!

16 September 2013

Active learning for positive examples

I have a colleague who wants to look through large amounts of (text) data for examples of a pretty rare phenomenon (maybe 1% positive class, at most). We have about 20 labeled positive examples and 20 labeled negative examples. The natural thing to do at this point is some sort of active learning.

But here's the thing. We have no need for a classifier. And we don't even care about being good at finding negative examples. All we care about is finding as many positive examples from a fixed corpus as possible.

That is to say: this is really a find-a-needle-in-a-haystack problem. The best discussion I've found of this is Section 4 of Inactive Learning, by Attenberg and Provost. I came across a couple other papers that basically do some sort of balancing to deal with the imbalanced data problem in active learning, but the Attenberg and Provost results suggest that even this is tricky to get right.

But I think this only partially addresses the problem. Here's why.

Let's say that my colleague (call her Alice) is willing to spend one hour of her life looking at examples. Alice estimates that she can look at about 500 examples in that time period (small caveat: some examples are larger than other and therefore slower, but let's ignore this for now). Alice's goal is to find as many positive examples in that one hour as possible. Here are some possible strategies Alice could deploy

  1. Look at 500 randomly selected examples, getting 5 (more) positive examples in expectation (likely somewhere between 2 and 12, with 95% confidence). This gives her a total of 25 positive examples
  2. Train a classifier on her 40 labeled examples, have it rank the entire set (minus those 40). Then look at the top 500 examples. If the learned model has a recall of r% in the top 500, then she should expect to get 5*r more examples, giving her a total of 20+5r examples. (Caveat: this is a bit dangerous, because with very few training examples, results can often be anti-correlated with the desired output and you get better performance by flipping the predicted label. There was a paper on this maybe 5-10 years ago but I can't find it... Maybe somewhat knows what I'm referring to.)
  3. Train a classifier, spend 30 minutes looking at 250 random examples, getting 2.5r more positive examples, then train another classifier, then spend 30 minutes looking at it's top ranked 250 examples. If the second classifier has a recall of r'% in the top 250, then she should expect to get another 2.5*r', and she'll have a total of 20+2.5r+2.5r' = 20 + 2.5(r+r') so long as r>r' this should be better.
  4. Taking this to the extreme, Alice could annotate one example at a time (subject to constraints on either the learning being very fast or Alice not actually being a human). As long as the recall-at-one of the classifiers learned is non-decreasing, this should (I think) be the optimal strategy.
So it seems that in a setting like this, what the classifier should optimize for is simply recall-at-one. In other words, it doesn't care about anything except that it's top ranked output is correct.

Okay, so why is this difficult? Basically because we don't have that much data. If the corpus that represents the haystack is huge, then we want to ensure that the top-one example from that haystack is a positive example. So in principle even if all we're trying to do is select between two different hypotheses (say our hypothesis class at each time step has cardinality two) then in principle we should use as much data as possible to evaluate these two hypotheses. In particular, looking at the recall-at-one on the subset of 40 labeled examples that we have is probably not representative, essentially because this loss function doesn't decompose nicely.

So what could we do instead? Well, we could pool the labeled and unlabeled examples and predict on those instead. Certainly if one of the positive labeled examples outranked everything else, that would be a good thing. But if truly 1% of the unlabeled dataset is positive, then this is not a necessary condition. In particular, the top-ranked positive labeled example could be as far as 1% down the list and we could still (in principle) be doing a good job.

Ok I'll admit: I really don't know how to do this. Maybe it's enough in most cases to optimize for zero-one loss (or a weighted version thereof) and just cross your fingers that the recall-at-one is good. But it does feel like there should be a more direct way to go about this problem.

08 July 2013

The *SEM 2013 Panel on Language Understanding (aka semantics)

One of the highlights for me at NAACL was the *SEM panel on "Toward Deep NLU", which had the following speakers: Kevin Knight (USC/ISI), Chris Manning (Stanford), Martha Palmer (UC Boulder), Owen Rambow (Columbia) and Dan Roth (UIUC). I want to give a bit of an overview the panel, interspersed with some opinion. I gratefully acknowledge my wonderful colleague Bonnie Dorr for taking great notes (basically a transcript) and sharing them with me to help my failing memory. For what it's worth, this basically seemed like the "here's what I'm doing for DEFT panel" :).

Here's the basic gist that I got from each of the panel members, who gave roughly 10 minute talks:

Dan Roth: doing role labeling restricted to verbs is not enough. As an easy example, "John, a fast-rising politician, slept on the train to Chicago"... by normal SRL we get that John is sleeping, but not the possibly more important fact that John is a politician. Another example is prepositions: "University of Illinois" versus "State of Illinois" -- "of" is ambiguous. They came up with a taxonomy of 32 relations and labeled data and then did some learning -- see the TACL paper that was presented at NAACL, Srikumar & Roth.

Commentary: the ambiguity of prepositions issue is cool and I really liked the TACL paper. It reminds me of learning Latin in high school and being confused that ablative case markers were abiguous across from/by/with. It astounded me that that was an acceptable ambiguity, but of course English has equally crazy ones that I've just gotten used to. But it does make me think that some cross-linguistic study/model might be cool here. Even more broadly, it made me think about noun-noun compound semantics: "farmers market" (market put on by farmers) versus "fruit market" (market where you buy fruit) versus "fruit pie" (pie made out of fruit). I went back and read Lucy Vanderwende's dissertation, which dealt exactly with these issues. She had far fewer relations than Srikumar and Roth, though perhaps once you allow explicit prepositions the range of things you can express grows (though somehow my gut feeling is that it doesn't, at least in English).

Kevin Knight: Basically talked about their deep semantic approach to MT: see the abstract meaning representation web page for more. The idea is that people who work on syntax don't Balkanize into those who do PPs, those who do VPs, etc., so why should semantics break apart like it does. AMR is very GOFAI-style representation for language, and they've annotated a Chinese-English bilingual copy of Le Petite Prince with this representation. Now they need analyzers (hard), generators (hard) and transformation formalisms (hard). The nice thing is that this one representation captures almost all relevant semantic issues: scoping, argument structure, coreference, etc. For instance, co-ref is not explicitly annotated: it's just that a single agent can participate in multiple predicates. (Note: not yet across sentences.)

Commentary: It's hard not to get excited about this stuff, especially when Kevin talks about it. His enthusiasm is infectious. I left the talk thinking "wow I want to work on that!" There's of course the worry that we've tried this before and failed and that's why things in semantics Balkanized, but maybe the time is right to revisit it. For instance, Bonnie herself (note: she didn't tell me this; it had come up in recent discussions with Philip Resnik and our postdoc Junhui Li) had a meaning representation very similar to AMR called Lexical Conceptual Structures (LCS), and Nizar Habash had a hybrid rule-based/statistical approach to translating there. The idea was that if you want to handle divergent translations (classic example: "the bottle floated across the river" (English) versus "the bottle crossed the river floatingly" (Spanish, I think)), you need a representation that abstracts means from predicate. But it's still very cool. (Actually in digging up refs, I just found this paper on mapping from LCS to AMR... from AMTA 1998!)

Martha Palmer: focused mostly on event relations that go across sentences, which includes things like even coreference, bridging relations (enablement, result) and so on. They're also looking seriously at type (evidential, aspectual, etc.), modality (actual, hypothetical, generic, hedged, etc.), polarity and aspect. They are currently doing a lot of work in the clinical domain, in which these distinctions are really important if you want to understand, say, patient medical histories.

Commentary: this is a bit outside things I usually think about, so I have less to say. I really like the hyper-sentence view, of course.

Owen Rambow: talked about some of my favorite work that I've seen recently: basically work on propositional attitudes. The view Owen put forth is that most of NLP is focused on a world of facts, and the goal of NLU is to figure out what these facts are. They are taking a much more social model of text meaning, in which you really care about inferring partipants' cognitive states (standard triumvirate: belief, desire and intention). This actually shows up in at least one English-German translation example, essentially in which Google translate misses a very important subjunctive.

Commentary: I really liked the original work Owen did on BDI inference and I'm thrilled it's going further. I think one of the historical reasons why I find this so interesting is that propositional attitudes is basically what I started doing when I started grad school, when looking at discourse analysis through RST. I think many people forget this, but the discourse relationships in RST (and other discourse theories) are really based on attitude. For instance, X is in a background relation to Y if (roughly) the listener already believes X and the listener also believes that X increases the chance of Y. (Or something like that: I just made that up :P.) But it's all about belief of listeners and utterers.

Chris Manning: focused on deep learning, basically asserting (in a manner designed to be a bit controversial) that Stanford dependencies are their meaning representation and that the big problems aren't in representations. Sure, Stanford dependencies miss out on a lot (quanitification, tense, semantic roles, modality, etc.) but he felt that there are more important problems to address. And then what we need instead is "soft" meaning representations, like vector space models and distributed representations give us. Giving rise to something akin to Natural Logic.

Commentary: to a large degree I agree with the notion that the "big problems" in language are probably not those that (eg) semanticists like to look at, at least from the typical view of NLE in which we want systems that do well on average across a distribution of examples that we've cultivated. But I also worry that there's a bit of magical thinking here, in the sense that it kind of feels like a cop-out: it's too hard to define categories by hand so let's let the machine figure it out. Now, don't get me wrong, I'm all for machines figuring out stuff (I gave a not-very-well-received talk to that degree at a workshop a couple years ago on linguistics in NLP), but I'm also a bit reticent to believe that this is really going to bring us any closer to really solving the NLU problem (whatever that is), though of course it will get us another 5-10% in standard benchmarks. (Ok this sounds way too negative: I actually really liked Chris' talk, and one of the things I liked about it was that it challenged my thinking. And I agree that there is a lot that we shouldn't be designing by hand -- some people, like Yoshua Bengio, would probably argue that we shouldn't be designing anything by hand, or at least that we shouldn't have to -- but I guess I still belong to the camp of "linguists give the structure, statistics gives the parameters.")

There was also a lot of really interesting discussion after the presentations, some of which I'll highlight below:

Lucy Vanderwende, I think mostly directed at Kevin, fell into the "we tried this X years ago" camp, basically said that whenever they tried to abstract more and more from the input representation, you ended up getting very boring sentences generated because you'd thrown out all the "nuance" (my word, not hers). The discussion afterward basically revolved about whether you annotate input sentences with meaning (which is currently the standard) or throw them out with the bathwater. Owen points out that the meaning of a passive sentence is not +passive but something much more nuanced, and if you could capture that correctly, then (in principle) generators could reflect it properly in the target language. (Me: For instance, maybe in some wacky language a passive sentence actually means that you're trying to emphasize the subject.)

There was also a lot of discussion around Chris, I think partially because he went last and partially because he was trying to be controversial. Mausam made an argument (akin to what I wrote above) that logicians have made a billion logics of language and nothing really has worked (in a sense it's been a series of negative results). What about inference rules or consistency?

Okay, that's all I want to write for now. Congrats if you made it this far. And thanks to the *SEM organizers for putting together this great panel!

17 June 2013

My NAACL 2013 list...

I feel a bit odd doing my "what I liked at NAACL 2013" as one of the program chairs, but not odd enough to skip what seems to be the most popular type of post :). First, though, since Katrin Kirchhoff (my co-chair) and I never got a chance to formally thank Lucy Vanderwende (the general chair) and give her flowers (or wine or...) let me take this opportunity to say that Lucy was an amazing general chair and that working with her made even the least pleasant parts of PCing fun. So: thanks Lucy -- I can't imagine having someone better to have worked with! And all of the rest of you: if you see Lucy and if you enjoyed NAACL, please thank her!

I also wanted to really thank Matt Post for doing the NAACL app -- lots of people really liked it and I hope we do it in the future. I'm at ICML now and constantly wishing there were an ICML app :).

Okay, with that preface, let's get down to what you came for. Below is the list of my (complete) list of favorite papers from NAACL 2013 (also indexed on Braque) in no particular order:

  • Relation Extraction with Matrix Factorization and Universal Schemas (N13-1008 by Sebastian Riedel; Limin Yao; Andrew McCallum; Benjamin M. Marlin)
    Very cool paper. The idea is to try to jointly infer relations (think OpenIE-style) across text and databases, by writing everything down in a matrix and doing matrix completion. In particular, make the rows of this matrix equal to pairs of entities (Hal,UMD and UMD,DC-area) and the columns relations like "is-professor-at" and "is-located-in." These entity pairs and relations come both from free text and databases like FreeBase. Fill in the known entities and then think of it as a recommender system. They get great results with a remarkably straightforward approach. Reminds me a lot of my colleague Lise Getoor's work on multi-relational learning using tensor decompositions.
  • Combining multiple information types in Bayesian word segmentation (N13-1012 by Gabriel Doyle; Roger Levy)
    I guess this qualifies as an "obvious in retrospect" idea -- and please recognize that I see that as a very positive quality! The basic idea is that stress patterns (eg trochees versus iambs) are very useful for kids (who apparently can recognize such things at 4 days old!) and are also very useful for word segmentation algorithms.
  • Learning a Part-of-Speech Tagger from Two Hours of Annotation (N13-1014 by Dan Garrette; Jason Baldridge)
    Probably my overall favorite paper of the conference, and the title says everything. Also probably one of the best presentations I saw at the conference -- I can't even begin to guess how long Dan spent on his slides! I loved the question from Salim in the Q/A session, too: "Why did you stop at two hours?" (They have an ACL paper coming up, apparently, that answers this.) You should just read this paper.

  • Automatic Generation of English Respellings (N13-1072 by Bradley Hauer; Grzegorz Kondrak)
    This paper was the recipient of the best student paper award and, I thought, really great. It's basically about how English (in particular) has funny orthography and some times it's useful to map spellings to their pro-nun-see-ey-shuns, which most people find more useful than . It's a bit more of a bunch of stuff glued together than I usually go for in papers, but the ideas are solid and it seems to work pretty well -- and I'd never even thought this would be something interesting to look at, but it makes complete sense. Best part of presentation was when Greg tripped up pronouncing some example words :).

  • Linguistic Regularities in Continuous Space Word Representations (N13-1090 by Tomas Mikolov; Wen-tau Yih; Geoffrey Zweig)
    This is a paper that makes my list because it made me think. The basic idea is that if you do some representation learning thingamajig and then do vector space algebra like repr("King") - repr("man") + repr("woman") you end up with something that's similar to repr("Queen"). It's a really interesting observation, but I'm at a loss for why we would actually expect something like this to happen!
  • PPDB: The Paraphrase Database (N13-1092 by Juri Ganitkevitch; Benjamin Van Durme; Chris Callison-Burch)
    This is a paper about a dataset release that I think I'll find useful and I bet other people will too. Go download it and play with it. I'd encourage the authors (are you listening, Juri!) to make a web demo (or web service) so that I don't need to go through the pain of getting it all set up to see if it might be useful for me.
  • Supervised Learning of Complete Morphological Paradigms (N13-1138 by Greg Durrett; John DeNero)
    Basic idea: college morphological paradigms from Wiktionary and then train a supervised system to generalize from those to novel words. Works remarkably well and the model is well thought out. Plus I like papers that take morphology seriously: I wish we saw more stuff like this in NAACL.
And though I don't often do this, I have to mention the following paper because although I didn't see the talk or read the paper, enough independent people pointed it out to be as great that I figured I'd mention it:
  • Improved Reordering for Phrase-Based Translation using Sparse Features (N13-1003 by Colin Cherry)
Anyone else with favorites should comment!

30 April 2013

What is a sparse difference in probability distributions?

Sparsity has been all the rage for a couple of years now. The standard notion of "sparse" vector u is that the number of non-zeros in u is small. This is simply the l_0 norm of u, ||u||_0. This norm is well studied, known to be non-convex, and often relaxed to the l_1 norm of u, ||u||_1: the sum of absolute values. (Which has the nice property of being the "tightest" convex approximation to l_0.)

In some circumstances, it might not be that most of u is zero, but simply that most of u is some fixed scalar constant a. The "non-constant" norm of u would be something like "the number of components that are not equal to some a" or, in more mathy terms, "min_a ||u-a||_0". I first came across it this in the Differentiable sparse coding paper by Bradley and Bagnell (NIPS 2008), where they claim physicists call it "Maximum entropy on the mean" if you're using a KL distance, but I don't know if there's a name for it in the l_0 sense, so I'l call it "l_0 from the mode" which seems to capture the right notion.

In many cases, we want to use a sparse norm to define a notion of a sparse distance function between two vectors u and v. For instance d_0(u,v) = ||u-v||_0 would be a reasonable notion of a sparse difference between two vectors.

Now, let's suppose that u and v are probability distributions over a finite event space, so that u_i is the probability of event i. In this case, both u and v belong to the simplex: u_i >= 0 for all i, and sum_i u_i = 1 (which is equivalent to saying ||u||_1 = 1, given the first condition).

It seems natural to ask ourselves whether there is a reasonable notion of sparse difference between two probability distributions. This is something I've thought about, but not come up with a satistfactory answer. The "natural" thing would be to define d_0(p,q) = ||p-q||_0, where I've switched from u/v to p/q to emphasize that these are distributions. This is certainly reasonable in some sense, but does not correspond to how I usually think of probability distributions, essentially because it ignores the "normalization" effect.

As a simple example, let's suppose that I generate a data set by rolling a dice with sides A-F. Suppose it comes up A five times, B twice, C once and F twice. Then my maximum likelihood estimate for p would be [.5, .2, .1, 0, 0, .2]. If I had a different dataset, in which exactly the same thing happened except I got one roll of D in addition to all the other rolls, I would estimate the distribution here as q = [.45, .18, .09, .09, 0, .18], which differs everywhere except the "E" position. The d_0 distance between these two distributions would be 4, which intuitively seems "wrong.

One possible way around this would be to allow us to rescale the distributions before computing the l_0 distance, something like d(p,q) = min_(a,b > 1) || ap - bq ||_0. When talking about l_0 as the underlying norm, this can be simplified to min_(0 < a < 1) || ap - (1-a)q ||_0. Note that the inequalities on "a" are necessary. With equalities, you could "erase" all of p (or q) by setting a to zero (or 1), which is definitely not desirable. Fortunately, the optimal "a" can be computed in linear time in the dimensionality: it's just the value that maximizes the number of i for which ap_i = (1-a)q_i. This definitlon captures the example above and gives a distance of "1" (corresponding to a = mode(q / (p+q)) = 10/21 = 0.4797).

We can attempt to make this definition more formal by invoking the notion of exponential families. Recall that an expfam distribution has the form log p(x;t) = t*f(x) - A(t) + B(x), where t are the natural parameters, f is the vector of sufficient statistics, * denotes dot product, and A() is the log partition function. A quick observation is that depending on how f looks, there might be lots of t that give rise to the same probability distribution. If we guarantee that the fs are always linearly independent, then this representation is called "minimal"; otherwise it is "overcomplete." (Formally, it is overcomplete if there exists a non-zero vector a for which a*f(x) is constant for all x.)

For a categorical distribution, an overcomplete representation would be a six dimensional indicator of which side of the die came up, and the natural parameters would be the (log of the) ps and qs shown above (it's overcomplete because one component is completely determined by the other 5). For example, the example p above would have exponential form with t=log[.5/.8, .2/.8, .1/.8, 0, 0] and log partition function A(t) = log 1.8. (See wikipedia's variant 3 for categorical distributions.) and q would have t=log[.45/.82, .18/.82, .09/.82, .09/.82, 0] with A(t) = log 1.82.

Supposing that we're working with a minimal representation, what I think this amounts to is that we want the difference in natural parameters to be sparse in the "l_0 from the mode" sense -- the difference is equal to a constant. In the above example, the difference in sufficient statistics is equal to 0.1301 in the first three components, -Infinity in the fourth, and something like zero in the last (who knows -- NaN?) which gives a distance of one (if you discard the last one).

The problem here is that there is not a unique minimal representation; I could also have chosen t=log[.2/.5, .1/.5, 0, 0, .2/.5] and A=log 1.5 for p and t=log[.18/.55 .09/.55 .09/.55 0/.55 .18/.55] and A=log 1.55 for q. In this case, the mode of the difference is 0.2007 (in the first, second and last components) there's one -Infinity and one NaN. So the answer works out the same, at least if you know how to deal with the NaNs.

There's an obviously needed lemma here: that any minimal representation will give you the same distance. I haven't thought nearly long enough about why this should be true (i.e., I've thought about if for about 30 seconds :P). There's also a lot of other questions: what if you relax this notion of sparsity to something more l_1-like? What other properties does it have? And of course is there a better notion of sparse difference of probability distributions for this or other circumstances?

27 December 2012

Teaching (intro, grad) NLP

I had a post a while back on teaching ML that I still basically stand by.  I've taught intro grad NLP here at UMD twice now, and a sort-of-similar-course back at Utah once.  I find these courses really hard to teach.  And not for the usually bemoaned reason of the CS/linguistics mix -- I think it's possible to deal with that, and certainly it's an issue that's been talked about a lot.

What I find difficult is that NLP (and CL) is a collection of problems, techniques, ideas, frameworks, etc. that really are not tied together in any reasonable way other than the fact that they have to do with NLP.  Even if you manage to answer questions about "what sort of topics are most interesting?" you're still faced with this problem that every time you switch topics, the entire context in which you're discussing them changes.  This is exacerbated by the problem that things like tagging and parsing are hopelessly boring (in comparison to all the cool interesting stuff in NLP these days), but yet so many modern ideas are based on understanding basic dynamic programming for tree structures and things like that.

To make things a bit more concrete, a standard intro NLP class might start with morphology.  Okay, so you have to explain what morphemes are and why they're important.  Now, you probably will take a finite state approach, so you have to explain transducers.  If you want these things to work, you have to explain weighted transducers.  Do you do probabilities, in which case there's the whole local vs global normalization stuff that takes more time?  So now you want to do POS tagging or something.  Fine, you can do that with finite state models too.  But no one actually does this any more (except lingpipe :P).  So you have to explain POS stuff, perhaps how this works in non-English, and then you can leave them with HMMs (maybe talking about Viterbi algorithm) or do lots of ML so you can get to CRFs or structured perceptron or something.  And we're still at POS tagging.  Now you switch to parsing.  Back to square one.  And then you want to do compositional semantics, now there's lots more structure, lots more features and so on.  But even now things are at least somewhat connected.  But then you talk about lexical semantics: be it distributed representations or WSD or whatever, but the problem is new, the techniques are new (do you teach Yarowsky?), the evaluation is now and so on.

I think it's worth contrasting this with ML.  I find ML remarkably easy to teach (so I'm flipping the classroom this coming Spring for the UG version to make it more exciting) despite the fact that the material is (in many ways) much harder for CS types.  The thing that is nice about ML is that the problem is basically always the same (or at least changes only once, when you switch from supervised to unsupervised).  In that sense, ML tends to be a course about techniques for a relatively fixed problem (or at least fixed problem type).  This makes for significantly less context switching, which makes learning easier (and thereby makes teaching easier).

So the question I wanted to ask is: can we do something similar in NLP.  The crazy idea that I'm sure everyone will say is insane is the following: teach NLP as a course about what you can do with log-linear models.  Here's how I envision it.  You spend the first day talking about NLP and why data is important, ambiguity, etc, just like normal.  You spend the next two days explaining enough about log linear models that you can treat them as given for the rest of the semester.  Maybe you tell how to optimize them by gradient descent or something, but basically enough that anyone who is simultaneously taking ML will get more out of it, but those that are not are fine with LL models as a black box.

Now, when you teach different topics, the framework in which you discuss them is the same.  You have a structured problem (which forces you to talk about algorithms like Viterbi or CKY) with interesting ambiguities (which forces you to talk about features).  Then, the class essentially becomes a sequence of problems, associated algorithms and relevant features.  The rest is left as a black box, which can be provided off the shelf for programming projects, and they can focus on the interesting and more NLP-ish problems of algorithms and features.  You could even start with something like sentiment classification (at a document level) to make the beginning gentle.

I realize there are some things you couldn't do this way, or would be very awkward to do this way.  Anything generative or unsupervised, which often go together.  For instance, word alignment via the IBM models won't fit.  Topic models won't fit (though I don't usually do them anyway -- maybe I should).  Probably there are some other things too.

Anyway, I'd be curious to hear what people think of this idea.  I know it's biased by my own view of the world, but hey -- that's why I'm a professor (or at least why I assist professors...).  Or if anyone has tried it.

09 December 2012

NIPS stuff...

NIPS is over as of last night.  Overall I thought the program was strong (though I think someone, somewhere, is trying to convince me I need to do deep learning -- or at least that was the topic d'jour... or I guess d'an? this time).  I wasn't as thrilled with the venue (details at the end) but that's life.  Here were some of the highlights for me, of course excluding our own papers :P, (see the full paper list here)... note that there will eventually be videos for everything!

  • User-Friendly Tools for Studying Random Matrices
    Joel A Tropp

    This tutorial was awesome.  Joel has apparently given it several times and so it's really well fine-tuned.  The basic result is that if you love your Chernoff bounds and Bernstein inequalities for (sums of) scalars, you can get almost exactly the same results for (sums of) matrices.  Really great talk.  If I ever end up summing random matrices, I'm sure I'll use this stuff!
  • Emergence of Object-Selective Features in Unsupervised Feature Learning
    Adam Coates, Andrej Karpathy, Andrew Y. Ng
    They show that using only unlabeled data that is very heterogenous, some simple approaches can pull out faces.  I imagine that some of what is going on is that faces are fairly consistent in appearance whereas "other stuff" often is not.  (Though I'm sure my face-recognition colleagues would argue with my "fairly consistent" claim.)
  • Scalable nonconvex inexact proximal splitting
    Suvrit Sra
    I just have to give props to anyone who studies nonconvex optimization.  I need to read this -- I only had a glance at the poster -- but I definitely think it's worth a look.
  • A Bayesian Approach for Policy Learning from Trajectory Preference Queries
    Aaron Wilson, Alan Fern, Prasad Tadepalli
    The problem solved here is imitation learning where your interaction with an expert is showing them two trajectories (that begin at the same state) and asking them which is better.  Something I've been thinking about recently -- very happy to see it work!
  • FastEx: Hash Clustering with Exponential Families
    Amr Ahmed, Sujith Ravi, Shravan M. Narayanamurthy, Alexander J. Smola
    The idea here is to replace the dot product between the parameters and sufficient statistics of an exp fam model with an approximate dot product achieved using locality sensitive hashing.  Take a bit to figure out exactly how to do this.  Cool idea and nice speedups.
  • Identifiability and Unmixing of Latent Parse Trees
    Daniel Hsu, Sham M. Kakade, Percy Liang
    Short version: spectral learning for unsupervised parsing; the challenge is to get around the fact that different sentences have different structures, and "unmixing" is the method they propose to do this.  Also some identifiability results.
  • Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs
    Shay B. Cohen and Michael Collins
    Another spectral learning paper, this time for doing exact latent variable learning for latent-variable PCFGs.  Fast, and just slightly less good than EM.
  • Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
    Abner Guzman-Rivera Dhruv Batra Pushmeet Kohli
    Often we want our models to produce k-best outputs, but for some reason we only train them to produce one-best outputs and then just cross our fingers.  This paper shows that you can train directly to produce a good set of outputs (not necessarily diverse: just that it should contain the truth) and do better.  It's not convex, but the standard training is a good initializer.
  • [EDIT Dec 9, 11:12p PST -- FORGOT ONE!]
    Query Complexity of Derivative-Free Optimization
    Kevin G. Jamieson, Robert D. Nowak, Benjamin Recht
    This paper considers derivative free optimization with two types of oracles.  In one you can compute f(x) for any x with some noise (you're optimizing over x).  In the other, you can only ask whether f(x)>f(y) for two points x and y (again with noise).  It seems that the first is more powerful, but the result of this paper is that you get the same rates with the second!
  • I didn't see it, but Satoru Fujishige's talk Submodularity and Discrete Convexity in the Discrete Machine Learning workshop was supposedly wonderful.  I can't wait for the video.
  • Similarly, I heard that Bill Dolan's talk on Modeling Multilingual Grounded Language in the xLiTe workshop was very good.
  • Ryan Adam's talk on Building Probabilistic Models Around Deterministic Optimization Procedures in the "Perturbations, Optimization and Statistics" workshop (yeah, I couldn't figure out what the heck that meant either) was also very good.  The Perturb-and-MAP stuff and the Randomized Optimum models are high on my reading list, but I haven't gotten to them quite yet.
  • As always, Ryan McDonald and Ivan Titov gave very good talks in xLiTe, on Advances in Cross-Lingual Syntactic Transfer and Inducing Cross-Lingual Semantic Representations of Words, Phrases, Sentences and Events, respectively.
I'm sure there was lots of other stuff that was great and that I missed because I was skiing working hard on NAACL.

Really my only gripe about NIPS this year was the venue.  I normally wouldn't take the time to say this, but since we'll be enjoying this place for the next few years, I figured I'd state what I saw as the major problems, some of which are fixable.  For those who didn't come, we're in Stateline, NV (on the border between CA and NV) in two casinos.  Since we're in NV, there is a subtle note of old cigarette on the nose fairly constantly.  There is also basically nowhere good to eat (especially if you have dietary restrictions) -- I think there are a half dozen places on yelp with 3.5 stars or greater.  My favorite tweet during NIPS was Jacob Eisenstein who said: "stateline, nevada / there is nothing but starbucks / the saddest haiku".  Those are the "unfixables" that make me think that I'll think twice about going to NIPS next year, but of course I'll go.

The things that I think are fixable... there was no where to sit.  Presumably this is because the casino wants you to sit only where they can take your money, but I had most of my long discussions either standing or sitting on the ground.  More chairs in hallways would be much appreciated.  There was almost no power in rooms, which could be solved by some power strips.  The way the rooms divided for tutorials was really awkward, as the speaker was clear on one side of the room and the screen was on the other (and too high to point to) so it was basically like watching a video of slides online without ever seeing the presenter.  Not sure if that's fixable, but seems plausible.  And the walls between the workshop rooms were so thin that often I could hear another workshop's speaker better than I could hear the speaker in the workshop I was attending.  And the internet in my hotel room was virtually unusably slow (though the NIPS specific internet was great).

26 September 2012

Sure, you can do that....

I'll warn in advance that this is probably one of the more controversial posts I've written, but realize that my goal is really to raise questions, not necessarily give answers.  It's just more fun to write strong rhetoric :).

Let me write down a simple Markov Chain:

  1. Download some data from the web
  2. Call part of that data the input and part of it the label
  3. Train a classifier on bag of words and get 84% accuracy
  4. Submit a paper to *ACL
  5. Go to 1
Such papers exist in the vision community, too, where you replace "bag of words" with "SIFT features" and "*ACL" with "CVPR/ICCV."  In that community (according to my one informant :P), such papers are called "data porn."  Turns out this is actually a term from journalism, in which one definition is "where journalists look for big, attention grabbing numbers or produce visualisations of data that add no value to the story."

There's a related paper that looks at this issue in one specific setting: predicting political outcomes.  On Arxiv back at the end of April, we got this wonderful, and wonderfully titled paper:
"I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" -- A Balanced Survey on Election Prediction using Twitter Data by Daniel Gayo-Avello
The thing I especially like about this paper is that it's not a complaint (like this blog post!) but rather a thoughtful critique of how one could do this sort of research right.  This includes actually looking at what has been done before (political scientists have been studying this issue for a long time and perhaps we should see what they have to say; what can we do to make our results more reproducible; etc).

For me, personally, this goes back to my main reviewing criteria: "what did I learn from this paper?"  The problem is that in the extreme, cartoon version of a data porn paper (my 1-4 list above), the answer is that I learned that machine learning works pretty well, even when asked to do arbitrary tasks.  Well, actually I already knew that.  So I didn't really learn anything.

Now, of course, many data porn-esque papers aren't actually that bad.  There are many things one can do (and people often do do) that make these results interesting:
  • Picking a real problem -- i.e., one that someone else might actually care about.  There's a temptation (that I suffer from, too) of saying "well, people are interested in X, and X' is kind of like X, so let's try to predict X' and call it a day."  For example, in the context of looking at scientific articles, it's a joy in many communities to predict future citation counts because we think that might be indicative of something else.  I've certainly been guilty of this.  But where this work can get interesting is if you're able to say "yes, I can collect data for X' and train a model there, but I'll actually evaluate it in terms of X, which is the thing that is actually interesting."
  • Once you pick a real problem, there's an additional challenge: other people (perhaps social scientists, political scientists, humanities researchers, etc.) have probably looked at this in lots of different lights before.  That's great!  Teach me what they've learned!  How, qualitatively, do your results compare to their hypotheses?  If they agree, then great.  If they disagree, then explain to me why this would happen: is there something your model can see that they cannot?  What's going on?
  • On the other hand, once you pick a real problem, there's a huge advantage: other people have looked at this and can help you design your model!  Whether you're doing something straightforward like linear classification/regression (with feature engineering) or something more in vogue, like building some complex Bayesian model, you need information sources (preferably beyond bag of words!) and all this past work can give you insights here.  Teach me how to think about the relationship between the input and the output, not just the fact that one exists.
In some sense, these things are obvious.  And of course I'm not saying that it's not okay to define new problems: that's part of what makes the world fun.  But I think it's prudent to be careful.

One attitude is "eh, such papers will die a natural death after people realize what's going on, they won't garner citations, no harm done."  I don't think this is all together wrong.  Yes, maybe they push out better papers, but there's always going to be that effect, and it's very hard to evaluate "better."

The thing I'm more worried about is the impression that such work gives from our community to others.  For instance, I'm sure we've all seen papers published in other venues that do NLP-ish things poorly (Joshua Goodman has his famous example in physics, but there's tons more).  The thing I worry is that we're doing ourselves a disservice as a community to try to claim that we're doing something interesting in other people's spaces, without trying to understand and acknowledge what they're doing.

NLP obviously has a lot of potential impact on the world, especially in the social and humanities space, but really anywhere that we want to deal with text.  I'd like to see ourselves set up to succeed there, by working on real problems and making actual scientific contributions, in terms of new knowledge gathered and related to what was previously known.

15 September 2012

Somehow I totally missed NIPS workshops!

I don't know how it happened or when it happened, but at some point NIPS workshops were posted and papers are due about a week from now and I completely missed it!  The list of workshops is here:


Since my job as a blogger is to express my opinion about things you don't want to hear my opinion about, I wish they'd select fewer workshops.  I've always felt that NIPS workshops are significantly better than *ACL workshops because they tend to be workshops and not mini-conferences (where "mini" is a euphemism for non-selective :P).  At NIPS workshops people go, really talk about problems and it's really the best people and the best work in the area.  And while, yes, it's nice to be supportive of lots of areas, but what ends up happening is that people jump between workshops because there are too many that interest them, and then you lose this community feeling.  This is especially troubling when workshops are already competing with skiing :).

Anyway, with that behind me, there are a number that NLP folks might find interesting:

With the deadlines so close I don't imagine anyone's going to be submitting stuff that they just started, but if you have things that already exist, NIPS is fun and it would be fun to see more NLPers there!

13 June 2012

NAACL 2012 Retrospective

Like many people, I spent last week in lovely Montreal (at least lovely for the second half) at NAACL.  Despite the somewhat unfortunate submission date, I thought the program this year was quite good.  Of course I didn't see every talk and haven't read many of the papers yet, but I figured I'd point out what I saw that I liked and other people can comment likewise.

Identifying High-Level Organizational Elements in Argumentative Discourse (Madnani, Heilman, Tetreault, Chodorow).  This is maybe one of the first discourse papers I've seen where I actually believe that they have a chance of solving the problem that they've set out.  Here, the problem is separating the meat (content of an essay) from the shell (the bits of discourse that hold the meat together).  It's a cool problem and their solution seems to work well.  Very nice paper.  (And Nitin's talk was great.)

Risk Training of Approximate CRF-Based NLP Systems (Stoyanov, Eisner).  This paper is basically about training approximate models based on some given loss function.  Reminds me a lot of the Ross et al. CVPR 2011 paper on Message-Passing.  It's a cool idea, and there's software available.  Being me, the thing I wonder the most about is whether you can achieve something similar being completely greedy, and then whether you need to do all this work to get a good decision function.  But that's me -- maybe other people like CRFs :).

MSR SPLAT, a language analysis toolkit (Quirk, Choudhury, Gao, Suzuki, Toutanova, Gamon, Yih, Cherry, Vanderwende).  This is a demo of a system where you send them a sentence and they tell you everything you want to know about it.  Never run your own parser/NER/etc. again.  And, having see it in action at MSR, it's fast and high quality.

Parsing Time: Learning to Interpret Time Expressions (Angeli, Manning, Jurafsky).  This was a great paper about semantic interpretation via compositional semantics (something sort of like lambda calculus) for time expressions.  I cannot find myself getting super jazzed up about time, but it's a nice constrained problem and their solution is clean.  I'm actually thinking of using something like this (or a subset thereof) as a course project for the intro CL course in the Fall, since I'm always starved for something to do with compositional semantics.

Getting More from Morphology in Multilingual Dependency Parsing (Hosensee, Bender).  If you have morphology, you can do better parsing by modeling things like agreement (really?  people hadn't done this before??).  Caveat: they use gold standard morphology.  But cool idea still.

Unsupervised Translation Sense Clustering (Bansal, DeNero, Lin).  If you want to build a bilingual dictionary from parallel text, you need to cluster translations into senses.  Here's a way to do it.  Nice results improving using bilingual contexts, which was nice to see.

I also feel like I should raise my glass to the organizers of NLP Idol and congrats to Ray for winning with Robert Wilensky's paper "PAM." (If anyone can find an online version, please comment!) Though I would actually encourage everyone to read all three papers if you haven't already.  They all changed how I was thinking about problems.  Here are the others: Burstiness (Church), Context (Akman), Suppertagging (Bangalore, Joshi).

18 February 2012

Making sense of Wikipedia categories

Wikipedia's category hierarchy forms a graph. It's definitely cyclic (Category:Ethology belongs to Category:Behavior, which in turn belongs to Category:Ethology).

At any rate, did you know that "Chicago Stags coaches" are a subcategory of "Natural sciences"?  If you don't believe me, go to the Wikipedia entry for the Natural sciences category, and expand the following list of subcategories:

  • Biology
  • Zoology
  • Subfields of zoology
  • Ethology
  • Behavior
  • Human behavior
  • Recreation
  • Games
  • Ball games
  • Basketball
  • Basketball teams
  • Defunct basketball teams
  • Defunct National Basketball Association teams
  • Chicago Stags
  • Chicago Stags coaches
I guess it kind of makes sense.  There are some other fun ones, like "Rhaeto-Romance languages", "American World War I flying aces" and "1911 films". Of course, these are all quite deep in the "hierarchy" (all of those are at depth 15 or higher).

So if you're trying to actually find pages about Natural sciences, maybe it's enough to limit the depth of your breadth first search down the graph.

This is sort of reasonable, and things up to and including depth four are quite reasonable, including topics like "Neurochemistry", "Planktology" and "Chemical elements".  There are a few outliers, like "Earth observation satellites of Israel" which you could certainly make a case might not be natural science.

At depth five, things become much more mixed.  On the one hand, you get categories you might like to include, like "Statins", "Hematology", "Lagoons" and "Satellites" (interesting that Satellites is actually deeper than the Isreal thing).  But you also get a roughly equal amount of weird things, like "Animals in popular culture" and "Human body positions".  It's still not 50/50, but it's getting murky.

At depth six, based on my quick perusal, it's about 50/50.

And although I haven't tried it, I suspect that if you use a starting point other than Natural sciences, the depth at which things get weird is going to be very different.

So I guess the question is how do deal with this.

One thought is to "hope" that editors of Wikipedia pages will list the categories of pages roughly in order of importance, so that you can assume that the first category listed for a page is "the" category for that page.  This would render the structure to be a tree.   For the above example, this would cut the list at "Subfields of zoology" because the first listed category for the Ethology category is "Behavioral sciences", not "Subfields of zoology."

Doing this seems to make life somewhat better; you cut out the stags coaches, but you still get the "Chicago Stags draft picks" (at depth 17).  The path, if you care, is (Natural sciences -> Physical sciences -> Physics -> Fundamental physics concepts -> Matter -> Structure -> Difference -> Competition -> Competitions -> Sports competitions -> Sports leagues -> Sports leagues by country -> Sports leagues in the United States -> Basketball leagues in the United States -> National Basketball Association -> National Basketball Association draft picks).  Still doesn't feel like Natural sciences to me.  In fairness, at depth 6, life is much better.  You still get "Heating, ventilating, and air conditioning" but many of the weird entries have gone away.

Another idea is the following.  Despite not being a tree or DAG, there is a root to the Wikipedia hierarchy (called Category:Contents).  For each page/category you can compute it's minimum depth from that Contents page.  Now, when you consider subpages of Natural sciences, you can limit yourself to pages whose shortest path goes through Natural sciences.  Basically trying to encode the idea that if the shallowest way to reach Biology is through Natural sciences, it's probably a natural science.

This also fails.  For instance, the depth of "Natural sciences" (=5) is the same as the depth of "Natural sciences good articles", so if you start from Natural sciences, you'll actually exclude all the good articles!  Moreover, even if you insist that a shortest path go through Natural sciences, you'll notice that many editors have depth 5, so any page they've edited will be allowed.  Maybe this is a fluke, but "Biology lists" has depth of only 4, which means that anything that can be reached through "Biology lists" would be excluded, something we certainly wouldn't want to do.  There's also the issue that the hierarchy might be much bushier for some high-level topics than others, which makes comparing depths very difficult.

So, that leaves me not really knowing what to do.  Yes, I could compute unigram distributions over the pages in topics and cut when those distributions get too dissimilar, but (a) that's annoying and very computationally expensive, (b) requires you to look at the text of the pages which seems silly, (c) you now just have more hyperparameters to tune.  You could annotate it by hand ("is this a natural science") but that doesn't scale.  You could compute the graph Laplacian and look at flow and use "average path length" rather than shortest paths, but this is a pretty big graph that we're talking about.

Has anyone else tried and succeed at using the Wikipedia category structure?

11 February 2012

De-Authorship attribution

I received the following (slightly edited) question from my colleague Jon Katz a few days ago:

I was thinking about the problem of authorship attribution... Have people thought about the flip side of this problem? Namely, "anonymizing" text so that it would be hard to attribute it to any author?
This is something I've actually wondered about in the context of blogging for a while.  I noticed at some point that my "blogger voice" is very similar to my "reviewer voice" and started worrying that I might be too identifiable as a reviewer.  This might either be due to lexical choice ("bajillion" or "awesome") or due to some more subtle stylistic choices.

There is quite a bit of work on authorship attribution.  I think the first time I heard a talk on this topic was on March 24, 2004, when Shlomo Argamon gave a talk at ISI (no, I don't have an amazing memory, I cheated) on "On Writing, Our Selves: Explorations in Stylistic Text Categorization."  The basic hypothesis of the talk, at least as I remember it, was that if you're trying to do authorship attribution, you should throw out content words and focus on things like POS tag sequences, parse tree structures, and things like that.

There's been a lot of subsequent work in this, and related areas.  One very related area is on things like trying to predict demographic information (age, gender, socio-economic status, education level, and, yes, astrological sign) from tweets, blog posts or emails (or other forms).  One of the key distinctions that I think is important in all of this work is whether the original author is intentionally trying to hide information about him or herself.  For instance, someone trying to impersonate Shakespeare, or a child predator pretending to be a different age or gender, or a job applicant trying to sound more educate than is true.  This latter is a much harder problem because the stupid topically stereotypical features that pop out as being indicative (like men talking about "wifes" and "football" and women talking about "husbands" and "yoga") and the silly features that don't really tell us anything interesting (on twitter, apparently men tend to put  "http://" before URLs more than women -- who knew?) because these "pretenders" are going to intentionally try to hide that information (now that everyone knows to hide "http://" to trick gender recognizers!).  It also means that falling back on topic as a surrogate for demography should not work as well.  This seems to be a very different problem from trying to identify whether a blog post is written by me or by Jon, which should be 99.9% do-able by just looking at content words.

The reason I bring this all up is because we don't want to anonymize by changing the topic.  The topic needs to stay the same: we just need to cut out additional identifying information.  So, getting back to Jon's question, the most relevant work that I know of is on text steganography (by Ching-Yun Chang and Stephen Clark), where they use the ability to do paraphrasing to encode messages in text.  Aside from the challenge of making the output actually somewhat grammatical, the basic idea is that when you have two ways of saying the same thing (via paraphases), you can choose the first one to encode a "0" and the second to encode a "1" and then use this to encode a message in seemingly-natural text.

I also remember having a conversation a while ago while a (different) colleague about trying to build a chat system where you could pretend that you're chatting with someone famous (like Obama or Harry Potter or Scooby Doo).  A similar problem is trying to paraphrase my own writing to sound like someone else, but zoinks, that seems hard!  A basic approach would be to build a Scooby Doo language model (SDLM) and then run my blog posts through a paraphrase engine that uses the SDLM for producing the output.  My vague sense is that this would work pretty poorly, primarily because the subtleness in phrase structure selection would be lost on a highly-lexicalized language model.  I imagine you'd get some funny stuff out and it might be amusing to do, but I don't have time to try.

As far as pure anonymization goes, it seems like doing something similar to the steganography approach would work.  Here, what you could do is generate a random sequence of bits, and then "encode" that random sequence using the steganography system.  This would at least remove some identifying information.  But the goal of the steganography isn't to change every phrase, but just to change enough phrases that you can encode your message.  It also wouldn't solve the problem that perhaps you can identifying a bit about an author by the lengths of their sentences.  Or their oscillation between long and short sentences.  This also wouldn't be hidden.

An alternative, human-in-the-loop approach might be simply to have an authorship recognition system running in your word processor, and then any time you type something that enables it to identify you, it could highlight it and you could be tasked with changing it.  I suspect this would be a frustrating, but fairly interesting experience (at least the first time).

p.s., I'm now officially tweeting on @haldaume3.

12 December 2011

It's that magical time of year...

By which I mean NIPS, and the incumbent exhaustion of 14 hour days.  (P.s., if you're at NIPS, see the meta-comment I have at the end of this post, even if you skip the main body :P.)

Today I went to two tutorials: one by Yee Whye Teh and Peter Orbanz (who is starting shortly at Columbia) on non-parametric Bayesian stuff, and one by Naftali Tishby on information theory in learning and control.  They were streamed online; I'm sure the videos will show up at some point on some web page, but I can't find them right now (incidentally, and shamelessly, I think NAACL should have video tutorials, too -- actually my dear friend Chris wants that too, and since Kevin Knight has already promised that ACL approves all my blog posts, I suppose I can only additionally promise that I will do everything I can to keep at least a handful of MT papers appearing in each NAACL despite the fact that no one really works on it anymore :P).  Then there were spotlights followed by posters, passed (as well as sat down) hors d'oeuvres, free wine/sangria/beer/etc, and friends and colleagues.

The first half of the Teh/Orbanz tutorial is roughly what I would categorize as "NP Bayes 101" -- stuff that everyone should know, with the addition of some pointers to results about consistency, rates of convergence of the posterior, etc.  The second half included a lot of stuff that's recently become interesting, in particular topics like completely random measures, coagulation/fragmentation processes, and the connection between gamma processes (an example of a completely random measure) and Dirichlet processes (which we all know and love/hate).

One of the more interesting things toward the end was what I was roughly characterized as variants of the DeFinetti theorem on exchangable objects.  What follows is from memory, so please forgive errors: you can look it up in the tutorial.  DeFinetti's theorem states that if p(X1, X2, ..., Xn, ...) is exchangeable, then p has a representation as a mixture model, with (perhaps) infinite dimensional mixing coefficients.  This is a fairly well-known result, and was apparently part of the initial reason Bayesians started looking into non-parametrics.

The generalizations (due to people like Kingman, Pitman, Aldous, etc...) are basically what happens for other types of data (i.e., other than just exchangeable).  For instance, if a sequence of data is block-exchangeable (think of a time-series, which is obviously not exchangeable, but for which you could conceivably cut it into a bunch of contiguous pieces and these pieces would be exchangeable) then it has a representation as a mixture of Markov chains.  For graph-structured data, if the nodes are exchangeable (i.e., all that matters is the pattern of edges, not precisely which nodes they happen to connect), then this also has a mixture parameterization, though I've forgotten the details.

The Tishby tutorial started off with some very interesting connections between information theory, statistics, and machine learning, essentially from the point of view of hypothesis testing.  The first half of the tutorial centered around information bottleneck, which is a very beautiful idea. You should all go read about it if you don't know it already.

What actually really struck me was a comment that Tishby made somewhat off-hand, and I'm wondering if anyone can help me out with a reference.  The statement has to do with the question "why KL?"  His answer had two parts.  For the first part, consider mutual information (which is closely related to KL).  MI has the property that if "X -> Y -> Z" is a Markov chain, then the amount of information that Y gives you about Z is at most the amount of information that X gives you about Z.  In other words, if you think if Y as a "processed" version of X, then this processing cannot give you more information.  This property is more general than just MI, and I believe anything that obeys it is a Csiszar divergence.  The second part is the part that I'm not so sure of.  It originated with the observation that if you have a product, take a log, you now get an additive term.  This is really nice because you can apply results like the central limit theorem to this additive term.  (Many of the results in the first half of his tutorial hinged on this additivity.)  The claim was something like: the only divergences that have this additivity are Bregman divergences.  (This is not obvious to me, and actually not entirely obvious what the right definition of additivity is, so if someone wants to help out, please do so!)  But the connection is that MI and KL are the "intersection" of Bregman divergences and Csiszar divergences.  In other words, if you want the decreasing information property and you want the additivity property, then you MUST use information theoretic measures.

I confess that roughly the middle third of the talk went above my head, but I did learn about an interesting connection between Gaussian information bottleneck and CCA: basically they're the same, up to a trimming of the eigenvalues.  This is in a 2005 JMLR paper by Amir Globerson and others.  In the context of this, actually, Tishby made a very offhand comment that I couldn't quite parse as whether it was a theorem or a hope.  Basically the context was that when working with Gaussian distributed random variables, you can do information bottleneck "easily," but that it's hard for other distributions.  So what do we do?  We do a kernel mapping into a high dimension space (they use an RBF kernel) where the data will look "more Gaussian."  As I said, I couldn't quite parse whether this is "where the data will provably look more Gaussian" or "where we hope that maybe by dumb luck the data will look more Gaussian" or something in between.  If anyone knows the answer, again, I'd love to know.  And if you're here at NIPS and can answer either of these two questions to my satisfaction, I'll buy you a glass of wine (or beer, but why would you want beer? :P).

Anyway, that's my report for day one of NIPS!

p.s. I made the silly decision of taking a flight from Granada to Madrid at 7a on Monday 19 Dec.  This is way too early to take a bus, and I really don't want to take a bus Sunday night.  Therefore, I will probably take a cab.  I think it will be about 90 euros.  If you also were silly and booked early morning travel on Monday and would like to share said cab, please email me (me AT hal3 DOT name).

14 October 2011

You need a job and I have $$$

If you're an NLP or ML person and graduating in the next six months or so and are looking for a postdoc position with very very open goals, read on.  The position would be at UMD College Park, in the greater DC area, with lots of awesome people around (as well as JHU and other universities a short drive/train away).

This position could start as early as January 2012, probably more likely around June 2012 and could be as late as September 2012 for the right person.  Even if you're not graduating until the one-year-from-now time frame, please contact me now!  I'm looking more for a brilliant, hard-working, creative person than anyone with any particular skills.  That said, you probably know what sort of problems I tend to work on, so it would be good if you're at least interested in things roughly in that space (regardless of whether you've worked on them before or not).

The position would be for one year, with an extension to two if things are working out well for both of us (not subject to funding).

If you're interested, please email me at postdoc@hal3.name with the following information:

  1. Inline in the email:
    1. Your PhD institution and advisor, thesis title, and expected graduation date.
    2. Links to the two or three most awesome papers you have, together with titles and venue.
    3. Link to your homepage.
    4. A list of three references (names, positions and email addresses).
  2. Attached to the email:
    1. A copy of your CV, in PDF format.
    2. A brief (one page) research statement that focuses mostly on what problem(s) you'd most like to work on in a postdoc position with me.  Also in PDF format.
 I need this information by November 1st so please reply quickly!!!

11 October 2011

Active learning: far from solved

As Daniel Hsu and John Langford pointed out recently, there has been a lot of recent progress in active learning. This is to the point where I might actually be tempted to suggest some of these algorithms to people to use in practice, for instance the one John has that learns faster than supervised learning because it's very careful about what work it performs. That is, in particular, I might suggest that people try it out instead of the usual query-by-uncertainty (QBU) or query-by-committee (QBC). This post is a brief overview of what I understand of the state of the art in active learning (paragraphs 2 and 3) and then a discussion of why I think (a) researchers don't tend to make much use of active learning and (b) why the problem is far from solved. (a will lead to b.)

For those who know what QBU and QBC are, skip this paragraph. The idea with QBU is exactly what you think: when choosing the next point to as for the label of, choose the one on which your current model is maximally uncertain. If you're using a probabilistic model, this means something like "probability is closest to 0.5," or, in the non-binary case, something like "highest entropy of p(y|x)." If you're using something like an SVM, perhaps margin (aka distance to the hyperplane) is a reasonable measure of uncertainty. In QBC, the idea is still to query on uncertain points, but the uncertainty is computed by the amount of agreement among a committee of classifiers, for instance, classifiers trained in a boostrap manner on whatever data you have previously labeled.

One of the issues with QBU and QBC and really a lot of the classic methods for active learning is that you end up with a biased set of training data. This makes it really hard to prove anything about how well your algorithm is going to do on future test examples, because you've intentially selected examples that are not random (and probably not representative). One of the "obvious in retrospect" ideas that's broken this barrier is to always train your classifier on all examples: the label for those that you've queried on is given by the human, and the label for those that you haven't queried on is given by your model from the previous iteration. Thus, you are always training on an iid sample from the distribution you care about (at least from a p(x) perspective). This observation, plus a lot of other work, leads to some of the breakthroughs that John mentions.

An easy empirical observation is that not many people (in my sphere) actually use active learning. In fact, the only case that I know of was back in 2004 where IBM annotated extra coreference data for the Automatic Content Extraction (ACE) challenge using active learning. Of course people use it to write papers about active learning, but that hardly counts. (Note that the way that previously learned taggers, for instance the Brill Tagger, were used in the construction of the Penn Treebank does not fall under the auspices of active learning, at least as I'm thinking about it here.)

It is worth thinking about why this is. I think that the main issue is that you end up with a biased training set. If you use QBC or QBU, this is very obvious. If you use one of the new approaches that self-label the rest of the data to ensure that you don't have a biased training set, then of course p(x) is unbiased, but p(y|x) is very biased by whatever classifier you are using.

I think the disconnect is the following. The predominant view of active learning is that the goal is a classifier. That data that is labeled is a byproduct that will be thrown away, once the classifier exists.

The problem is that this view flies in the face of the whole point of active learning: that labeling is expensive. If labeling is so expensive, we should be able to reuse this data so that the cost is amortized. That is, yes, of course we care about a classifier. But just as much, we care about having a data set (or "corpus" in the case of NLP).

Consider, for instance, the Penn Treebank. The sorts of techniques that are good at parsing now were just flat-out not available (and perhaps not even conceivable) back in the late 1990s when the Treebank was being annotated. If we had done active learning for the Treebank under a non-lexicalized, non-parent-annoted PCFG that gets 83% accuracy, maybe worse because we didn't know how to smooth well, then how well would this data set work for modern day state splitting grammars with all sorts of crazy Markovization and whatnot going on?

The answer is: I have no idea. I have never seen an experiment that looks at this issue. And it would be so easy! Run your standard active learning algorithm with one type of classifier. Plot your usual active-versus-passive learning curves. Now, using the same sequence of data, train another classifier. Plot that learning curve. Does it still beat passive selection? By how much? And then, of course, can we say anything formal about how well this will work?

There are tons of ways that this problem can arise. For instance, when I don't have much data I might use a generative model and then when I have lots of data I might use a discriminative model. Or, as I get more data, I add more features. Or someone finds awesome features 5 years later for my problem. Or new machine learning techniques are developed. Or anything. I don't want my data to become obselete when this happens.

I am happy to acknowledge that this is a very hard problem. In fact, I suspect that there's some sort of no-free-lunch theorem lurking in here. Essentially, if the inductive biases of the classifier that you use to the active learning and the classifier you train at the end are too different, then you could do (arbitrarily?) badly. But in the real world, our hypothesis classes aren't all that different, or perhaps you can assume you're using a universal function approximator or a universal kernel or something. Assume what you want to start, but I think it's an interesting question.

And then, while we're on the topic of active learning, I'd also like to see whether an active learning algorithm's performance asymptotes before all your training data is exhausted. That is, the usual model in active learning experiments is that you have 1000 training examples because that's what someone labeled. You then do active learning up to 1000 examples, and of course at that point, everything has been labeled, so active learning performance coincides precisely with passive learning performance. But this is a poor reflection of many problems in the world, where new inputs are almost always free. I want the Banko and Brill paper for active learning... perhaps it's out there, and if you've seen it, I'd love a pointer. I ran a couple experiments along these lines (nothing concrete), but it actually seemed that active learning from a smaller pool was better, perhaps because you have fewer outliers (I was using QBU). But these results are by no means concrete, so don't cite me as saying this :).

At any rate, I agree that active learning has come a long way. I would humbly suggest that the goal of simply building a classifier is not in the real spirit of trying to save money. If you wanted to save money, you would save your data and share it (modulo lawyers). In the long run, passive learning currently seems much less expensive than active learning to me.

29 September 2011

A technique for me is a task for you

Originally in the context of Braque and now in the context of FUSE, I've thought a bit about understanding the role of techniques and tasks in scientific papers (admittedly, mostly NLP and ML, which I realize are odd and biased).  I worked with Sandeep Pokkunuri, a MS student at Utah, looking at the following problem: given a paper (title, abstract, fulltext), determine what task is being solved and what technique is being used to solve it.  For instance, a paper like "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data" the task would be "segmenting and labeling sequence data" and the technique would be "conditional random fields."

You can actually go a long way just looking for simple patterns in paper titles, like "TECH for TASK" or "TASK by TECH" and a few things like that (after doing some NP chunking and clean-up).  From there you can get a good list of seed tasks and techniques, and could conceivably bootstrap your way from there.  We never got a solid result out of these, and sadly I moved and Sandeep graduated and it never went anywhere.  What we wanted to do was automatically generate tables of "for this TASK, here are all the TECHs that have been applied (and maybe here are some results) oh and by the way maybe applying these other TECHs would make sense."  Or visa-verse: this TECH has been applied to blah blah blah tasks.  You might even be able to tell what TECHs are better for what types of tasks, but that's quite a bit more challenging.

At any rate, a sort of "obvious in retrospect" thing that we noticed was that what I might consider a technique, you might consider a task.  And you can construct a chain, typically all the way back to math.  For instance, I might consider movie recommendations a task.  To solve recommendations, I apply the technique of sparse matrix factorization.  But then to you, sparse matrix factorization is a task and to solve it, you apply the technique of compressive sensing.  But to Scott Tanner, compressive sensing is a task, and he applies the technique of smoothed analysis (okay this is now false, but you get the idea).  But to Daniel Spielman, smoothed analysis is the task, and he applies the technique of some other sort of crazy math.  And then eventually you get to set theory (or some might claim you get to category theory, but they're weirdos :P).

(Note: I suspect the same thing happens in other fields, like bio, chem, physics, etc., but I cannot offer such an example because I don't know those areas.  Although not so obvious, I do think it holds in math: I use the proof technique of Shelah35 to prove blah -- there, both theorems and proof techniques are objects.)

At first, this was an annoying observation.  It meant that our ontology of the world into tasks and techniques was broken.  But it did imply something of a richer structure than this simple ontology.  For instance, one might posit as a theory of science and technologies studies (STS, a subfield of social science concerned with related things) that the most basic thing that matters is that you have objects (things of study) and an appliedTo relationship.  So recommender systems, matrix factorization, compressive sensing, smoothed analysis, set theory, etc., are all objects, and they are linked by appliedTos.

You can then start thinking about what sort of properties appliedTo might have.  It's certainly not a function (many things can be applied to any X, and any Y can be applied to many things).  I'm pretty sure it should be antireflexive (you cannot apply X to solve X).  It should probably also be antisymmetric (if X is applied to Y, probably Y cannot be applied to X).  Transitivity is not so obvious, but I think you could argue that it might hold: if I apply gradient descent to an optimization problem, and my particular implementation of gradient descent uses line search, then I kind of am applying line search to my problem, though perhaps not directly.  (I'd certainly be interested to hear of counter-examples if any come to mind!)

If this is true, then what we're really talking about is something like a directed acyclic graph, which at least at a first cut seems like a reasonable model for this world.  Probably you can find exceptions to almost everything I've said, but that's why you need statistical models or other things that can deal with "noise" (aka model misspecification).

Actually something more like a directed acyclic hypergraph might make sense, since often you simultaneously apply several techniques in tandem to solve a problem.  For instance, I apply subgradient descent and L1 regularization to my binary classification problem -- the fact that these two are being applied together rather than separately seems important somehow.

Not that we've gone anywhere with modeling the world like this, but I definitely thing there are some interesting questions buried in this problem.

26 September 2011

Four months without blogs

As you've noticed, I haven't posted in a while.  I've also not been reading blogs.  My unread number of posts is now 462.  Clearly I'm not going to go back and read all 462 posts that I missed.  I will claim that this was an experiment to see what a (nearly) blog-free world is like.

I actually found that I missed both the reading and the writing, so now (especially that I've switch over to public transportation and so have about an hour to kill in transportation time) I'm going to go back to reading while being transported and blogging when I have time.

I figured I'd return to blogging by saying a bit about a recent experience.  Less than a month ago I had the honor of serving on Jurgen Van Gael's Ph.D. examination committee.  Jurgen did an excellent job and, as perhaps expected, passed.  But what I want to talk about is how the UK model (or at least the Cambridge model) is different from the US model.

In the UK, the examination is done by two faculty members, one internal (this was Stephen Clark) and one external (that was me).  It does not involve the advisor/supervisor, though this person can sit in the room without speaking :).  There is no public presentation and the process we followed was basically to go through the dissertation chapter-by-chapter, ask clarification questions, perhaps some things to get Jurgen to think on his toes, and so on.  This took about two hours.

Contrast this to the (prototypical) US model, where a committee consists of 5 people, perhaps one external (either external to CS or to the university, depending on how your institution sets it up), and includes the advisor.  The defense is typically a 45 minute public presentation followed by questions from the committee in a closed-room environment with the student.

Having been involved, now, in both types, I have to say they each have their pros and cons.  I think the lack of a public presentation in the UK model is a bit of a shame, though of course students could decide to do these anyway.  But it's nice to have something official for parents or spouses to come to if they'd like.  However, in the US, the public presentation, plus the larger committee, probably leads to situation that students often joke about that not even their committee reads their dissertation.  You can always fall back on the presentation, much like students skip class reading when they know that the lecture will cover it all.  When it was just me, Stephen and Jurgen, there's really no hiding in the background :).

I also like how in the UK model, you can skip over the easy stuff and really spend time talking with the student about the deep material.  I found myself much more impressed with how well Jurgen knows his stuff after the examination than before, and this is not a feeling I usually get with US students because their defense it typically quite high-level.  And after 45 minutes of a presentation, plus 15 minutes of audience questions, the last thing anyone wants to do is sit around for another two hours examining the details of the defense chapter-by-chapter.

Regarding the issue of having the advisor there or not, I don't have a strong preference.  The one thing I will say is that by having the advisor missing removes the potential for weird politics.  For instance, I have seen one or two defenses in which an advisor tends to answer questions for the student, without the student first attempting an answer.  If I were on these committees, with a relatively senior advisor, it might be politically awkward to ask them not to do this.  Luckily this issue hasn't come up for me, but I could imagine it happening.

Obviously I don't really expect anyone's policies to change, and I'm not even sure that they should, but I like thinking about things that I've grown used to taking for granted.  Plus, after having gone through the UK model, I think I will grill students a bit more during the Q/A time.  And if this means that fewer students ask me to be on their committees, then there's more time to blog :).

07 July 2011

Introducing Braque, your paper discovery friend

(Shameless plug/advertisement follows.)

Want to be informed of new interesting papers that show up online?

Tired of trolling conference proceedings to find that one gem?

Want to make sure interested parties hear about your newest results?

Want to know when a new paper comes out that cites you?

Braque (http://braque.cc) can help.

Braque is a news service for research papers (currently focusing primarily on NLP and ML, though it needn't be that way).  You can create channels that provide email or RSS feeds for topics you care about. You can add your own publications page as a resource to Braque so it knows to crawl your papers and send them out to interested parties.

Braque is something I built ages ago with Percy Liang, but it's finally more or less set up after my move. Feel free to email me questions and comments or (preferably) use the online comment system.

As a bit of warning: Braque is neither a paper search engine nor a paper archive.  And please be a bit forgiving if you go there immediately after this post shows up and it's a bit slow.... we only have one server :).

ps., yes, Braque is sort of like WhatToSee on crack.

06 July 2011

The conference(s) post: ACL and ICML

I'm using ACL/ICML as an excuse to jumpstart my resumed, hopefully regular, posting.  The usual "I didn't see/read everything" applies to all of this.  My general feeling about ACL (which was echoed by several other participants) was that the program was quite strong, but there weren't many papers that really stood out as especially great.  Here are some papers I liked and some attached thoughts, from ACL:

P11-1002 [bib]: Sujith Ravi; Kevin Knight
Deciphering Foreign LanguageThis paper is about building MT systems without parallel data.  There's been a bunch of work in this area.  The idea here is that if I have English text, I can build an English LM.  If you give me some French text and I hallucinate a F2E MT system, then it's output had better score high on the English LM.

P11-1020 [bib] [dataset]: David Chen; William Dolan
Collecting Highly Parallel Data for Paraphrase Evaluation
Although this paper is about paraphrasing, the fun part is the YouTube stuff they did.  Read it and see :).

P11-1060 [bib]: Percy Liang; Michael Jordan; Dan Klein
Learning Dependency-Based Compositional Semantics
This paper is along the lines of semantic parsing stuff that various people (Ray Mooney, Luke Zettlemoyer/Mike Collins, etc.) have been doing.  It's a nice compositional model that is learned online.

P11-1099 [bib]: Vanessa Wei Feng; Graeme Hirst
Classifying arguments by scheme
This paper is about argumentation (in the "debate" sense) and identifying different argumentation types.  There are some nice correlations with discourse theory, but in a different context.

P11-2037 [bib]: Shu Cai; David Chiang; Yoav Goldberg
Language-Independent Parsing with Empty Elements
I'm really glad to see that people are starting to take this problem seriously again.  This falls under the category of "if you've ever actually tried to use a parser to do something then you need this."

Okay so that's not that many papers, but I did "accidentally" skip some sections.  So you're on your own for the rest.

For ICML, I actually felt it was more of a mixed bag.  Here are some things that stood out as cool:

Minimum Probability Flow Learning 
Jascha Sohl-Dickstein; Peter Battaglino; Michael DeWeese
This is one that I need to actually go read, because it seems too good to be true.  If computing a partition function ever made you squirm, read this paper.

Tree-Structured Infinite Sparse Factor Model 
XianXing Zhang; David Dunson; Lawrence Carin
This is trying to do factor analysis with tree factors; they use a "multiplicative gamma process" to accomplish it. This is something we tried to do a while ago, but could never really figure out how to do it.

Sparse Additive Generative Models of Text 
Jacob Eisenstein; Amr Ahmed; Eric Xing
The idea here is that if you're learning a model of text, don't re-learn the same "general background" distribution over and over again.  Then learn class- or topic-specific stuff as a sparse amendment to that background.

OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning 
Arvind Sujeeth; HyoukJoong Lee; Kevin Brown; Tiark Rompf; Hassan Chafi; Michael Wu; Anand Atreya; Martin Odersky; Kunle Olukotun
Six more words: Most authors ever on ICML paper.

Generalized Boosting Algorithms for Convex Optimization 
Alexander Grubb; Drew Bagnell
Suppose you want to boost something that's non-smooth?  Now you can do it.  Has nice applications in imitation learning, which is I suppose why I like it.

Learning from Multiple Outlooks 
Maayan Harel; Shie Mannor
This is a nice approach based on distribution mapping to the problem of multiview learning when you don't have data with parallel views.  (I'm not sure that we need a new name for this task, but I still like the paper.)

Parsing Natural Scenes and Natural Language with Recursive Neural Networks
Richard Socher; Cliff Chiung-Yu Lin; Andrew Ng; Chris Manning
This is basically about learning compositional semantics for vector space models of text, something that I think is really interesting and understudied (Mirella Lapata has done some stuff).  The basic idea is that if "red" is embedded at position x, and "sparrow" is embedded at y, then the embedding of the phrase "red sparrow" should be at f([x y]) where f is some neural network.  Trained to get good representations for parsing.

Please reply in comments if you had other papers you liked!!!