27 July 2014

Hello, World!

Okay, usually Hello World is the first program you learn to write in a new programming language. For fun, I've been collecting how to say hello world in different human languages, something remarkably difficult to search for (because of the overloading of the word "language"). I have 28. I'd like to make it to 280 :). If you have one (or more) to contribute, email me, post a comment, or tweet to me @haldaume3. And of course if you think any of these is wrong, please let me know that too.

     1 bar Servus Woid!
     2 ca  Hola Món!
     3 de  Hallo Welt!
     4 en  Hello World!
     5 eo  Saluton, Mondo!
     6 es  ¡Hola Mundo!
     7 eu  Kaixo, mundua!
     8 fi  Hei maailma!
     9 hu  Helló, világ!
    10 ia  Hallo, mundo!
    11 id  Halo dunia!
    12 ja  こんにちは世界
    13 lv  Sveika, pasaule!
    14 min Helo dunia!
    15 mk  Здраво свету!
    16 ms  Helo dunia!
    17 nn  Hallo verda!
    18 no  Hallo, verden!
    19 pt  Olá Mundo!
    20 sh  Zdravo svete!
    21 sl  Pozdravljen svet!
    22 sq  Njatjeta Botë!
    23 sr  Здраво свете!
    24 sv  Hej Världen!
    25 th  เฮลโลเวิลด์
    26 tr  Merhaba dünya!
    27 vi  Xin chào thế giới!
    28 zh  世界,你好!

05 July 2014

My ACL 2014 picks...

Usual caveats: didn't see all papers, blah blah blah. Also look for #acl14nlp on twitter -- lots of papers were mentioned there too!

  • A Tabular Method for Dynamic Oracles in Transition-Based Parsing; Yoav Goldberg, Francesco Sartorio, Giorgio Satta.
    Jaokim Nivre, Ryan McDonald and I tried searnifying MaltParser back in 2007 and never got it to work. Perhaps this is because we didn't have dynamic oracles and we thought that a silly approximate oracle would be good enough. Guess not. Yoav, Francesco and Giorgio have a nice technique for efficiently computing the best possible-to-achieve dependency parse given some prefix, possibly incorrect, parse.
  • Joint Incremental Disfluency Detection and Dependency Parsing; Matthew Honnibal, Mark Johnson
    The basic idea is to do shift-reduce dependency parsing, but allow for "rewinds" in the case of (predicted) disfluencies. I like that they didn't just go with the most obvious model and actually thought about how might be a good way to solve this problem. Basic idea is if you get "Please book a flight to Boston uh to Denver..." is that you parse "to Boston" like usual but then when you get to the "uh", you remove old arcs. You do it this way because detecting the disfluent segment ("to Boston") is much easier when you hit "uh" than when you hit "to Boston."
  • Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors; Marco Baroni; Georgiana Dinu; Germán Kruszewski
    This paper is summarized best by its own statement, which should win it the award for most honest paper ever: "...we set out to conduct this study because we were annoyed by the triumphalist overtones often surrounding [neural network embeddings], despite the almost complete lack of a proper comparison.... Our secret wish was to discover that it is all hype... Instead, we found that the [embeddings] are so good that, while the triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture."
  • Learning to Automatically Solve Algebra Word Problems ; Nate Kushman; Luke Zettlemoyer; Regina Barzilay; Yoav Artzi
    An algebra word problem is something like "I have twice as many dimes as nickles and have $2.53. How many nickles do I have." Of course usually they actually have an answer. They have a nice, fairly linguistically unstructured (i.e., no CCG) for mapping word problems to algebraic formulae and then solving those formulae. Code/data available.
  • Grounded Compositional Semantics for Finding and Describing Images with Sentences; Richard Socher, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng
    This is the follow-on work from Richard's NIPS workshop paper on text <-> images from this past NIPS. They fixed the main bug in that paper (the use of l2 error, which gives a trivial and uninteresting global optimal solution) and get nice results. If you're in the langvis space, worth a read, even if you don't like neural networks :).
  • From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions; Peter Young, Alice Lai, Micah Hodosh, Julia Hockenmaier
    I really like the "visual denotations" idea here. Basically you say something like "the set of worlds in which this sentence is true is the set of images in which this sentence is true (i.e., roughly the sentence is entailed by the image)." You can then measure similarity between sentences based on denotations.
  • Kneser-Ney Smoothing on Expected Counts; Hui Zhang; David Chiang
    I didn't actually see this talk or read the paper, but lots of people told me in hallways that this is a very nice result. Basically we like KN smoothing, but it only works for integral counts, which means it's hard to incorporate into something like EM, which produces fractional counts. This paper solves this problem.
  • Linguistic Structured Sparsity in Text Categorization; Dani Yogatama; Noah A. Smith
    Also didn't see this one, but skimmed the paper. The reason I really like this paper is because they took a well known technique in ML land (structured sparsity) and applied it to NLP, but in an interesting way. I.e., it wasn't just "apply X to Y" but rather find a very linguistically clever/interesting way of mapping X to a problem that we care about. Very cool work.
Overall I really liked the conference, thanks to everyone who helped put it together. I can't help but notice that about half of my picks above were actually TACL papers. I suspect this will be more and more true over time.

Please add comments with your favorite papers that I missed!

30 June 2014

Divergences passed through Bayes' rule

In a previous post's comments, we talked about Bayes rule and things like that. This got me wondering about the following question:

If we know p(A) and p(B|A), we can reconstruct p(A|B) perfectly by Bayes' rule. What if we only have estimates of p(A) and p(B|A)? How does the quality of the reconstruction of p(A|B) vary as a function of the quality of the estimates of the marginal and conditional?
I feel like there have to be results along these lines, but I was unable to find them. My next attempt was to prove something, which failed miserably after a few hours.  So, as a good empiricist and lazy(/bad) theorist, I designed a simple experiment.

Let A and B be binary variables. Let's generate a random joint distribution p(A,B), which has four cells for the four possible combinations of values of A and B. From this, we can directly compute the true marginal p(A) and the true conditionals p(B|A) and p(A|B).

Now, let's pick some "estimate" q(A) and q(B|A). You can think of these as a "noisy" version of p(A) and p(B|A). Given q(A) and q(B|A), we can compute an estimate a reconstructed joint distribution q(A,B) = q(A)q(B|A), as well as a reconstructed conditional distribution q(A|B) = q(A)q(B|A) / Z(q), where Z(q) is computed according to q. We can then compare q(A,B) to the true p(A,B) and q(A|B) to the true p(A|B) and measure how far they are.

At this point we have to decide what our measurement (divergence) function is. I tried three: variational distance (max absolute difference), l1 distance (sum absolute difference) and KL divergence. To be absolutely pedantic, I will define the versions of these that I used. First, the KL variants:
KL( p(A) || q(A) ) = sum_a p(a) log [ p(a) / q(a) ]
KL( p(A,B) || q(A,B) ) = sum_{a,b} p(a,b) log [ p(a,b) / q(a,b) ]
KL( p(A|B) || q(A|B) ) = sum_b p(b) KL( p(A|B=b) || q(A|B=b) )
Note that the direction is q from p (chosen because p is the "true" distribution) and that this also has the advantage that the conditional KL is based on p(B), which (in this case) is known exactly and is "correct."

By analogy, for l1 distance we have:
l1(p(A), q(A)) = sum_a |p(a) - q(a)|
l1(p(A,B), q(A,B)) = sum_{a,b} |p(a) - q(a)|
l1(p(A|B),q(A|B)) = sum_b p(b) l1(p(A|B=b), q(A|B=b))
Note that this last one might be slightly non-standard, but is parallel to the KL definition.

Similarly, for variational distance:
var(p(A), q(A)) = max_a |p(a) - q(a)|
var(p(A,B), q(A,B)) = max_{a,b} |p(a) - q(a)|
var(p(A|B),q(A|B)) = sum_b p(b) var(p(A|B=b), q(A|B=b))
Okay, so now for the experiment. First I generate a random (uniform) true joint distribution p(A,B). I then run through 1,000,000 possible q(A,B), where each of the three sufficient statistics are chosen from [0.01, 0.02, ... 0.99]. I then conditionalize and marginalize these in all the relevant ways and compute KL. Finally, I generate plots like the following very representative example for KL:
On the left column, we're inspecting the recovered joint distribution and in the right column the recovered conditional distribution. The top row shows: for different divergences of q(A) from p(A), and for different divergences of q(B|A) from p(B|A), how far is (left) the recovered joint q(A,B) from the true joint q(A,B), or how far is the (right) recovered conditional q(A|B) from the true conditional p(A|B). The middle row is the projection of this into two dimensions, focusing on the divergence in the marginal, and the bottom row is the projection onto the divergence in the conditional. The title shows what the true distribution is in the form [p(a,b) p(a,~b) ; p(~a,b) p(~a,~b)]. I chose this example because the joint has a correlation between a and b.

This example is fairly benign: as the approximations become worse, so do both of the recovered distributions, in a fairly linear way until a plateau. From the bottom row, you can see that it's more important to get the conditional right than the marginal (you can have a marginal that's quite far--eg., a KL of 1.5--and still get an almost perfect recovery of the conditional or joint, but this is not true for large differences in the conditional B|A.

One strange thing is that you often (for different true joints) see results that look like:
There's a very strange effect here, in which a larger kl on B|A can actually be better at the recovery of the conditional, while worse at the recovery of the joint.

 One can ask if this is an artifact of KL. So let's switch to L1 and variational for the first set of plots:

and variational:
So, in both L1 land and variational land, you can do better on the conditional by being worse on the (other) conditional.

For the example that gave rise to the weird KL results, we have the following for L1:
which shows almost an identical effect. For variational:
the effect is still the same.

Okay, so it's totally entirely possible (perhaps probable?) that there's a bug in my code. If you'd like to look, check out mykl.m and myklrun.m (yes, it's matlab). Let me know in the comments if there are bugs. If you'd like to look at more examples, check out all ten examples.

02 June 2014

Role models

During grad school, my advisor suggested I identify a recent grad who has been, to me, successful. I could then use him or her as a guide. I picked someone (he now knows who he is), and the exercise was useful: there are lots of ways to be successful in research land, and this helped me focus.

RST-relation=Topic-Shift.

I'm fairly serious about yoga. I've had a lot of instructors over the years and noticed a high correlation between InstructorILike and InstructorWhoIsMale. Initially I believed this was because male instructors pushed more, and that worked for me. Over time I realized that was not the full story.

I spent two weeks going to classes by instructors I hadn't had before to try to understand what variable(s) made the difference. I've believe now that a large part of the reason I like male instructors is precisely because they're male. A female instructor would do some crazy pose and my brain would immediately say "I could never do that." A male instructor would do the same pose and my brain would say "If he can do it, so can I." (I'd then try and fail several times, but never with a defeatist attitude.)

Topic-UnShift.

I've heard for a long time that having role models you can identify with is important. As someone who has in almost all of my life fit into the overwhelming majority (white male in tech/academia), it's been rare that I've had the opportunity to really feel this effect for myself. I try to believe things even if they haven't happened to me, but it's always better when you can empathize rather than sympathize and it's easier to empathize when you've actually been there.

The first time I remember feeling the effect of a role model "who looks like me" was  at the 1996 Olympics and Poul-Erik Høyer Larsen (Denmark) was the first European to ever win the badminton semi-finals; he then won the gold medal against Dong Jiong (China). (This sport is dominated by Indonesia, China and Malaysia.) Growing up in a particular part of Los Angeles and playing badminton as a kid, I was very much an outlier. Even though I'd never heard for Poul-Erik before (everyone knew who Jiong was), his win gave me something I could aspire to.

A few years ago I began broadcasting my support of the LGBT community, e.g., an HRC link on my web page and painting my laptop. Since then I've gotten emails from several people (mostly students) effectively asking why there aren't more/any LGBT role models in our community. You can interpret "community" meaning anything in the NLP/ML to CS to Science/Tech range. My answer: I don't know. It's hard to even know how large this community is because, unlike things like race and (binary) gender, it's not always outwardly inferrable (with noise). These issues effect tech is nuanced ways; see for instance an interview with the founder of Lesbians Who Tech or Queer in STEM for more.

This is all to say that having role models is important, and yes, it does matter who they are, where they came from, and what they look like. It mattered to the high school aged version of me, the grad school version of me, and the associate prof version of me. I'm not saying anything new here, but for our field to be healthy, we need a large number of successful people who can be role models for all sorts of students (and beyond). Token visibility is not enough because a single example of some particular label won't match with everyone who self-identifies with that label. The person I chose was, yes, a while male. There were plenty to choose from. But I chose him, and others would not have sufficed.

30 May 2014

Past tense is not past tense

I took part in a wonderful Dagstuhl workshop this past February on translating morphologically rich languages. (Yeah, I also don't really know why I was invited :P.) But many thanks to Alex, Kevin, Philipp, Helmut and Hans for inviting me. I had a realization during this workshop that I thought I'd share. It's obvious in retrospect, and perhaps in front-spect for many of you. Much of this came up in the discussion with Bonnie Webber, Marion Weller, Martin Volk, Marine Carpuat, Jörg Tiedemann and Maja Popovic, and Maja deserves much credit for her awesome error analysis tool that helped shed some light on German.

One thing you commonly think of when translating into a morphologically rich language is that there's stuff you're going to have to hallucinate. Really this isn't an issue of morphology per se, but just that this is one place where it's obvious. For instance, even going from English to French you'll have to hallucinate gender on your determiners (un versus une and le versus la) that's unmarked in English. Or when going from Japanese (which roughly combines present and future tenses into a single tense) to English, you'll have to hallucinate "will" at appropriate places.

An abstraction that I think was pretty widespread among the initial discussions in the workshop was that if you're going from language X to Y, there are basically two options:

  1. Phenomenon foo is explicit in Y but implicit in X, and therefore you'll have to hallucinate it (i.e., tense is explicit in English but not in Mandarin)
  2. Phenomenon bar is explicit in Y and also explicit in X, and so you can just copy it.
The problem is that (2) is just false, even for things that you think it might not be false for. That is to say: Just because two languages explicitly code for something that we give a consistent (linguistic) name to, doesn't mean that they code for it consistently.

Okay, so you want examples.

An easy example is gender. I've been well assured that, for instance, French and Russian both have explicit gender. But just because some noun (eg moon/lune) is feminine in French doesn't mean it's also feminine in Russian. (In fact I think it's neuter.)

You might argue gender is a stupid thing to pick because it's essentially an artificial encoding of who-knows-what.

How about tense. That clearly has a semantic interpretation (did something happen in the past, the present or the future) and so if languages X and Y both express some particular tense, they must be consistent in how they do it.

Wrong. Now my memory is getting a bit shaky, but my recollection is that, for instance, in newswire text, it's very common in German to refer to things that have happened in the past in present tense. To English speakers this is a strange convention (we tend to refer to such things in past tense), but it doesn't have to be so. And of course English has it's own idiosyncrasies: see the plight of the native German speaker who cannot understand English tense usage in (New Zealand) news articles.

Part of this is probably because tense, even in English, is a pretty slippery concept. We (native English speakers) have no problem using present (or progressive) tense to refer to things that happened in the present (John runs) the past (so yesterday I'm running to the store and a hamburger falls on my head!) or the future (my flight leaves at 8:00 tonight).

Another easy example is definiteness (thanks to Kevin Knight for this inspiration). Again, our high school English teachers tell us that "the" (+Definite) has to refer to something that's already been introduced into context. I just went to cnn.com, clicked on the very first link, and the first sentence is "The boss resigned under pressure and other Veterans Affairs managers are likely on the way out." Ok you could argue that "The boss" is already in the context of the US news media (this is an article about Shinseki) but it's nonetheless very common to see (English) entities introduced using "the" and the precise rules that govern this may or may not be consistent across other languages.

The long and short of this is: I like the fact that translation into morphologically rich languages makes us pay attention to linguistic divergence. But that doesn't mean that divergences aren't there even when languages express the same set of linguistically-named phenomena. Usage can vary dramatically, be it for conventions, socio-linguistic reasons, or other things that are hard to pin down. It's just that by focusing all our energy on a very particular convention (newswire, parliament), we can pretty easily learn these mappings because there's no variability. Add some variability and we're hosed, even for languages with the same set of (overt) markings.

16 May 2014

Perplexity versus error rate for language modeling

It's fair to say that perplexity is the de facto standard for evaluating language models. Perplexity comes under the usual attacks (what does it mean? does it correlate with something we care about? etc.), but here I want to attack it for a more pernicious reason: it locks us in to probabilistic models.

Background

Language modeling---or more specifically, history-based language modeling (as opposed to full sentence models)---is the task of predicting the next word in a text given the previous words. For instance, given the history "Mary likes her coffee with milk and", a good language model might predict "sugar" and a bad language model might predict "socks." (This is related to the notion of cloze probability.)

It's quite clear that there is no "right answer" to any of these prediction problem. As an extreme example, given the history " The", there are any number of possible words that could go next. There's just no way to know what the "right" answer is, whether you're a machine or a person.

This is probably the strongest justification for a perplexity-like measure. Since there's no "right" answer, we'll let our learned model propose a probability distribution over all possible next words. We say that this model is good if it assigns high probability to "sugar" and low probability to "socks."

Perplexity just measures the cross entropy between the empirical distribution (the distribution of things that actually appear) and the predicted distribution (what your model likes) and then divides by the number of words and exponentiates after throwing out unseen words.

The Issue

The issue here is that in order to compute perplexity, your model must produce a probability distribution. Historically we've liked probability distributions because they can be combined with other probability distributions according to the rules of probability (eg., Bayes rule or chain rule). Of course we threw that out a long time ago when we realized that combining things, for instance, in log-linear models, worked a lot better in practice, if you had a bit of data to tune the weights of the log-linear models.

So the issue in my mind is that there's plenty of good technology out there for making predictions that does not produce probability distributions. I think it's really unfortunate that non-probabilistic approaches don't get to play the language modeling game because they produce the "wrong" sort of output (according to the evaluation, but not according to the real world). I'm not saying there aren't good reasons to like probabilistic models, but just that alternatives are good. And right now those alternatives cannot compete. (For instance, Roark, Saraclar and Collins [2007] don't use perplexity at all and just go for word error rate of a speech recognizer around their perceptron-based language model.)

When I Ran Into This

I was curious about building a language model using vw, in the context of another project, and also to stress-test multiclass classification algorithms that scale well with respect to the number of classes.

As soon as I ran it, I discovered the issue: it produced results in the form of error rates. As I recall (it was a while ago), the error rate was somewhere in the 60s or 70s. I had absolutely no idea whether this was good or not. It seemed reasonable.

To get a sense of how standard language models fare, I decided to train a language model using srilm, and evaluate it according to error rate. To make my life easier, I just ran it on the WSJ portion of the Penn treebank. I used the first 48k sentences as train and the last 1208 sentences as test. I trained a 5gram Kneser-Ney smoothed language model and evaluated both perplexity and error rate (the latter required a bit of effort -- if anyone wants the scripts, let me know and I'll post them---but basically, I just take the LM's prediction to be the highest probability word given the context).

The language model I built had a perplexity (ppl1 in srilm) of 236.4, which seemed semi-reasonable, though of course pretty crappy. There was an OOV rate of 2.5% (ignored in the perplexity calculation).

The overall error rate for this model was 75.2%. This means that it was only guessing a quarter of words correct. (Note that this includes the 2.5% errors mandated by OOVs.)

I also tried another version, where all the model had to do was put the words in the right order. In other words, it knows ahead of time the set of words in the sentence and just has to pick between those 20, rather than between the full vocabulary (43k types). [This is maybe semi-reasonable for MT.] The error rate under this setting was 66.8%. Honestly I expected it would be a lot better.

Note that if you always guess ","---the most frequent type in this data---your error rate is 95.3%.

So why was it only moderately helpful (< 10% improvement) to tell the language model what the set of possible words was? Basically because the model was always guessing really high probability unigrams. Below are the top ten predicted words when the model made an error with their frequencies. They're basically all stop words. (This is in the unrestricted setting.)

     1    14722 ,
     2     1393 .
     3     1298 the
     4      512 and
     5      485 in
     6      439 of
     7      270 to
     8      163 ''
     9      157 a
    10      108 is
    11       54 's
    12       52 have
    13       49 said
    14       41 ``
    15       38


    16       38 for
    17       38 are
    18       34 %
    19       33 $
    20       31 be

The same list for the restricted setting is virtually identical, basically because most of these words are available in the average sentence:

     1    10194 ,
     2     5357 .
     3     1274 the
     4      274 of
     5      251 in
     6      232 to
     7      230 and
     8      193 a
     9       51


    10       36 for
    11       28 's
    12       25 ''
    13       24 said
    14       24 is
    15       21 that
    16       21 ``
    17       16 it
    18       15 from
    19       14 be
    20       13 by

Oh well.

Ok, so I don't have a "good" counter proposal to perplexity. Error rate certainly has many issues of its own. You could use IR-like metrics, like recall @ 10, mean average precision, etc., which are all questionable in their own ways.

I would just, in general, like if we could evaluate language models without having to be handcuffed by probabilities.

26 April 2014

An easy way to write less hurtful reviews: don't say "you"

I'll be honest: I've had my feelings hurt by scathing reviews more than a few times. In grad school I remember even crying over a review that I thought was particular pernicious. My skin has thickened a bit over time, though often in the not-so-helpful manner of dismissing reviews that I don't like as "they didn't get it," which defeats one of the two primary purposes of reviews in the first place (providing feedback; the other: making accept/reject decisions).

The thing that's hard to reconcile is that I really like most of the people in our community, and everyone I meet at least seems really friendly.

When doing mock reviews with grad students, I'll often tell them to keep in mind that there's a good chance that the author is, or later will be, a friend of theirs. It's possible to provide feedback to a friend in such a way that you don't hurt their feelings.

I've recently started doing something else (in addition to the above suggestion). I don't use the words "you" or "the authors" or even "I." The review of a scientific contribution is not about me and it's not about the authors. It's about the method, the experiments and the contribution. I see little reason why you need to mention anything related to the people involved. (One exception: "I" is often useful in hedging, like the previous sentence, which would be more forceful if I just said "There is little reason...") Perhaps we could even integrate this into START...

This is, of course, similar to the pop-psych advice of talking to loved ones about "actions" rather than "the person." For instance: "I hate you for spilling coffee and not cleaning it up" versus "I hate having coffee spilt on the floor." Or something. I'm sure others can come up with better examples.

My current approach is to write my review with this in mind, and then go back and search for all occurrences of my outlawed nouns, and rewrite these sentences. Often in the process of doing this, I become aware that in many of the cases what I've said really does sound like an attack, and with the very small edit this effect is removed or at least greatly reduced.

I realize I've now just given a pretty good signal for people reading reviews to see if they were written by me or not. Here's a solution: everyone should adopt this policy and then my reviews will no longer be so obvious.
But overall, I really think we should be nice to each other. Perhaps fewer people will depart from the field if they're not constantly battered down by harsh reviews, and then we'll all be better off.

14 April 2014

Waaaah! EMNLP six months late :)

Okay, so I've had this file called emnlp.txt sitting in my home directory since Oct 24 (last modification), and since I want to delete it, I figured I'd post it here first. I know this is super belated, but oh well, if anyone actually reads this blog any more, you're the first to know how I felt 6 months ago. I wonder if I would make the same calls today... :)

If you remember anything about EMNLP anymore and have your own opinions, please feel free to comment. It will also let me know if anyone reads here anymore :).

Happy Spring from DC!




16 September 2013

Active learning for positive examples

I have a colleague who wants to look through large amounts of (text) data for examples of a pretty rare phenomenon (maybe 1% positive class, at most). We have about 20 labeled positive examples and 20 labeled negative examples. The natural thing to do at this point is some sort of active learning.

But here's the thing. We have no need for a classifier. And we don't even care about being good at finding negative examples. All we care about is finding as many positive examples from a fixed corpus as possible.

That is to say: this is really a find-a-needle-in-a-haystack problem. The best discussion I've found of this is Section 4 of Inactive Learning, by Attenberg and Provost. I came across a couple other papers that basically do some sort of balancing to deal with the imbalanced data problem in active learning, but the Attenberg and Provost results suggest that even this is tricky to get right.

But I think this only partially addresses the problem. Here's why.

Let's say that my colleague (call her Alice) is willing to spend one hour of her life looking at examples. Alice estimates that she can look at about 500 examples in that time period (small caveat: some examples are larger than other and therefore slower, but let's ignore this for now). Alice's goal is to find as many positive examples in that one hour as possible. Here are some possible strategies Alice could deploy

  1. Look at 500 randomly selected examples, getting 5 (more) positive examples in expectation (likely somewhere between 2 and 12, with 95% confidence). This gives her a total of 25 positive examples
  2. Train a classifier on her 40 labeled examples, have it rank the entire set (minus those 40). Then look at the top 500 examples. If the learned model has a recall of r% in the top 500, then she should expect to get 5*r more examples, giving her a total of 20+5r examples. (Caveat: this is a bit dangerous, because with very few training examples, results can often be anti-correlated with the desired output and you get better performance by flipping the predicted label. There was a paper on this maybe 5-10 years ago but I can't find it... Maybe somewhat knows what I'm referring to.)
  3. Train a classifier, spend 30 minutes looking at 250 random examples, getting 2.5r more positive examples, then train another classifier, then spend 30 minutes looking at it's top ranked 250 examples. If the second classifier has a recall of r'% in the top 250, then she should expect to get another 2.5*r', and she'll have a total of 20+2.5r+2.5r' = 20 + 2.5(r+r') so long as r>r' this should be better.
  4. Taking this to the extreme, Alice could annotate one example at a time (subject to constraints on either the learning being very fast or Alice not actually being a human). As long as the recall-at-one of the classifiers learned is non-decreasing, this should (I think) be the optimal strategy.
So it seems that in a setting like this, what the classifier should optimize for is simply recall-at-one. In other words, it doesn't care about anything except that it's top ranked output is correct.

Okay, so why is this difficult? Basically because we don't have that much data. If the corpus that represents the haystack is huge, then we want to ensure that the top-one example from that haystack is a positive example. So in principle even if all we're trying to do is select between two different hypotheses (say our hypothesis class at each time step has cardinality two) then in principle we should use as much data as possible to evaluate these two hypotheses. In particular, looking at the recall-at-one on the subset of 40 labeled examples that we have is probably not representative, essentially because this loss function doesn't decompose nicely.

So what could we do instead? Well, we could pool the labeled and unlabeled examples and predict on those instead. Certainly if one of the positive labeled examples outranked everything else, that would be a good thing. But if truly 1% of the unlabeled dataset is positive, then this is not a necessary condition. In particular, the top-ranked positive labeled example could be as far as 1% down the list and we could still (in principle) be doing a good job.

Ok I'll admit: I really don't know how to do this. Maybe it's enough in most cases to optimize for zero-one loss (or a weighted version thereof) and just cross your fingers that the recall-at-one is good. But it does feel like there should be a more direct way to go about this problem.

08 July 2013

The *SEM 2013 Panel on Language Understanding (aka semantics)

One of the highlights for me at NAACL was the *SEM panel on "Toward Deep NLU", which had the following speakers: Kevin Knight (USC/ISI), Chris Manning (Stanford), Martha Palmer (UC Boulder), Owen Rambow (Columbia) and Dan Roth (UIUC). I want to give a bit of an overview the panel, interspersed with some opinion. I gratefully acknowledge my wonderful colleague Bonnie Dorr for taking great notes (basically a transcript) and sharing them with me to help my failing memory. For what it's worth, this basically seemed like the "here's what I'm doing for DEFT panel" :).

Here's the basic gist that I got from each of the panel members, who gave roughly 10 minute talks:

Dan Roth: doing role labeling restricted to verbs is not enough. As an easy example, "John, a fast-rising politician, slept on the train to Chicago"... by normal SRL we get that John is sleeping, but not the possibly more important fact that John is a politician. Another example is prepositions: "University of Illinois" versus "State of Illinois" -- "of" is ambiguous. They came up with a taxonomy of 32 relations and labeled data and then did some learning -- see the TACL paper that was presented at NAACL, Srikumar & Roth.

Commentary: the ambiguity of prepositions issue is cool and I really liked the TACL paper. It reminds me of learning Latin in high school and being confused that ablative case markers were abiguous across from/by/with. It astounded me that that was an acceptable ambiguity, but of course English has equally crazy ones that I've just gotten used to. But it does make me think that some cross-linguistic study/model might be cool here. Even more broadly, it made me think about noun-noun compound semantics: "farmers market" (market put on by farmers) versus "fruit market" (market where you buy fruit) versus "fruit pie" (pie made out of fruit). I went back and read Lucy Vanderwende's dissertation, which dealt exactly with these issues. She had far fewer relations than Srikumar and Roth, though perhaps once you allow explicit prepositions the range of things you can express grows (though somehow my gut feeling is that it doesn't, at least in English).

Kevin Knight: Basically talked about their deep semantic approach to MT: see the abstract meaning representation web page for more. The idea is that people who work on syntax don't Balkanize into those who do PPs, those who do VPs, etc., so why should semantics break apart like it does. AMR is very GOFAI-style representation for language, and they've annotated a Chinese-English bilingual copy of Le Petite Prince with this representation. Now they need analyzers (hard), generators (hard) and transformation formalisms (hard). The nice thing is that this one representation captures almost all relevant semantic issues: scoping, argument structure, coreference, etc. For instance, co-ref is not explicitly annotated: it's just that a single agent can participate in multiple predicates. (Note: not yet across sentences.)

Commentary: It's hard not to get excited about this stuff, especially when Kevin talks about it. His enthusiasm is infectious. I left the talk thinking "wow I want to work on that!" There's of course the worry that we've tried this before and failed and that's why things in semantics Balkanized, but maybe the time is right to revisit it. For instance, Bonnie herself (note: she didn't tell me this; it had come up in recent discussions with Philip Resnik and our postdoc Junhui Li) had a meaning representation very similar to AMR called Lexical Conceptual Structures (LCS), and Nizar Habash had a hybrid rule-based/statistical approach to translating there. The idea was that if you want to handle divergent translations (classic example: "the bottle floated across the river" (English) versus "the bottle crossed the river floatingly" (Spanish, I think)), you need a representation that abstracts means from predicate. But it's still very cool. (Actually in digging up refs, I just found this paper on mapping from LCS to AMR... from AMTA 1998!)

Martha Palmer: focused mostly on event relations that go across sentences, which includes things like even coreference, bridging relations (enablement, result) and so on. They're also looking seriously at type (evidential, aspectual, etc.), modality (actual, hypothetical, generic, hedged, etc.), polarity and aspect. They are currently doing a lot of work in the clinical domain, in which these distinctions are really important if you want to understand, say, patient medical histories.

Commentary: this is a bit outside things I usually think about, so I have less to say. I really like the hyper-sentence view, of course.

Owen Rambow: talked about some of my favorite work that I've seen recently: basically work on propositional attitudes. The view Owen put forth is that most of NLP is focused on a world of facts, and the goal of NLU is to figure out what these facts are. They are taking a much more social model of text meaning, in which you really care about inferring partipants' cognitive states (standard triumvirate: belief, desire and intention). This actually shows up in at least one English-German translation example, essentially in which Google translate misses a very important subjunctive.

Commentary: I really liked the original work Owen did on BDI inference and I'm thrilled it's going further. I think one of the historical reasons why I find this so interesting is that propositional attitudes is basically what I started doing when I started grad school, when looking at discourse analysis through RST. I think many people forget this, but the discourse relationships in RST (and other discourse theories) are really based on attitude. For instance, X is in a background relation to Y if (roughly) the listener already believes X and the listener also believes that X increases the chance of Y. (Or something like that: I just made that up :P.) But it's all about belief of listeners and utterers.

Chris Manning: focused on deep learning, basically asserting (in a manner designed to be a bit controversial) that Stanford dependencies are their meaning representation and that the big problems aren't in representations. Sure, Stanford dependencies miss out on a lot (quanitification, tense, semantic roles, modality, etc.) but he felt that there are more important problems to address. And then what we need instead is "soft" meaning representations, like vector space models and distributed representations give us. Giving rise to something akin to Natural Logic.

Commentary: to a large degree I agree with the notion that the "big problems" in language are probably not those that (eg) semanticists like to look at, at least from the typical view of NLE in which we want systems that do well on average across a distribution of examples that we've cultivated. But I also worry that there's a bit of magical thinking here, in the sense that it kind of feels like a cop-out: it's too hard to define categories by hand so let's let the machine figure it out. Now, don't get me wrong, I'm all for machines figuring out stuff (I gave a not-very-well-received talk to that degree at a workshop a couple years ago on linguistics in NLP), but I'm also a bit reticent to believe that this is really going to bring us any closer to really solving the NLU problem (whatever that is), though of course it will get us another 5-10% in standard benchmarks. (Ok this sounds way too negative: I actually really liked Chris' talk, and one of the things I liked about it was that it challenged my thinking. And I agree that there is a lot that we shouldn't be designing by hand -- some people, like Yoshua Bengio, would probably argue that we shouldn't be designing anything by hand, or at least that we shouldn't have to -- but I guess I still belong to the camp of "linguists give the structure, statistics gives the parameters.")

There was also a lot of really interesting discussion after the presentations, some of which I'll highlight below:

Lucy Vanderwende, I think mostly directed at Kevin, fell into the "we tried this X years ago" camp, basically said that whenever they tried to abstract more and more from the input representation, you ended up getting very boring sentences generated because you'd thrown out all the "nuance" (my word, not hers). The discussion afterward basically revolved about whether you annotate input sentences with meaning (which is currently the standard) or throw them out with the bathwater. Owen points out that the meaning of a passive sentence is not +passive but something much more nuanced, and if you could capture that correctly, then (in principle) generators could reflect it properly in the target language. (Me: For instance, maybe in some wacky language a passive sentence actually means that you're trying to emphasize the subject.)

There was also a lot of discussion around Chris, I think partially because he went last and partially because he was trying to be controversial. Mausam made an argument (akin to what I wrote above) that logicians have made a billion logics of language and nothing really has worked (in a sense it's been a series of negative results). What about inference rules or consistency?

Okay, that's all I want to write for now. Congrats if you made it this far. And thanks to the *SEM organizers for putting together this great panel!

17 June 2013

My NAACL 2013 list...

I feel a bit odd doing my "what I liked at NAACL 2013" as one of the program chairs, but not odd enough to skip what seems to be the most popular type of post :). First, though, since Katrin Kirchhoff (my co-chair) and I never got a chance to formally thank Lucy Vanderwende (the general chair) and give her flowers (or wine or...) let me take this opportunity to say that Lucy was an amazing general chair and that working with her made even the least pleasant parts of PCing fun. So: thanks Lucy -- I can't imagine having someone better to have worked with! And all of the rest of you: if you see Lucy and if you enjoyed NAACL, please thank her!

I also wanted to really thank Matt Post for doing the NAACL app -- lots of people really liked it and I hope we do it in the future. I'm at ICML now and constantly wishing there were an ICML app :).

Okay, with that preface, let's get down to what you came for. Below is the list of my (complete) list of favorite papers from NAACL 2013 (also indexed on Braque) in no particular order:

  • Relation Extraction with Matrix Factorization and Universal Schemas (N13-1008 by Sebastian Riedel; Limin Yao; Andrew McCallum; Benjamin M. Marlin)
    Very cool paper. The idea is to try to jointly infer relations (think OpenIE-style) across text and databases, by writing everything down in a matrix and doing matrix completion. In particular, make the rows of this matrix equal to pairs of entities (Hal,UMD and UMD,DC-area) and the columns relations like "is-professor-at" and "is-located-in." These entity pairs and relations come both from free text and databases like FreeBase. Fill in the known entities and then think of it as a recommender system. They get great results with a remarkably straightforward approach. Reminds me a lot of my colleague Lise Getoor's work on multi-relational learning using tensor decompositions.
  • Combining multiple information types in Bayesian word segmentation (N13-1012 by Gabriel Doyle; Roger Levy)
    I guess this qualifies as an "obvious in retrospect" idea -- and please recognize that I see that as a very positive quality! The basic idea is that stress patterns (eg trochees versus iambs) are very useful for kids (who apparently can recognize such things at 4 days old!) and are also very useful for word segmentation algorithms.
  • Learning a Part-of-Speech Tagger from Two Hours of Annotation (N13-1014 by Dan Garrette; Jason Baldridge)
    Probably my overall favorite paper of the conference, and the title says everything. Also probably one of the best presentations I saw at the conference -- I can't even begin to guess how long Dan spent on his slides! I loved the question from Salim in the Q/A session, too: "Why did you stop at two hours?" (They have an ACL paper coming up, apparently, that answers this.) You should just read this paper.

  • Automatic Generation of English Respellings (N13-1072 by Bradley Hauer; Grzegorz Kondrak)
    This paper was the recipient of the best student paper award and, I thought, really great. It's basically about how English (in particular) has funny orthography and some times it's useful to map spellings to their pro-nun-see-ey-shuns, which most people find more useful than . It's a bit more of a bunch of stuff glued together than I usually go for in papers, but the ideas are solid and it seems to work pretty well -- and I'd never even thought this would be something interesting to look at, but it makes complete sense. Best part of presentation was when Greg tripped up pronouncing some example words :).

  • Linguistic Regularities in Continuous Space Word Representations (N13-1090 by Tomas Mikolov; Wen-tau Yih; Geoffrey Zweig)
    This is a paper that makes my list because it made me think. The basic idea is that if you do some representation learning thingamajig and then do vector space algebra like repr("King") - repr("man") + repr("woman") you end up with something that's similar to repr("Queen"). It's a really interesting observation, but I'm at a loss for why we would actually expect something like this to happen!
  • PPDB: The Paraphrase Database (N13-1092 by Juri Ganitkevitch; Benjamin Van Durme; Chris Callison-Burch)
    This is a paper about a dataset release that I think I'll find useful and I bet other people will too. Go download it and play with it. I'd encourage the authors (are you listening, Juri!) to make a web demo (or web service) so that I don't need to go through the pain of getting it all set up to see if it might be useful for me.
  • Supervised Learning of Complete Morphological Paradigms (N13-1138 by Greg Durrett; John DeNero)
    Basic idea: college morphological paradigms from Wiktionary and then train a supervised system to generalize from those to novel words. Works remarkably well and the model is well thought out. Plus I like papers that take morphology seriously: I wish we saw more stuff like this in NAACL.
And though I don't often do this, I have to mention the following paper because although I didn't see the talk or read the paper, enough independent people pointed it out to be as great that I figured I'd mention it:
  • Improved Reordering for Phrase-Based Translation using Sparse Features (N13-1003 by Colin Cherry)
Anyone else with favorites should comment!

30 April 2013

What is a sparse difference in probability distributions?

Sparsity has been all the rage for a couple of years now. The standard notion of "sparse" vector u is that the number of non-zeros in u is small. This is simply the l_0 norm of u, ||u||_0. This norm is well studied, known to be non-convex, and often relaxed to the l_1 norm of u, ||u||_1: the sum of absolute values. (Which has the nice property of being the "tightest" convex approximation to l_0.)

In some circumstances, it might not be that most of u is zero, but simply that most of u is some fixed scalar constant a. The "non-constant" norm of u would be something like "the number of components that are not equal to some a" or, in more mathy terms, "min_a ||u-a||_0". I first came across it this in the Differentiable sparse coding paper by Bradley and Bagnell (NIPS 2008), where they claim physicists call it "Maximum entropy on the mean" if you're using a KL distance, but I don't know if there's a name for it in the l_0 sense, so I'l call it "l_0 from the mode" which seems to capture the right notion.

In many cases, we want to use a sparse norm to define a notion of a sparse distance function between two vectors u and v. For instance d_0(u,v) = ||u-v||_0 would be a reasonable notion of a sparse difference between two vectors.

Now, let's suppose that u and v are probability distributions over a finite event space, so that u_i is the probability of event i. In this case, both u and v belong to the simplex: u_i >= 0 for all i, and sum_i u_i = 1 (which is equivalent to saying ||u||_1 = 1, given the first condition).

It seems natural to ask ourselves whether there is a reasonable notion of sparse difference between two probability distributions. This is something I've thought about, but not come up with a satistfactory answer. The "natural" thing would be to define d_0(p,q) = ||p-q||_0, where I've switched from u/v to p/q to emphasize that these are distributions. This is certainly reasonable in some sense, but does not correspond to how I usually think of probability distributions, essentially because it ignores the "normalization" effect.

As a simple example, let's suppose that I generate a data set by rolling a dice with sides A-F. Suppose it comes up A five times, B twice, C once and F twice. Then my maximum likelihood estimate for p would be [.5, .2, .1, 0, 0, .2]. If I had a different dataset, in which exactly the same thing happened except I got one roll of D in addition to all the other rolls, I would estimate the distribution here as q = [.45, .18, .09, .09, 0, .18], which differs everywhere except the "E" position. The d_0 distance between these two distributions would be 4, which intuitively seems "wrong.

One possible way around this would be to allow us to rescale the distributions before computing the l_0 distance, something like d(p,q) = min_(a,b > 1) || ap - bq ||_0. When talking about l_0 as the underlying norm, this can be simplified to min_(0 < a < 1) || ap - (1-a)q ||_0. Note that the inequalities on "a" are necessary. With equalities, you could "erase" all of p (or q) by setting a to zero (or 1), which is definitely not desirable. Fortunately, the optimal "a" can be computed in linear time in the dimensionality: it's just the value that maximizes the number of i for which ap_i = (1-a)q_i. This definitlon captures the example above and gives a distance of "1" (corresponding to a = mode(q / (p+q)) = 10/21 = 0.4797).

We can attempt to make this definition more formal by invoking the notion of exponential families. Recall that an expfam distribution has the form log p(x;t) = t*f(x) - A(t) + B(x), where t are the natural parameters, f is the vector of sufficient statistics, * denotes dot product, and A() is the log partition function. A quick observation is that depending on how f looks, there might be lots of t that give rise to the same probability distribution. If we guarantee that the fs are always linearly independent, then this representation is called "minimal"; otherwise it is "overcomplete." (Formally, it is overcomplete if there exists a non-zero vector a for which a*f(x) is constant for all x.)

For a categorical distribution, an overcomplete representation would be a six dimensional indicator of which side of the die came up, and the natural parameters would be the (log of the) ps and qs shown above (it's overcomplete because one component is completely determined by the other 5). For example, the example p above would have exponential form with t=log[.5/.8, .2/.8, .1/.8, 0, 0] and log partition function A(t) = log 1.8. (See wikipedia's variant 3 for categorical distributions.) and q would have t=log[.45/.82, .18/.82, .09/.82, .09/.82, 0] with A(t) = log 1.82.

Supposing that we're working with a minimal representation, what I think this amounts to is that we want the difference in natural parameters to be sparse in the "l_0 from the mode" sense -- the difference is equal to a constant. In the above example, the difference in sufficient statistics is equal to 0.1301 in the first three components, -Infinity in the fourth, and something like zero in the last (who knows -- NaN?) which gives a distance of one (if you discard the last one).

The problem here is that there is not a unique minimal representation; I could also have chosen t=log[.2/.5, .1/.5, 0, 0, .2/.5] and A=log 1.5 for p and t=log[.18/.55 .09/.55 .09/.55 0/.55 .18/.55] and A=log 1.55 for q. In this case, the mode of the difference is 0.2007 (in the first, second and last components) there's one -Infinity and one NaN. So the answer works out the same, at least if you know how to deal with the NaNs.

There's an obviously needed lemma here: that any minimal representation will give you the same distance. I haven't thought nearly long enough about why this should be true (i.e., I've thought about if for about 30 seconds :P). There's also a lot of other questions: what if you relax this notion of sparsity to something more l_1-like? What other properties does it have? And of course is there a better notion of sparse difference of probability distributions for this or other circumstances?










27 December 2012

Teaching (intro, grad) NLP

I had a post a while back on teaching ML that I still basically stand by.  I've taught intro grad NLP here at UMD twice now, and a sort-of-similar-course back at Utah once.  I find these courses really hard to teach.  And not for the usually bemoaned reason of the CS/linguistics mix -- I think it's possible to deal with that, and certainly it's an issue that's been talked about a lot.

What I find difficult is that NLP (and CL) is a collection of problems, techniques, ideas, frameworks, etc. that really are not tied together in any reasonable way other than the fact that they have to do with NLP.  Even if you manage to answer questions about "what sort of topics are most interesting?" you're still faced with this problem that every time you switch topics, the entire context in which you're discussing them changes.  This is exacerbated by the problem that things like tagging and parsing are hopelessly boring (in comparison to all the cool interesting stuff in NLP these days), but yet so many modern ideas are based on understanding basic dynamic programming for tree structures and things like that.

To make things a bit more concrete, a standard intro NLP class might start with morphology.  Okay, so you have to explain what morphemes are and why they're important.  Now, you probably will take a finite state approach, so you have to explain transducers.  If you want these things to work, you have to explain weighted transducers.  Do you do probabilities, in which case there's the whole local vs global normalization stuff that takes more time?  So now you want to do POS tagging or something.  Fine, you can do that with finite state models too.  But no one actually does this any more (except lingpipe :P).  So you have to explain POS stuff, perhaps how this works in non-English, and then you can leave them with HMMs (maybe talking about Viterbi algorithm) or do lots of ML so you can get to CRFs or structured perceptron or something.  And we're still at POS tagging.  Now you switch to parsing.  Back to square one.  And then you want to do compositional semantics, now there's lots more structure, lots more features and so on.  But even now things are at least somewhat connected.  But then you talk about lexical semantics: be it distributed representations or WSD or whatever, but the problem is new, the techniques are new (do you teach Yarowsky?), the evaluation is now and so on.

I think it's worth contrasting this with ML.  I find ML remarkably easy to teach (so I'm flipping the classroom this coming Spring for the UG version to make it more exciting) despite the fact that the material is (in many ways) much harder for CS types.  The thing that is nice about ML is that the problem is basically always the same (or at least changes only once, when you switch from supervised to unsupervised).  In that sense, ML tends to be a course about techniques for a relatively fixed problem (or at least fixed problem type).  This makes for significantly less context switching, which makes learning easier (and thereby makes teaching easier).

So the question I wanted to ask is: can we do something similar in NLP.  The crazy idea that I'm sure everyone will say is insane is the following: teach NLP as a course about what you can do with log-linear models.  Here's how I envision it.  You spend the first day talking about NLP and why data is important, ambiguity, etc, just like normal.  You spend the next two days explaining enough about log linear models that you can treat them as given for the rest of the semester.  Maybe you tell how to optimize them by gradient descent or something, but basically enough that anyone who is simultaneously taking ML will get more out of it, but those that are not are fine with LL models as a black box.

Now, when you teach different topics, the framework in which you discuss them is the same.  You have a structured problem (which forces you to talk about algorithms like Viterbi or CKY) with interesting ambiguities (which forces you to talk about features).  Then, the class essentially becomes a sequence of problems, associated algorithms and relevant features.  The rest is left as a black box, which can be provided off the shelf for programming projects, and they can focus on the interesting and more NLP-ish problems of algorithms and features.  You could even start with something like sentiment classification (at a document level) to make the beginning gentle.

I realize there are some things you couldn't do this way, or would be very awkward to do this way.  Anything generative or unsupervised, which often go together.  For instance, word alignment via the IBM models won't fit.  Topic models won't fit (though I don't usually do them anyway -- maybe I should).  Probably there are some other things too.

Anyway, I'd be curious to hear what people think of this idea.  I know it's biased by my own view of the world, but hey -- that's why I'm a professor (or at least why I assist professors...).  Or if anyone has tried it.

09 December 2012

NIPS stuff...

NIPS is over as of last night.  Overall I thought the program was strong (though I think someone, somewhere, is trying to convince me I need to do deep learning -- or at least that was the topic d'jour... or I guess d'an? this time).  I wasn't as thrilled with the venue (details at the end) but that's life.  Here were some of the highlights for me, of course excluding our own papers :P, (see the full paper list here)... note that there will eventually be videos for everything!

  • User-Friendly Tools for Studying Random Matrices
    Joel A Tropp

    This tutorial was awesome.  Joel has apparently given it several times and so it's really well fine-tuned.  The basic result is that if you love your Chernoff bounds and Bernstein inequalities for (sums of) scalars, you can get almost exactly the same results for (sums of) matrices.  Really great talk.  If I ever end up summing random matrices, I'm sure I'll use this stuff!
  • Emergence of Object-Selective Features in Unsupervised Feature Learning
    Adam Coates, Andrej Karpathy, Andrew Y. Ng
    They show that using only unlabeled data that is very heterogenous, some simple approaches can pull out faces.  I imagine that some of what is going on is that faces are fairly consistent in appearance whereas "other stuff" often is not.  (Though I'm sure my face-recognition colleagues would argue with my "fairly consistent" claim.)
  • Scalable nonconvex inexact proximal splitting
    Suvrit Sra
    I just have to give props to anyone who studies nonconvex optimization.  I need to read this -- I only had a glance at the poster -- but I definitely think it's worth a look.
  • A Bayesian Approach for Policy Learning from Trajectory Preference Queries
    Aaron Wilson, Alan Fern, Prasad Tadepalli
    The problem solved here is imitation learning where your interaction with an expert is showing them two trajectories (that begin at the same state) and asking them which is better.  Something I've been thinking about recently -- very happy to see it work!
  • FastEx: Hash Clustering with Exponential Families
    Amr Ahmed, Sujith Ravi, Shravan M. Narayanamurthy, Alexander J. Smola
    The idea here is to replace the dot product between the parameters and sufficient statistics of an exp fam model with an approximate dot product achieved using locality sensitive hashing.  Take a bit to figure out exactly how to do this.  Cool idea and nice speedups.
  • Identifiability and Unmixing of Latent Parse Trees
    Daniel Hsu, Sham M. Kakade, Percy Liang
    Short version: spectral learning for unsupervised parsing; the challenge is to get around the fact that different sentences have different structures, and "unmixing" is the method they propose to do this.  Also some identifiability results.
  • Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs
    Shay B. Cohen and Michael Collins
    Another spectral learning paper, this time for doing exact latent variable learning for latent-variable PCFGs.  Fast, and just slightly less good than EM.
  • Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
    Abner Guzman-Rivera Dhruv Batra Pushmeet Kohli
    Often we want our models to produce k-best outputs, but for some reason we only train them to produce one-best outputs and then just cross our fingers.  This paper shows that you can train directly to produce a good set of outputs (not necessarily diverse: just that it should contain the truth) and do better.  It's not convex, but the standard training is a good initializer.
  • [EDIT Dec 9, 11:12p PST -- FORGOT ONE!]
    Query Complexity of Derivative-Free Optimization
    Kevin G. Jamieson, Robert D. Nowak, Benjamin Recht
    This paper considers derivative free optimization with two types of oracles.  In one you can compute f(x) for any x with some noise (you're optimizing over x).  In the other, you can only ask whether f(x)>f(y) for two points x and y (again with noise).  It seems that the first is more powerful, but the result of this paper is that you get the same rates with the second!
  • I didn't see it, but Satoru Fujishige's talk Submodularity and Discrete Convexity in the Discrete Machine Learning workshop was supposedly wonderful.  I can't wait for the video.
  • Similarly, I heard that Bill Dolan's talk on Modeling Multilingual Grounded Language in the xLiTe workshop was very good.
  • Ryan Adam's talk on Building Probabilistic Models Around Deterministic Optimization Procedures in the "Perturbations, Optimization and Statistics" workshop (yeah, I couldn't figure out what the heck that meant either) was also very good.  The Perturb-and-MAP stuff and the Randomized Optimum models are high on my reading list, but I haven't gotten to them quite yet.
  • As always, Ryan McDonald and Ivan Titov gave very good talks in xLiTe, on Advances in Cross-Lingual Syntactic Transfer and Inducing Cross-Lingual Semantic Representations of Words, Phrases, Sentences and Events, respectively.
I'm sure there was lots of other stuff that was great and that I missed because I was skiing working hard on NAACL.

Really my only gripe about NIPS this year was the venue.  I normally wouldn't take the time to say this, but since we'll be enjoying this place for the next few years, I figured I'd state what I saw as the major problems, some of which are fixable.  For those who didn't come, we're in Stateline, NV (on the border between CA and NV) in two casinos.  Since we're in NV, there is a subtle note of old cigarette on the nose fairly constantly.  There is also basically nowhere good to eat (especially if you have dietary restrictions) -- I think there are a half dozen places on yelp with 3.5 stars or greater.  My favorite tweet during NIPS was Jacob Eisenstein who said: "stateline, nevada / there is nothing but starbucks / the saddest haiku".  Those are the "unfixables" that make me think that I'll think twice about going to NIPS next year, but of course I'll go.

The things that I think are fixable... there was no where to sit.  Presumably this is because the casino wants you to sit only where they can take your money, but I had most of my long discussions either standing or sitting on the ground.  More chairs in hallways would be much appreciated.  There was almost no power in rooms, which could be solved by some power strips.  The way the rooms divided for tutorials was really awkward, as the speaker was clear on one side of the room and the screen was on the other (and too high to point to) so it was basically like watching a video of slides online without ever seeing the presenter.  Not sure if that's fixable, but seems plausible.  And the walls between the workshop rooms were so thin that often I could hear another workshop's speaker better than I could hear the speaker in the workshop I was attending.  And the internet in my hotel room was virtually unusably slow (though the NIPS specific internet was great).

26 September 2012

Sure, you can do that....

I'll warn in advance that this is probably one of the more controversial posts I've written, but realize that my goal is really to raise questions, not necessarily give answers.  It's just more fun to write strong rhetoric :).

Let me write down a simple Markov Chain:

  1. Download some data from the web
  2. Call part of that data the input and part of it the label
  3. Train a classifier on bag of words and get 84% accuracy
  4. Submit a paper to *ACL
  5. Go to 1
Such papers exist in the vision community, too, where you replace "bag of words" with "SIFT features" and "*ACL" with "CVPR/ICCV."  In that community (according to my one informant :P), such papers are called "data porn."  Turns out this is actually a term from journalism, in which one definition is "where journalists look for big, attention grabbing numbers or produce visualisations of data that add no value to the story."

There's a related paper that looks at this issue in one specific setting: predicting political outcomes.  On Arxiv back at the end of April, we got this wonderful, and wonderfully titled paper:
"I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" -- A Balanced Survey on Election Prediction using Twitter Data by Daniel Gayo-Avello
The thing I especially like about this paper is that it's not a complaint (like this blog post!) but rather a thoughtful critique of how one could do this sort of research right.  This includes actually looking at what has been done before (political scientists have been studying this issue for a long time and perhaps we should see what they have to say; what can we do to make our results more reproducible; etc).

For me, personally, this goes back to my main reviewing criteria: "what did I learn from this paper?"  The problem is that in the extreme, cartoon version of a data porn paper (my 1-4 list above), the answer is that I learned that machine learning works pretty well, even when asked to do arbitrary tasks.  Well, actually I already knew that.  So I didn't really learn anything.

Now, of course, many data porn-esque papers aren't actually that bad.  There are many things one can do (and people often do do) that make these results interesting:
  • Picking a real problem -- i.e., one that someone else might actually care about.  There's a temptation (that I suffer from, too) of saying "well, people are interested in X, and X' is kind of like X, so let's try to predict X' and call it a day."  For example, in the context of looking at scientific articles, it's a joy in many communities to predict future citation counts because we think that might be indicative of something else.  I've certainly been guilty of this.  But where this work can get interesting is if you're able to say "yes, I can collect data for X' and train a model there, but I'll actually evaluate it in terms of X, which is the thing that is actually interesting."
     
  • Once you pick a real problem, there's an additional challenge: other people (perhaps social scientists, political scientists, humanities researchers, etc.) have probably looked at this in lots of different lights before.  That's great!  Teach me what they've learned!  How, qualitatively, do your results compare to their hypotheses?  If they agree, then great.  If they disagree, then explain to me why this would happen: is there something your model can see that they cannot?  What's going on?
  • On the other hand, once you pick a real problem, there's a huge advantage: other people have looked at this and can help you design your model!  Whether you're doing something straightforward like linear classification/regression (with feature engineering) or something more in vogue, like building some complex Bayesian model, you need information sources (preferably beyond bag of words!) and all this past work can give you insights here.  Teach me how to think about the relationship between the input and the output, not just the fact that one exists.
In some sense, these things are obvious.  And of course I'm not saying that it's not okay to define new problems: that's part of what makes the world fun.  But I think it's prudent to be careful.

One attitude is "eh, such papers will die a natural death after people realize what's going on, they won't garner citations, no harm done."  I don't think this is all together wrong.  Yes, maybe they push out better papers, but there's always going to be that effect, and it's very hard to evaluate "better."

The thing I'm more worried about is the impression that such work gives from our community to others.  For instance, I'm sure we've all seen papers published in other venues that do NLP-ish things poorly (Joshua Goodman has his famous example in physics, but there's tons more).  The thing I worry is that we're doing ourselves a disservice as a community to try to claim that we're doing something interesting in other people's spaces, without trying to understand and acknowledge what they're doing.

NLP obviously has a lot of potential impact on the world, especially in the social and humanities space, but really anywhere that we want to deal with text.  I'd like to see ourselves set up to succeed there, by working on real problems and making actual scientific contributions, in terms of new knowledge gathered and related to what was previously known.

15 September 2012

Somehow I totally missed NIPS workshops!

I don't know how it happened or when it happened, but at some point NIPS workshops were posted and papers are due about a week from now and I completely missed it!  The list of workshops is here:

    http://nips.cc/Conferences/2012/Program/schedule.php?Session=Workshops

Since my job as a blogger is to express my opinion about things you don't want to hear my opinion about, I wish they'd select fewer workshops.  I've always felt that NIPS workshops are significantly better than *ACL workshops because they tend to be workshops and not mini-conferences (where "mini" is a euphemism for non-selective :P).  At NIPS workshops people go, really talk about problems and it's really the best people and the best work in the area.  And while, yes, it's nice to be supportive of lots of areas, but what ends up happening is that people jump between workshops because there are too many that interest them, and then you lose this community feeling.  This is especially troubling when workshops are already competing with skiing :).

Anyway, with that behind me, there are a number that NLP folks might find interesting:

With the deadlines so close I don't imagine anyone's going to be submitting stuff that they just started, but if you have things that already exist, NIPS is fun and it would be fun to see more NLPers there!

13 June 2012

NAACL 2012 Retrospective

Like many people, I spent last week in lovely Montreal (at least lovely for the second half) at NAACL.  Despite the somewhat unfortunate submission date, I thought the program this year was quite good.  Of course I didn't see every talk and haven't read many of the papers yet, but I figured I'd point out what I saw that I liked and other people can comment likewise.

Identifying High-Level Organizational Elements in Argumentative Discourse (Madnani, Heilman, Tetreault, Chodorow).  This is maybe one of the first discourse papers I've seen where I actually believe that they have a chance of solving the problem that they've set out.  Here, the problem is separating the meat (content of an essay) from the shell (the bits of discourse that hold the meat together).  It's a cool problem and their solution seems to work well.  Very nice paper.  (And Nitin's talk was great.)

Risk Training of Approximate CRF-Based NLP Systems (Stoyanov, Eisner).  This paper is basically about training approximate models based on some given loss function.  Reminds me a lot of the Ross et al. CVPR 2011 paper on Message-Passing.  It's a cool idea, and there's software available.  Being me, the thing I wonder the most about is whether you can achieve something similar being completely greedy, and then whether you need to do all this work to get a good decision function.  But that's me -- maybe other people like CRFs :).

MSR SPLAT, a language analysis toolkit (Quirk, Choudhury, Gao, Suzuki, Toutanova, Gamon, Yih, Cherry, Vanderwende).  This is a demo of a system where you send them a sentence and they tell you everything you want to know about it.  Never run your own parser/NER/etc. again.  And, having see it in action at MSR, it's fast and high quality.

Parsing Time: Learning to Interpret Time Expressions (Angeli, Manning, Jurafsky).  This was a great paper about semantic interpretation via compositional semantics (something sort of like lambda calculus) for time expressions.  I cannot find myself getting super jazzed up about time, but it's a nice constrained problem and their solution is clean.  I'm actually thinking of using something like this (or a subset thereof) as a course project for the intro CL course in the Fall, since I'm always starved for something to do with compositional semantics.

Getting More from Morphology in Multilingual Dependency Parsing (Hosensee, Bender).  If you have morphology, you can do better parsing by modeling things like agreement (really?  people hadn't done this before??).  Caveat: they use gold standard morphology.  But cool idea still.

Unsupervised Translation Sense Clustering (Bansal, DeNero, Lin).  If you want to build a bilingual dictionary from parallel text, you need to cluster translations into senses.  Here's a way to do it.  Nice results improving using bilingual contexts, which was nice to see.

I also feel like I should raise my glass to the organizers of NLP Idol and congrats to Ray for winning with Robert Wilensky's paper "PAM." (If anyone can find an online version, please comment!) Though I would actually encourage everyone to read all three papers if you haven't already.  They all changed how I was thinking about problems.  Here are the others: Burstiness (Church), Context (Akman), Suppertagging (Bangalore, Joshi).

18 February 2012

Making sense of Wikipedia categories

Wikipedia's category hierarchy forms a graph. It's definitely cyclic (Category:Ethology belongs to Category:Behavior, which in turn belongs to Category:Ethology).

At any rate, did you know that "Chicago Stags coaches" are a subcategory of "Natural sciences"?  If you don't believe me, go to the Wikipedia entry for the Natural sciences category, and expand the following list of subcategories:

  • Biology
  • Zoology
  • Subfields of zoology
  • Ethology
  • Behavior
  • Human behavior
  • Recreation
  • Games
  • Ball games
  • Basketball
  • Basketball teams
  • Defunct basketball teams
  • Defunct National Basketball Association teams
  • Chicago Stags
  • Chicago Stags coaches
I guess it kind of makes sense.  There are some other fun ones, like "Rhaeto-Romance languages", "American World War I flying aces" and "1911 films". Of course, these are all quite deep in the "hierarchy" (all of those are at depth 15 or higher).

So if you're trying to actually find pages about Natural sciences, maybe it's enough to limit the depth of your breadth first search down the graph.

This is sort of reasonable, and things up to and including depth four are quite reasonable, including topics like "Neurochemistry", "Planktology" and "Chemical elements".  There are a few outliers, like "Earth observation satellites of Israel" which you could certainly make a case might not be natural science.

At depth five, things become much more mixed.  On the one hand, you get categories you might like to include, like "Statins", "Hematology", "Lagoons" and "Satellites" (interesting that Satellites is actually deeper than the Isreal thing).  But you also get a roughly equal amount of weird things, like "Animals in popular culture" and "Human body positions".  It's still not 50/50, but it's getting murky.

At depth six, based on my quick perusal, it's about 50/50.

And although I haven't tried it, I suspect that if you use a starting point other than Natural sciences, the depth at which things get weird is going to be very different.

So I guess the question is how do deal with this.

One thought is to "hope" that editors of Wikipedia pages will list the categories of pages roughly in order of importance, so that you can assume that the first category listed for a page is "the" category for that page.  This would render the structure to be a tree.   For the above example, this would cut the list at "Subfields of zoology" because the first listed category for the Ethology category is "Behavioral sciences", not "Subfields of zoology."

Doing this seems to make life somewhat better; you cut out the stags coaches, but you still get the "Chicago Stags draft picks" (at depth 17).  The path, if you care, is (Natural sciences -> Physical sciences -> Physics -> Fundamental physics concepts -> Matter -> Structure -> Difference -> Competition -> Competitions -> Sports competitions -> Sports leagues -> Sports leagues by country -> Sports leagues in the United States -> Basketball leagues in the United States -> National Basketball Association -> National Basketball Association draft picks).  Still doesn't feel like Natural sciences to me.  In fairness, at depth 6, life is much better.  You still get "Heating, ventilating, and air conditioning" but many of the weird entries have gone away.

Another idea is the following.  Despite not being a tree or DAG, there is a root to the Wikipedia hierarchy (called Category:Contents).  For each page/category you can compute it's minimum depth from that Contents page.  Now, when you consider subpages of Natural sciences, you can limit yourself to pages whose shortest path goes through Natural sciences.  Basically trying to encode the idea that if the shallowest way to reach Biology is through Natural sciences, it's probably a natural science.

This also fails.  For instance, the depth of "Natural sciences" (=5) is the same as the depth of "Natural sciences good articles", so if you start from Natural sciences, you'll actually exclude all the good articles!  Moreover, even if you insist that a shortest path go through Natural sciences, you'll notice that many editors have depth 5, so any page they've edited will be allowed.  Maybe this is a fluke, but "Biology lists" has depth of only 4, which means that anything that can be reached through "Biology lists" would be excluded, something we certainly wouldn't want to do.  There's also the issue that the hierarchy might be much bushier for some high-level topics than others, which makes comparing depths very difficult.

So, that leaves me not really knowing what to do.  Yes, I could compute unigram distributions over the pages in topics and cut when those distributions get too dissimilar, but (a) that's annoying and very computationally expensive, (b) requires you to look at the text of the pages which seems silly, (c) you now just have more hyperparameters to tune.  You could annotate it by hand ("is this a natural science") but that doesn't scale.  You could compute the graph Laplacian and look at flow and use "average path length" rather than shortest paths, but this is a pretty big graph that we're talking about.

Has anyone else tried and succeed at using the Wikipedia category structure?