10 October 2014

Hyperparameter search, Bayesian optimization and related topics

In terms of (importance divided-by glamour), hyperparameter (HP) search is probably pretty close to the top. We all hate finding hyperparameters. Default settings are usually good, but you're always left wondering: could I have done better? I like averaged perceptron for this reason (I believe Yoav Goldberg has also expressed this sentiment): no pesky hyperparameters.
But I want to take a much broader perspective on hyperparameters. We typically think of HPs as { regularization constant, learning rate, architecture } (where "architecture" can mean something like neural network structure, choice of kernel, etc. But I think it's a lot broader and can include things like feature engineering, or at least representation modifications. For instance, vw now supports a number of helpful NLP feature templates: suffix and prefix features (via --affix), spelling features (via --spelling), ngram features (--ngrams), quadratic and cubic features, etc. Picking the right incarnation of these thing is essentially a hyperparameter search process, very akin (IMO) to things like representation learning.

Once you're willing to accept all these things as HPs (and I think you should), something like "grid search", which works for tuning C and eta in your SVM, just doesn't seem to cut it anymore.

Enter the world of automatic HP tuning. Lots of this work, not surprisingly, comes out of the neural networks community, because HP search is a big deal there. Most of my information here comes via Hugo Larochelle, who has done a lot of great work. Some places to start looking:

  • Spearmint toolkit by Snoek, Larochelle, Swersky, Zemel and Adams (and a JMLR paper)
  • SMAC by Hutter, Hoos, Leyton-Brown, Murphy and Ramage (and a paper
Most of this work falls under the framework of "Bayesian Optimization." The idea comes from the space of derivative-free optimization, where a common strategy is to fit a response surface. Basically you have a bunch of hyperparameters to tune. For any setting of hyperparameters, you can observe some response. In ML land this is usually something like held-out accuracy. Now, fit a regression function that can map from hyperparameters to response. Do something that looks like active learning to explore this space, with a bias toward finding places with high response (high accuracy). In these examples, the function being fit is a Gaussian Process, which is super useful because it can provide realistic estimates of variance, which are useful in the active learning/experimental design.

I first learned about this stuff, not in the context of HP optimization, from Ilya Ryzhov, a faculty member in our business school who works on these topics -- I learned that in his world, exploration/exploitation is also called "learning versus earning" which I think it awesome :).

I would love if hyperparameter optimization were a black box, and Spearmint and SMAC are both great steps in this direction. There are a couple of things that I haven't seen (though like I said, I'm not hugely well read in this space) that I would love to see.
  • Learning across multiple datasets, akin to meta-learning. I imagine that if you give me a problem to do HP optimization on, and also give the same problem to an undergraduate, I would be much better. Presumably I've learned from past experience how to do this well.
  • More importantly, taking advantage of some of the structure of learning algorithms (rather than oblivious black box optimization) seems like it could be a big win. For instance, most learning algorithms have some notion of early stopping (# of passes over the data, tolerance for convergence, etc.). You can also of course run on subsets of the data (which is equivalent in many cases). If you assume heldout accuracy doesn't get worse over time (e.g., because you do early stopping) then you can think of this as a type of right-censoring. I.e., you run an experiment part of the way through and you know "ok if I kept running it might get better, but I know it's at least 85% accurate now." The SMAC folks had a nice paper on Bayesian optimization with censored data, but my (incomplete) understanding is that it doesn't quite capture this (very common, IMO) case. I should be willing to start and stop processes at various points and try to figure out where to invest more computation to get better estimates. I can presumably even estimate: if I ran another pass, how much better do I think it would get?
  • I also think the focus on "finding the best hyperparameters" is somewhat the wrong problem. We want to find the best parameters, period. Hyperparameters are a nuisance on their way to that end. For instance, related to the above bullet, I could imagine running a few passes with one setting of hyperparameters, and then doing some other work, and then going back and restarting that previous run with a different setting of hyperparameters (assuming the model being learned is such that "warm starting" makes sense, which is almost always the case--except maybe in some neural network settings).
     
  • Parallelization is a big deal. One of the reasons something akin to grid search is so attractive is that it's trivial to submit 20*20*20 jobs to my cluster and just wait for them to finish. Anything that's less friendly than doing this is not worth it. Again, the SMAC folks have worked on this. But I don't think the problem is solved.
Beyond these technical issues there's always the obnoxious issue of trust. Somehow I need to believe that I'm not better than these algorithms at tuning hyperparameters. I should be happy to just run them, preferably saying "okay, here are 120 cores, you have four hours -- go to town." And I should believe that it's better than I could do with equivalent time/resources by clever grid search. Or perhaps I should be able to encode my strategies in some way so that it can prove to me that it's better than me.

Overall, I'd love to see more work on this problem, especially work that doesn't focus on neural networks, but still takes advantage of the properties of machine learning algorithms that are not shared by all black-box derivative-free optimization tasks. In the mean time, from what I hear, SMAC and Spearmint are actually quite good. Would love to hear if any NLP people have played around with them!

03 October 2014

Machine learning is the new algorithms

When I was an undergrad, probably my favorite CS class I took was algorithms. I liked it (a) because my background was math so it was the closest match to what I knew and (b) because even though it was "theory," a lot of the stuff we learned was really relevant. Over time, it seemed like the area had distilled worthwhile algorithms from interesting-in-theory-but-you'll-never-actually use algorithms.

In fact, I think this is a large part of why most undergraduate CS degrees today require a course in algorithms. You have these very nice, clearly defined statements, and very elegant solutions to those statements that in most cases (at the UG level) are known to be optimal.

Fast forward N years.

My claim today---and I'm speaking really as an NLP person, which is how I self-identify---is that machine learning is the new core. Everything that algorithms was to computer science 15 years ago, machine learning is today. That's not to say it won't move in another 10 years, but that's how I see it.

Why?

For the most part, algorithms (especially as taught at th UG level) is the study of one thing: Given a perfect input, how do I most efficiently compute the optimal output.

The problem is the "perfect input" part.

All of my experience in the past N years has told me that you never have a perfect input, and that it's far far far more important to be able to synthesize information from a large number of sources and reason about it than it is to find the exact-right-solution to some problem that exists only to Plato.

Even within machine learning you see this effect. Lots of numerical analysis people have worked on good algorithms for getting that last little bit of precision out of optimization algorithms. Does it matter? Nope! Model specification, parameter tuning, features, and data matter infinitely more than that last little bit of precision. (In some fields, for instance, scientific computing, that last little bit of precision may matter. I don't know enough to know one way or the other.)

Let's play a thought game. Say you're an UG CS major. You graduate and get a job in CS (not grad school). Which are you more likely to use: (1) a weighted cost flow algorithm or (2) a perceptron/decision tree?

Clearly I think the answer is (2). And I loved flow algorithms when I was an undergrad and have actually spent since 2006 trying to figure out how I can use them for a problem I want to solve. No dice.

I would actually go further. Suppose you have a problem whose inputs are ill-specified (as they always are when dealing with data), and whose structure actually does look like a flow problem. There are two CS students trying to solve this problem. Akiko knows about machine learning but not flows; Bob knows about flows but not machine learning. Bob tries to massage his data by hand into the input to an optimal flow algorithm, and then solves it exactly. Akiko uses machine learning to get good edge weights and hacks together some greedy algorithm for flows, not even knowing it's called a flow. Who's solution works better? I would put almost any amount of money on Akiko.

Full disclosure: those who know about my research in structured prediction will recognize this as a recurring theme in my own research agenda: fancy algorithms always lose to better models.

There's another big difference between N years ago and today: almost every algorithm you could possibly care about (or learn about as an UG) is implemented in a library for any reasonable programming language. That's not to say that it's unimportant to know how things work in order to use them, but I would argue it's much less important in a field like algorithms whose knowledge is comparatively stable, versus a field like machine learning where things are still changing and there is no "one right answer" to the "machine learning problem." In a field that's still a bit of an art rather than a science, understanding how things work under the hood feels a lot more important. Quicksort, heaps, minimum spanning trees, ... these are all here to stay.
Okay, so now I've convinced myself that we should yank algorithms out as an UG requirement and replace it with machine learning.

But wait, I can hear my colleagues yelling, taking algorithms isn't about learning algorithms: it's about learning how to think! But that's also what I think is great about machine learning: the distance between theory and algorithms is actually usually quite small (I try to get this across at various points in CiML, to varying degrees of success). If the only point of an algorithms class (I've heard exactly this argument made about automata theory, for instance) is to teach students how to think, I think we could do much better.

Okay, so I've thrown down the gauntlet. Someone should come smack me with theirs :P!

Edit after some comments:

I think I probably wrote badly and as a result my main point got lost. I'll try to restate it here briefly and then I'll edit the main post.

Main point: I feel like for 15 years, algorithms has been at the heart of most of what computer science does. I feel like that coveted position has now changed to machine learning or, more generically, statistical reasoning. I feel this way because figuring out how to map a real world problem into something an "algorithm" can consume, especially when that needs statistical modeling of various inputs, is (IMO) a more important and harder problem than details about flow algorithms. 



EDIT #2 AFTER DISCUSSION ON SURESH'S BLOG:

let me give a concrete example that may actually be a real world example, but i don't know (though see this paper). that of path finding for taxis or cars. the world is a graph and given directed edge costs we can run dijkstra or whatever to find LEAST-TIME (shortest) paths. this is basically google maps/etc.

of course, we never know the true time to travel some segment. we might know it now, but by the time the driver gets to some road (5 or 10 minutes from now) the conditions may have changed. and of course we have historical data on traffic from which we can predict what the condition of the road will be like in 10 minutes.

so here, "foo" is a function that takes the time of data, historical traffic data, weather and whathaveyou, and maps it to edge costs.

"bar" is dijkstra's algorithm or whatever shortest path algorithm you like.

my claim is that if you really want to solve this problem, it's much more important to understand how to create foo than how to create bar. in particular, if i gave you a greedy or near greedy approach to bar, combined with a really good foo, i bet this would be significantly better than an optimal bar and a crappy foo.

27 September 2014

AMR: Not semantics, but close (? maybe ???)

Okay, necessary warning. I'm not a semanticist. I'm not even a linguist. Last time I took semantics was twelve years ago (sigh.)

Like a lot of people, I've been excited about AMR (the "Abstract Meaning Representation") recently. It's hard not to get excited. Semantics is all the rage. And there are those crazy people out there who think you can cram meaning of a sentence into a !#$* vector [1], so the part of me that likes Language likes anything that has interesting structure and calls itself "Meaning." I effluviated about AMR in the context of the (awesome) SemEval panel.

There is an LREC paper this year whose title is where I stole the title of this post from: Not an Interlingua, But Close: A Comparison of English AMRs to Chinese and Czech by Xue, Bojar, Hajič, Palmer, Urešová and Zhang. It's a great introduction to AMR and you should read it (at least skim).

What I guess I'm interested in discussing is not the question of whether AMR is a good interlingua but whether it's a semantic representation. Note that it doesn't claim this: it's not called ASR. But as semantics is the study of the relationship between signifiers and denotation, [Edit: it's a reasonable place to look; see Emily Bender's comment.] it's probably the closest we have.


We've spent some time looking at the data (dummy), to try to understand what is actually there. What surprised me was how un-semantics-y AMR tends to be. The conclusion I reached is that it might be much closer to a sort of D-structure (admittedly with some word sense disambiguation) than a semantics (sorry, I grew up during GB days and haven't caught on to the whole minimalism thing). And actually it's kind of dubious even as a D-structure....

Why do I say that? A handful of reasons; I'll give an example of some of them. All of these examples are from the 1274 training AMRs from The Little Prince.

Gapped agents in matrix clauses

Example: "... insisted the little prince , who wanted to help him ."

(i / insist-01
  :ARG0 (p / prince
          :mod (l / little)
          :ARG0-of (w / want-01
                     :ARG1 (h / help-01
                             :ARG1 (h2 / he))))
  :ARG1 (...))

Why does this surprise me? I would strongly have expected "p" (the Little Prince) to be the ARG0 of help. But in this representation, help doesn't have any ARG0.

You could make some argument that you could write a tool to transform such matrix clauses and re-insert the ARG0 of the matrix verb as the ARG0 of the subordinate verb, but this doesn't work in general. For instance, "I always want to rest" in the dataset also doesn't have "I" as an argument of "rest." Unfortunately, the rest-er in the propbank/verbnet definition of rest is (correctly) its ARG1/theme. So this deterministic mapping doesn't work -- if you did this the interpretation would be that "I always want to rest" means "I always want to cause-something-unknown to rest" which is clearly different.

Perhaps this is an annotation error? I found the same "error" in "And I knew that I could not bear the thought of never hearing that laughter any more" (the ARG0 of "hear" is missing) but there are a few cases where the subordinate clause does get an agent; namely:
  • She did not wish to go out into the world all rumpled...
    ("go" correctly gets "she" as the ARG0)
  • ...that she wished to appear...
    ("appear" gets "she" as the ARG0)

So I'm not sure what's going on here. At least it's inconsistent...

Noun-noun compounds

In a semantic representation, I would expect the interpretation of noun noun compounds to be disambiguated.

For instance, the string "only one ring of petals" is annotated as:

        (r / ring :quant 1
                  :mod (o / only)
                  :consist-of (p / petal)))

But the string "glass globe" is simply:

        (g / globe
                  :mod (g2 / glass)))

In other words, the "ring of petals" is a ring that consists of petals. But the "glass globe" is just a "globe" modified by "glass." We have no idea what sort of modification this is, though one could argue that it's the consist-of relation that was used for ring of petals.

As made poinant by Ewan Dunbar, disambiguating Noun-Noun compounds is important when translating into many other languages:



Possession

There are many types of possession, many of which can be turned into predicates:
  • Erin's student => the student Erin advises
  • Julio's paper => the paper Julio wrote
  • Smita's football team => the team Smita plays on; the team Smita owns; the team Smita roots for
  • Kenji's speech => the manner in which Kenji talks; the speech Kenji gave (these last are actually hard to predicatize given my inventory of English verbs)
(Thanks to Ani Nenkova for most of these examples.)

In AMR, it appears the rule is that apostrophe-s ('s) turns into :poss. You (appear to) get (student :poss Erin) and (paper :poss Julio) and (team :poss Smita) and (speech :poss Kenji).

This is a bit strange because the Norman genitive alternation ("of") of possession in English (as opposed to the Saxon genitive "'s") does not turn into :poss. For instance, "other side of the planet" becomes:

                     (s2 / side
                              :mod (o / other)
                              :part-of (p2 / planet))

Here, the "of" has been disambiguated into :part-of; in contrast, with "air of authority", we get:

         (a / air
            :domain (a2 / authority))

where the "of" has turned into ":domain". In fact, I cannot find any cases where there is a :poss and no "'s" (or his/her/etc...).

Now, you could argue that "planet's side" and "authority's air" sound at best poetic and at worst wrong. (Though I find "planet's other side" pretty acceptable.) But this is basically a property of English. But these are totally fine in Japanese with /no as the possessive marker (according first to my prior and then confirmed by my Japanese informant Alvin -- thanks Alvin :P). I'm guessing they're okay in Chinese too (with /de as the possessive marker), but I'm not positive.

Moreover, possession is more complicated that than in lots of languages that distinguish between alienable and inalienable possession. A classic example being "my aunt" versus "my car." My mother is inalienable because try as I might, I cannot sell, buy, exchange, etc., my aunt. My car is alienable because I can do these things. In lots of languages, (about half, according to WALS), there is more than one class of possession, and the two class (in)alienable distinction is the most common.

As an example (taken from that WALS link, originally due to Langdon (1970)): In Mesa Grande Diegueño (Yuman; California), inalienable nouns like mother (ətalʸ) take a simple prefix ʔ- ('my'), while house (ewa) takes the compound prefix ʔə-nʸ- as in:

    a. ʔ-ətalʸ
       1sg-mother
       ‘my mother’ 

    b. ʔə-nʸ-ewaː
       1sg-alienable-house
       ‘my house’ 

So you could argue that in this case it's purely a property of the possessed noun, and so even in Mesa Grande Digueño, you could say :poss and then disambiguate by the semantics of the possessee.


Wrap Up

I could continue to go on, but perhaps I've made my point. Which is not that AMR sucks or is uninteresting or anything like that. It's just that even if we can parse English into AMR, there's a long way to go before we can start doing semantic reasoning from it. And maybe along the way you learned something. I know I did.

I think what really drove it home for me that AMR is not so much of a semantic representation is the ease with which I could imagine writing a rule-based generator for AMR to English. Yes, the sentences would come out stodgy and kind of wrong, but given an AMR and doing an in-order traversal of the tree, I'm pretty sure I could generate some fairly reasonable sentences (famous last words, right?). I believe this is true even if you took the AMR, re-ified it into a Hobbs-esque flat form first. The first step would be un-reification, which would basically amount to choosing a root, and then going from there. Like all NLP papers written in the 80s, perhaps the devil is in the details, in which case I'll be happy to be proved wrong.

One interesting final point for me is that as established in the paper I stole this title from, AMR is actually a pretty reasonable interlingua. But it's a pretty crappy semantic representation (IMO). This sort of breaks the typical MT Vauquois triangle triangle. That alone is kind of interesting :).




[1] Due to Ray Mooney, saw on Twitter, but now can't find the picture of the slide any more.

31 July 2014

Reading group notes: point/counter-point on "predict models"

In our local summer reading group, I led the discussion of two papers that appeared in Baltimore last month:

I love handouts, so I made a handout for this one too. I paste below the handout. All good ideas are those of the respective authors; all errors and bad ideas are probably due to bad transcription on my  part.


Don't count, predict! A systematic comparison of context-counting
vs. context-predicting semantic vectors

Marco Baroni & Georgiana Dinu & Germ\'an Kruszewski

Motivation: these silly deep learning people keep writing papers but
don't compare to traditional distributional semantics models. So we
will.

Conclusion: okay, those people are actually right.

== Background ==

Distributional semantics = you know a word by the company it keeps

"Count models":
 * For each word type, collect context vectors
 * Context vectors look at n words on the left and right
     (with position info? together or separately?)
     varied in 2..5
 * Each type is represented by the bag of contexts in which it appears
 * Contexts are scored by PMI or LLR
     (gets rid of useful but frequent contexts)
 * We might reduce dimensionality to k in { 200, ..., 500 }
    - using either SVD or NNMF

"Predict models" (aka deep learning):
 * Assume a mapping from word type -> k-dim vector
 * Learn a model to predict any word token given the vectors
     of the n words to its left and right
     varied in {2,5}
 * Words are thrown out proportional to their frequency
    - makes things faster
    - reduces importance of frequent words like IDF
 * Vary k in { 200, ..., 500 }

CW (Collobert & Weston) models:
 * freely available online
 * 100 dimensional vectors trained for 2 months on wikipedia
 * predict a word with 5 words to the left and right
 * used extensively in other literature

== Tasks ==
 * Synonym detection from TOEFL
   - given "levied" choose from { imposed, believed, requested, correlated }
   - compute cosine of word representations
 * Concept categorization
   - cluster words like { helicopters, motorcyles} into one class
     and { dogs, elephants } into another
   - used off-the-shelf clustering algorithms
 * Selectional preferences
    - given a verb/noun pair say whether the verb selects for that noun
    - eg, "eat apples" versus "eat gravity"
    - for each verb, take 20 most strongly associated nouns, average
        their representations, and measure cosine similarity to that
 * Analogy
    - eg, brother:sister :: grandson:(?granddaughter)
    - find nearest neighbor of (brother - sister + grandson)

== Summary of results ==
                                     pred    tie   count    (win: >=5)
                                     wins           wins
 * tune parameters PER TASK:          10      4      0
 * tune parameters OVERALL:           10      3      1
 * worst parameters                   14      0      0
 * best from relatedness:             11      3      0

 - best count   model: window=2, PMI, no compression, 300k dimensions
 - best predict model: window=5, no hier softmax, neg sampling, 400 dim







Linguistic Regularities in Sparse and Explicit Word Representations
Omer Levy & Yoav Goldberg

Motivation: neural representations capture analogical reasoning well;
why does this happen?

Conclusion: we can just as well using tranditional distributed word
representations if we measure similarity in a "multiplicative" way

== Background ==

* neural language models produce representations that can answer analogy questions:
  - gender...   man:woman :: king:queen
  - speakers... france:french :: mexico:spanish
  - number...   apple:apples :: car:cars
* question: how much of this is a property of *embeddings* (ie dense, low-dimensional)
  - alternative is distributional similarity == bag of contexts (ala Baroni paper)

== Experiment 1 ==

* Mikolov (word2vec) computes similarity for solving a:b :: a*:b* by finding:
    arg max_{b*}  similarity(b*, a* - a + b)       (called 3CosAdd)
   aka similarity(queen, king - man + woman)
* they use cosine similarity: cos(u,v) = dot(u,v) / [ ||u||  ||v|| ]
* expanding out, you get:
    arg max_{b*}  cos(b*,b) + cos(b*,a*) - cos(b*,a)
  aka
    cos(queen,woman) + cos(queen,king) - cos(queen,man)

Results: on MSR & Google datasets, embeggings >> explicit (predict >> count)
         on SemEval, basically tied (closed vocabulary?)

* an alternative is:
    arg max_{b*}  similarity(b*-b, a*-a)           (called PairDirection)
  aka similarity(queen-woman, king-man)

Results: Much much worse on MSR, Google (open vocab), and better (though tied) on
SemEval. Perhaps because scale matters for open vocabulary? could have been tested
explicitly... (drat!)


== Experiment 2 ==

Looking at the expansion of 3CosAdd, it looks like a "noisy or" sort of operation:
 - b* should be close to b, should be close to a*, should be far from a...

What about using something more like noisy and:

            cos(b*,b)  cos(b*,a*)
  arg max  -----------------------
      b*       cos(b*,a) + eps

aka

            cos(queen,king)  cos(queen,woman)
  arg max  -----------------------------------
      b*          cos(queen,man) + eps

Results:


                      MSR       Google
  3CosAdd  Predict    54%       63%
           Count      29%       45%
  3CosMul  Predict    59%       67%
           Count      57%       68%

One against the other:

  Both correct    -- 54% of cases
  Both wrong      -- 24%
  Predict correct -- 11.1%
  Count correct   -- 11.6%

27 July 2014

Hello, World!

Okay, usually Hello World is the first program you learn to write in a new programming language. For fun, I've been collecting how to say hello world in different human languages, something remarkably difficult to search for (because of the overloading of the word "language"). I have 28. I'd like to make it to 280 :). If you have one (or more) to contribute, email me, post a comment, or tweet to me @haldaume3. And of course if you think any of these is wrong, please let me know that too.

     1 bar Servus Woid!
     2 ca  Hola Món!
     3 de  Hallo Welt!
     4 en  Hello World!
     5 eo  Saluton, Mondo!
     6 es  ¡Hola Mundo!
     7 eu  Kaixo, mundua!
     8 fi  Hei maailma!
     9 hu  Helló, világ!
    10 ia  Hallo, mundo!
    11 id  Halo dunia!
    12 ja  こんにちは世界
    13 lv  Sveika, pasaule!
    14 min Helo dunia!
    15 mk  Здраво свету!
    16 ms  Helo dunia!
    17 nn  Hallo verda!
    18 no  Hallo, verden!
    19 pt  Olá Mundo!
    20 sh  Zdravo svete!
    21 sl  Pozdravljen svet!
    22 sq  Njatjeta Botë!
    23 sr  Здраво свете!
    24 sv  Hej Världen!
    25 th  เฮลโลเวิลด์
    26 tr  Merhaba dünya!
    27 vi  Xin chào thế giới!
    28 zh  世界,你好!

05 July 2014

My ACL 2014 picks...

Usual caveats: didn't see all papers, blah blah blah. Also look for #acl14nlp on twitter -- lots of papers were mentioned there too!

  • A Tabular Method for Dynamic Oracles in Transition-Based Parsing; Yoav Goldberg, Francesco Sartorio, Giorgio Satta.
    Jaokim Nivre, Ryan McDonald and I tried searnifying MaltParser back in 2007 and never got it to work. Perhaps this is because we didn't have dynamic oracles and we thought that a silly approximate oracle would be good enough. Guess not. Yoav, Francesco and Giorgio have a nice technique for efficiently computing the best possible-to-achieve dependency parse given some prefix, possibly incorrect, parse.
  • Joint Incremental Disfluency Detection and Dependency Parsing; Matthew Honnibal, Mark Johnson
    The basic idea is to do shift-reduce dependency parsing, but allow for "rewinds" in the case of (predicted) disfluencies. I like that they didn't just go with the most obvious model and actually thought about how might be a good way to solve this problem. Basic idea is if you get "Please book a flight to Boston uh to Denver..." is that you parse "to Boston" like usual but then when you get to the "uh", you remove old arcs. You do it this way because detecting the disfluent segment ("to Boston") is much easier when you hit "uh" than when you hit "to Boston."
  • Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors; Marco Baroni; Georgiana Dinu; Germán Kruszewski
    This paper is summarized best by its own statement, which should win it the award for most honest paper ever: "...we set out to conduct this study because we were annoyed by the triumphalist overtones often surrounding [neural network embeddings], despite the almost complete lack of a proper comparison.... Our secret wish was to discover that it is all hype... Instead, we found that the [embeddings] are so good that, while the triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture."
  • Learning to Automatically Solve Algebra Word Problems ; Nate Kushman; Luke Zettlemoyer; Regina Barzilay; Yoav Artzi
    An algebra word problem is something like "I have twice as many dimes as nickles and have $2.53. How many nickles do I have." Of course usually they actually have an answer. They have a nice, fairly linguistically unstructured (i.e., no CCG) for mapping word problems to algebraic formulae and then solving those formulae. Code/data available.
  • Grounded Compositional Semantics for Finding and Describing Images with Sentences; Richard Socher, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng
    This is the follow-on work from Richard's NIPS workshop paper on text <-> images from this past NIPS. They fixed the main bug in that paper (the use of l2 error, which gives a trivial and uninteresting global optimal solution) and get nice results. If you're in the langvis space, worth a read, even if you don't like neural networks :).
  • From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions; Peter Young, Alice Lai, Micah Hodosh, Julia Hockenmaier
    I really like the "visual denotations" idea here. Basically you say something like "the set of worlds in which this sentence is true is the set of images in which this sentence is true (i.e., roughly the sentence is entailed by the image)." You can then measure similarity between sentences based on denotations.
  • Kneser-Ney Smoothing on Expected Counts; Hui Zhang; David Chiang
    I didn't actually see this talk or read the paper, but lots of people told me in hallways that this is a very nice result. Basically we like KN smoothing, but it only works for integral counts, which means it's hard to incorporate into something like EM, which produces fractional counts. This paper solves this problem.
  • Linguistic Structured Sparsity in Text Categorization; Dani Yogatama; Noah A. Smith
    Also didn't see this one, but skimmed the paper. The reason I really like this paper is because they took a well known technique in ML land (structured sparsity) and applied it to NLP, but in an interesting way. I.e., it wasn't just "apply X to Y" but rather find a very linguistically clever/interesting way of mapping X to a problem that we care about. Very cool work.
Overall I really liked the conference, thanks to everyone who helped put it together. I can't help but notice that about half of my picks above were actually TACL papers. I suspect this will be more and more true over time.

Please add comments with your favorite papers that I missed!

30 June 2014

Divergences passed through Bayes' rule

In a previous post's comments, we talked about Bayes rule and things like that. This got me wondering about the following question:

If we know p(A) and p(B|A), we can reconstruct p(A|B) perfectly by Bayes' rule. What if we only have estimates of p(A) and p(B|A)? How does the quality of the reconstruction of p(A|B) vary as a function of the quality of the estimates of the marginal and conditional?
I feel like there have to be results along these lines, but I was unable to find them. My next attempt was to prove something, which failed miserably after a few hours.  So, as a good empiricist and lazy(/bad) theorist, I designed a simple experiment.

Let A and B be binary variables. Let's generate a random joint distribution p(A,B), which has four cells for the four possible combinations of values of A and B. From this, we can directly compute the true marginal p(A) and the true conditionals p(B|A) and p(A|B).

Now, let's pick some "estimate" q(A) and q(B|A). You can think of these as a "noisy" version of p(A) and p(B|A). Given q(A) and q(B|A), we can compute an estimate a reconstructed joint distribution q(A,B) = q(A)q(B|A), as well as a reconstructed conditional distribution q(A|B) = q(A)q(B|A) / Z(q), where Z(q) is computed according to q. We can then compare q(A,B) to the true p(A,B) and q(A|B) to the true p(A|B) and measure how far they are.

At this point we have to decide what our measurement (divergence) function is. I tried three: variational distance (max absolute difference), l1 distance (sum absolute difference) and KL divergence. To be absolutely pedantic, I will define the versions of these that I used. First, the KL variants:
KL( p(A) || q(A) ) = sum_a p(a) log [ p(a) / q(a) ]
KL( p(A,B) || q(A,B) ) = sum_{a,b} p(a,b) log [ p(a,b) / q(a,b) ]
KL( p(A|B) || q(A|B) ) = sum_b p(b) KL( p(A|B=b) || q(A|B=b) )
Note that the direction is q from p (chosen because p is the "true" distribution) and that this also has the advantage that the conditional KL is based on p(B), which (in this case) is known exactly and is "correct."

By analogy, for l1 distance we have:
l1(p(A), q(A)) = sum_a |p(a) - q(a)|
l1(p(A,B), q(A,B)) = sum_{a,b} |p(a) - q(a)|
l1(p(A|B),q(A|B)) = sum_b p(b) l1(p(A|B=b), q(A|B=b))
Note that this last one might be slightly non-standard, but is parallel to the KL definition.

Similarly, for variational distance:
var(p(A), q(A)) = max_a |p(a) - q(a)|
var(p(A,B), q(A,B)) = max_{a,b} |p(a) - q(a)|
var(p(A|B),q(A|B)) = sum_b p(b) var(p(A|B=b), q(A|B=b))
Okay, so now for the experiment. First I generate a random (uniform) true joint distribution p(A,B). I then run through 1,000,000 possible q(A,B), where each of the three sufficient statistics are chosen from [0.01, 0.02, ... 0.99]. I then conditionalize and marginalize these in all the relevant ways and compute KL. Finally, I generate plots like the following very representative example for KL:
On the left column, we're inspecting the recovered joint distribution and in the right column the recovered conditional distribution. The top row shows: for different divergences of q(A) from p(A), and for different divergences of q(B|A) from p(B|A), how far is (left) the recovered joint q(A,B) from the true joint q(A,B), or how far is the (right) recovered conditional q(A|B) from the true conditional p(A|B). The middle row is the projection of this into two dimensions, focusing on the divergence in the marginal, and the bottom row is the projection onto the divergence in the conditional. The title shows what the true distribution is in the form [p(a,b) p(a,~b) ; p(~a,b) p(~a,~b)]. I chose this example because the joint has a correlation between a and b.

This example is fairly benign: as the approximations become worse, so do both of the recovered distributions, in a fairly linear way until a plateau. From the bottom row, you can see that it's more important to get the conditional right than the marginal (you can have a marginal that's quite far--eg., a KL of 1.5--and still get an almost perfect recovery of the conditional or joint, but this is not true for large differences in the conditional B|A.

One strange thing is that you often (for different true joints) see results that look like:
There's a very strange effect here, in which a larger kl on B|A can actually be better at the recovery of the conditional, while worse at the recovery of the joint.

 One can ask if this is an artifact of KL. So let's switch to L1 and variational for the first set of plots:

and variational:
So, in both L1 land and variational land, you can do better on the conditional by being worse on the (other) conditional.

For the example that gave rise to the weird KL results, we have the following for L1:
which shows almost an identical effect. For variational:
the effect is still the same.

Okay, so it's totally entirely possible (perhaps probable?) that there's a bug in my code. If you'd like to look, check out mykl.m and myklrun.m (yes, it's matlab). Let me know in the comments if there are bugs. If you'd like to look at more examples, check out all ten examples.

02 June 2014

Role models

During grad school, my advisor suggested I identify a recent grad who has been, to me, successful. I could then use him or her as a guide. I picked someone (he now knows who he is), and the exercise was useful: there are lots of ways to be successful in research land, and this helped me focus.

RST-relation=Topic-Shift.

I'm fairly serious about yoga. I've had a lot of instructors over the years and noticed a high correlation between InstructorILike and InstructorWhoIsMale. Initially I believed this was because male instructors pushed more, and that worked for me. Over time I realized that was not the full story.

I spent two weeks going to classes by instructors I hadn't had before to try to understand what variable(s) made the difference. I've believe now that a large part of the reason I like male instructors is precisely because they're male. A female instructor would do some crazy pose and my brain would immediately say "I could never do that." A male instructor would do the same pose and my brain would say "If he can do it, so can I." (I'd then try and fail several times, but never with a defeatist attitude.)

Topic-UnShift.

I've heard for a long time that having role models you can identify with is important. As someone who has in almost all of my life fit into the overwhelming majority (white male in tech/academia), it's been rare that I've had the opportunity to really feel this effect for myself. I try to believe things even if they haven't happened to me, but it's always better when you can empathize rather than sympathize and it's easier to empathize when you've actually been there.

The first time I remember feeling the effect of a role model "who looks like me" was  at the 1996 Olympics and Poul-Erik Høyer Larsen (Denmark) was the first European to ever win the badminton semi-finals; he then won the gold medal against Dong Jiong (China). (This sport is dominated by Indonesia, China and Malaysia.) Growing up in a particular part of Los Angeles and playing badminton as a kid, I was very much an outlier. Even though I'd never heard for Poul-Erik before (everyone knew who Jiong was), his win gave me something I could aspire to.

A few years ago I began broadcasting my support of the LGBT community, e.g., an HRC link on my web page and painting my laptop. Since then I've gotten emails from several people (mostly students) effectively asking why there aren't more/any LGBT role models in our community. You can interpret "community" meaning anything in the NLP/ML to CS to Science/Tech range. My answer: I don't know. It's hard to even know how large this community is because, unlike things like race and (binary) gender, it's not always outwardly inferrable (with noise). These issues effect tech is nuanced ways; see for instance an interview with the founder of Lesbians Who Tech or Queer in STEM for more.

This is all to say that having role models is important, and yes, it does matter who they are, where they came from, and what they look like. It mattered to the high school aged version of me, the grad school version of me, and the associate prof version of me. I'm not saying anything new here, but for our field to be healthy, we need a large number of successful people who can be role models for all sorts of students (and beyond). Token visibility is not enough because a single example of some particular label won't match with everyone who self-identifies with that label. The person I chose was, yes, a while male. There were plenty to choose from. But I chose him, and others would not have sufficed.

30 May 2014

Past tense is not past tense

I took part in a wonderful Dagstuhl workshop this past February on translating morphologically rich languages. (Yeah, I also don't really know why I was invited :P.) But many thanks to Alex, Kevin, Philipp, Helmut and Hans for inviting me. I had a realization during this workshop that I thought I'd share. It's obvious in retrospect, and perhaps in front-spect for many of you. Much of this came up in the discussion with Bonnie Webber, Marion Weller, Martin Volk, Marine Carpuat, Jörg Tiedemann and Maja Popovic, and Maja deserves much credit for her awesome error analysis tool that helped shed some light on German.

One thing you commonly think of when translating into a morphologically rich language is that there's stuff you're going to have to hallucinate. Really this isn't an issue of morphology per se, but just that this is one place where it's obvious. For instance, even going from English to French you'll have to hallucinate gender on your determiners (un versus une and le versus la) that's unmarked in English. Or when going from Japanese (which roughly combines present and future tenses into a single tense) to English, you'll have to hallucinate "will" at appropriate places.

An abstraction that I think was pretty widespread among the initial discussions in the workshop was that if you're going from language X to Y, there are basically two options:

  1. Phenomenon foo is explicit in Y but implicit in X, and therefore you'll have to hallucinate it (i.e., tense is explicit in English but not in Mandarin)
  2. Phenomenon bar is explicit in Y and also explicit in X, and so you can just copy it.
The problem is that (2) is just false, even for things that you think it might not be false for. That is to say: Just because two languages explicitly code for something that we give a consistent (linguistic) name to, doesn't mean that they code for it consistently.

Okay, so you want examples.

An easy example is gender. I've been well assured that, for instance, French and Russian both have explicit gender. But just because some noun (eg moon/lune) is feminine in French doesn't mean it's also feminine in Russian. (In fact I think it's neuter.)

You might argue gender is a stupid thing to pick because it's essentially an artificial encoding of who-knows-what.

How about tense. That clearly has a semantic interpretation (did something happen in the past, the present or the future) and so if languages X and Y both express some particular tense, they must be consistent in how they do it.

Wrong. Now my memory is getting a bit shaky, but my recollection is that, for instance, in newswire text, it's very common in German to refer to things that have happened in the past in present tense. To English speakers this is a strange convention (we tend to refer to such things in past tense), but it doesn't have to be so. And of course English has it's own idiosyncrasies: see the plight of the native German speaker who cannot understand English tense usage in (New Zealand) news articles.

Part of this is probably because tense, even in English, is a pretty slippery concept. We (native English speakers) have no problem using present (or progressive) tense to refer to things that happened in the present (John runs) the past (so yesterday I'm running to the store and a hamburger falls on my head!) or the future (my flight leaves at 8:00 tonight).

Another easy example is definiteness (thanks to Kevin Knight for this inspiration). Again, our high school English teachers tell us that "the" (+Definite) has to refer to something that's already been introduced into context. I just went to cnn.com, clicked on the very first link, and the first sentence is "The boss resigned under pressure and other Veterans Affairs managers are likely on the way out." Ok you could argue that "The boss" is already in the context of the US news media (this is an article about Shinseki) but it's nonetheless very common to see (English) entities introduced using "the" and the precise rules that govern this may or may not be consistent across other languages.

The long and short of this is: I like the fact that translation into morphologically rich languages makes us pay attention to linguistic divergence. But that doesn't mean that divergences aren't there even when languages express the same set of linguistically-named phenomena. Usage can vary dramatically, be it for conventions, socio-linguistic reasons, or other things that are hard to pin down. It's just that by focusing all our energy on a very particular convention (newswire, parliament), we can pretty easily learn these mappings because there's no variability. Add some variability and we're hosed, even for languages with the same set of (overt) markings.

16 May 2014

Perplexity versus error rate for language modeling

It's fair to say that perplexity is the de facto standard for evaluating language models. Perplexity comes under the usual attacks (what does it mean? does it correlate with something we care about? etc.), but here I want to attack it for a more pernicious reason: it locks us in to probabilistic models.

Background

Language modeling---or more specifically, history-based language modeling (as opposed to full sentence models)---is the task of predicting the next word in a text given the previous words. For instance, given the history "Mary likes her coffee with milk and", a good language model might predict "sugar" and a bad language model might predict "socks." (This is related to the notion of cloze probability.)

It's quite clear that there is no "right answer" to any of these prediction problem. As an extreme example, given the history " The", there are any number of possible words that could go next. There's just no way to know what the "right" answer is, whether you're a machine or a person.

This is probably the strongest justification for a perplexity-like measure. Since there's no "right" answer, we'll let our learned model propose a probability distribution over all possible next words. We say that this model is good if it assigns high probability to "sugar" and low probability to "socks."

Perplexity just measures the cross entropy between the empirical distribution (the distribution of things that actually appear) and the predicted distribution (what your model likes) and then divides by the number of words and exponentiates after throwing out unseen words.

The Issue

The issue here is that in order to compute perplexity, your model must produce a probability distribution. Historically we've liked probability distributions because they can be combined with other probability distributions according to the rules of probability (eg., Bayes rule or chain rule). Of course we threw that out a long time ago when we realized that combining things, for instance, in log-linear models, worked a lot better in practice, if you had a bit of data to tune the weights of the log-linear models.

So the issue in my mind is that there's plenty of good technology out there for making predictions that does not produce probability distributions. I think it's really unfortunate that non-probabilistic approaches don't get to play the language modeling game because they produce the "wrong" sort of output (according to the evaluation, but not according to the real world). I'm not saying there aren't good reasons to like probabilistic models, but just that alternatives are good. And right now those alternatives cannot compete. (For instance, Roark, Saraclar and Collins [2007] don't use perplexity at all and just go for word error rate of a speech recognizer around their perceptron-based language model.)

When I Ran Into This

I was curious about building a language model using vw, in the context of another project, and also to stress-test multiclass classification algorithms that scale well with respect to the number of classes.

As soon as I ran it, I discovered the issue: it produced results in the form of error rates. As I recall (it was a while ago), the error rate was somewhere in the 60s or 70s. I had absolutely no idea whether this was good or not. It seemed reasonable.

To get a sense of how standard language models fare, I decided to train a language model using srilm, and evaluate it according to error rate. To make my life easier, I just ran it on the WSJ portion of the Penn treebank. I used the first 48k sentences as train and the last 1208 sentences as test. I trained a 5gram Kneser-Ney smoothed language model and evaluated both perplexity and error rate (the latter required a bit of effort -- if anyone wants the scripts, let me know and I'll post them---but basically, I just take the LM's prediction to be the highest probability word given the context).

The language model I built had a perplexity (ppl1 in srilm) of 236.4, which seemed semi-reasonable, though of course pretty crappy. There was an OOV rate of 2.5% (ignored in the perplexity calculation).

The overall error rate for this model was 75.2%. This means that it was only guessing a quarter of words correct. (Note that this includes the 2.5% errors mandated by OOVs.)

I also tried another version, where all the model had to do was put the words in the right order. In other words, it knows ahead of time the set of words in the sentence and just has to pick between those 20, rather than between the full vocabulary (43k types). [This is maybe semi-reasonable for MT.] The error rate under this setting was 66.8%. Honestly I expected it would be a lot better.

Note that if you always guess ","---the most frequent type in this data---your error rate is 95.3%.

So why was it only moderately helpful (< 10% improvement) to tell the language model what the set of possible words was? Basically because the model was always guessing really high probability unigrams. Below are the top ten predicted words when the model made an error with their frequencies. They're basically all stop words. (This is in the unrestricted setting.)

     1    14722 ,
     2     1393 .
     3     1298 the
     4      512 and
     5      485 in
     6      439 of
     7      270 to
     8      163 ''
     9      157 a
    10      108 is
    11       54 's
    12       52 have
    13       49 said
    14       41 ``
    15       38


    16       38 for
    17       38 are
    18       34 %
    19       33 $
    20       31 be

The same list for the restricted setting is virtually identical, basically because most of these words are available in the average sentence:

     1    10194 ,
     2     5357 .
     3     1274 the
     4      274 of
     5      251 in
     6      232 to
     7      230 and
     8      193 a
     9       51


    10       36 for
    11       28 's
    12       25 ''
    13       24 said
    14       24 is
    15       21 that
    16       21 ``
    17       16 it
    18       15 from
    19       14 be
    20       13 by

Oh well.

Ok, so I don't have a "good" counter proposal to perplexity. Error rate certainly has many issues of its own. You could use IR-like metrics, like recall @ 10, mean average precision, etc., which are all questionable in their own ways.

I would just, in general, like if we could evaluate language models without having to be handcuffed by probabilities.

26 April 2014

An easy way to write less hurtful reviews: don't say "you"

I'll be honest: I've had my feelings hurt by scathing reviews more than a few times. In grad school I remember even crying over a review that I thought was particular pernicious. My skin has thickened a bit over time, though often in the not-so-helpful manner of dismissing reviews that I don't like as "they didn't get it," which defeats one of the two primary purposes of reviews in the first place (providing feedback; the other: making accept/reject decisions).

The thing that's hard to reconcile is that I really like most of the people in our community, and everyone I meet at least seems really friendly.

When doing mock reviews with grad students, I'll often tell them to keep in mind that there's a good chance that the author is, or later will be, a friend of theirs. It's possible to provide feedback to a friend in such a way that you don't hurt their feelings.

I've recently started doing something else (in addition to the above suggestion). I don't use the words "you" or "the authors" or even "I." The review of a scientific contribution is not about me and it's not about the authors. It's about the method, the experiments and the contribution. I see little reason why you need to mention anything related to the people involved. (One exception: "I" is often useful in hedging, like the previous sentence, which would be more forceful if I just said "There is little reason...") Perhaps we could even integrate this into START...

This is, of course, similar to the pop-psych advice of talking to loved ones about "actions" rather than "the person." For instance: "I hate you for spilling coffee and not cleaning it up" versus "I hate having coffee spilt on the floor." Or something. I'm sure others can come up with better examples.

My current approach is to write my review with this in mind, and then go back and search for all occurrences of my outlawed nouns, and rewrite these sentences. Often in the process of doing this, I become aware that in many of the cases what I've said really does sound like an attack, and with the very small edit this effect is removed or at least greatly reduced.

I realize I've now just given a pretty good signal for people reading reviews to see if they were written by me or not. Here's a solution: everyone should adopt this policy and then my reviews will no longer be so obvious.
But overall, I really think we should be nice to each other. Perhaps fewer people will depart from the field if they're not constantly battered down by harsh reviews, and then we'll all be better off.

14 April 2014

Waaaah! EMNLP six months late :)

Okay, so I've had this file called emnlp.txt sitting in my home directory since Oct 24 (last modification), and since I want to delete it, I figured I'd post it here first. I know this is super belated, but oh well, if anyone actually reads this blog any more, you're the first to know how I felt 6 months ago. I wonder if I would make the same calls today... :)

If you remember anything about EMNLP anymore and have your own opinions, please feel free to comment. It will also let me know if anyone reads here anymore :).

Happy Spring from DC!




16 September 2013

Active learning for positive examples

I have a colleague who wants to look through large amounts of (text) data for examples of a pretty rare phenomenon (maybe 1% positive class, at most). We have about 20 labeled positive examples and 20 labeled negative examples. The natural thing to do at this point is some sort of active learning.

But here's the thing. We have no need for a classifier. And we don't even care about being good at finding negative examples. All we care about is finding as many positive examples from a fixed corpus as possible.

That is to say: this is really a find-a-needle-in-a-haystack problem. The best discussion I've found of this is Section 4 of Inactive Learning, by Attenberg and Provost. I came across a couple other papers that basically do some sort of balancing to deal with the imbalanced data problem in active learning, but the Attenberg and Provost results suggest that even this is tricky to get right.

But I think this only partially addresses the problem. Here's why.

Let's say that my colleague (call her Alice) is willing to spend one hour of her life looking at examples. Alice estimates that she can look at about 500 examples in that time period (small caveat: some examples are larger than other and therefore slower, but let's ignore this for now). Alice's goal is to find as many positive examples in that one hour as possible. Here are some possible strategies Alice could deploy

  1. Look at 500 randomly selected examples, getting 5 (more) positive examples in expectation (likely somewhere between 2 and 12, with 95% confidence). This gives her a total of 25 positive examples
  2. Train a classifier on her 40 labeled examples, have it rank the entire set (minus those 40). Then look at the top 500 examples. If the learned model has a recall of r% in the top 500, then she should expect to get 5*r more examples, giving her a total of 20+5r examples. (Caveat: this is a bit dangerous, because with very few training examples, results can often be anti-correlated with the desired output and you get better performance by flipping the predicted label. There was a paper on this maybe 5-10 years ago but I can't find it... Maybe somewhat knows what I'm referring to.)
  3. Train a classifier, spend 30 minutes looking at 250 random examples, getting 2.5r more positive examples, then train another classifier, then spend 30 minutes looking at it's top ranked 250 examples. If the second classifier has a recall of r'% in the top 250, then she should expect to get another 2.5*r', and she'll have a total of 20+2.5r+2.5r' = 20 + 2.5(r+r') so long as r>r' this should be better.
  4. Taking this to the extreme, Alice could annotate one example at a time (subject to constraints on either the learning being very fast or Alice not actually being a human). As long as the recall-at-one of the classifiers learned is non-decreasing, this should (I think) be the optimal strategy.
So it seems that in a setting like this, what the classifier should optimize for is simply recall-at-one. In other words, it doesn't care about anything except that it's top ranked output is correct.

Okay, so why is this difficult? Basically because we don't have that much data. If the corpus that represents the haystack is huge, then we want to ensure that the top-one example from that haystack is a positive example. So in principle even if all we're trying to do is select between two different hypotheses (say our hypothesis class at each time step has cardinality two) then in principle we should use as much data as possible to evaluate these two hypotheses. In particular, looking at the recall-at-one on the subset of 40 labeled examples that we have is probably not representative, essentially because this loss function doesn't decompose nicely.

So what could we do instead? Well, we could pool the labeled and unlabeled examples and predict on those instead. Certainly if one of the positive labeled examples outranked everything else, that would be a good thing. But if truly 1% of the unlabeled dataset is positive, then this is not a necessary condition. In particular, the top-ranked positive labeled example could be as far as 1% down the list and we could still (in principle) be doing a good job.

Ok I'll admit: I really don't know how to do this. Maybe it's enough in most cases to optimize for zero-one loss (or a weighted version thereof) and just cross your fingers that the recall-at-one is good. But it does feel like there should be a more direct way to go about this problem.

08 July 2013

The *SEM 2013 Panel on Language Understanding (aka semantics)

One of the highlights for me at NAACL was the *SEM panel on "Toward Deep NLU", which had the following speakers: Kevin Knight (USC/ISI), Chris Manning (Stanford), Martha Palmer (UC Boulder), Owen Rambow (Columbia) and Dan Roth (UIUC). I want to give a bit of an overview the panel, interspersed with some opinion. I gratefully acknowledge my wonderful colleague Bonnie Dorr for taking great notes (basically a transcript) and sharing them with me to help my failing memory. For what it's worth, this basically seemed like the "here's what I'm doing for DEFT panel" :).

Here's the basic gist that I got from each of the panel members, who gave roughly 10 minute talks:

Dan Roth: doing role labeling restricted to verbs is not enough. As an easy example, "John, a fast-rising politician, slept on the train to Chicago"... by normal SRL we get that John is sleeping, but not the possibly more important fact that John is a politician. Another example is prepositions: "University of Illinois" versus "State of Illinois" -- "of" is ambiguous. They came up with a taxonomy of 32 relations and labeled data and then did some learning -- see the TACL paper that was presented at NAACL, Srikumar & Roth.

Commentary: the ambiguity of prepositions issue is cool and I really liked the TACL paper. It reminds me of learning Latin in high school and being confused that ablative case markers were abiguous across from/by/with. It astounded me that that was an acceptable ambiguity, but of course English has equally crazy ones that I've just gotten used to. But it does make me think that some cross-linguistic study/model might be cool here. Even more broadly, it made me think about noun-noun compound semantics: "farmers market" (market put on by farmers) versus "fruit market" (market where you buy fruit) versus "fruit pie" (pie made out of fruit). I went back and read Lucy Vanderwende's dissertation, which dealt exactly with these issues. She had far fewer relations than Srikumar and Roth, though perhaps once you allow explicit prepositions the range of things you can express grows (though somehow my gut feeling is that it doesn't, at least in English).

Kevin Knight: Basically talked about their deep semantic approach to MT: see the abstract meaning representation web page for more. The idea is that people who work on syntax don't Balkanize into those who do PPs, those who do VPs, etc., so why should semantics break apart like it does. AMR is very GOFAI-style representation for language, and they've annotated a Chinese-English bilingual copy of Le Petite Prince with this representation. Now they need analyzers (hard), generators (hard) and transformation formalisms (hard). The nice thing is that this one representation captures almost all relevant semantic issues: scoping, argument structure, coreference, etc. For instance, co-ref is not explicitly annotated: it's just that a single agent can participate in multiple predicates. (Note: not yet across sentences.)

Commentary: It's hard not to get excited about this stuff, especially when Kevin talks about it. His enthusiasm is infectious. I left the talk thinking "wow I want to work on that!" There's of course the worry that we've tried this before and failed and that's why things in semantics Balkanized, but maybe the time is right to revisit it. For instance, Bonnie herself (note: she didn't tell me this; it had come up in recent discussions with Philip Resnik and our postdoc Junhui Li) had a meaning representation very similar to AMR called Lexical Conceptual Structures (LCS), and Nizar Habash had a hybrid rule-based/statistical approach to translating there. The idea was that if you want to handle divergent translations (classic example: "the bottle floated across the river" (English) versus "the bottle crossed the river floatingly" (Spanish, I think)), you need a representation that abstracts means from predicate. But it's still very cool. (Actually in digging up refs, I just found this paper on mapping from LCS to AMR... from AMTA 1998!)

Martha Palmer: focused mostly on event relations that go across sentences, which includes things like even coreference, bridging relations (enablement, result) and so on. They're also looking seriously at type (evidential, aspectual, etc.), modality (actual, hypothetical, generic, hedged, etc.), polarity and aspect. They are currently doing a lot of work in the clinical domain, in which these distinctions are really important if you want to understand, say, patient medical histories.

Commentary: this is a bit outside things I usually think about, so I have less to say. I really like the hyper-sentence view, of course.

Owen Rambow: talked about some of my favorite work that I've seen recently: basically work on propositional attitudes. The view Owen put forth is that most of NLP is focused on a world of facts, and the goal of NLU is to figure out what these facts are. They are taking a much more social model of text meaning, in which you really care about inferring partipants' cognitive states (standard triumvirate: belief, desire and intention). This actually shows up in at least one English-German translation example, essentially in which Google translate misses a very important subjunctive.

Commentary: I really liked the original work Owen did on BDI inference and I'm thrilled it's going further. I think one of the historical reasons why I find this so interesting is that propositional attitudes is basically what I started doing when I started grad school, when looking at discourse analysis through RST. I think many people forget this, but the discourse relationships in RST (and other discourse theories) are really based on attitude. For instance, X is in a background relation to Y if (roughly) the listener already believes X and the listener also believes that X increases the chance of Y. (Or something like that: I just made that up :P.) But it's all about belief of listeners and utterers.

Chris Manning: focused on deep learning, basically asserting (in a manner designed to be a bit controversial) that Stanford dependencies are their meaning representation and that the big problems aren't in representations. Sure, Stanford dependencies miss out on a lot (quanitification, tense, semantic roles, modality, etc.) but he felt that there are more important problems to address. And then what we need instead is "soft" meaning representations, like vector space models and distributed representations give us. Giving rise to something akin to Natural Logic.

Commentary: to a large degree I agree with the notion that the "big problems" in language are probably not those that (eg) semanticists like to look at, at least from the typical view of NLE in which we want systems that do well on average across a distribution of examples that we've cultivated. But I also worry that there's a bit of magical thinking here, in the sense that it kind of feels like a cop-out: it's too hard to define categories by hand so let's let the machine figure it out. Now, don't get me wrong, I'm all for machines figuring out stuff (I gave a not-very-well-received talk to that degree at a workshop a couple years ago on linguistics in NLP), but I'm also a bit reticent to believe that this is really going to bring us any closer to really solving the NLU problem (whatever that is), though of course it will get us another 5-10% in standard benchmarks. (Ok this sounds way too negative: I actually really liked Chris' talk, and one of the things I liked about it was that it challenged my thinking. And I agree that there is a lot that we shouldn't be designing by hand -- some people, like Yoshua Bengio, would probably argue that we shouldn't be designing anything by hand, or at least that we shouldn't have to -- but I guess I still belong to the camp of "linguists give the structure, statistics gives the parameters.")

There was also a lot of really interesting discussion after the presentations, some of which I'll highlight below:

Lucy Vanderwende, I think mostly directed at Kevin, fell into the "we tried this X years ago" camp, basically said that whenever they tried to abstract more and more from the input representation, you ended up getting very boring sentences generated because you'd thrown out all the "nuance" (my word, not hers). The discussion afterward basically revolved about whether you annotate input sentences with meaning (which is currently the standard) or throw them out with the bathwater. Owen points out that the meaning of a passive sentence is not +passive but something much more nuanced, and if you could capture that correctly, then (in principle) generators could reflect it properly in the target language. (Me: For instance, maybe in some wacky language a passive sentence actually means that you're trying to emphasize the subject.)

There was also a lot of discussion around Chris, I think partially because he went last and partially because he was trying to be controversial. Mausam made an argument (akin to what I wrote above) that logicians have made a billion logics of language and nothing really has worked (in a sense it's been a series of negative results). What about inference rules or consistency?

Okay, that's all I want to write for now. Congrats if you made it this far. And thanks to the *SEM organizers for putting together this great panel!

17 June 2013

My NAACL 2013 list...

I feel a bit odd doing my "what I liked at NAACL 2013" as one of the program chairs, but not odd enough to skip what seems to be the most popular type of post :). First, though, since Katrin Kirchhoff (my co-chair) and I never got a chance to formally thank Lucy Vanderwende (the general chair) and give her flowers (or wine or...) let me take this opportunity to say that Lucy was an amazing general chair and that working with her made even the least pleasant parts of PCing fun. So: thanks Lucy -- I can't imagine having someone better to have worked with! And all of the rest of you: if you see Lucy and if you enjoyed NAACL, please thank her!

I also wanted to really thank Matt Post for doing the NAACL app -- lots of people really liked it and I hope we do it in the future. I'm at ICML now and constantly wishing there were an ICML app :).

Okay, with that preface, let's get down to what you came for. Below is the list of my (complete) list of favorite papers from NAACL 2013 (also indexed on Braque) in no particular order:

  • Relation Extraction with Matrix Factorization and Universal Schemas (N13-1008 by Sebastian Riedel; Limin Yao; Andrew McCallum; Benjamin M. Marlin)
    Very cool paper. The idea is to try to jointly infer relations (think OpenIE-style) across text and databases, by writing everything down in a matrix and doing matrix completion. In particular, make the rows of this matrix equal to pairs of entities (Hal,UMD and UMD,DC-area) and the columns relations like "is-professor-at" and "is-located-in." These entity pairs and relations come both from free text and databases like FreeBase. Fill in the known entities and then think of it as a recommender system. They get great results with a remarkably straightforward approach. Reminds me a lot of my colleague Lise Getoor's work on multi-relational learning using tensor decompositions.
  • Combining multiple information types in Bayesian word segmentation (N13-1012 by Gabriel Doyle; Roger Levy)
    I guess this qualifies as an "obvious in retrospect" idea -- and please recognize that I see that as a very positive quality! The basic idea is that stress patterns (eg trochees versus iambs) are very useful for kids (who apparently can recognize such things at 4 days old!) and are also very useful for word segmentation algorithms.
  • Learning a Part-of-Speech Tagger from Two Hours of Annotation (N13-1014 by Dan Garrette; Jason Baldridge)
    Probably my overall favorite paper of the conference, and the title says everything. Also probably one of the best presentations I saw at the conference -- I can't even begin to guess how long Dan spent on his slides! I loved the question from Salim in the Q/A session, too: "Why did you stop at two hours?" (They have an ACL paper coming up, apparently, that answers this.) You should just read this paper.

  • Automatic Generation of English Respellings (N13-1072 by Bradley Hauer; Grzegorz Kondrak)
    This paper was the recipient of the best student paper award and, I thought, really great. It's basically about how English (in particular) has funny orthography and some times it's useful to map spellings to their pro-nun-see-ey-shuns, which most people find more useful than . It's a bit more of a bunch of stuff glued together than I usually go for in papers, but the ideas are solid and it seems to work pretty well -- and I'd never even thought this would be something interesting to look at, but it makes complete sense. Best part of presentation was when Greg tripped up pronouncing some example words :).

  • Linguistic Regularities in Continuous Space Word Representations (N13-1090 by Tomas Mikolov; Wen-tau Yih; Geoffrey Zweig)
    This is a paper that makes my list because it made me think. The basic idea is that if you do some representation learning thingamajig and then do vector space algebra like repr("King") - repr("man") + repr("woman") you end up with something that's similar to repr("Queen"). It's a really interesting observation, but I'm at a loss for why we would actually expect something like this to happen!
  • PPDB: The Paraphrase Database (N13-1092 by Juri Ganitkevitch; Benjamin Van Durme; Chris Callison-Burch)
    This is a paper about a dataset release that I think I'll find useful and I bet other people will too. Go download it and play with it. I'd encourage the authors (are you listening, Juri!) to make a web demo (or web service) so that I don't need to go through the pain of getting it all set up to see if it might be useful for me.
  • Supervised Learning of Complete Morphological Paradigms (N13-1138 by Greg Durrett; John DeNero)
    Basic idea: college morphological paradigms from Wiktionary and then train a supervised system to generalize from those to novel words. Works remarkably well and the model is well thought out. Plus I like papers that take morphology seriously: I wish we saw more stuff like this in NAACL.
And though I don't often do this, I have to mention the following paper because although I didn't see the talk or read the paper, enough independent people pointed it out to be as great that I figured I'd mention it:
  • Improved Reordering for Phrase-Based Translation using Sparse Features (N13-1003 by Colin Cherry)
Anyone else with favorites should comment!

30 April 2013

What is a sparse difference in probability distributions?

Sparsity has been all the rage for a couple of years now. The standard notion of "sparse" vector u is that the number of non-zeros in u is small. This is simply the l_0 norm of u, ||u||_0. This norm is well studied, known to be non-convex, and often relaxed to the l_1 norm of u, ||u||_1: the sum of absolute values. (Which has the nice property of being the "tightest" convex approximation to l_0.)

In some circumstances, it might not be that most of u is zero, but simply that most of u is some fixed scalar constant a. The "non-constant" norm of u would be something like "the number of components that are not equal to some a" or, in more mathy terms, "min_a ||u-a||_0". I first came across it this in the Differentiable sparse coding paper by Bradley and Bagnell (NIPS 2008), where they claim physicists call it "Maximum entropy on the mean" if you're using a KL distance, but I don't know if there's a name for it in the l_0 sense, so I'l call it "l_0 from the mode" which seems to capture the right notion.

In many cases, we want to use a sparse norm to define a notion of a sparse distance function between two vectors u and v. For instance d_0(u,v) = ||u-v||_0 would be a reasonable notion of a sparse difference between two vectors.

Now, let's suppose that u and v are probability distributions over a finite event space, so that u_i is the probability of event i. In this case, both u and v belong to the simplex: u_i >= 0 for all i, and sum_i u_i = 1 (which is equivalent to saying ||u||_1 = 1, given the first condition).

It seems natural to ask ourselves whether there is a reasonable notion of sparse difference between two probability distributions. This is something I've thought about, but not come up with a satistfactory answer. The "natural" thing would be to define d_0(p,q) = ||p-q||_0, where I've switched from u/v to p/q to emphasize that these are distributions. This is certainly reasonable in some sense, but does not correspond to how I usually think of probability distributions, essentially because it ignores the "normalization" effect.

As a simple example, let's suppose that I generate a data set by rolling a dice with sides A-F. Suppose it comes up A five times, B twice, C once and F twice. Then my maximum likelihood estimate for p would be [.5, .2, .1, 0, 0, .2]. If I had a different dataset, in which exactly the same thing happened except I got one roll of D in addition to all the other rolls, I would estimate the distribution here as q = [.45, .18, .09, .09, 0, .18], which differs everywhere except the "E" position. The d_0 distance between these two distributions would be 4, which intuitively seems "wrong.

One possible way around this would be to allow us to rescale the distributions before computing the l_0 distance, something like d(p,q) = min_(a,b > 1) || ap - bq ||_0. When talking about l_0 as the underlying norm, this can be simplified to min_(0 < a < 1) || ap - (1-a)q ||_0. Note that the inequalities on "a" are necessary. With equalities, you could "erase" all of p (or q) by setting a to zero (or 1), which is definitely not desirable. Fortunately, the optimal "a" can be computed in linear time in the dimensionality: it's just the value that maximizes the number of i for which ap_i = (1-a)q_i. This definitlon captures the example above and gives a distance of "1" (corresponding to a = mode(q / (p+q)) = 10/21 = 0.4797).

We can attempt to make this definition more formal by invoking the notion of exponential families. Recall that an expfam distribution has the form log p(x;t) = t*f(x) - A(t) + B(x), where t are the natural parameters, f is the vector of sufficient statistics, * denotes dot product, and A() is the log partition function. A quick observation is that depending on how f looks, there might be lots of t that give rise to the same probability distribution. If we guarantee that the fs are always linearly independent, then this representation is called "minimal"; otherwise it is "overcomplete." (Formally, it is overcomplete if there exists a non-zero vector a for which a*f(x) is constant for all x.)

For a categorical distribution, an overcomplete representation would be a six dimensional indicator of which side of the die came up, and the natural parameters would be the (log of the) ps and qs shown above (it's overcomplete because one component is completely determined by the other 5). For example, the example p above would have exponential form with t=log[.5/.8, .2/.8, .1/.8, 0, 0] and log partition function A(t) = log 1.8. (See wikipedia's variant 3 for categorical distributions.) and q would have t=log[.45/.82, .18/.82, .09/.82, .09/.82, 0] with A(t) = log 1.82.

Supposing that we're working with a minimal representation, what I think this amounts to is that we want the difference in natural parameters to be sparse in the "l_0 from the mode" sense -- the difference is equal to a constant. In the above example, the difference in sufficient statistics is equal to 0.1301 in the first three components, -Infinity in the fourth, and something like zero in the last (who knows -- NaN?) which gives a distance of one (if you discard the last one).

The problem here is that there is not a unique minimal representation; I could also have chosen t=log[.2/.5, .1/.5, 0, 0, .2/.5] and A=log 1.5 for p and t=log[.18/.55 .09/.55 .09/.55 0/.55 .18/.55] and A=log 1.55 for q. In this case, the mode of the difference is 0.2007 (in the first, second and last components) there's one -Infinity and one NaN. So the answer works out the same, at least if you know how to deal with the NaNs.

There's an obviously needed lemma here: that any minimal representation will give you the same distance. I haven't thought nearly long enough about why this should be true (i.e., I've thought about if for about 30 seconds :P). There's also a lot of other questions: what if you relax this notion of sparsity to something more l_1-like? What other properties does it have? And of course is there a better notion of sparse difference of probability distributions for this or other circumstances?