10 March 2011

Postdoc Position at CLIP (@UMD)

Okay, now is why I take serious unfair advantage of having this blog.  We have a postdoc opening.  See the official ad below for details:

A postdoc position is available in the Computational Linguistics and
Information Processing (CLIP) Laboratory in the Institute for Advanced
Computer Studies at University of Maryland.  We are seeking a talented
researcher in natural language processing, with strong interests in
the processing of scientific literature.

A successful candidate should have a strong NLP background with a
track record of top-tier research publications.  A Ph.D. in computer
science and strong organizational and coordination skills are a must.
In addition to pursuing original research in scientific literature
processing, the ideal candidate will coordinate the efforts of the
other members of that project.  While not necessary, experience in one
or more of the following areas is highly advantageous: summarization,
NLP or data mining for scientific literature, machine learning, and
the use of linguistic knowledge in computational systems.
Additionally, experience with large-data NLP and system building will
be considered favorably.

The successful candidate will work closely with current CLIP faculty,
especially Bonnie Dorr, Hal Daume III and Ken Fleischmann, while
interacting with a large team involving NLP researchers across several
other prominent institutions.  The duration of the position is one
year, starting Summer or Fall 2011, and is potentially extendible.

CLIP is a a dynamic interdisciplinary computational linguistics
program with faculty from across the university, and major research
efforts in machine translation, information retrieval, semantic
analysis, generation, and development of large-scale statistical
language processing tools.

Please send a CV and names and contact information of 3 referees,
preferably by e-mail, to:

    Jessica Touchard
    jessica AT cs DOT umd DOT edu
    Department of Computer Science
    A.V. Williams Building, Room 1103
    University of Maryland
    College Park, MD 20742

Specific questions about the position may be addressed to Hal Daume
III at hal AT umiacs DOT umd DOT edu.

08 March 2011

Some thoughts on supplementary materials

Having the option of authors submitting supplementary materials is becoming popular in NLP/ML land.  NIPS was one of the first conferences I submit to that has allowed this; I think ACL allowed it this past year, at least for specific types of materials (code, data), and EMNLP is thinking of allowing it at some point in the near future.

Here is a snippet of the NIPS call for papers (see section 5) that describes the role of supplementary materials:

In addition to the submitted PDF paper, authors can additionally submit supplementary material for their paper... Such extra material may include long technical proofs that do not fit into the paper, image, audio or video sample outputs from your algorithm, animations that describe your algorithm, details of experimental results, or even source code for running experiments.  Note that the reviewers and the program committee reserve the right to judge the paper solely on the basis of the 8 pages, 9 pages including citations, of the paper; looking at any extra material is up to the discretion of the reviewers and is not required.
(Emphasis mine.)  Now, before everyone goes misinterpreting what I'm about to say, let me make it clear that in general I like the idea of supplementary materials, given our current publishing model.

You can think of the emphasized part of the call as a form of reviewer protection.  It basically says: look, we know that reviewers are overloaded; if your paper isn't very interesting, the reviewers aren't required to read the supplement.  (As an aside, I feel the same thing happens with pages 2-8 given page 1 in a lot of cases :P.)

I think it's good to have such a form a reviewer protection.  What I wonder is whether it also makes sense to add a form of author protection.  In other words, the current policy -- which seems only explicitly stated in the case of NIPS, but seems to be generally understood elsewhere, too -- is that reviewers are protected from overzealous authors.  I think we need to have additional clauses that protect authors from overzealous reviewers.

Why?  Already I get annoyed with reviewers who seem to think that extra experiments, discussion, proofs or whatever can somehow magically fit in an already crammed 8 page page.  A general suggestion to reviewers is that if you're suggesting things to add, you should also suggest things to cut.

This situation is exacerbated infinity-fold with the "option" of supplementary material.  There now is no length-limit reason why an author couldn't include everything under the sun.  And it's too easy for a reviewer just to say that XYZ should have been included because, well, it could just have gone in the supplementary material!

So what I'm proposing is that supplementary material clauses should have two forms of protection.  The first being the existing one, protecting reviewers from overzealous authors.  The second being the reverse, something like:
Authors are not obligated to include supplementary materials.  The paper should stand on its own, excluding any supplement.  Reviewers must take into account the strict 8 page limit when evaluating papers.
Or something like that: the wording isn't quite right.  But without this, I fear that supplementary materials will, in the limit, simply turn into an arms race.

02 March 2011

Grad school survey, revisited

You may recall a while ago I ran a survey on where people applied to grad school. Obviously I've been sitting on these results for a while now, but I figured since it's that time of year when people are choosing grad schools, that I would say how things turned out.  Here's a summary of things that people thought were most important (deciding factor), and moderately important (contributing factor, in parens):

  • Academic Program
    • Specialty degree programs in my research area, 48%
    • (Availability of interesting courses, 16%)
    • (Time to completion, 4%)
  • Application Process
    • Nothing 
  • Faculty Member(s)
    • Read research papers by faculty member, 44%
  • Geographic Area
    • (Outside interests/personal preference, 15%)
  • Recommendations from People 
    • Professors in technical area, 45%
    • (Teachers/academic advisors, 32%)
    • (Technical colleagues, 20%)
  • Reputation
    • ... of research group, 61%
    • ... of department/college, 50%
    • (Ranking of university, 35%)
    • (Reputation of university, 34%)
  • Research Group
    • Research group works on interesting problems, 55%
    • Many faculty in a specialty area (eg., ML), 44%
    • (Many faculty/students in general area (eg., AI), 33%)
    • (Research group publishes a lot, 26%)
  • Web Presence
    • (Learned about group via web search, 37%)
    • (Learned about dept/univ via web search, 24%)
  • General
    • Funding availability, 49%
    • (High likelihood of being accepted, 12%)
    • (Size of dept/university, 5%)
Overall these seem pretty reasonable.  And of course they all point to the fact that everyone should come to Maryland :P.  Except for the fact that we don't have specialty degree programs, but that's the one thing on the list that I actually think is a bit silly: it might make sense for MS, but I don't really think it should be an important consideration for Ph.D.s.  You can get the full results if you want to read them and the comments: they're pretty interesting, IMO.

28 February 2011

What are your plans between ACL and ICML?

I'll tell you what they should be: attending the Symposium on Machine Learning in Speech and Language Processing, jointly sponsored by IMLS, ICML and ISCA, that I'm co-organizing with Dan Roth, Geoff Zweig and Joseph Keshet (the exact date is June 27, in the same venue as ICML in Bellevue, Washington).  So far we've got a great list of invited speakers from all of these areas, including Mark Steedman, Stan Chen, Yoshua Bengio, Lawrence Saul, Sanjoy Dasgupta and more.  (See the web page for more details.)  We'll also be organizing some sort of day trips (together with the local organizers of ICML) for people who want to join!  You should also consider submitting papers (deadline is April 15).

I know I said a month ago that I would blog more.  I guess that turned out to be a lie.  The problem is that I only have so much patience for writing and I've been spending a lot of time writing non-blog things recently.  I decided to use my time-off-teaching doing something far more time consuming than teaching. This has been a wonderously useful exercise for me and I hope that, perhaps starting in 2012, other people can take advantage of this work.

17 January 2011

Parsing with Transformations

I remember when I took my first "real" Syntax class, where by "real" I mean "Chomskyan." It was at USC in Fall 2001, taught by Roumyana Pancheva. It was hard as hell but I loved it. However, as a computationally minded guy, I remember snickering to myself the whole time we were talking about movements that get you from deep structure to surface structure. This stuff was all computationally ridiculous.

But why was it computationally ridiculous? It was ridiculous because my mindset, and I think the mindset of most computational folks at the time, was that of n^3 CKY or Earley style parsing. Namely exact parsing in a context free manner. This whole idea of transformations would kill anything like that in a very bad way.

However, there's been a recent shift in attitudes. Sure, people still do their n^3 parsing, but of course none of it is exact anyway (due to pruning). But more than that, things like linear time parsing algorithms as popularized by people like Joakim Nivre and Kenji Sagae and Brian Roark and Joseph Turian, have proved very useful. They work well, are incredibly efficient, and are easy to implement. They're also a bit more psychologically plausible (as Eugene Charniak said recently "we don't know what people are doing, but they're definitely not doing CKY.").

So I'm led to wonder: could we actually do parsing in a transformational grammar using all the new stuff we know about (for instance) left-to-right parsing?

One thing that stands in our way, of course, is the stupid Penn Treebank, which was annotated only with very simple transformations (mostly noun phrase movements) and not really "deep" transformations as most Chomskyan linguists would recognize them.

But I think you could still do it.  It would end up as being partially unsupervised, but at least from a minimum description length perspective, I can either spend weights learning more special cases, or I can learn general transformational rules.  It would take some thought and effort to write it out and figure out how to actually optimize such a thing, but I bet it could be done in a semester.

So then the question is: aside from smaller models (potentially), is there any other reason to do it?

I can think of at least one: parsing non-declarative sentences.  Since almost all sentences in the Treebank are declarative, parsers do pretty crappy when tested on other things.  Slav Petrov had a paper at EMNLP 2010 on parsing questions.  Here is the abstract, which says pretty much everything:

... We show that dependency parsers have more difficulty parsing questions than constituency parsers. In particular, deterministic shift-reduce dependency parsers ... drop to 60% labeled accuracy on a question test set. We propose an uptraining procedure in which a deterministic parser is trained on the output of a more accurate, but slower, latent variable constituency parser (converted to dependencies). Uptraining with 100K unlabeled questions achieves results comparable to having 2K labeled questions for training. With 100K unlabeled and 2K labeled questions, uptraining is able to improve parsing accuracy to 84%, closing the gap between in-domain and out-of-domain performance.
Now, at least in principle, if you can parse declarative sentences, you should be able to parse questions.  At least if you know about some basic syntactic transformations in English.  (As an aside, the "uptraining" idea is almost exactly the same as the structure compilation idea that Percy, Dan and I had at ICML 2008, though Slav and colleagues apply it to a domain adaptation problem, while we just did simple semi-supervised learning.)

We have observed similar effects in the parsing of commands, such as "Put your head in a noose" where parsers -- even constituency ones -- really really want "Put" to be a noun!  Again, if you know simple transformations -- like subject dropping -- you should be able to parse commands if you can already parse declarations.

As with any generalization, the hope is that by realizing the generalization, you don't need to store so many specific cases.  So if you can learn that commands and questions are simple transformation on declarative sentences, and you can learn to parse declaratives, you should be able to handle the other case.

(Anticipating comments: yes, I know you could try to pre-transform your data, like they do in MT, but that's quite inelegant.  And yes, I know you could probably take the treebank and turn a lot of the sentences into commands or questions to create a new data set.  But that's kind of missing the point: I don't want to just handle commands or questions... I want to handle anything, even things that I might not have anticipated.)

11 January 2011

NIPS 2010 Retrospective

Happy New Year and I know I've been silent but I've been busy.  But no teaching this semester (YAY!) so maybe you'll see more posts.

At any rate, I'm really late to the table, but here are my comments about this past year's NIPS.  Before we get to that, I hope that everyone knows that this coming NIPS will be in Granada, and then for (at least) the next five years will be in Tahoe.  Now that I'm not in ski-land, it's nice to have a yearly ski vacation ... erm I mean scientific conference.

But since this was the last year of NIPS in Vancouver, I thought I'd share a conversation that occurred this year at NIPS, with participants anonymized.  (I hope everyone knows to take this in good humor: I'm perfectly happy to poke fun at people from the States, too...).  The context is that one person in a large group, which was going to find lunch, had a cell phone with a data plan that worked in Canada:

A: Wow, that map is really taking a long time to load.
B: I know.  It's probably some socialized Canadian WiFi service.
C: No, it's probably just slow because every third bit has to be a Canadian bit?
D: No no, it's because every bit has to be sent in both English and French!
Okay it's not that funny, but it was funny at the time.  (And really "B" is as much a joke about the US as it was about Canada :P.)

But I'm sure you are here to hear about papers, not stupid Canada jokes.  So here's my take.

The tutorial on optimization by Stephen Wright was awesome.  I hope this shows up on videolectures soon. (Update: it has!) I will make it required reading / watching for students.  There's just too much great stuff in it to go in to, but how about this: momentum is the same as CG!  Really?!?!  There's tons of stuff that I want to look more deeply into, such as robust mirror descent, some work by Candes about SVD when we don't care about near-zero SVs, regularized stochastic gradient (Xiao) and sparse eigenvector work.  Lots of awesome stuff.  My favorite part of NIPS.

Some papers I saw that I really liked:

A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire): Formalizes boosting in a multiclass setting.  The crux is a clever generalization of the "weak learning" notion from binary.  The idea is that a weak binary classifier is one that has a small advantage over random guessing (which, in the binary case, gives 50/50).  Generalize this and it works.

Structured sparsity-inducing norms through submodular functions (Francis Bach): I need to read this.  This was one of those talks where I understood the first half and then got lost.  But the idea is that you can go back-and-forth between submodular functions and sparsity-inducing norms.

Construction of Dependent Dirichlet Processes based on Poisson Processes (Dahua Lin, Eric Grimson, John Fisher): The title says it all!  It's an alternative construction to the Polya urn scheme and also to the stick-breaking scheme.

A Reduction from Apprenticeship Learning to Classification (Umar Syed, Robert Schapire): Right up my alley, some surprising results about apprenticeship learning (aka Hal's version of structured prediction) and classification.  Similar to a recent paper by Stephane Ross and Drew Bagnell on Efficient Reductions for Imitation Learning.

Variational Inference over Combinatorial Spaces (Alexandre Bouchard-Cote, Michael Jordan): When you have complex combinatorial spaces (think traveling salesman), how can you construct generic variational inference algorithms?

Implicit Differentiation by Perturbation (Justin Domke): This is a great example of a paper that I never would have read, looked at, seen, visited the poster of, known about etc., were it not for serendipity at conferences (basically Justin was the only person at his poster when I showed up early for the session, so I got to see this poster).  The idea is if you have a graphical model, and some loss function L(.) which is defined over the marginals mu(theta), where theta are the parameters of the model, and you want to optimize L(mu(theta)) as a function of theta.  Without making any serious assumptions about the form of L, you can actually do gradient descent, where each gradient computation costs two runs of belief propagation.  I think this is amazing.

Probabilistic Deterministic Infinite Automata (David Pfau, Nicholas Bartlett, Frank Wood): Another one where the title says it all.  DP-style construction of infinite automata.

Graph-Valued Regression (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): The idea here is to define a regression function over a graph.  It should be regularized in a sensible way.  Very LASSO-esque model, as you might expect given the author list :).

Other papers I saw that I liked but not enough to write mini summaries of:

Word Features for Latent Dirichlet Allocation (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy)
Tree-Structured Stick Breaking for Hierarchical Data (Ryan Adams, Zoubin Ghahramani, Michael Jordan)
Categories and Functional Units: An Infinite Hierarchical Model for Brain Activations (Danial Lashkari, Ramesh Sridharan, Polina Golland)
Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum)
Joint Analysis of Time-Evolving Binary Matrices and Associated Documents (Eric Wang, Dehong Liu, Jorge Silva, David Dunson, Lawrence Carin)
Learning Efficient Markov Networks (Vibhav Gogate, William Webb, Pedro Domingos)
Tree-Structured Stick Breaking for Hierarchical Data (Ryan Adams, Zoubin Ghahramani, Michael Jordan)
Construction of Dependent Dirichlet Processes based on Poisson Processes (Dahua Lin, Eric Grimson, John Fisher)
Supervised Clustering (Pranjal Awasthi, Reza Bosagh Zadeh)

Two students who work with me (though one isn't actually mine :P), who went to NIPS also shared their favorite papers.  The first is a list from Avishek Saha:

A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire)

Repeated Games against Budgeted Adversaries (Jacob Abernethy, Manfred Warmuth)

Non-Stochastic Bandit Slate Problems (Satyen Kale, Lev Reyzin, Robert Schapire)

Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum)

Learning Bounds for Importance Weighting (Corinna Cortes, Yishay Mansour, Mehryar Mohri)

Supervised Clustering (Pranjal Awasthi, Reza Bosagh Zadeh)

The second list is from Piyush Rai, who apparently aimed for recall (though not with a lack of precision) :P:

Online Learning: Random Averages, Combinatorial Parameters, and Learnability (Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari): defines several complexity measures for online learning akin to what we have for the batch setting (e.g., radamacher averages, covering numbers etc).

Online Learning in The Manifold of Low-Rank Matrices (Uri Shalit, Daphna Weinshall, Gal Chechik): nice general framework applicable in a number of online learning settings. could also be used for online multitask learning.

Fast global convergence rates of gradient methods for high-dimensional statistical recovery (Alekh Agarwal, Sahand Negahban, Martin Wainwright): shows that the properties of sparse estimation problems that lead to statistical efficiency also lead to computational efficiency which explains the faster practical convergence of gradient methods than what the theory guarantees.

Copula Processes (Andrew Wilson, Zoubin Ghahramani): how do you determine the relationship between random variables which could have different marginal distributions (say one has gamma and the other has gaussian distribution)? copula process gives an answer to this.

Graph-Valued Regression (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): usually undirected graph structure learning involves a set of random variables y drawn from a distribution p(y). but what if y depends on another variable x? this paper is about learning the graph structure of the distribution p(y|x=x).

Structured sparsity-inducing norms through submodular functions (Francis Bach): standard sparse recovery uses l1 norm as a convex proxy for the l0 norm (which constrains the number of nonzero coefficients to be small). this paper proposes several more general set functions and their corresponding convex proxies, and links them to known norms.

Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum): an interesting paper -- what if in an online learning setting you could abstain from making a prediction on some of the training examples and just say "i don't know"? on others, you may or may not make the correct prediction. lies somewhere in the middle of always predicting right or wrong (i.e., standard mistake driven online learning) versus the recent work on only predicting correctly or otherwise saying "i don't know".

Variational Inference over Combinatorial Spaces (Alexandre Bouchard-Cote, Michael Jordan): cool paper. applicable to lots of settings.

A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire): we know that boosting in binary case requires "slightly better than random" weak learners. this paper characterizes conditions on the weak learners for the multi-class case, and also gives a boosting algorithm.

Multitask Learning without Label Correspondences (Novi Quadrianto, Alexander Smola, Tiberio Caetano, S.V.N. Vishwanathan, James Petterson): usually mtl assumes that the output space is the same for all the tasks but in many cases this may not be true. for instance, we may have two related prediction problems on two datasets but the output spaces for both may be different and may have some complex (e.g., hierarchical, and potentially time varying) output spaces. the paper uses a mutual information criteria to learn the correspondence between the output spaces.

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty (Yi Zhang, Jeff Schneider): presents a general multitask learning framework and many recently proposed mtl models turn out to be special cases. models both feature covariance and task covariance matrices.

Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions (Achintya Kundu, vikram Tankasali, Chiranjib Bhattacharyya, Aharon Ben-Tal): the title says it all. :) multiple kernel learning is usually applied in classification setting but due to the applicability of the proposed method for a wide variety of loss functions, one can possibly also use it for unsupervised learning problems as well (e.g., spectral clustering, kernel pca, etc).

Getting lost in space: Large sample analysis of the resistance distance (Ulrike von Luxburg, Agnes Radl, Matthias Hein): large sample analysis of the commute distance: shows a rather surprising result that commute distance between two vertices in the graph if the graph is "large" and nodes represent high dimensional variables is meaningless. the paper proposes a correction and calls it "amplified commute distance".

A Bayesian Approach to Concept Drift (Stephen Bach, Mark Maloof): gives a bayesian approach for segmenting a sequence of observations such that each "block" of observations has the same underlying concept.

MAP Estimation for Graphical Models by Likelihood Maximization (Akshat Kumar, Shlomo Zilberstein): they show that you can think of an mrf as a mixture of bayes nets and then the map problem on the mrf corresponds to solving a form of the maximum likelihood problem on the bayes net. em can be used to solve this in a pretty fast manner. they say that you can use this methods with the max-product lp algorithms to yield even better solutions, with a quicker convergence.

Energy Disaggregation via Discriminative Sparse Coding (J. Zico Kolter, Siddharth Batra, Andrew Ng): about how sparse coding could be used to save energy. :)

Semi-Supervised Learning with Adversarially Missing Label Information (Umar Syed, Ben Taskar): standard ssl assumes that labels for the unlabeled data are missing at random but in many practical settings this isn't actually true.this paper gives an algorithm to deal with the case when the labels could be adversarially missing.

Multi-View Active Learning in the Non-Realizable Case (Wei Wang, Zhi-Hua Zhou): shows that (under certain assumptions) exponential improvements in the sample complexity of active learning are still possible if you have a multiview learning setting.

Self-Paced Learning for Latent Variable Models (M. Pawan Kumar, Benjamin Packer, Daphne Koller): an interesting paper, somewhat similar in spirit to curriculum learning. basically, the paper suggests that in learning a latent variable model, it helps if you provide the algorithm easy examples first.

More data means less inference: A pseudo-max approach to structured learning (David Sontag, Ofer Meshi, Tommi Jaakkola, Amir Globerson): a pseudo-max approach to structured learning: this is somewhat along the lines of the paper on svm's inverse dependence on training size from icml a couple of years back. :)

Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning (Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman): selecting the most uncertain example in a pool based active learning can be expensive if the number of candidate examples is very large. this paper suggests some hashing tricks to expedite the search.

Active Instance Sampling via Matrix Partition (Yuhong Guo): frames batch mode active learning as a matrix partitioning problems and proposes local optimization technique for the matrix partitioning problem.

A Discriminative Latent Model of Image Region and Object Tag Correspondence (Yang Wang, Greg Mori): it's kind of doing correspondence lda on image+captions but they additionally infer the correspondences between tags and objects in the images, and show that this gives improvements over corr-lda.

Factorized Latent Spaces with Structured Sparsity (Yangqing Jia, Mathieu Salzmann, Trevor Darrell): a multiview learning algorithm that uses sparse coding to learn shared as well as private features of different views of the data.

Word Features for Latent Dirichlet Allocation (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy): extends lda for the case when you have access to features for each word in the vocabulary

18 November 2010

Crowdsourcing workshop (/tutorial) decisions

Everyone at conferences (with multiple tracks) always complains that there are time slots with nothing interesting, and other time slots with too many interesting papers.  People have suggested crowdsourcing this, enabling parcipants to say -- well ahead of the conference -- which papers they'd go to... then let an algorithm schedule.

I think there are various issues with this model, but don't want to talk about it.  What I do want to talk about is applying the same ideas to workshop acceptance decisions.  This comes up because I'm one of the two workshop chairs for ACL this year, and because John Langford just pointed to the ICML call for tutorials.  (I think what I have to say applies equally to tutorials as to workshops.)

I feel like a workshop (or tutorial) is successful if it is well attended.  This applies both from a monetary perspective, as well as a scientific perspective.  (Note, though, that I think that small workshops can also be successful, especially if they are either fostering a small community, bring people in, or serving other purposes.  That is to say, size is not all that matters.  But it is a big part of what matters.)

We have 30-odd workshop proposals for three of us to sort through (John Carroll and I are the two workshop chairs for ACL, and Marie Candito is the workshop chair for EMNLP; workshops are being reviewed jointly -- which actually makes the allocation process more difficult).  The idea would be that I could create a poll, like the following:

  1. Are you going to ACL?  Yes, maybe, no
  2. Are you going to EMNLP?  Yes, maybe, no
  3. If workshop A were offered at a conference you were going to, would you go to workshop A?
  4. If workshop B...
  5. And so on
This gives you two forms of information.  First it can help estimate expected attendance (though we ask proposers to estimate that, too, and I think they do a reasonable job if you skew their estimates down by about 10%).  But more importantly, it gives correlations between workshops.  This lets you be sure that you're not scheduling things on top of each other that people might want to go to.  Some of these are obvious (for instance, if we got 10 MT workshop proposals... which didn't actually happen but is moderately conceivable :P), but some are not.  For instance, maybe people who care about annotation also care about ML, but maybe not?  I actually have no idea.

Of course we're not going to do this this year.  It's too late already, and it would be unfair to publicise all the proposals, given that we didn't tell proposers in advance that we would do so.  And of course I don't think this should exclusively be a popularity contest.  But I do beleive that popularity should be a factor.  And it should probably be a reasonably big factor.  Workshop chairs could then use the output of an optimization algorithm as a starting point, and use this as additional data for making decisions.  Especially since two or three people are being asked to make decisions that cover--essentially--all areas of NLP, this actually seems like a good idea to me.

I actually think something like this is more likely to actually happen at a conference like ICML than ACL, since ICML seems (much?) more willing to try new things than ACL (for better or for worse).

But I do think it would be interesting to try to see what sort of response you get.  Of course, just polling on this blog wouldn't be sufficient: you'd want to spam, perhaps all of last year's attendees.  But this isn't particularly difficult.

Is there anything I'm not thinking of that would make this obviously not work?  I could imagine someone saying that maybe people won't propose workshops/tutorials if the proposals will be made public?  I find that a bit hard to swallow.  Perhaps there's a small embarassment factor if you're public and then don't get accepted.  But I wouldn't advocate making the voting results public -- they would be private to the organizers / workshop chairs.

I guess -- I feel like I'm channeling Fernando here? -- that another possible issue is that you might not be able to decide which workshops you'd go to without seeing what papers are there and who is presenting.  This is probably true.  But this is also the same problem that the workshop chairs face anyway: we have to guess that good enough papers/people will be there to make it worthwhile.  I doubt I'm any better at guessing this than any other random NLP person...

So what am I forgetting?

09 November 2010

Managing group papers

Every time a major conference deadline (ACL, NIPS, EMNLP, ICML, etc...) comes around, we usually have a slew of papers (>=3, typically) that are getting prepared.  I would say on average 1 doesn't make it, but that's par for the course.

For AI-Stats, whose deadline just passed, I circulated student paper drafts to all of my folks to solicit comments at any level that they desired.  Anywhere from not understanding the problem/motivation to typos or errors in equations.  My experience was that it was useful, both from the perspective of distributing some of my workload and getting an alternative perspective, to keeping everyone abreast of what everyone else is working on.

In fact, it was so successful that two students suggested to me that I require more-or-less complete drafts of papers at least one week in advance so that this can take place.  How you require something like this is another issue, but the suggestion they came up with was that I'll only cover conference travel if this occurs.  It's actually not a bad idea, but I don't know if I'm enough of a hard-ass (or perceived as enough of a hard-ass) to really pull it off.  Maybe I'll try it though.

The bigger question is how to manage such a thing.  I was thinking of installing some conference management software locally (eg., HotCRP, which I really like) and giving students "reviewer" access.  Then, they could upload their drafts, perhaps with an email circulated when a new draft is available, and other students (and me!) could "review" them.  (Again, perhaps with an email circulated -- I'm a big fan of "push" technology: I don't have time to "pull" anymore!)

The only concern I have is that it would be really nice to be able to track updates, or to have the ability for authors to "check off" things that reviewers suggested.  Or to allow discussion.  Or something like that.

I'm curious if anyone has ever tried anything like this and whether it was successful or not.  It seems like if you can get a culture of this established, it could actually be quite useful.

21 October 2010

Comparing Bounds

This is something that's bothered me for quite a while, and I don't know of a good answer.  I used to think it was something that theory people didn't worry about, but then this exact issue was brought up by a reviewer of a theory-heavy paper that we have at NIPS this year (with Avishek Saha and Abhishek Kumar). There are (at least?) two issues with comparing bounds, the first is the obvious "these are both upper bounds, what does it mean to compare them?"  The second is the slightly less obvious "but your empirical losses may be totally different" issue.  It's actually the second one that I want to talk about, but I have much less of a good internal feel about it.

Let's say that I'm considering two learning approaches.  Say it's SVMs versus logistic regression.  Both regularized.  Or something.  Doesn't really matter.  At the end of the day, I'll have a bound that looks roughly like:

        expected test error <= empirical training error + f( complexity / N)

Here, f is often "sqrt", but could really be any function.  And N is the number of data points.

Between two algorithms, both "f" and "complexity" can vary.  For instance, one might have a linear dependence on the dimensionality of the data (i.e., complexity looks like O(D), where D is dimensionality) and the other might have a superlinear dependence (eg., O(D log D)).  Or one might have a square root.  Who knows.  Sometimes there's an inf or sup hiding in there, too, for instance in a lot of the margin bounds.

At the end of the day, we of course want to say "my algorithm is better than your algorithm."  (What else is there in life?)  The standard way to say this is that "my f(complexity / N) looks better than your f'(complexity' / N)."

Here's where two issues crop up.  The first is that our bound is just an upper bound.  For instance, Alice could come up to me and say "I'm thinking of a number between 1 and 10" and Bob could say "I'm thinking of a number between 1 and 100."  Even though the bound is lower for Alice, it doesn't mean that Alice is actually thinking of a smaller number -- maybe Alice is thinking of 9 and Bob of 5.  In this way, the bounds can be misleading.

My general approach with this issue is to squint, as I do for experimental results.  I don't actually care about constant factors: I just care about things like "what does the dependence on D look like."  Since D is usually huge for problems I care about, a linear or sublinear dependence on D looks really good to me.  Beyond that I don't really care.  I especially don't care if the proof techniques are quite similar.  For instance, if they both use Rademacher complexities, then I'm more willing to compare them than if one uses Rademacher complexities and the other uses covering numbers.  They somehow feel more comparable: I'm less likely to believe that the differences are due to the method of analysis.

(You can also get around this issue with some techniques, like Rademacher complexities, which give you both upper and lower bounds, but I don't think anyone really does that...)

The other issue I don't have as good a feeling for.  The issue is that we're entirely ignoring the "empirical training error" question.  In fact, this is often measured differently between different algorithms!  For instance, for SVMs, the formal statement is more like "expected 0/1 loss on test <= empirical hinge loss on training + ..."  Whereas for logistic regression, you might be comparing expected 0/1 loss with empirical log loss.

Now I really don't know what to do.

We ran into this issue because we were trying to compare some bounds between EasyAdapt and a simple model trained just on source data.  The problem is that the source training error might be totally incomparable to the (source + target) training error.

But the issue is for sure more general.  For instance, what if your training error is measured in squared error?  Now this can be huge when hinge loss is still rather small.  In fact, your squared error could be quadratically large in your hinge loss.  Actually it could be arbitrarily larger, since hinge goes to zero for any sufficiently correct classification, but squared error does not.  (Neither does log loss.)

This worries me greatly, much more than the issue of comparing upper bounds.

Does this bother everyone, or is it just me?  Is there a good way to think about this that gets your out of this conundrum?

05 October 2010

My Giant Reviewing Error

I try to be a good reviewer, but like everything, reviewing is a learning process.  About five years ago, I was reviewing a journal paper and made an error.  I don't want to give up anonymity in this post, so I'm going to be vague in places that don't matter.

I was reviewing a paper, which I thought was overall pretty strong.  I thought there was an interesting connection to some paper from Alice Smith (not the author's real name) in the past few years and mentioned this in my review.  Not a connection that made the current paper irrelevant, but something the authors should probably talk about.  In the revision response, the authors said that they had looked to try to find Smith's paper, but could figure out which one I was talking about, and asked for a pointer.  I spend the next five hours looking for the reference and couldn't find it myself.  It turns out that actually I was thinking of a paper by Bob Jones, so I provided that citation.  But the Jones paper wasn't even as relevant as it seemed at the time I wrote the review, so I apologized and told the authors they didn't really need to cover it that closely.

Now, you might be thinking to yourself: aha, now I know that Hal was the reviewer of my paper!  I remember that happening to me!

But, sadly, this is not true.  I get reviews like this all the time, and I feel it's one of the most irresponsible things reviewers can do.  In fact, I don't think a single reviewing cycle has passed where I don't get a review like this.  The problem with such reviews is that it enables a reviewer to make whatever claim they want, without any expectation that they have to back it up.  And the claims are usually wrong.  They're not necessarily being mean (I wasn't trying to be mean), but sometimes they are.

Here are some of the most ridiculous cases I've seen.  I mention these just to show how often this problem occurs.  These are all on papers of mine.

  • One reviewer wrote "This idea is so obvious this must have been done before."  This is probably the most humorous example I've seen, but the reviewer was clearly serious.  And no, this was not in a review for one of the the "frustratingly easy" papers.
  • In a NSF grant review for an educational proposal, we were informed by 4 of 7 reviewers (who each wrote about a paragraph) that our ideas had been done in SIGCSE several times.  Before submitting, we had skimmed/read the past 8 years of SIGCSE and could find nothing.  (Maybe it's true and we just were looking in the wrong place, but that still isn't helpful.)  It turned out to strongly seem that this was basically their way of saying "you are not one of us."
  • In a paper on technique X for task A, we were told hands down that it's well known that technique Y works better, with no citations.  The paper was rejected, we went and implemented Y, and found that it worked worse on task A.  We later found one paper saying that Y works better than X on task B, for B fairly different from A.
  • In another paper, we were told that what we were doing had been done before and in this case a citation was provided.  The citation was to one of our own papers, and it was quite different by any reasonable metric.  At least a citation was provided, but it was clear that the reviewer hadn't bothered reading it.
  • We were told that we missed an enormous amount of related work that could be found by a simple web search.  I've written such things in reviews, often saying something like "search for 'non-parametric Bayesian'" or something like that.  But here, no keywords were provided.  It's entirely possible (especially when someone moves into a new domain) that you can miss a large body of related work because you don't know how to find in: that's fine -- just tell me how to find it if you don't want to actually provide citations.
There are other examples I could cite from my own experience, but I think you get the idea.

I'm posting this not to gripe (though it's always fun to gripe about reviewing), but to try to draw attention to this problem.  It's really just an issue of laziness.  If I had bothered trying to look up a reference for Alice Smith's paper, I would have immediately realized I was wrong.  But I was lazy.  Luckily this didn't really adversely affect the outcome of the acceptance of this paper (journals are useful in that way -- authors can push back -- and yes, I know you can do this in author responses too, but you really need two rounds to make it work in this case).

I've really really tried ever since my experience above to not ever do this again.  And I would encourage future reviewers to try to avoid the temptation to do this: you may find your memory isn't as good as you think.  I would also encourage area chairs and co-reviewers to push their colleagues to actually provide citations for otherwise unsubstantiated claims.

24 September 2010

ACL / ICML Symposium?

ACL 2011 ends on June 24, in Portland (that's a Friday). ICML 2011 begins on June 28, near Seattle (the following Tuesday). This is pretty much as close to a co-location as we're probably going to get in a long time. A few folks have been discussing the possibility of having a joint NLP/ML symposium in between. The current thought is to have it on June 27 at the ICML venue (for various logistical reasons).  There are buses and trains that run between the two cities, and we might even be able to charter some buses.

One worry is that it might only attract ICML folks due to the weekend between the end of ACL and the beginning of said symposium. As a NLPer/MLer, I believe in data. So please provide data by filling out the form below and, if you wish, adding comments.

If you woudn't attend any, you don't need to fill out the poll :).

The last option is there if you want to tell me "I'm going to go to ACL, and I'd really like to go to the symposium, but the change in venue and the intervening weekend is too problematic to make it possible."

Which would you attend?
Only ACL
Only ICML
ACL and the symposium
ICML and the symposium
Only ACL and ICML
All three
Only ACL, because of convenience



  
pollcode.com free polls

15 September 2010

Very sad news....

I heard earlier this morning that Fred Jelinek passed away last night.  Apparently he had been working during the day: a tenacious aspect of Fred that probably has a lot to do with his many successes.

Fred is probably most infamous for the famous "Every time I fire a linguist the performace of the recognizer improves" quote, which Jurafsky+Martin's textbook says is actually supposed to be the more innocuous "Anytime a linguist leaves the group the recognition rate goes up."  And in Fred's 2009 ACL Lifetime Achievement Award speech, he basically said that such a thing never happened.  I doubt that will have any effect on how much the story is told.

Fred has had a remarkable influence on the field.  So much so that I won't attempt to list anything here: you can find all about him all of the internet.  Let me just say that the first time I met him, I was intimidated.  Not only because he was Fred, but because I knew (and still know) next to nothing about speech, and the conversation inevitably turned to speech.  Here's roughly how a segment of our conversation went:

Hal: What new projects are going on these days?
Fred: (Excitedly.)  We have a really exciting new speech recognition problem.  We're trying to map speech signals directly to fluent text.
Hal: (Really confused.) Isn't that the speech recognition problem?
Fred: (Playing the "teacher role" now.)  Normally when you transcribe speech, you end up with a transcrit that includes disfluencies like "uh" and "um" and also false starts [Ed note: like "I went... I went to the um store"].
Hal: So now you want to produce the actual fluent sentence, not the one that was spoken?
Fred: Right.

Apparently (who knew) in speech recognition you try to transcribe disfluencies and are penalized for missing them!  We then talked for a while about how they were doing this, and other fun topics.

A few weeks later, I got a voicemail on my home message machine from Fred.  That was probably one of the coolest things that have ever happened to me in life.  I actually saved it (but subsequently lost it, which saddens me greatly).  The content is irrelevant: the point is that Fred -- Fred! -- called me -- me! -- at home!  Amazing.

I'm sure that there are lots of other folks who knew Fred better than me, and they can add their own stories in comments if they'd like.  Fred was a great asset to the field, and I will certainly miss his physical presense in the future, though his work will doubtless continue to affect the field for years and decades to come.

13 September 2010

AIStats 2011 Call for Papers

The full call, and some changes to the reviewing process. The submission deadline is Nov 1, and the conference is April 11-13, in Fort Lauderdale, Florida. Promises to be warm :).

The changes the the reviewing process are interesting.  Basically the main change is that the author response is replaced by a journal-esque "revise and resubmit."  That is, you get 2 reviews, edit your paper, submit a new version, and get a 3rd review.  The hope is that this will reduce author frustration from the low bandwidth of author response.  Like with a journal, you'll also submit a "diff" saying what you've changed.  I can see this going really well: the third reviewer will presumably see a (much) better than the first two.  The disadvantage, which irked me at ICML last year, is that it often seemed like the third reviewer made the deciding call, and I would want to make sure that the first two reviewers also get updated.  I can also see it going poorly: authors invest even more time in "responding" and no one listens.  That will be increased frustration :).

The other change is that there'll be more awards.  I'm very much in favor of this, and I spend two years on the NAACL exec trying to get NAACL to do the same thing, but always got voted down :).  Oh well.  The reason I think it's a good idea is two-fold.  First, I think we're bad at selecting single best papers: a committee decision can often lead to selecting least offensive papers rather than ones that really push the boundary.  I also think there are lots of ways for papers to be great: they can introduce new awesome algorithms, have new theory, have a great application, introduce a cool new problem, utilize a new linguistic insight, etc., etc., etc... Second, best papers are most useful at promotion time (hiring, and tenure), where you're being compared with people from other fields.  Why should our field put our people at a disadvantage by not awarding great work that they can list of their CVs?

Anyway, it'll be an interesting experiment, and I encourage folks to submit!

07 September 2010

Manifold Assumption versus Margin Assumption

[This post is based on some discussions that came up while talking about manifold learning with Ross Whitaker and Sam Gerber, who had a great manifold learning paper at ICCV last year.]

There are two assumptions that are often used in statistical learning (both theory and practice, though probably more of the latter), especially in the semi-supervised setting.  Unfortunately, they're incompatible.

The margin assumption states that your data are well separated.  Usually it's in reference to linear, possibly kernelized, classifiers, but that need not be the case.  As most of us know, there are lots of other assumptions that boil down to the same thing, such as the low-weight-norm assumption, or the Gaussian prior assumption.  At the end of the day, it means your data looks like what you have on the left, below, not what you have on the right.
The manifold assumption that is particularly popular in semi-supervised learning, but also shows up in supervised learning, says that your data lie on a low dimensional manifold embedded in a higher dimensional space.  One way of thinking about this is saying that your features cannot covary arbitrarily, but the manifold assumption is quite a bit stronger.  It usually assumes a Reimannian (i.e., locally Euclidean) structure, with data points "sufficiently" densely sampled.  In other words, life looks like the left, not the right, below:
Okay, yes, I know that the "Bad" one is a 2D manifold embedded in 2D, but that's only because I can't draw 3D images :).  And anyway, this is a "weird" manifold in the sense that at one point (where the +s and -s meet), it drops down to 1D.  This is fine in math-manifold land, but usually not at all accounted for in ML-manifold land.

The problem, of course, is that once you say "margin" and "manifold" in the same sentence, things just can't possibly work out.  You'd end up with a picture like:

This is fine from a margin perspective, but it's definitely not a (densely sampled) manifold any more.

In fact, almost by definition, once you stick a margin into a manifold (which is okay, since you'll define margin Euclideanly, and manifolds know how to deal with Euclidean geometry locally), you're hosed.

So I guess the question is: who do you believe?

31 August 2010

Online Learning Algorithms that Work Harder

It seems to be a general goal in practical online learning algorithm development to have the updates be very very simply.  Perceptron is probably the simplest, and involves just a few adds.  Winnow takes a few multiplies.  MIRA takes a bit more, but still nothing hugely complicated.  Same with stochastic gradient descent algorithms for, eg., hinge loss.

I think this maybe used to make sense.  I'm not sure that it makes sense any more.  In particular, I would be happier with online algorithms that do more work per data point, but require only one pass over the data.  There are really only two examples I know of: the StreamSVM work that my student Piyush did with me and Suresh, and the confidence-weighted work by Mark Dredze, Koby Crammer and Fernando Pereira (note that they maybe weren't trying to make a one-pass algorithm, but it does seem to work well in that setting).

Why do I feel this way?

Well, if you look even at standard classification tasks, you'll find that if you have a highly optimized, dual threaded implementation of stochastic gradient descent, then your bottleneck becomes I/O, not learning.  This is what John Langford observed in his Vowpal Wabbit implementation.  He has to do multiple passes.  He deals with the I/O bottleneck by creating an I/O friendly, proprietary version of the input file during the first past, and then careening through it on subsequent passes.

In this case, basically what John is seeing is that I/O is too slow.  Or, phrased differently, learning is too fast :).  I never thought I'd say that, but I think it's true.  Especially when you consider that just having two threads is a pretty low requirement these days, it would be nice to put 8 or 16 threads to good use.

But I think the problem is actually quite a bit more severe.  You can tell this by realizing that the idealized world in which binary classifier algorithms usually get developed is, well, idealized.  In particular, someone has already gone through the effort of computing all your features for you.  Even running something simple like a tokenizer, stemmer and stop word remover over documents takes a non-negligible amount of time (to convince yourself: run it over Gigaword and see how long it takes!), easily much longer than a silly perceptron update.

So in the real world, you're probably going to be computing your features and learning on the fly.  (Or at least that's what I always do.)  In which case, if you have a few threads computing features and one thread learning, your learning thread is always going to be stalling, waiting for features.

One way to partially circumvent this is to do a variant of what John does: create a big scratch file as you go and write everything to this file on the first pass, so you can just read from it on subsequent passes.  In fact, I believe this is what Ryan McDonald does in MSTParser (he can correct me in the comments if I'm wrong :P).  I've never tried this myself because I am lazy.  Plus, it adds unnecessary complexity to your code, requires you to chew up disk, and of course adds its own delays since you now have to be writing to disk (which gives you tons of seeks to go back to where you were reading from initially).

A similar problem crops up in structured problems.  Since you usually have to run inference to get a gradient, you end up spending way more time on your inference than your gradients.  (This is similar to the problems you run into when trying to parallelize the structured perceptron.)

Anyway, at the end of the day, I would probably be happier with an online algorithm that spent a little more energy per-example and required fewer passes; I hope someone will invent one for me!

27 August 2010

Calibrating Reviews and Ratings

NIPS decision are going out soon, and then we're done with submitting and reviewing for a blessed few months. Except for journals, of course.

If you're not interested in paper reviews, but are interested in sentiment analysis, please skip the first two paragraphs :).

One thing that anyone who has ever area chaired, or probably even ever reviewed, has noticed is that different people have different "baseline" ratings. Conferences try to adjust for this, for instance NIPS defines their 1-10 rating scale as something like "8 = Top 50% of papers accepted to NIPS" or something like that. Even so, some people are just harsher than others in scoring, and it seems like the area chair's job to calibrate for this. (For instance, I know I tend to be fairly harsh -- I probably only give one 5 (out of 5) for every ten papers I review, and I probably give two or three 1s in the same size batch. I have friends who never give a one -- except in the case of something just being wrong -- and often give 5s. Perhaps I should be nicer; I know CS tends to be harder on itself than other fiends.) As an aside, this is one reason why I'm generally in favor of fewer reviewers and more reviews per reviewer: it allows easier calibration.

There's also the issue of areas. Some areas simply seem to be harder to get papers into than others (which can lead to some gaming of the system). For instance, if I have a "new machine learning technique applied to parsing," do I want it reviewed by parsing people or machine learning people? How do you calibrate across areas, other than by some form of affirmative action for less-represented areas?

A similar phenomenon occurs in sentiment analysis, as was pointed out to me at ACL this year by Franz Och. The example he gives is very nice. If you go to TripAdvisor and look up The French Laundry, which is definitely one of the best restaurants in the U.S. (some people say the best), you'll see that it got 4.0/5.0 stars, and a 79% recommendation. On the other hand, if you look up In'N'Out Burger, a LA-based burger chain (which, having grown up in LA, was admittedly one of my favorite places to eat in high school, back when I ate stuff like that) you see another 4.0/5.0 stars and a 95% recommendation.

So now, we train a machine learning system to predict that the rating for The French Laundry is 79% and In'N'Out Burger is 95%. And we expect this to work?!

Probably the main issue here is calibrating for expectations. As a teacher, I've figured out quickly that managing student expectations is a big part of getting good teaching reviews. If you go to In'N'Out, and have expectations for a Big Mac, you'll be pleasantly surprised. If you go to The French Laundry with expectations of having a meal worth selling your soul, your children's souls, etc., for, then you'll probably be disappointed (though I can't really say: I've never been).

One way that a similar problem has been dealt with on Hotels.com is that they'll show you ratings for the hotel you're looking at, and statistics of ratings for other hotels within a 10 mile radius (or something). You could do something similar for restaurants, though distance probably isn't the right categorization: maybe price. For "$", In'N'Out is probably near the top, and for "$$$$" The French Laundry probably is.

(Anticipating comments, I don't think this is just an "aspect" issue. I don't care how bad your palate is, even just considering the "quality of food" aspect, Laundry has to trump In'N'Out by a large margin.)

I think the problem is that in all of these cases -- papers, restaurants, hotels -- and others (movies, books, etc.) there simply isn't a total order on the "quality" of the objects you're looking at. (For instance, as soon as a book becomes a best seller, or is advocated by Oprah, I am probably less likely to read it.) There is maybe a situation-depend order, and the distance to hotel, or "$" rating, or area classes are heuristics for describing this "situation." Bit without knowing the situation, or having a way to approximate it, I worry that we might be entering a garbage-in-garbage-out scenario here.

23 August 2010

Finite State NLP with Unlabeled Data on Both Sides

(Can you tell, by the recent frequency of posts, that I'm try not to work on getting ready for classes next week?)

[This post is based partially on some conversations with Kevin Duh, though not in the finite state models formalism.]

The finite state machine approach to NLP is very appealing (I mean both string and tree automata) because you get to build little things in isolation and then chain them together in cool ways. Kevin Knight has a great slide about how to put these things together that I can't seem to find right now, but trust me that it's awesome, especially when he explains it to you :).

The other thing that's cool about them is that because you get to build them in isolation, you can use different data sets, which means data sets with different assumptions about the existence of "labels", to build each part. For instance, to do speech to speech transliteration from English to Japanese, you might build a component system like:

English speech --A--> English phonemes --B--> Japanese phonemes --C--> Japanese speech --D--> Japanese speech LM

You'll need a language model (D) for Japanese speech, that can be trained just on acoustic Japanese signals, then parallel Japanese speech/phonemes (for C), parallel English speech/phonemes (for A) and parallel English phonemes/Japanese phonemes (for B). [Plus, of course, if you're missing any of these, EM comes to your rescue!]

Let's take a simpler example, though the point I want to make applies to long chains, too.

Suppose I want to just do translation from French to English. I build an English language model (off of monolingual English text) and then an English-to-French transducer (remember that in the noisy channel, things flip direction). For the E2F transducer, I'll need parallel English/French text, of course. The English LM gives me p(e) and the transducer gives me p(f|e), which I can put together via Bayes' rule to get something proportional to p(e|f), which will let me translate new sentences.

But, presumably, I also have lots of monolingual French text. Forgetting math for a moment, which seems to suggest that this can't help me, we can ask: why should this help?

Well, it probably won't help with my English language model, but it should be able to help with my transducer. Why? Because my transducer is supposed to give me p(f|e). If I have some French sentence in my GigaFrench corpus to which my transducer assigns zero probability (for instance, max_e p(f|e) = 0), then this is probably a sign that something bad is happening.

More generally, I feel like the following two operations should probably give roughly the same probabilities:

  1. Drawing an English sentence from the language model p(e).
  2. Picking a French sentence at random from GigaFrench, and drawing an English sentence from p(e|f), where p(e|f) is the composition of the English LM and the transducer.
If you buy this, then perhaps one thing you could do is to try to learn a transducer q(f|e) that has low KL divergence between 1 and 2, above. If you work through the (short) make, and throw away terms that are independent of the transducer, then you end up wanting to minimize [ sum_e p(e) log sum_f q(f|e) ]. Here, the sum over f is a finite sum over GigaFrench, and the sum over e is an infinite sum over positive probability English sentences given my the English LM p(e).

One could then apply something like posterior regularization (Kuzman Ganchev, Graça and Taskar) to do the learning. There's the nasty bit about how to compute these things, but that's why you get to be friends with Jason Eisner so he can tell you how to do anything you could ever want to do with finite state models.

Anyway, it seems like an interesting idea. I'm definitely not aware if anyone has tried it.

21 August 2010

Readers kill blogs?

I try to avoid making meta-posts, but the timing here was just too impeccable for me to avoid a short post on something that's been bothering me for a year or so.

I actually complete agree with both points. The problem is that I worry that they are actually fairly opposed. I comment much less on other people's blogs now that I use reader, because the 10 second overhead of clicking on the blog, being redirected, entering a comment, blah blah blah, is just too high. Plus, I worry that no one (except the blog author) will see my comment, since most readers don't (by default) show comments in with posts.

Hopefully the architects behind readers will pick up on this and make these things (adding and viewing comments, within the reader -- yes, I realize that it's then not such a "reader") easier. That is, unless they want to lose out to tweets!

Until then, I'd like to encourage people to continue commenting here.

19 August 2010

Multi-task learning: should our hypothesis classes be the same?

It is almost an unspoken assumption in multitask learning (and domain adaptation) that you use the same type of classifier (or, more formally, the same hypothesis class) for all tasks. In NLP-land, this usually means that everything is a linear classifier, and the feature sets are the same for all tasks; in ML-land, this usually means that the same kernel is used for every task. In neural-networks land (ala Rich Caruana), this is enforced by the symmetric structure of the networks used.

I probably would have gone on not even considering this unspoken assumption, until a few years ago I saw a couple papers that challenged it, albeit indirectly. One was Factorizing Complex Models: A Case Study in Mention Detection by Radu (Hans) Florian, Hongyan Jing, Nanda Kambhatla and Imed Zitouni, all from IBM. They're actually considering solving tasks separately rather than jointly, but joint learning and multi-task learning are very closely related. What they see is that different features are useful for spotting entity spans, and for labeling entity types.

That year, or the next, I saw another paper (can't remember who or what -- if someone knows what I'm talking about, please comment!) that basically showed a similar thing, where a linear kernel was doing best for spotting entity spans, and a polynomial kernel was doing best for labeling the entity types (with the same feature sets, if I recall correctly).

Now, to some degree this is not surprising. If I put on my feature engineering hat, then I probably would design slightly different features for these two tasks. On the other hand, coming from a multitask learning perspective, this is surprising: if I believe that these tasks are related, shouldn't I also believe that I can do well solving them in the same hypothesis space?

This raises an important (IMO) question: if I want to allow my hypothesis classes to be different, what can I do?

One way is to punt: you can just concatenate your feature vectors and cross your fingers. Or, more nuanced, you can have some set of shared features and some set of features unique to each task. This is similar (the nuanced version, not the punting version) to what Jenny Finkel and Chris Manning did in their ACL paper this year, Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data.

An alternative approach is to let the two classifiers "talk" via unlabeled data. Although motivated differently, this was something of the idea behind my EMNLP 2008 paper on Cross-Task Knowledge-Constrained Self Training, where we run two models on unlabeled data and look for where they "agree."

A final idea that comes to mind, though I don't know if anyone has tried anything like this, would be to try to do some feature extraction over the two data sets. That is, basically think of it as a combination of multi-view learning (since we have two different hypothesis classes) and multi-task learning. Under the assumption that we have access to examples labeled for both tasks simultaneously (i.e., not the settings for either Jenny's paper or my paper), then one could do a 4-way kernel CCA, where data points are represented in terms of their task-1 kernel, task-2 kernel, task-1 label and task-2 label. This would be sort of a blending of CCA-for-multiview-learning and CCA-for-multi-task learning.

I'm not sure what the right way to go about this is, but I think it's something important to consider, especially since it's an assumption that usually goes unstated, even though empirical evidence seems to suggest it's not (always) the right assumption.

02 August 2010

Why Discourse Structure?

I come from a strong lineage of discourse folks. Writing a parser for Rhetorical Structure Theory was one of the first class projects I had when I was a grad student. Recently, with the release of the Penn Discourse Treebank, there has been a bit of a flurry of interest in this problem (I had some snarky comments right after ACL about this). I've also talked about why this is a hard problem, but never really about why it is an interesting problem.

My thinking about discourse has changed a lot over the years. My current thinking about it is in an "interpretation as abduction" sense. (And I sincerely hope all readers know what that means... if not, go back and read some classic papers by Jerry Hobbs.) This is a view I've been rearing for a while, but I finally started putting it into words (probably mostly Jerry's words) in a conversation at ACL with Hoifung Poon and Joseph Turian (I think it was Joseph... my memory fades quickly these days :P).

This view is that discourse is that thing that gives you an interpretation above and beyond whatever interpretations you get from a sentence. Here's a slightly refined version of the example I came up with on the fly at ACL:

  1. I only like traveling to Europe. So I submitted a paper to ACL.
  2. I only like traveling to Europe. Nevertheless, I submitted a paper to ACL.
What does the hearer (H) infer from these sentences. Well, if we look at the sentences on their own, then H infers something like Hal-likes-travel-to-Europe-and-only-Europe, and H infers something like Hal-submitted-a-paper-to-ACL. But when you throw discourse in, you can derive two additional bits of information. In example (1), you can infer ACL-is-in-Europe-this-year and in (2) you can infer the negation of that.

Pretty amazing stuff, huh? Replacing a "so" with a "nevertheless" completely changes this interpretation.

What does this have to do with interpretation as abduction? Well, we're going to assume that this discourse is coherent. Given that assumption, we have to ask ourselves: in (1), what do we have to assume about the world to make this discourse coherent? The answer is that you have to assume that ACL is in Europe. And similarly for (2).

Of course, there are other things you could assume that would make this discourse coherent. For (1), you could assume that I have a rich benefactor who likes ACL submissions and will send me to Europe every time I submit something to ACL. For (2), you could assume that I didn't want my paper to get in, but I wanted a submission to get reviews, and so I submitted a crappy paper. Or something. But these fail the Occam's Razor test. Or, perhaps they are a priori simply less likely (i.e., you have to assume more to get the same result).

Interestingly, I can change the interpretation of (2), for instance, by adding a third sentence to the discourse: "I figured that it would be easy to make my way to Europe after going to Israel." Here, we would abduce that ACL is in Israel, and that I'm willing to travel to Israel on my way to Europe. For you GOFAI folks, this would be something like non-monotonic reasoning.

Whenever I talk about discourse to people who don't know much about it, I always get this nagging sense of "yes, but why do I care that you can recognize that sentence 4 is background to sentence 3, unless I want to do summarization?" I hope that this view provides some alternative answer to that question. Namely, that there's some information you can get from sentences, but there is additional information in how those sentences are glued together.

Of course, one of the big problems we have is that we have no idea how to represent sentence-level interpretations, or at least some ideas but no way to get there in the general case. In the sentence-level case, we've seen some progress recently in terms of representing semantics in a sort of substitutability manner (ala paraphrasing), which is nice because the representation is still text. One could ask if something similar might be possible at a discourse level. Obviously you could paraphrase discourse connectives, but that's missing the point. What else could you do?