29 April 2007

Problems with Sometimes-hidden Variables

Following in the footsteps of Andrew, I may start posting NLP-related questions I receive by email to the blog (if the answer is sufficiently interesting). At least its a consistent source of fodder. Here's one from today:

I am looking for a task, preferably in nlp and info retriv, in which we have labeled data where some are completely observed (x,h) and some are partially observed x. (h is a latent variable and is observed for some data points and not observed for others) Briefly, the data should be like {(x_i,h_i,y_i),(x_j,y_j)} where y is the class label.
I think there are a lot of such examples. The first one to spring to mind is alignments for MT, but please don't use the Aachen alignment data; see Alex's ACL06 paper for one way to do semi-supervised learning when the majority of the work is coming from the unsupervised part, not from the supervised part. You could try to do something like the so-called end-to-end system, but with semi-supervised training on the alignments. Of course, building an MT system is a can of worms we might not all want to open.

Depending on exactly what you're looking for, there may be (many) other options. One might be interested in the case where the ys are named entity tags or entities with coref annotated and the hs are parse trees (ala the BBN corpus). By this reasoning, pretty much any pipelined system can be considered of this sort, provided that there is data annotated for both tasks simultaneously (actually, I think a very interesting research question is what to do when this is not the case). Unfortunately, typically if the amount of annotated data differs between the two tasks, it is almost certainly going to be the "easier" task for which more data is available, and typically you would want this easier task to be the hidden task.

Other than that, I can't think of any obvious examples. You could of course fake it (do parsing, with POS tags as the hidden, and pretend that you don't have fully annotated POS data, but I might reject such a paper on the basis that this is a totally unrealistic setting -- then again, I might not if the results were sufficiently convincing). You might also look at multitask learning literature; I know that in such settings, it is often useful to treat the "hidden" variable as just another output variable that's sometimes present and sometimes not. This is easiest to do in the case of neural networks, but can probably be generalized.

27 April 2007

EMNLP papers, tales from the trenches

First, EMNLP/CoNLL papers have been posted. I must congratulate the chairs for publishing this so fast -- for too many conferences we have to wait indefinitely for the accepted papers to get online. Playing the now-standard game of looking at top terms, we see:

1. model (24)
2. translat (17)
3. base (16)
4. learn (13) -- woohoo!
5. machin (12) -- mostly as in "machin translat"
6. word (11) -- mostly from WSD
7. structur (10) -- yay, as in "structured"
8. disambigu (10) -- take a wild guess
9. improv (9) -- boring
10. statist, semant, pars, languag, depend, approach (all 7)
EMNLP/CoNLL this year was the first time I served as area chair for a conference. Overall it was a very interesting and enlightening experience. It has drastically changed my view of the conference process. I must say that overall it was a very good experience: Jason Eisner did a fantastic job at keeping things on track.

I want to mention a few things that I noticed about the process:
1. The fact that reviewers only had a few weeks, rather than a few months to review didn't seem to matter. Very few people I asked to review declined. (Thanks to all of you who accepted!) It seems that, in general, people are friendly and happy to help out with the process. My sense is that 90% of the reviews get done in the last few days anyway, so having 3 weeks or 10 weeks is irrelevant.

2. The assignment process of papers to reviewers is hard, especially when I don't personally know many of the reviewers (this was necessary because my area was broader than me). Bidding helps a lot here, but the process is not perfect. (People bid differently: some "want" to review only 3 papers, so "want" half...) If done poorly, this can lead to uninformative reviews.

3. Papers were ranked 1-5, with half-points allowed if necessary. None of my papers got a 5. One got a 4.5 from one reviewer. Most got between 2.5 and 3.5, which is highly uninformative. I reviewed for AAAI this year, where you had to give 1,2,3 or 4, which forces you to not abstain. This essentially shifts responsibility from the area chair to the reviewer. I'm not sure which approach is better.

4. EMNLP asked for many criteria to be evaluated by reviewers; more than in conferences past. I thought this was very useful to help me make my decisions. (Essentially, in addition to high overall recommendation, I looked for high scores on "depth" and "impact.") So if you think (like I used to) that these other scores are ignored: be assured, they are not (unless other area chairs behave differently).

5. Blindness seems like a good thing. I've been very back and forth on this until now. This was the first time I got to see author names with papers and I have to say that it is really hard to not be subconsciously biased by this information. This is a debate that will never end, but for the time being, I'm happy with blind papers.

6. 20-25% acceptance rate is more than reasonable (for my area -- I don't know about others). I got 33 papers, of which three basically stood out as "must accepts." There were then a handful over the bar, some of which got in, some of which didn't. There is certainly some degree of randomness here (I believe largely due to the assignment of papers to reviewers), and if that randomness hurt your paper, I truly apologize. Not to belittle the vast majority of papers in my area, but I honestly don't think that the world would be a significantly worse place is only those top three papers had gotten in. This would make for a harsh 9% acceptance rate, but I don't have a problem with this.

I know that this comment will probably not make me many friends, but probably about half of the papers in my area were clear rejects. It seems like some sort of cascaded approach to reviewing might be worth consideration. The goal wouldn't be to reduce the workload for reviewers, but to have them concentrate their time on papers that stand a chance. (Admittedly, some reviewers do this anyway internally, but it would probably be good to make it official.)

7. Discussion among the reviewers was very useful. I want to thank those reviewers who added extra comments at the end to help me make the overall decisions (which then got passed up to the "higher powers" to be interpolated with other area chair's decisions). There were a few cases where I had to recruit extra reviewers, either to get an additional opinion or because one of my original reviewers went AWOL (thanks, all!). I'm happy to say that overall, almost all reviews were in on time, and without significant harassment.
So that was my experience. Am I glad I did it? Yes. Would I do it again? Absolutely.

Thanks again to all my reviewers, all the authors, and especially Jason for doing a great job organizing.

19 April 2007

Searn versus HMMs

Searn is my baby and I'd like to say it can solve (or at least be applied to) every problem we care about. This is of course not true. But I'd like to understand the boundary between reasonable and unreasonable. One big apparently weakness of Searn as it stands is that it appears applicable only to fully supervised problems. That is, we can't do hidden variables, we can't do unsupervised learning.

I think this is a wrong belief. It's something I've been thinking about for a long time and I think I finally understand what's going on. This post is about showing that you can actually recover forward-backward training of HMMs as an instance of Searn with a particular choice of base classifier, optimal policy, loss function and approximation method. I'll not prove it (I haven't even done this myself), but I think that even at a hand-waving level, it's sufficiently cool to warrant a post.

I'm going to have to assume you know how Searn works in order to proceed. The important aspect is essentially that we train on the basis of an optimal policy (which may be stochastic) and some loss function. Typically I've made the "optimal policy" assumption, which means that when computing the loss for a certain prediction along the way, we approximate the true expected loss with the loss given by the optimal policy. This makes things efficient, but we can't do it in HMMs.

So here's the problem set up. We have a sequence of words, each of which will get a label (for simplicity, say the labels are binary). I'm going to treat the prediction task as predicting both the labels and the words. (This looks a lot like estimating a joint probability, which is what HMMs do.) The search strategy will be to first predict the first label, then predict the first word, then predict the second label and so on. The loss corresponding to an entire prediction (of both labels and words) is just going to be the Hamming loss over the words, ignoring the labels. Since the loss doesn't depend on the labels (which makes sense because they are latent so we don't know them anyway), the optimal policy has to be agnostic about their prediction.

Thus, we set up the optimal policy as follows. For predictions of words, the optimal policy always predicts the correct word. For predictions of labels, the optimal policy is stochastic. If there are K labels, it predicts each with probability 1/K. Other optimal policies are possible and I'll discuss that later.

Now, we have to use a full-blown version of Searn that actually computes expected losses as true expectations, rather than with an optimal policy assumption. Moreover, instead of sampling a single path from the current policy to get to a given state, we sample all paths from the current policy. In other words, we marginalize over them. This is essentially akin to not making the "single sample" assumption on the "left" of the current prediction.

So what happens in the first iteration? Well, when we're predicting the Nth word, we construct features over the current label (our previous prediction) and predict. Let's use a naive Bayes base classifier. But we're computing expectations to the left and right, so we'll essentially have "half" an example for predicting the Nth word from state 0 and half an example for predicting it from state 1. For predicting the Nth label, we compute features over the previous label only and again use a naive Bayes classifier. The examples thus generated will look exactly like a training set for the first maximization in EM (with all the expectations equal to 1/2). We then learn a new base classifier and repeat.

In the second iteration, the same thing happens, except now when we predict a label, there can be an associated loss due to messing up future word predictions. In the end, if you work through it, the weight associated with each example is given by an expectation over previous decisions and an expectation over future decisions, just like in forward-backward training. You just have to make sure that you treat your learned policy as stochastic as well.

So with this particular choice of optimal policy, loss function, search strategy and base learner, we recover something that looks essentially like forward-backward training. It's not identical because in true F-B, we do full maximization each time, while in Searn we instead take baby steps. There are two interesting things here. First, this means that in this particular case, where we compute true expectations, somehow the baby steps aren't necessary in Searn. This points to a potential area to improve the algorithm. Second, and perhaps more interesting, it means that we don't actually have to do full F-B. The Searn theorem holds even if you're not computing true expectations (you'll just wind up with higher variance in your predictions). So if you want to do, eg., Viterbi F-B but are worried about convergence, this shows that you just have to use step sizes. (I'm sure someone in the EM community must have shown this before, but I haven't seen it.)

Anyway, I'm about 90% sure that the above actually works out if you set about to prove it. Assuming its validity, I'm about 90% sure it holds for EM-like structured prediction problems in general. If so, this would be very cool. Or, at least I would think it's very cool :).

17 April 2007

ACL papers up

List is here.

Top stemmed, non-stop words from titles are:

• Model (22)
• Tranlat (16)
• Languag (16)
• Machin (15)
• Learn (15)
• Base (14) -- as in "syntax-based"
• Word (13)
• Statist (13)
• Pars (11) -- as in parsing
• Semant (10) -- wow!
• Approach (10)

12 April 2007

Multinomial on a graph

I'm going to try something here to see if it works; my guess is not, but maybe it will at least spark some discussion. (I'm also testing LaTeX in Blogger.) In "text modeling" applications, we're usually dealing with lots of multinomials, especially multinomials over our vocabulary (i.e., unigram language models). These are parameterized by a vector $\theta$ of length equal to the size of the vocabulary. $\theta_i$ should be the probability of seeing word $i$. The probability of observing an entire document of length $N$, containing $w_1$ copies of word 1, $w_2$ copies of word 2, and so on, is:

$Pr(w\|N,\theta) = \frac {N!} {\prod_i w_i!} \prod_i \theta_i^{w_i}$

What I'm interested in is extensions to this where we don't assume independence. In particular, suppose that we have an ontology. Let's say that our ontology is just some arbitrary graph. What I want is a distribution that essentially prefers to keep drawing values "locally" within the graph. That is, if I've already drawn "computer" then I'm not so suprised to also see "program."

One way of thinking about this is by having a sort of random walk model. Think about placing the individual $\theta$s on the graph and then allowing them to disperse. So that maybe there is some $i$ for which $\theta_i = 1$ (and the rest are zero). We certainly want our distribution to prefer draws for word $i$, but draws of word $j$ where $j$ is "near" $i$ should also be acceptable.

My first cut attempt at this is as follows. Let $G$ be a graph over our vocabulary; let $N(i)$ be a neighborhood of $i$ (maybe, all nodes within 10 steps); let $N^{-1}(j)$ be the inverse neighborhood (the set of all $i$ that have $j$ in $N(i)$). Let $d_{ij}$ be the distance between two nodes and let $0 < \gamma < 1$ be a dampening parameter (how fast probability diffuses on the graph). Define our probabilistic model as:

$Pr(w \| N,G,\gamma,\theta) \prop \prod_i \sum_{j \in N^{-1}(i)} \left( \theta_j \frac {\gamma^{d_{ij}}} {S_j} \right)^{w_i}$

Where we let $S_j = \sum_{k \in N(j)} \gamma^{d_{ij}}$.

The idea behind this model is as follows. When we observe word i, we have to assign a probability to it. This probability is given by any node j that contains i in its neighborhood. Each such node has some mass $\theta_j$ which it can distribute. It distributes this mass proportional to $\gamma^{d_{ij}}$ to all nodes $i$, where $S_j$ serves to normalize this diffusion.

I'm pretty sure that the normalizing constant for this distribution is the same as the multinomial. After all, there is a one-to-one mapping between the multinomial parameters and the "graph multinomial" parameters, so we should still get $N!/\prod_i w_i!$ to normalize.

Now, because I feel an incredible need to be Bayesian, I need a conjugate prior for this "graph multinomial" distribution. Since the graph multinomial is essentially just a reparameterization of the standard multinomial, it seems that a vanilla Dirichlet would be an appropriate prior. However, I cannot prove that it is conjugate -- in particular, I can't seem to come up with appropriate posterior updates.

So this is where y'all come in to save the day! I can't imagine that no one has come up with such a "graph multinomial" distribution -- I just don't know what keywords to search for. If someone has seen something at all like this, I'd love to know. Alternatively, if someone can tell me what the appropriate conjugate prior is for this distribution, plus the posterior update rules, that would also be fantastic.

07 April 2007

AI Stats post mortem

Got back from AIStats a week or so ago; overall, I enjoyed it. Though by the end, I was totally sick of Puerto Rican food. (I'm a huge fan of Cuban food and expected stronger similarities. Oh well.) Here are some things I learned at AIStats:

• Multiple instance learning. There was a paper on this topic. I can't comment on how good the paper was (in comparison to other MIL papers), but I didn't know about the MIL task before. The problem is as follows. Instead of getting a set of labeled training points, you get a set of sets of points. Each set of points comes with a +1/-1 label. +1 means that at least one example in the set is positive; -1 means none are. Vision people like this because they have an image that contains a person but they don't know where the person is, so the scan a box around the image and generate lots of examples. We know at least one will contain a person. I realized after seeing the poster that I've actually worked on this problem! The BayeSum paper is actually a MIL algorithm, but I didn't know the right terminology. I wonder how it compares in vision problems.
• Deep belief networks. These seem to be growing in popularity -- both Hinton's group and LeCun's group have their own version of DBNs. You can basically think of these as neural nets with tons of layers of hidden units. They can presumably learn really complicated functions of inputs. It's also possible to train them unsupervisedly (is that a word?) and then just tune the unsupervised model to some classification task. I don't really understand how they work, but they seem to be doing something interesting.
• It is possible to learn (probably approximate) underestimates for the heuristic in A* search. The problem is that you have a given, fixed path cost and you want to learn an admissible heuristic. Seems quite similar to Berkeley paper at NAACL this year, though different in important ways. Seems like a reasonable thing to try if you're stuck with a fixed path cost. I'm curious how these techniques compare to keeping a heuristic fixed and learning a path cost, ala Drew -- seems that on the surface they'd be similar, but learning the path cost seems a bit easier to me.
There were, of course, other interesting papers. Yee Whye has listed a few.

05 April 2007

Discourse is darn hard

Discourse is an area I've been peripherally interested in for quite a while; probably unduly influence by my advisor's (ex-?)interest in the problem. The general sense I get is that discourse relations are relatively easy to identify when cue-phrases exist. This is essentially the approach taken by the Penn Discourse Treebank and Daniel's original parser. Unfortunately, cue phrases exist only for a small subset of sentences, so it's tempting to look at lexical techniques, since they work so well in other problems. The intuition is that if one sentence says "love" and the next says "hate," then maybe they are in some sort of contrastive relation. Unfortunately, the world is rarely this nice. The amount of inference necessary to reliably identify seemingly simple, coarse-grained relationships, seems astonishing.

Here are some examples of contrast relations:

1. The interest-only securities will be sold separately by BT Securities.
The principal-only securities will be repackaged by BT Securities into a Freddie Mac Remic, Series 103, that will have six classes.
2. James Earl Jones is cast to play the Rev. Mr. Johns.
The network deals a lot with unknowns, including Scott Wentworth, who portrayed Mr. Anderson, and Bill Alton as Father Jenco.
3. At the same time, second-tier firms will continue to lose ground.
In personal computers, Apple, Compaq and IBM are expected to tighten their hold on their business.
In (1), we need to know that interest-only and principal-only securities are essentially themselves contrastive. We then need to combine this information with the coreference of BT Securities in both cases (should be easy). This seems like maybe we could possibly get it with enough data. For (2), we essentially need to know that James Earl Jones is famous, and that "deals with unknowns" is the same as "casts." For (3), we need to know that Apple, Compaq and IBM are not second-tier firms. These latter two seem quite difficult.

Next, let's consider background. Here are some short examples of background relations:
1. The March 24 oil spill soiled hundreds of miles of shoreline along Alaska's southern coast and wreaked havoc with wildlife and the fishing industry.
Exxon Corp. is resigning from the National Wildlife Federation 's corporate advisory panel, saying the conservation group has been unfairly critical of the Exxon Valdez oil spill along the Alaskan coast.
2. Insurers typically retain a small percentage of the risks they underwrite and pass on the rest of the losses.
After Hugo hit, many insurers exhausted their reinsurance coverage and had to tap reinsurers to replace that coverage in case there were any other major disasters before the end of the year.
The first example here might be identifiable by noting that the tense in the two sentences is different: one is present, one is past. Beyond this, however, we'd essentially have to just recognize that the oil spill referred to in both sentences is, in fact, the same spill. We'd have to infer this from the fact that otherwise the first sentence is irrelevant. In (2), we'd have to recognize that the "passed on" losses go to reinsurers (I didn't even know such things existed).

As a final example, let's look at some evidence relations:
1. A month ago, the firm started serving dinner at about 7:30 each night; about 50 to 60 of the 350 people in the investment banking operation have consistently been around that late.
Still, without many actual deals to show off, Kidder is left to stress that it finally has "a team" in place, and that everyone works harder.
2. Yesterday, Mobil said domestic exploration and production operations had a \$16 million loss in the third quarter, while comparable foreign operations earned \$234 million.
Individuals familiar with Mobil's strategy say that Mobil is reducing its U.S. work force because of declining U.S. output.
These examples also seem quite difficult. In (1), the fact that the "team" is supposed to refer to the "50 to 60 ... people" and that having dinner, within the context of a company, is a "team-like" activity. (2) is somewhat easier -- knowing that a "loss" is related to a "decline" helps a lot, though it is entirely possible that the type of relation would differ. You also pretty much have to know that "domestic" means "U.S."

I feel like, in many ways, this discourse relation problem subsumes the now-popular "textual entailment" problem. In fact, some of the RST relations look a lot like textual entailment: elaboration-general-specific, eleboration-set-member, Contrast (as a negative entailment), consequence, etc.

I don't have a strong conclusion here other than: this is a hard problem. There've been some attempts at learning lexical things for this task, but I feel like it needs more than that.