25 October 2007

Non-parametric versus model selection/averaging

Non-parametric approaches (probably the most familiar of which is the Dirichlet process, but there are a whole host of them) are nice because they don't require us to pre-specify a bunch of things that in standard parametric inference would essentially be a model selection issue. For instance, in the DP, we needn't specify how many "clusters" gave rise to our data (in the context of a mixture model).

This brings up the immediate question, though: instead of doing inference in a non-parametric model, why don't you just do model selection (eg by comparing marginals) or model averaging. You can just vary whatever it is that is the "non-parametric" part of the model. For instance, in a DP, you run a bunch of inferences with different numbers of clusters and either choose the best (model selection) or average with respect to the marginals (model averaging). In something like an IBP, you can run with a different number of latent factors and select or average.

I've been asked this general question a few times by non-ML people and I rarely feel like I can give a compelling answer. In particular, I'm not aware of any non-toy experimental comparisons between doing model selection/averaging in any of these models. And even toy ones are hard to come by. But even beyond empirical evidence, I often have a hard time even formulating a coherent qualitative argument.

Here are some points I've come up with, but maybe commentors can either debunk them or add...

  1. In some cases, there are lots of parts of the model for which we don't know the structure, so to do model selection/averaging would require trying a ridiculously large number of models. For instance, I might have two components in my model that are DP-ish, so now I have to try quadratically many models.
  2. I may not know a good upper/lower bound on the number of components (eg., in a DP). So I'm going to have to try a really large range. In fact, although it's well known that the expected number of clusters in a DP grows as o(log N), where N is the number of data points, it is actually unbounded (and there's a conjecture that it's w(log log N), which isn't terribly slow).
  3. Comparing marginal likelihoods across models with a different number of parameters is just plain hard. In fact, for most cases, I don't know how to do it, especially if you want to live in MCMC world. (In variational world you could compare the lower bound on the marginals, but it's always a bit nerve wracking to compare two lower bounds -- you'd rather compare lowers and uppers.) I'm aware of things like reversible jump MCMC and so on, but in most cases these aren't actually applicable to the models you want. Alternatively, if all you want to do is select (not average), you could always do something with held-out data.
The problem is that I can think of counter-arguments to most of these points. In the case of (1), you could argue that if the space is too big, then your sampler isn't going to hit everywhere anyway. In the case of (2), my guess is that for most of these models the marginal will be semi-convex, so you can just start small and keep increasing until things seem to get worse. For (3), this seems to be an argument for developing better MCMC techniques for comparing marginals, not necessarily an argument in favor of non-parametric methods.

But I can go back yet again. To counter the counter to (1), you can argue that the sampler is at least guaranteed after a long time to hit what you care about, whereas if you construct some arbitrary search policy, you may not be. For (2), well...I don't know...I'm pretty convinced by the counter-argument to (2) :P... For (3), you could just disagree and say: why should we develop better MCMC techniques for comparing marginals when we can get away from this whole business by doing non-parametric inference.

Overall, I think non-parametric inference is interesting, useful and fun. But I'd like better arguments against the nay-sayers (who, in my experience, are actually typically non-ML people).

(Note that I'm ignoring the case where the non-parametric model is actually known--or roughly known--to be the right model for your problem. Of course if it's the right model, then you should use it. I'm more referring to the question of using non-parametric methods to get around model selection issues.)

19 October 2007

Gender and text, gender and speech

For some crazy reason I decided a while ago that I wanted to learn Japanese. Essentially, I wanted to learn a language as unlike English as I could find. So I did some summer intensive thing before college (that amounted to a year of class) and then continued taking class for all three years of undergrad. At the end, I could get by passably for most conversation topics (business, politics, current events, etc.) other than research stuff (at some point I learned how to say NLP, but I don't remember anymore...I wonder if en-eru-pi would be understood...). During the whole time we were required to meet weekly with conversation partners so as to practice our speaking skills.

For the first "semester" during the summer, I had a male professor. For all remaining seven semesters, my profs were female. With the exception of one conversation partner (who was from Hokkaido and spoke quicky with a strong accent and who was quickly replaced by someone who I could understand a bit more), all of my conversation partners were female.

At the end of my four years, I was speaking to a frien (who was neither a conversation partner nor a prof) in Japanese and after about three turns of conversation, he says to me (roughly): "you talk like a girl."

Based on the set up of this post, you may have seen that coming. But the thing that is most interesting is that Japanese is not one of those languages where the speaker's gender is encoded in (eg.) verb morphology. In fact, as best I could tell at that point, the only thing that I did that was effeminate was to use too many sentence ending particles that were more commonly used by women (-ka-na, I think, was one, but it's been too long now to really remember). The guy who said this to me was a close enough friend that I tried to figure out what it was about my speech that made him assess that I talk like a girl. The sentence particle thing was part of it, but he said that there was also something else that he couldn't really figure out; he was hypothesizing it was something to do with emphasis patterns.

It's not at all surprising that given that the majority of native speakers that I talked to were female, that if there were some underlying bias that was sufficiently subtle that the profs weren't able to intentially avoid it, that I would have picked it up.

Now, getting back to en-eru-pi. There's been a reasonable amount of work in the past few years on identifying the gender of the authors of texts. I know both Moshe Koppel and Shlomo Argamon, to name two, have worked on this problem. I also remember seeing a web site a year or so ago where you could enter a few sentences that you wrote and it would guess your gender. I don't remember what it cued off of---i think distribution of types of verbs and adjectives, mostly, but I do remember that given a short paragraph, it's shockingly accurate.

What I don't know is (a) if anyone has done this for something other than English and (b) if someone has done it for speech. Of course, if you have speech, you have extra information (eg., pitch) which might be useful. But given my Japanese friend's reaction to my speech pattern (my voice is rather low), there has to be more going on. And I'm not convinced that what is going on will be the same between written text and (say) transcribed speech. If someone wanted to try such an experiment for non-English text, you could probably just mine non-English from some social networking site (like myspace or facebook), where people tend to list their genders. I'm not sure how to do it for speech. Maybe there's some speech transcription corpus out there that's annotated with gender, but I don't know what it is. Although I don't see a huge financial marked out there for an answer, I'm personally curious what it is about my English writing patterns that made the web site I refered to earlier strongly convinced that I'm male, and what it is about my Japanese speech patterns that make it clear that I'm not.

04 October 2007

F-measure versus Accuracy

I had a bit of a revelation a few years ago. In retrospect, it's obvious. And I'm hoping someone else out there hasn't realized this because otherwise I'll feel like an idiot. The realization was that F-measure (for a binary classification problem) is not invariant under label switching. That is, if you just change which class it is that you call "positive" and which it is that you call "negative", then your overall F-measure will change.

What this means is that you have to be careful, when using F-measure, about how you choose which class is the "positive" class.

On the other hand, the simple "accuracy" metric is (of course) invariant under label switching. So when using accuracy, you needn't worry about which class you consider "positive."

In the olden days, when people pretty much just used F-measure to analyze things like retrieval quality, this wasn't a problem. It was "clear" which class was positive (good documents) and which was negative. (Or maybe it wasn't...) But now, when people use F-measure to compare results on a large variety of tasks, it makes sense to ask: when is accuracy the appropriate measure and when is F the appropriate measure?

I think that, if you were to press anyone on an immediate answer to this question, they would say that they favor F when one of the classes is rare. That is, if one class occurs only in 1% of the instances, then a classifier that always reports "the other class" will get 99% accuracy, but terrible F.

I'm going to try to convince you that while rarity is a reasonable heuristic, there seems to be something deeper going on.

Suppose I had a bunch of images of people drinking soda (from a can) and your job was to classify if they were drinking Coke or Pepsi. I think it would be hard to argue that F is a better measure here than accuracy: how would I choose which one is "positive." Now, suppose the task were to distinguish between Coke and Diet Dr. Pepper. Coke is clearly going to be the majority class here (by a long shot), but I still feel that F is just the wrong measure. Accuracy still seems to make more sense. Now, suppose the task were to distinguish between Coke and "anything else." All of a sudden, F is much more appealing, even though Coke probably isn't much of a minority (maybe 30%).

What seems to be important here is the notion of X versus not-X, rather than X versus Y. In other words, the question seems to be: does the "not-X" space make sense?

Let's consider named entity recognition (NER). Despite the general suggestion that F is a bad metric for NER, I would argue that it makes more sense than accuracy. Why? Because it just doesn't make sense to try to specify what "not a name" is. For instance, consider the string "Bill Clinton used to be the president; now it's Bush." Clearly "Bill Clinton" is a person. But is "Clinton used"? Not really. What about "
Bill the"? Or "Bill Bush"? I think a substantial part of the problem here is not that names are rare, but that it's just not reasonable to develop an algorithm that finds all not-names. They're just not well defined.

This suggests that F is a more appropriate measure for NER than, at least, accuracy.

One might argue--and I did initially myself--that this is an artifact of the fact that names are often made up of multiple words, and so there's a segmentation issue. (The same goes for the computer vision problem of trying to draw bounding boxes around humans in images, for which, again, F seems to make more sense.)

But I think I've been convinced that this isn't actually the key issue. It seems, again, that what it boils down to is that it makes sense to ask one to find Entities, but it doesn't make sense to ask one to find non-Entities, in the same way it it doesn't make sense to ask one to find non-Cokes.

(Of course, in my heart of hearts I believe that you need to use a real--i.e., type 4--evaluation metric, but if you're stuck without this, then perhaps this yields a reasonable heuristic.)