31 January 2010

Coordinate ascent and inverted indices...

Due to a small off-the-radar project I'm working on right now, I've been building my own inverted indices. (Yes, I'm vaguely aware of discussions in DB/Web search land about whether you should store your inverted indices in a database or whether you should handroll your own. This is tangential to the point of this post.)

For those of you who don't remember your IR 101, here's the deal with inverted indices. We're a search engine and want to be able to quickly find pages that contain query terms. One way of storing our set of documents (eg., the web) is to store a list of documents, each of which is a list of words appearing in that document. If there are N documents of length L, then answering a query is O(N*L) since we have to look over each document to see if it contains the word we care about. The alternative is to store an inverted index, where we have a list of words and for each word, we store the list of documents it appears in. Answering a query here is something like O(1) if we hash them, O(log |V|) if we do binary search (V = vocabulary), etc. Why it's called an inverted index is beyond me: it's really just like the index you find at the back of a textbook. And the computation difference is like trying to find mentions of "Germany" in a textbook by reading every page and looking for "Germany" versus going to the index in the back of the book.

Now, let's say we have an inverted index for, say, the web. It's pretty big (and in all honesty, probably distributed across multiple storage devices or multiple databases or whatever). But regardless, a linear scan of the index would give you something like: here's word 1 and here are the documents it appears in; here's word 2 and its doucments; here's word v and its documents.

Suppose that, outside of the index, we have a classification task over the documents on the web. That is, for any document, we can (efficiently -- say O(1) or O(log N)) get the "label" of this document. It's either +1, -1 or ? (? == unknown, or unlabeled).

My argument is that this is a very plausible set up for a very large scale problem.

Now, if we're trying to solve this problem, doing a "common" optimization like stochastic (sub)gradient descent is just not going to work, because it would require us to iterate over documents rather than iterating over words (where I'm assuming words == features, for now...). That would be ridiculously expensive.

The alternative is to do some sort of coordinate ascent algorithm. These actually used to be quite popular in maxent land, and, in particular, Joshua Goodman had a coordinate ascent algorithm for maxent models that apparently worked quite well. (In searching for that paper, I just came across a 2009 paper on roughly the same topic that I hadn't seen before.)

Some other algorithms have a coordinate ascent feel, for instance the LASSO (and relatives, including the Dantzig selector+LASSO = DASSO), but they wouldn't really scale well in this problem because it would require a single pass over the entire index to make one update. Other approaches, such as boosting, etc., would fare very poorly in this setting.

This observation first led me to wonder if we can do something LASSO or boosting like in this setting. But then that made me wonder if this is a special case, or if there are other cases in the "real world" where you data is naturally laid out as features * data points rather than data points * features. Sadly, I cannot think of any. But perhaps that's not because there aren't any.

(Note that I also didn't really talk about how to do semi-supervised learning in this setting... this is also quite unclear to me right now!)

26 January 2010

A machine learner's apology

Andrew Gelman recently announced an upcoming talk by John Lafferty. This reminded me of a post I've been meaning to write for ages (years, really) but haven't gotten around to. Well, now I'm getting around to it.

A colleague from Utah (not in ML) went on a trip and spent some time talking to a computational statistician, who will remain anonymous. But let's call this person Alice. The two were talking about various topics and at one point machine learning came up. Alice commented:

"Machine learning is just non-rigorous computational statistics."
Or something to that effect.

A first reaction is to get defensive: no it's not! But Alice has a point. Some subset of machine learning, in particular the side more Bayesian, tends to overlap quite a bit with compstats, so much so that in some cases they're probably not really that differentiable. (That is to say, there's a subset of ML that's very very similar to a subset of compstats... you could probably fairly easily find some antipoles that are amazingly different.)

And there's a clear intended interpretation to the comment: it's not that we're not rigorous, it's that we're not rigorous in the way that computational statisticians are. To that end, let me offer a glib retort:
Computational statistics is just machine learning where you don't care about the computation.
In much the same way that I think Alice's claim is true, I think this claim is also true. The part of machine learning that's really strong on the CS side, cares a lot about computation: how long, how much space, how many samples, etc., will it take to learn something. This is something that I've rarely seen in compstats, where the big questions really have to do with things like: is this distribution well defined, can I sample from it, etc., now let me run Metropolis-Hastings. (Okay, I'm still being glib.)

I saw a discussion on a theory blog recently that STOC/FOCS is about "THEORY of algorithms" while SODA is about "theory of ALGORITHMS" or something like that. (Given the capitalization, perhaps it was Bill Gasarch :)?) Likewise, I think it's fair to say that classic ML is "MACHINE learning" or "COMPUTATIONAL statistics" and classic compstats is "machine LEARNING" or "computational STATISTICS." We're really working on very similar problems, but the things that we value tend to be different.

Due to that, I've always found it odd that there's not more interaction between compstats and ML. Sure, there's some... but not very much. Maybe it's cultural, maybe it's institutional (conferences versus journals), maybe we really know everything we need to know about the other side and talking wouldn't really get us anywhere. But if it's just a sense of "I don't like you because you're treading on my field," then that's not productive (either direction), even if it is an initial gut reaction.

19 January 2010

Interviewing Follies

Continuing on the theme of applying for jobs, I thought I'd share some interviewing follies that have happened to me, that I've observed others do, and that I've heard about. There is a moral to this story; if you want to skip the stories and get to the moral, scroll to past the bullet points.

  1. Missing your plane. I had an interview in a place that was about a 1-2 hour flight away. I flew out first thing in the morning and back last thing at night. Except I didn't fly out first thing in the morning: I missed my flight. Why? Because I cut flights close (someone once said "if you've never missed a flight, you're spending too much time in the airport") and the particular flight I was on left not out of a normal gate, but out of one of those that you have to take a shuttle bus to. I didn't know that, didn't factor in the extra 5 minutes, and missed the flight. I called the places I was interviewing at, re-arranged meetings and the day proceeded with a small hiccup.

    I ended up getting an offer from this place.
  2. Missing a meeting. I was interviewing at a different place, going through my daily meetings, got really tired and misread my schedule. I though I was done when in fact I had one meeting to go. I caught a cab to the airport, flew back home, and noticed a few frantic emails trying to figure out where I was (this is before I had an email-capable cell phone). (As an aside, someone once told me that they would intentionally skip meetings on interview days with people outside their area, figuring that neither the candidate nor the interviewee really wanted such a meeting. They would hang out in the restroom or something, and blame a previous meeting running long on the miss. This was not the case in my setting.)

    I did not end up getting an offer from this place.
  3. Someone interviewing here a long time ago was scheduled to give their talk using transparencies. Two minutes before the talk they realized that they had left them in the hotel room. The already-assembled audience was asked to stay put, the speaker was quickly driven to the hotel and back, and proceeded to give one of the best interview talks on record here.

    This person ended up getting a job offer.
  4. Someone interviewing somewhere I know left their laptop in their hotel, just like number 3. But instead of having their host drive them back to the hotel, they borrowed someone's car to drive back to the hotel. They crashed the car, but managed to get their laptop, and gave a great talk.

    This person ended up getting a job offer.
  5. I flew in late to an interview, getting to my hotel around midnight. I woke up the next morning at seven for an 8:30 breakfast meeting. I unpacked my suit, tie, belt, socks and shoes. And realized I had forgotten to pack a dress shirt. All I had was the shirt I had worn on the plane: a red graphic tee shirt. My mind quickly raced to figure out if there was a place I could buy a dress shirt in the middle of nowhere at 7am. I quickly realized that it was impossible, wore my tee shirt under my suit jacket, and went through the day as if that was how I had planned it.

    I ended up getting a job offer.
The moral of this story is that bad things happen during interviews. I can't compare any of my stories to the crash-the-car story, but we've all been there, done stupid things, and gotten through it unscathed. I think the trick is to pretend like it was intentional, or at least not get flustered. Yes I missed my flight, yes I forgot my shirt, yes I crashed your car. But it doesn't affect the rest of my day. You have to be able to relax and forgive yourself minor mistakes: the interviewers really are looking at the bigger picture.

That said, there are definitely things you can do to botch an interview. They have to do with things like giving a bad talk (my experience is that a huge weight is placed on the quality of your job talk) and not having a clear vision for where you're going in research life. Don't be afraid to disagree with people you're talking to: we usually aren't trying to hire yes-men or yes-women. Once you get a job, especially a faculty job, you are the one who is going to make things happen for you. You have a world view and that's part of what we're hiring. Let us see it. We might not always agree, but if you have reasons for your view that you can articulate, we'll listen.

But don't focus on the little things, and don't get flustered.

04 January 2010

ArXiV and NLP, ML and Computer Science

Arxiv is something of an underutilized resource in computer science. Indeed, many computer scientists seems not to even know it exists, despite it having been around for two decades now! On the other hand, it is immensely popular among (some branches of) mathematics and physics. This used to strike me as odd: arxiv is a computer service, why haven't computer scientists jumped on it. Indeed, I spent a solid day a few months ago putting all my (well almost all my) papers on arxiv. One can always point to "culture" for such things, but I suspect there are more rational reasons why it hasn't affected us as much as it has others.

I ran in to arxiv first when I was in math land. The following is a cartoon view of how (some branches of) math research gets published:

  1. Authors write a paper
  2. Authors submit paper to a journal
  3. Authors simultaneously post paper on arxiv
  4. Journal publishes (or doesn't publish) paper
We can contrast this with how life goes in CS land:
  1. Conference announces deadline
  2. One day before deadline, authors write a paper
  3. Conference publishes (or rejects) paper
I think there are a few key differences that matter. Going up to the mathematician model, we can ask ourselves, why do they do #3? It's a way to get the results out without having to wait for a journal to come back with a go/no-go response. Basically in the mathematician model, arxiv is used for advertising while a journal is used for a stamp of approval (or correctness).

So then why don't we do arxiv too? I think there are two reasons. First, we think that conference turn around is good enough -- we don't need anything faster. Second, it completely screws up our notions of blind review. If everyone simultaneously posted a paper on arxiv when submitting to a conference, we could no longer claim, at all, to be blind. (Please, I beg of you, do not start commenting about blind review versus non-blind review -- I hate this topic of conversation and it never goes anywhere!) Basically, we rely on our conferences to do both advertising and stamp of approval. Of course, the speed of conferences is mitigated by the fact that you sometimes have to go through two or three before your paper gets in, which can make it as slow, or slower than, journals.

In a sense, I think that largely because of the blind thing, and partially because conferences tend to be faster than journals, the classic usage of arxiv is not really going to happen in CS.

(There's one other potential use for arxiv, which I'll refer to as the tech-report effect. I've many times seen short papers posted on people's web pages either as tech-reports or as unpublished documents. I don't mean tutorial like things, like I have, but rather real semi-research papers. These are papers that contain a nugget of an idea, but for which the authors seem unwilling to go all the way to "make it work." One could imagine posting such things on arxiv. Unfortunately, I really dislike such papers. It's very much a "flag planting" move in my opinion, and it makes life difficult for people who follow. That is, if I have an idea that's in someone elses semi-research paper, do I need to cite them? Ideas are a dime a dozen: making it work is often the hard part. I don't think you should get to flag plant without going through the effort of making it work. But that's just me.)

However, there is one prospect that arxiv could serve that I think would be quite valuable: literally, as an archive. Right now, ACL has the ACL anthology. UAI has its own repository. ICML has a rather sad state of affairs where, from what I can tell, papers from ICML #### are just on the ICML #### web page and if that happens to go down, oh well. All of these things could equally well be hosted on arxiv, which has strong government support to be sustained, is open access, blah blah blah.

This brings me to a question for you all: how would you feel if all (or nearly all) ICML papers were to be published on arxiv? That is, if your paper is accepted, instead of uploading a camera-ready PDF to the ICML conference manager website, you instead uploaded to arxiv and then sent your arxiv DOI link to the ICML folks?

How do you feel about arxiving ICML?
No, please don't put my paper on arxiv.
I'm happy to have my paper on arxiv, but you should do it for me!
I'm happy to upload my paper to arxiv.

Obviously there are some constraints, so there would need to be an opt-out policy, but I'm curious how everyone feels about this....