If you compare vision research with NLP research, there are a lot of interesting parallels. Like we both like linear models. And conditional random fields. And our problems are a lot harder than binary classification. And there are standard data sets that we've been evaluating on for decades and continue to evaluate on (I'm channeling Bob here :P).
But there's one thing that happens, the difference of which is so striking, that I'd like to call it to center stage. It has to do with "messing with our inputs."
I'll spend a bit more time describing the vision approach, since it's probably less familiar to the average reader. Suppose I'm trying to handwriting recognition to identify digits from zero to nine (aka MNIST). I get, say, 100 labeled zeros, 100 labeled ones, 100 labeled twos and so on. So a total of 1000 data points. I can train any off the shelf classifier based on pixel level features and get some reasonable performance (maybe 80s-90s, depending).
Now, I want to insert knowledge. The knowledge that I want to insert is some notion of invariance. I.e., if I take an image of a zero and translate it left a little bit, it's still a zero. Or up a little bit. Of if I scale it up 10%, it's still a zero. Or down 10%. Or if I rotate it five degrees. Or negative five. All zeros. Same hold for all the other digits.
One way to insert this knowledge is to muck with the learning algorithm. That's too complicated for me: I want something simpler. So what I'll do is take my 100 zeros and 100 ones and so on and just manipulate them a bit. That is, I'll sample a random zero, and apply some small random transformations to it, and call it another labeled example, also a zero. Now I have 100,000 training points. I train my off the shelf classifier based on pixel level features and get 99% accuracy or more. The same trick works for other vision problem (eg., recognizing animals). (This process is so common that it's actually described in Chris Bishop's new-ish PRML book!)
This is what I mean by small changes (to the input) begetting good example. A slightly transformed zero is still a zero.
Of course, you have to be careful. If you rotate a six by 180 degrees, you get a nine. If you rotate a cat by 180 degrees, you get an unhappy cat. More seriously, if you're brave, you might start looking at a class of transformations called diffeomorphisms, which are fairly popular around here. These are nice because of their nice mathematical properties, but un-nice because they can be slightly too flexible for certain problems.
Now, let's go over to NLP land. Do we ever futz with our inputs?
Sure!
In language modeling, we'll sometimes permute words or replace one word with another to get a negative example. Noah Smith futzed with his inputs in contrastive estimation to produce negative examples by swapping adjacent words, or deleting words.
In fact, try as I might, I cannot think of a single case in NLP where we make small changes to an input to get another good input: we always do it to get a bad input!
In a sense, this means that one thing that vision people have that we don't have is a notion of semantics preserving transformations. Sure, linguists (especially those from that C-guy) study transformations. And there's a vague sense that work in paraphrasing leads to transformations that maintain semantic equivalence. But the thing is that we really don't know any transformations that preserve semantics. Moreover, some transformations that seem benign (eg., passivization) actually are not: one of my favorite papers at NAACL this year by Greene and Resnik showed that syntactic structure affects sentiment (well, them, drawing on a lot of psycholinguistics work)!
I don't have a significant point to this story other than it's kind of weird. I mentioned this to some people at ICML and got a reaction that replacing words with synonyms should be fine. I remember doing this in high school, when word processors first started coming with thesauri packed in. The result seemed to be that if I actually knew the word I was plugging in, life was fine... but if not, it was usually a bad replacement. So this seems like something of a mixed bag: depending on how liberal you are with defining "synonym" you might be okay do this, but you might also not be.
05 July 2009
Small changes beget good or bad examples?
Posted by
hal
at
7/05/2009 07:55:00 AM
| 13
comments
Links to this post
Labels: linguistics, machine learning
30 June 2009
ICML/COLT/UAI 2009 retrospective
This will probably be a bit briefer than my corresponding NAACL post because even by day two of ICML, I was a bit burnt out; I was also constantly swapping in other tasks (grants, etc.). Note that John has already posted his list of papers.
- #317: Multi-View Clustering via Canonical Correlation Analysis (Chaudhuri, Kakade, Livescu, Sridharan). This paper shows a new application of CCA to clustering across multiple views. They use some wikipedia data in experiments and actually prove something about the fact that (under certain multi-view-like assumptions), CCA does the "right thing."
- #295: Learning Nonlinear Dynamic Models (Langford, Salakhutdinov,, Zhang). The cool idea here is to cut a deterministic classifier in half and use its internal state as a sort of sufficient statistic. Think about what happens if you represent your classifier as a circuit (DAG); then anywhere you cut along the circuit gives you a sufficient representation to predict. To avoid making circuits, they use neural nets, which have an obvious "place to cut" -- namely, the internal nodes.
- #364: Online Dictionary Learning for Sparse Coding (Mairal, Bach, Ponce, Sapiro). A new approach to sparse coding; the big take-away is that it's online and fast.
- 394: MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification (Zhu, Ahmed, Xing). This is a very cute idea for combining objectives across topic models (namely, the variational objective) and classification (the SVM objective) to learn topics that are good for performing a classification task.
- #393: Learning from Measurements in Exponential Families (Liang, Jordan, Klein). Suppose instead of seeing (x,y) pairs, you just see some statistics on (x,y) pairs -- well, you can still learn. (In a sense, this formalizes some work out of the UMass group; see also the Bellare, Druck and McCallum paper at UAI this year.)
- #119: Curriculum Learning (Bengio, Louradour, Collobert, Weston). The idea is to present examples in a well thought-out order rather than randomly. It's a cool idea; I've tried it in the context of unsupervised parsing (the unsearn paper at ICML) and it never helped and often hurt (sadly). I curriculum-ified by sentence length, though, which is maybe not a good model, especially when working with WSJ10 -- maybe using vocabulary would help.
- #319: A Stochastic Memoizer for Sequence Data (Wood, Archambeau, Gasthaus, James, Whye Teh). If you do anything with Markov models, you should read this paper. The take away is: how can I learn a Markov model with (potentially) infinite memory in a linear amount of time and space, and with good "backoff" properties. Plus, there's some cool new technology in there.
- A Uniqueness Theorem for Clustering Reza Bosagh Zadeh, Shai Ben-David. I already talked about this issue a bit, but the idea here is that if you fix k, then the clustering axioms become satisfiable, and are satisfied by two well known algorithms. Fixing k is a bit unsatisfactory, but I think this is a good step in the right direction.
- Convex Coding David Bradley, J. Andrew Bagnell. The idea is to make coding convex by making it infinite! And then do something like boosting.
- On Smoothing and Inference for Topic Models Arthur Asuncion, Max Welling, Padhraic Smyth, Yee Whye Teh. If you do topic models, read this paper: basically, none of the different inference algorithms do any better than the others (perplexity-wise) if you estimate hyperparameters well. Come are, of course, faster though.
- Correlated Non-Parametric Latent Feature Models Finale Doshi-Velez, Zoubin Ghahramani. This is an indian-buffet-process-like model that allows factors to be correlated. It's somewhat in line with our own paper from NIPS last year. There's still something a bit unsatisfactory in both our approach and their approach that we can't do this "directly."
- Domain Adaptation: Learning Bounds and Algorithms. Yishay Mansour, Mehryar Mohri and Afshin Rostamizadeh. Very good work on some learning theory for domain adaptation based on the idea of stability.
Posted by
hal
at
6/30/2009 06:50:00 AM
| 2
comments
Links to this post
Labels: conferences
25 June 2009
Should there be a shared task for semi-supervised learning in NLP?
(Guest post by Kevin Duh. Thanks, Kevin!)
At the NAACL SSL-NLP Workshop recently, we discussed whether there ought to be a "shared task" for semi-supervised learning in NLP. The panel discussion consisted of Hal Daume, David McClosky, and Andrew Goldberg as panelists and audience input from Jason Eisner, Tom Mitchell, and many others. Here we will briefly summarize the points raised and hopefully solicit some feedback from blog readers.
Three motivations for a shared task
A1. Fair comparison of methods: A common dataset will allow us to compare different methods in an insightful way. Currently, different research papers use different datasets or data-splits, making it difficult to draw general intuitions from the combined body of research.
A2. Establish a methodology for evaluating SSLNLP results: How exactly should a semi-supervised learning method be evaluated? Should we would evaluate the same method for both low-resource scenarios (few labeled points, many unlabeled points) and high-resource scenarios (many labeled points, even more unlabeled points)? Should we evaluate the same method under different ratios of labeled/unlabeled data? Currently there is no standard methodology for evaluating SSLNLP results, which means that the completeness/quality of experimental sections in research papers varies considerably.
A3. Encourage more research in the area: A shared task can potentially lower the barrier of entry to SSLNLP, especially if it involves pre-processed data and community support network. This will make it easier for researchers in outside fields, or researchers with smaller budgets to contribute their expertise to the field. Furthermore, a shared task can potentially direct the community research efforts in a particular subfield. For example, "online/lifelong learning for SSL" and "SSL as joint inference of multiple tasks and heterogeneous labels" (a la Jason Eisner's keynote) were identified as new, promising areas to focus on in the panel discussion. A shared task along those lines may help us rally the community behind these efforts.
Arguments against the above points
B1. Fair comparison: Nobody really argues against fair comparison of methods. The bigger question, however, is whether there exist a *common* dataset or task where everyone is interested in. At the SSLNLP Workshop, for example, we had papers in a wide range of areas ranging from information extraction to parsing to text classification to speech. We also had papers where the need for unlabeled data is intimately tied in to particular components of a larger system. So, a common dataset is good, but what dataset can we all agree upon?
B2. Evaluation methodology: A consistent standard for evaluating SSLNLP results is nice to have, but this can be done independently from a shared task through, e.g. an influential paper or gradual recognition of its importance by reviewers. Further, one may argue: what makes you think that your particular evaluation methodology is the best? What makes you think people will adopt it generally, both inside and outside of the shared task?
B3. Encourage more research: It is nice to lower the barriers to entry, especially if we have pre-processed data and scripts. However, it has been observed in other shared tasks that often it is the pre-processing and features that matter most (more than the actual training algorithm). This presents a dilemma: If the shared task pre-processes the data to make it easy for anyone to join, will we lose the insights that may be gained via domain knowledge? On the other hand, if we present the data in raw form, will this actually encourage outside researchers to join the field?
Rejoinder
A straw poll at the panel discussion showed that people are generally in favor of looking into the idea of a shared task. The important question is how to make it work, and especially how to address counterpoints B1 (what task to choose) and B3 (how to prepare the data). We did not have enough time during the panel discussion to go through the details, but here are some suggestions:
- We can view NLP problems as several big "classes" of problems: sequence prediction, tree prediction, multi-class classification, etc. In choosing a task, we can pick a representative task in each class, such as name-entity recognition for sequence prediction, dependency parsing for tree prediction, etc. This common dataset won't attract everyone in NLP, but at least it will be relevant for a large subset of researchers.
- If participants are allowed to pre-process their own data, the evaluation might require participant to submit a supervised system along with their semi-supervised system, using the same feature set and setup, if possible. This may make it easier to learn from results if there are differences in pre-processing.
- There should also be a standard supervised and semi-supervised baseline (software) provided by the shared task organizer. This may lower the barrier of entry for new participants, as well as establish a common baseline result.
Posted by
hal
at
6/25/2009 02:40:00 PM
| 8
comments
Links to this post
Labels: community, conferences, machine learning
20 June 2009
Why I Don't Buy Clustering Axioms
In NIPS 15, Jon Kleinberg presented some impossibility results for clustering. The idea is to specify three axioms that all clustering functions should obey and examine those axioms.
Let (X,d) be a metric space (so X is a discrete set of points and d is a metric over the points of X). A clustering function F takes d as input and produces a clustering of the data. The three axioms Jon states that all clustering functions should satisfy are:
- Scale invariance: For all d, for all a>0, F(ad) = F(d). In other words, if I transform all my distances by scaling them uniformly, then the output of my clustering function should stay the same.
- Richness: The range of F is the set of all partitions. In other words, there isn't any bias that prohibits us from producing some particular partition.
- Consistency: Suppose F(d) gives some clustering, C. Now, modify d by shrinking distances within clusters of C and expanding distances between clusters in C. Call the new metric d'. Then F(d') = C.
If you think about these axioms a little bit, they kind of make sense. My problem is that if you think about them a lot of bit, you (or at least I) realize that they're broken. The biggest broken one is consistency, which becomes even more broken when combined with scale invariance.
What I'm going to do to convince you that consistency is broken is start with some data in which there is (what I consider) a natural clustering into two clusters. I'll then apply consistency a few times to get something that (I think) should yield a different clustering.
Let's start with some data. The colors are my opinion as to how the data should be clustered:
I hope you agree with my color coding. Now, let's apply consistency. In particular, let's move some of the red points, only reducing inter-clustering distances. Formally, we find the closest pair of points and move things toward those.
The arrows denote the directions these points will be moved. To make the situation more visually appealing, let's move things into lines:
Okay, this is already looking funky. Let's make it even worse. Let's apply consistency again and start moving some blue points:
Again, let's organize these into a line:
And if I had given you this data to start with, my guess is the clustering you'd have come up with is more like:
This is a violation of consistency.So, what I'd like someone to do is to argue to my why consistency is actually a desirable property.
I can come up with lots of other examples. One reason why this invariance is bad is because it renders the notion of "reference sizes" irrelevant. This is of course a problem if you have prior knowledge (eg., one thing measured in millimeters, the other in kilometers). But even in the case where you don't know knowledge, what you can do is take the following. Take data generated by thousands of well separated Gaussians, so that the clearly right thing to do is have one cluster per Gaussian. Now, for each of these clusters except for one, shrink them down to single points. This is possible by consistency. Now, your data basically looks like thousands-1 of clusters with zero inter-cluster distances and then one cluster that's spread out. But now it seems that the reasonable thing is to put each data point that was in this old cluster into its own cluster, essentially because I feel like the other data shows you what clusters should look like. If you're not happy with this, apply scaling and push these points out super far from each other. (I don't think this example is as compelling as the one I drew in pictures, but I still think it's reasonable.
Now, in the UAI paper this year, they show that if you fix the number of clusters, these axioms are now consistent. (Perhaps this has to do with the fact that all of my "weird" examples change the number of clusters -- though frankly I don't think this is necessary... I probably could have arranged it so that the resulting green and blue clusters look like a single line that maybe should just be one cluster by itself.) But I still feel like consistency isn't even something we want.
(Thanks to the algorithms group at Utah for discussions related to some of these issues.)
UPDATE 20 JUNE 2009, 3:49PM EST
Here's some data to justify the "bad things happen even when the number of clusters stays the same" claim.
Start with this data:

Now, move some points toward the middle (note they have to spread to the side a bit so as not to decrease intra-cluster distances).

Yielding data like the following:

Now, I feel like two horizontal clusters are most natural here. But you may disagree. What if I add some more data (ie., this is data that would have been in the original data set too, where it clearly would have been a third cluster):

And if you still disagree, well then I guess that's fine. But what if there were hundreds of other clusters like that.
I guess the thing that bugs me is that I seem to like clusters that have similar structures. Even if some of these bars were rotated arbitrarily (or, preferably, in an axis-aligned manner), I would still feel like there's some information getting shared across the clusters.
Posted by
hal
at
6/20/2009 09:43:00 AM
| 10
comments
Links to this post
Labels: machine learning
14 June 2009
NL Generation: A new problem
Those who talk to me a lot over the years know that I really think that generation is a cool and interesting problem, but one that is hampered by a lack of clarity of what it is, or at least what the input/output is. It's like the problem with defining summarization, but one hundred times worse.
I have no idea how it came up. I think I was talking to a bunch of people from Microsoft Research and we were talking about translation on the web and what not. And how this is not the business model of Language Weaver. And how when you're on the web you make money by advertising.
And voila! The most incredible NL Generation task occurred to me! (I apologize if you ran in to me at all during NAACL because this was pretty much all I could talk about :P.) The initial idea for the task was embedded in MT, though it needn't be. But I'll tell it in the MT setting.
So I go to some web page in some weirdo language (say, French) that I don't understand (because I was a moron and took Latin instead of French or Spanish in middle school and high school). So I ask my favorite translation system (Google or Microsoft or Babelfish or whatever) to translate it. However, the translation system takes certain liberties with the translation. In particular, it might embed a few "product placements" in the text. For instance, maybe it's translating "Je suis vraiment soif" into English (if this is incorrect, blame Google). And perhaps it decides that instead of translating this as "I'm really thirsty," it will translate it as "I'm really thirsty for a Snapple," or "I'm really thirsty: I could go for a Snapple," perhaps with a link to snapple.com.
Product placement is all over the place, even in America where it's made fun of and kept a bit at bay. Not so in China: any American remotely turned off by the Coca-cola cups from which the judges on American Idol drink twice a week would be appalled by the ridiculous (my sentiment) product placement that goes on over there. The presence of the link would probably give away that it's an ad, though of course you could leave this off.
But why limit this to translation. You could put such a piece of technology directly on blogger, or on some proxy server that you can go through to earn some extra cash browsing the web (thanks to someone -- can't remember who -- at NAACL for this latter suggestion). I mean, you could just have random ads pop up in the middle of text on any web page, for instance one you could host on webserve.ca!
(See what I did there?)
So now here's a real generation problem! You're given some text. And maybe you're even given adwords or something like that, so you can assume that the "select which thing to advertise" problem has been solved. (Yes, I know it's not.) Your job is just to insert the ad in the most natural way in the text. You could evaluate in at least two ways: click through (as is standard in a lot of this advertising business) and human judgments of naturalness. I think the point of product placement is to (a) get your product on the screen more or less constantly, rather than just during commercial breaks which most people skip anyway, and (b) perhaps appeal to people's subconscious. I don't know. My parents (used to) do advertising like things but I know next to nothing about that world.
Okay, so this is slightly tongue in cheek, but not entirely. And frankly, I wouldn't be surprised if something like it were the norm in five years. (If you want to get more fancy, insert product placements into youtube videos!)
Posted by
hal
at
6/14/2009 09:14:00 PM
| 9
comments
Links to this post
Labels: problems
12 June 2009
NAACL-HLT 2009 Retrospective
I hope this post will be a small impetus to get other people to post comments about papers they saw at NAACL (and associated workshops) that they really liked.
As usual, I stayed for the whole conference, plus workshops. As usual, I also hit that day -- about halfway through the first workshop day -- where I was totally burnt out and was wondering why I always stick around for the entire week. That's not to say anything bad about the workshops specifically (there definitely were good ones going on, in fact, see some comments below), but I was just wiped.
Anyway, I saw a bunch of papers and missed even more. I don't think I saw any papers that I actively didn't like (or wondered how they got in), including short papers, which I think is fantastic. Many thanks to all the organizers (Mari Ostendorf for organizing everything, Mike Collins, Lucy Vanderwende, Doug Oard and Shri Narayanan for putting together a great program, James Martin and Martha Palmer for local arrangements -- which were fantastic -- and all the other organizers who sadly we -- i.e., the NAACL board -- didn't get a chance to thank publicly).
Here are some things I thought were interesting:
- Classifier Combination Techniques Applied to Coreference Resolution (Vemulapalli, Luo, Pitrelli and Zitouni). This was a student research workshop paper; in fact, it was the one that I was moderating (together with Claire Cardie). The student author, Smita, performed this work while at IBM; though her main research is on similar techniques applied to really cool sounding problems in recognizing stuff that happens in the classroom. Somehow classifier combination, and general system combination, issues came up a lot at this conference (mostly in the hallways where someone was begrudgingly admitting to working on something as dirty as system combination). I used to think system combination was yucky, but I don't really feel that way anymore. Yes, it would be nice to have one huge monolithic system that does everything, but that's often infeasible. My main complaint with system combination stuff is that in many cases I don't really understand why it's helping, which means that unless it's applied to a problem I really care about (of which there are few), it's hard for me to take anything away. But I think it's interesting. Getting back to Smita's paper, the key thing she did to make this work is introduce the notion of alignments between different clusterings, which seemed like a good idea. The results probably weren't as good as they were hoping for, but still interesting. My only major pointers as a panelist were to try using different systems, rather than bootstrapped versions of the same system, and to take a look at the literature on consensus clustering, which is fairly relevant for this problem.
- Graph-based Learning for Statistical Machine Translation (Alexandrescu and Kirchhoff). I'd heard of some of this work before in small group meetings with Andrei and Kathrin, but this is the first time I'd seen the results they presented. This is an MT paper, but really it's about how to do graph-based semi-supervised learning in a structured prediction context, when you have some wacky metric (read: BLEU) on which you're evaluating. Computation is a problem, but we should just hire some silly algorithms people to figure this out for us. (Besides, there was a paper last year at ICML -- I'm too lazy to dig it up -- that showed how to do graph-based stuff on billions of examples.)
- Intersecting Multilingual Data for Faster and Better Statistical Translations (Chen, Kay and Eisele). This is a very simple idea that works shockingly well. Had I written this paper, "Frustrating" would probably have made it into the title. Let's say we want an English to French phrase table. Well, we do phrase table extraction and we get something giant and ridiculous (have you ever looked at those phrase pairs) that takes tons of disk space and memory, and makes translation slow (it's like the "grammar constant" in parsing that means that O(n^3) for n=40 is impractical). Well, just make two more phrase tables, English to German and German to French and intersect. And viola, you have tiny phrase tables and even slightly better performance. The only big caveat seems to be that they estimate all these things on Europarl. What if your data sets are disjoint: I'd be worried that you'd end up with nothing in the resulting phrase table except the/le and sometimes/quelquefois (okay, I just used that example because I love that word).
- Quadratic Features and Deep Architectures for Chunking (Turian, Bergstra and Bengio). I definitely have not drunk the deep architectures kool-aid, but I still think this sort of stuff is interesting. The basic idea here stems from some work Bergstra did for modeling vision, where they replaced a linear classifier(y = w'x) with a low rank approximation to a quadratic classifier (y = w'x + sqrt[(a'x)^2 + (b'x)^2 + ... ]). Here, the a,b,... vectors are all estimated as part of the learning process (eg., by stochastic gradient descent). If you use a dozen of them, you get some quadratic style features, but without the expense of doing, say, an implicit (or worse, explicit) quadratic kernel. My worry (that I asked about during the talk) is that you obviously can't initialize these things to zero or else you're in a local minimum, so you have to do some randomization and maybe that makes training these things a nightmare. Joseph reassured me that they have initialization methods that make my worries go away. If I have enough time, maybe I'll give it a whirl.
- Exploring Content Models for Multi-Document Summarization (Haghighi and Vanderwende). This combines my two favorite things: summarization and topic models. My admittedly biased view was they started with something similar to BayeSum and then ran a marathon. There are a bunch of really cool ideas in here for content-based summarization.
- Global Models of Document Structure using Latent Permutations (Chen, Branavan, Barzilay and Karger). This is a really cool idea (previously mentioned in a comment on this blog) based on using generalized Mallow's models for permutation modeling (incidentally, see a just-appeared JMLR paper for some more stuff related to permutations!). The idea is that documents on a similar topic (eg., "cities") tend to structure their information in similar ways, which is modeled as a permutation over "things that could be discussed." It's really cool looking, and I wonder if something like this could be used in conjunction with the paper I talk about below on summarization for scientific papers (9, below). One concern raised during the questions that I also had was how well this would work for things not as standardized as cities, where maybe you want to express preferences of pairwise ordering, not overall permutations. (Actually, you can do this, at least theoretically: a recent math visitor here, Mark Huber, has some papers on exact sampling from permutations under such partial order constraints using coupling from the past.) The other thing that I was thinking during that talk that I thought would be totally awesome would be to do a hierarchical Mallow's model. Someone else asked this question, and Harr said they're thinking about this. Oh, well... I guess I'm not the only one :(.
- Dan Jurafsky's invited talk was awesome. It appealed to me in three ways: as someone who loves language, as a foodie, and as an NLPer. You just had to be there. I can't do it justice in a post.
- More than Words: Syntactic Packaging and Implicit Sentiment (Greene and Resnik). This might have been one of my favorite papers of the conference. The idea is that how you say things can express your point of view as much as what you say. They look specifically at effects like passivization in English, where you might say something like "The truck drove into the crowd" rather than "The soldier drove the truck into the crowd." The missing piece here seems to be identifying the "whodunnit" in the first sentence. This is like figuring out subjects in languages that like the drop subjects (like Japanese). Could probably be done; maybe it has been (I know it's been worked on in Japanese; I don't know about English).
- Using Citations to Generate Surveys of Scientific Paradigms (Mohammad, Dorr, Egan, Hassan, Muthukrishan, Qazvinian, Radev and Zajic). I really really want these guys to succeed. They basically study how humans and machines create summaries of scientific papers when given either the text of the paper, or citation snippets to the paper. The idea is to automatically generate survey papers. This is actually an area I've toyed with getting in to for a while. The summarization aspect appeals to me, and I actually know and understand the customer very well. The key issue I would like to see addressed is how these summaries vary across different users. I've basically come to the conclusion that in summarization, if you don't pay attention to the user, you're sunk. This is especially true here. If I ask for a summary of generalization bound stuff, it's going to look very different than if Peter Bartlett asks for it.
- Online EM for Unsupervised Models (Liang and Klein). If you want to do online EM read this paper. On the other hand, you're going to have to worry about things like learning rate and batch size (think Pegasos). I was thinking about stuff like this a year or two ago and was wondering how this would compare to doing SGD on the log likelihood directly and not doing EM at all. Percy says that asymptotically they're the same, but who knows what they're like in the real world :). I think it's interesting, but I'm probably not going to stop doing vanilla EM.
I spent the first morning in the Computational Approaches to Linguistic Creativity workshop, which was just a lot of fun. I really liked all of the morning talks: if you love language and want to see stuff somewhat off the beaten path, you should definitely read these. I went by the Semantic Evaluation Workshop for a while and learned that the most frequent sense baseline is hard to beat. Moreover, there might be something to this discourse thing after all: Marine tells us that translators don't like to use multiple translations when one will do (akin to the one sense per discourse observation). The biggest question in my head here is how much the direction of translation matters (eg., when this heuristic is violated, is it violated by the translator, or the original author)? Apparently this is under investigation. But it's cool because it says that even MT people shouldn't just look at one sentence at a time!
Andrew McCallum gave a great, million-mile-an-hour invited talk on joint inference in CoNLL. I'm pretty interested in this whole joint inference business, which also played a big role in Jason Eisner's invited talk (that I sadly missed) at the semi-supervised learning workshop. To me, the big question is: what happens if you don't actually care about some of the tasks. In a probabilistic model, I suppose you'd marginalize them out... but how should you train? In a sense, since you don't care about them, it doesn't make sense to have a real loss associated with them. But if you don't put a loss, what are you doing? Again,in probabilistic land you're saved because you're just modeling a distribution, but this doesn't answer the whole question.
Al Aho gave a fantastically entertaining talk in the machine translation workshop about unnatural language processing. How the heck they managed to get Al Aho to give an invited talk is beyond me, but I suspect we owe Dekai some thanks for this. He pointed to some interesting work that I wasn't familiar with, both in raw parsing (eg., how to parse errorfull strings with a CFG when you want to find the closest in edit distance string that is parseable by a CFG) and natural language/programming language interfaces. (In retrospect, the first result is perhaps obvious had I actually thought much about it, though probably not so back in 1972: you can represent edit distance by a lattice and then parse the lattice, which we know is efficient.)
Anyway, there were other things that were interesting, but those are the ones that stuck in my head somehow (note, of course, that this list is unfairly biased toward my friends... what can I say? :P).
So, off to ICML on Sunday. I hope to see many of you there!
Posted by
hal
at
6/12/2009 03:30:00 PM
| 4
comments
Links to this post
Labels: conferences, papers
09 June 2009
Navigating ICML
I couldn't find a schedule for ICML that had paper titles/authors written in, so I joined the existing schedule with the abstracts to create a printable schedule with titles. You can find organizer-created schedules for UAI and COLT already. In addition, I've put ICML 2009 up on WhatToSee, so have fun! (I haven't done UAI and/or COLT because their papers haven't appeared online yet.)
Posted by
hal
at
6/09/2009 08:25:00 AM
| 1 comments
Links to this post
Labels: conferences
08 June 2009
The importance of input representations
As some of you know, I run a (machine learning) reading group every semester. This summer we're doing "assorted" topics, which basically means students pick a few papers from the past 24 months that are related and present on them. The week before I went out of town, we read two papers about inferring features from raw data; one was a deep learning approach; the other was more Bayesian. (As a total aside, I found it funny that in the latter paper they talk a lot about trying to find independent features, but in all cog sci papers I've seen where humans list features of objects/categories, they're highly dependent: eg., "has fur" and "barks" are reasonable features that humans like to produce that are very much not independent. In general, I tend to think that modeling things as explicitly dependent is a good idea.)
Papers like this love to use vision examples, I guess because we actually have some understanding of how the visual cortex words (from a neuroscience perspective), which we sorely lack for language (it seems much more complicated). They also love to start with pixel representations; perhaps this is neurologically motivated: I don't really know. But I find it kind of funny, primarily because there's a ton of information hard wired into the pixel representation. Why not feed .jpg and .png files directly into your system?
On the language side, an analogy is the bag of words representation. Yes, it's simple. But only simple if you know the language. If I handed you a bunch of text files in Arabic (suppose you'd never done any Arabic NLP) and asked you to make a bag of words, what would you do? What about Chinese? There, it's well known that word segmentation is hard. There's already a huge amount of information in a bag of words format.
The question is: does it matter?
Here's an experiment I did. I took the twenty newsgroups data (standard train/test split) and made classification data. To make the classification data, I took a posting, fed it through a module "X". "X" produced a sequence of tokens. I then extract n-gram features over these tokens and throw out anything that appears less than ten times. I then train a multiclass SVM on these (using libsvm). The only thing that varies in this setup is what "X" does. Here are four "X"s that I tried:
- Extract words. When composed with extracting n-gram tokens, this leads to a bag of words, bag of bigrams, bag of trigrams, etc., representation.
- Extract characters. This leads to character unigrams, character bigrams, etc.
- Extract bits from characters. That is, represent each character in its 8 bit ascii form and extract a sequence of zeros and ones.
- Extract bits from a gzipped version of the posting. This is the same as (3), but before extracting the data, we gzip the file.
As we can see, characters do well, even at the same bit sizes. Basically you get a ton of binary sequence features from raw bits that are just confusing the classifier. Zipped bits do markedly worse than raw bits. The reason the bit-based models don't extend further is because it started taking gigantic amounts of memory (more than my poor 32gb machine could handle) to process and train on those files. But 40 bits is about five characters, which is just over a word, so in theory the 40 bit models have the same information that the bag of words model (at 79% accuracy) has.So yes, it does seem that the input representation matters. This isn't shocking, but I've never seen anyone actually try something like this before.
Posted by
hal
at
6/08/2009 11:36:00 AM
| 4
comments
Links to this post
Labels: machine learning
01 June 2009
NAACL-HLT 2006 has been WhatToSee-d!
See here.
Posted by
hal
at
6/01/2009 09:23:00 AM
| 1 comments
Links to this post
Labels: conferences
30 May 2009
Semi-supervised or Semi-unsupervised? (SSL-NLP invited papers)
Kevin Duh not-so-recently asked me to write a "position piece" for the workshop he's co-organizing on semi-supervised learning in NLP. I not-so-recently agreed. And recently I actually wrote said position piece. You can also find a link off the workshop page. I hope people recognize that it's intentionally a bit tongue-in-cheek. If you want to discuss this stuff or related things in general, come to the panel at NAACL from 4:25 to 5:25 on 4 June at the workshop! You can read the paper for more information, but my basic point is that we can typically divide semi-supervised approached into one lump (semi-supervised) that work reasonably well with only labeled data and are just improved with unlabeled data and one lump (semi-unsupervised) that work reasonably well with only unlabeled data and are just improved with labeled data. The former are typically encode lots of prior information; the latter do not. Let's combine! (Okay, my claim is more nuanced than that, but that's the high-order bit.)
Posted by
hal
at
5/30/2009 10:18:00 PM
| 2
comments
Links to this post
Labels: machine learning
29 May 2009
How to reduce reviewing overhead?
It's past most reviewing time (for the year), so somehow conversations I have with folks I visit tend to gravitate toward the awfulness of reviewing. That is, there is too much to review and too much garbage among it (of course, garbage can be slightly subjective). Reviewing plays a very important role but is a very fallible system, as everyone knows, both in terms of precision and recall. Sometimes there even seems to be evidence of abuse.
But this post isn't about good reviewing and bad reviewing. This is about whether it's possible to cut down on the sheer volume of reviewing. The key aspect of cutting down reviewing, to me, is that in order to be effective, the reduction has to be significant. I'll demonstrate by discussing a few ideas that have come up, and some notes about why I think they would or wouldn't work:
- Tiered reviewing (this was done at ICML this year). The model at ICML was that everyone was guaranteed two reviews, and only a third if your paper was "good enough." I applaud ICML for trying this, but as a reviewer I found it useless. First, it means that at most 1/3 of reviews are getting cut (assuming all papers are bad), but in practice it's probably more like 1/6 that get reduced. This means that if on average a reviewer would have gotten six papers to review, he will now get five. First, this is a very small decrease. Second, it comes with an additional swapping overhead: effectively I now have to review for ICML twice, which makes scheduling a much bigger pain. It's also harder for me to be self-consistent in my evaluations.
- Reject without review (this was suggested to me at dinner last night: if you'd like to de-anonymize yourself, feel free in comments). Give area chairs the power that editors of journals have to reject papers out of hand. This gives area chairs much more power (I actually think this is a good thing: area chairs are too often too lazy in my experience, but that's another post), so perhaps there would be a cap on the number of reject without reviews. If this number is less that about 20%, then my reviewing load will drop in expectation from 5 to 4, which, again, is not a big deal for me.
- Cap on submissions (again, a suggestion from dinner last night): authors may only submit one paper to any conference on which their name comes first. (Yes, I know, this doesn't work in theory land where authorship is alphabetical, but I'm trying to address our issues, not someone else's.) I've only twice in my life had two papers at a conference where my name came first, and maybe there was a third where I submitted two and one was rejected (I really can't remember). At NAACL this year, there are four such papers; at ACL there are two. If you assume these are equally distributed (which is probably a bad assumption, since the people who submit multiple first author papers at a conference probably submit stronger papers), then this is about 16 submissions to NAACL and 8 submissions to ACL. Again, which is maybe 1-4% of submitted papers: again, something that won't really affect me as a reviewer (this, even less than the above two).
- Strong encouragement of short papers (my idea, finally, but with tweaks from others): right now I think short papers are underutilized, perhaps partially because they're seen (rightly or wrongly) as less significant than "full" papers. I don't think this need be the case. Short papers definitely take less time to review. A great "short paper tweak" that was suggested to me is to allow only 3 pages of text, but essentially arbitrarily many pages of tables/figures (probably not arbitrarily, but at least a few... plus, maybe just make data online). This would encourage experimental evaluation papers to be submitted as shorts (currently these papers typically just get rejected as being longs because they don't introduce really new ideas, and rejected as shorts because its hard to fit lots of experiments in four pages). Many long papers that appear in ACL could easily be short papers (I would guesstimate somewhere around 50%), especially ones that have the flavor of "I took method X and problem Y and solved Y with X (where both are known)" or "I took known system X, did new tweak Y and got better results." One way to start encouraging short papers is to just have an option that reviews can say something like "I will accept this paper as a short paper but not a long paper -- please rewrite" and then just have it accepted (with some area chair supervision) without another round of reviewing. The understanding would have to be that it would be poor form as an author to pull your paper out just because it got accepted short rather than accepted long, and so authors might be encouraged just to submit short versions. (This is something that would take a few years to have an effect, since it would be partially social.)
- Multiple reviewer types (an idea that's been in the ether for a while). The idea would be that you have three reviewers for each paper, but each serves a specific role. For instance, one would exclusively check technical details. The other two could then ignore these. Or maybe one would be tasked with "does this problem/solution make sense." This would enable area chairs (yes, again, more work for area chairs) to assign reviewers to do things that they're good at. You'd still have to review as many papers, but you wouldn't have to do the same detailed level of review for all of them.
- Require non-student authors on papers to review 3 times as many papers as they submit to any given conference, no exceptions ("three" because that's how many reviews they will get for any given paper). I know several faculty to follow the model of "if there is a deadline, we will submit." I don't know how widespread this is. The idea is that even half-baked ideas will get garner reviews that can help direct the research. I try to avoid offending people here, but that's what colleagues are for: please stop wasting my time as a reviewer by submitting papers like this. If they get rejected, you've wasted my time; if they get accepted, it's embarrassing for you (unless you take time by camera ready to make them good, which happens only some of the time). Equating "last author" = "senior person", there were two people at NAACL who have three papers and nine who have two. This means that these two people (who in expectation submitted 12 papers -- probably not true, probably more like 4 or 5) should have reviewed 12-15 papers. The nine should probably have reviewed 9-12 papers. I doubt they did. (I hope these two people know that I'm not trying to say they're evil in any way :P.) At ACL, there are four people with three papers (one is a dupe with a three from NAACL -- you know who you are!) and eight people with two. This would have the added benefit of having lots of reviews done by senior people (i.e., no crummy grad student reviews) with the potential downside that these people would gain more control over the community (which could be good or bad -- it's not a priori obvious that being able to do work that leads to many publications is highly correlated with being able to identify good work done by others).
- Make the job of the reviewer more clear. Right now, most reviews I read have a schizophrenic feel. On the one hand, the reviewer justifies his rating to the area chair. On the other, he provides (what he sees as) useful feedback to the authors. I know that in my own reviewing, I have cut down on the latter. This is largely in reaction to the "submit anything and everything" model that some people have. I'll typically give (what I hope is) useful feedback to papers I rate highly, largely because I have questions whose answers I am curious about, but for lower ranked papers (1-3), I tend to say things like "You claim X but your experiments only demonstrate Y." Rather than "[that] + ... and in order to show Y you should do Z." Perhaps I would revert to my old ways if I had less to review, but this was a choice I made about a year ago.
I actually think that together, some of these ideas could have a significant impact. For instance, I would imagine 2 and 4 together would probably cut a 5-6 paper review down to a 3-4 paper review, and doing 6 on top of this would probably take the average person's review load down maybe one more. Overall, perhaps a 50% reduction in number of papers to review, unless you're one of the types who submits lots of papers. I'd personally like to see it done!
Posted by
hal
at
5/29/2009 07:33:00 PM
| 13
comments
Links to this post
Labels: community, conferences
26 April 2009
Viterbi search, minimum Bayes risk and laplacian eigenmaps
I've been slow at posting because I wanted to post on this current topic for a while, but wanted to work out some more details before doing so. Well, it didn't happen. So I'm writing sans details.
Let's think for a minute about non-linear dimensionality reduction, aka manifold learning. If we compare what, say, ISOMAP does with what laplacian eigenmaps (LE) does, they're really quite similar. In both cases, we construct a graph over our data points, usually a kNN graph or something, sometimes with edge weights, sometimes not. We then attempt to embed the points in a low dimensional space to minimize some sort of distortion. In ISOMAP, the distortion is based on the shortest path distance between the two points in our constructed graph. In LE, the distance is measures according to the Laplacian of the graph. The notion of a Laplacian of a graph is basically a discrete version of the more standard notion of the differential operator by the same name (that comes up in most standard analysis courses). In continuous land, it is the divergence of the differential, which essentially means that measures some form of diffusion (and has its original applications in physics). In discrete land, where we live, it can be thought of as a measure of flow on a graph.
The key difference, then, between ISOMAP and LE is whether you measure distances according to shortest path or to flow, which has a very "average path" feel to it. The advantage to LE is that it tends to be less susceptible to noise, which is easy to understand when viewed this way.
Now, getting back to NLP stuff, we often find ourselves doing some sort of shortest path search. It's well known that the much loved Viterbi algorithm is exactly an application of dynamic programming to search in the lattice defined by an HMM. This extends in well known ways to other structures. Of course, Viterbi search isn't the only game in town. There are two other popular approaches to "decoding" in HMMs. One is marginal decoding, where we individually maximize the probability of each node. The other is minimum Bayes risk decoding. Here, we take a user-supplied risk (aka loss) function and try to find the output that minimizes the expected risk, where the probabilities are given by our current model. MBR has been shown to outperform Viterbi in many applications (speech, MT, tagging, etc.). If your risk (loss) function is 0/1 loss, then these recover the same solution.
What I'm curious about is whether this is a connection here. I'm not exactly sure how the construction would go -- I'm thinking something like a graph defined over the lattice vertices with edges that reflect the loss function -- but there definitely seems to be a similarity between MBR and average path, which is approximately equal to a Laplacian operation (aka spectral operation). The reason I think this would be interesting is because a lot is known about spectral computations and we might be able to use this knowledge to our advantage (coming up with MBR algorithms is sometimes a bit painful). An additional complication (or bit of fun, if you see it that way) is that there are at least three standard ways to generalize the notion of a Laplacian to a hypergraph, which is what we would really need to do. Perhaps we can help pick out the "right" one through the MBR connection.
Posted by
hal
at
4/26/2009 08:01:00 AM
| 8
comments
Links to this post
Labels: algorithms, machine learning, questions
24 April 2009
...and here it is for ACL
Papers are here. Words:
- translat (19)
- model (19)
- base (19)
- learn (16)
- semant (12)
- supervis (10)
- machin (10)
- depend (10)
- automat (10)
- word (9)
- pars (9)
- approach (8)
- system (7)
- relat (7)
- gener (7)
- web (6)
- unsupervis (6)
- train (6)
- languag (6)
- label (6)
- decod (6)
- align (6)
Posted by
hal
at
4/24/2009 08:59:00 AM
| 3
comments
Links to this post
Labels: conferences
23 April 2009
NAACL program up
See here. I see a bunch that I reviewed and really liked, so I'm overall pretty happy. (Though I also see one or two in the "other" category :P.)
At any rate, here are the top stemmed/stopped words with frequencies:
- model (30)
- base (19)
- translat (17)
- languag (16)
- speech (14)
- improv (14)
- statist (12)
- machin (12)
- unsupervis (11)
- recognit (10)
- pars (10)
- system (9)
- learn (9)
- word (8)
- depend (8)
- approach (8)
- spoken (7)
- semant (7)
- search (7)
- linear (7)
- extract (7)
Posted by
hal
at
4/23/2009 10:00:00 AM
| 4
comments
Links to this post
Labels: conferences
22 March 2009
Programming Language of Choice
Some of you know that I (at least used to be) a bit of a programming language snob. In fact, on several occasions, I've met (in NLP or ML land) someone who recognizes my name from PL land and is surprised that I'm not actually a PL person. My favorite story is after teaching machine learning for the second time, I had Ken Shan, a friend from my PL days, visit. I announced his visit and got an email from a student who had taken ML from me saying:
I _knew_ your name was familiar! I learned a ton about Haskell from your tutorial, for what's worth.. Great read back in my freshman year in college. (Belatedly) Thanks for writing it!
And it's not like my name is particularly common!
At any rate (and, admittedly, this is a somewhat an HBC-related question) I'm curious what programming language(s) other NLP folks tend to use. I've tried to include a subset of the programming language shootout list here that I think are most likely to be used, but if you need to write-in, feel free to do so in a comment. You can select as many as you would like, but please just try to vote for those that you actually use regularly, and that you actually use for large projects. Eg., I use Perl a lot, but only for o(100) line programs... so I wouldn't select Perl.
Posted by
hal
at
3/22/2009 09:13:00 AM
| 22
comments
Links to this post
07 March 2009
n-gram words an language Ordering model with
N-gram language models have been fairly successful at the task of distinguishing homophones, in the context of speech recognition. In machine translation (and other tasks, such as summarization, headline generation, etc.), this is not their job. Their job is to select fluent/grammatical sentences, typically ones which have undergone significant reordering. In a sense, they have to order words. A large part of the thesis of my academic sibling, Radu Soricut, had to do with exploring how well ngram language models can reorder sentences. Briefly, they don't do very well. This is something that our advisor, Daniel Marcu, likes to talk about when he gives invited talk; he shows a 15 word sentence and the preferred reorderings by a ngram LM and they're total hogwash, even though audience members can fairly quickly solve the exponential time problem of reordering the words to make a good sounding sentence. (As an aside, Radu found that if you add in a syntactic LM, things get better... if you don't want to read the whole thesis, just skip forward to section 8.4.2.)
Let's say we like ngram models. They're friendly for many reasons. What could we do to make them more word-order sensitive? I'm not claiming that none of these things have been tried; just that I'm not aware of them having been tried :).
- Discriminative training. There's lots of work on discriminative training of language models, but, from what I've seen, it usually has to do with trying to discriminate true sentences from fake sentences, where the fake sentences are generated by some process (eg., an existing MT or speech system, a trigram LM, etc.). The alternative is to directly train a language model to order words. Essentially think of it as a structured prediction problem and try to predict the 8th word based on (say) the two previous. The correct answer is the actual 8th word; the incorrect answer is any other word in the sentence. Words that don't appear in the sentence are "ignored." This is easy to implement and seems to do something reasonable (on a small set of test data).
- Add syntactic features to words, eg., via cluster-based language models. My thought here is to look at syntactic features of words (for instance, CCG-style lexicon information) and use these to create descriptors of the words; these can then be clustered (eg., use tree-kernel-style-features) to give a cluster LM. This is similar to how people have added CCG/supertag information to phrase-based MT, although they don't usually do the clustering step. The advantage to clustering is then you (a) get generalization to new words and (b) it fits in nicely with the cluster LM framework.
Posted by
hal
at
3/07/2009 11:10:00 AM
| 11
comments
Links to this post
Labels: language modeling, machine translation
02 March 2009
Mixture models: clustering or density estimation
My colleague Suresh Venkatasubramanian is running as seminar on clustering this semester. Last week we discussed EM and mixture of Gaussians. I almost skipped because it's a relatively old hat topic for me (how many times have I given this lecture?!), and had some grant stuff going out that day. But I decided to show up anyway. I'm glad I did.
We discussed a lot of interesting things, but something that had been bugging me for a while finally materialized in a way about which I can be precise. I basically have two (purely qualitative) issues with mixture of Gaussians as a clustering method. (No, I'm not actually suggesting there's anything wrong with using it in practice.) My first complaint is that many times, MoG is used to get the cluster assignments, or to get soft-cluster assignments... but this has always struck me as a bit weird because then we should be maximizing over the cluster assignments and doing expectations over everything else. Max Welling has done some work related to this in the Bayesian setting. (I vaguely remember that someone else did basically the same thing at basically the same time, but can't remember any more who it was.)
But my more fundamental question is this. When we start dealing with MoG, we usually say something like... suppose we have a density F which can be represented at F = pi_0 F_0 + pi_1 F_1 + ... + pi_K F_K, where the pis give a convex combination of "simpler" densities F_k. This question arose in the context of density estimation (if my history is correct) and the maximum likelihood solution via expectation maximization was developed to solve the density estimation problem. That is, the ORIGINAL goal in this case was to do density estimation; the fact that "cluster assignments" were produced as a byproduct was perhaps not the original intent.
I can actually give a fairly simple example to try to make this point visually. Here is some data generated by a mixture of uniform distributions. And I'll even tell you that K=2 in this case. There are 20,000 points if I recall correctly:
Can you tell me what the distribution is? Can you give me the components? Can you give me cluster assignments?
The problem is that I've constructed this to be non-identifiable. Here are two ways of writing down the components. (I've drawn this in 2D, but only pay attention to the x dimension.) They give rise to exactly the same distribution. One is equally weighted components, one uniform on the range (-3,1) and one uniform on the range (-1,3). The other is to have to components, one with 2/3 weight on the range (-3,3) and one with 1/3 weight on the range (-1,1).
I could imagine some sort of maximum likelihood parameter estimation giving rise to either of these (EM is hard to get to work here because once a point is outside the bounds of a uniform, it has probability zero). They both correctly recover the distribution, but would give rise to totally different (and sort of weird) cluster assignments.
I want to quickly point out that this is a very different issue from the standard "non-identifiability in mixture models issue" that has to do with the fact that any permutation of cluster indices gives rise to the same model.
So I guess that all this falls under the category of "if you want X, go for X." If you want a clustering, go for a clustering -- don't go for density estimation and try to read off clusters as a by-product. (Of course, I don't entirely believe this, but I still think it's worth thinking about.)
Posted by
hal
at
3/02/2009 09:13:00 PM
| 8
comments
Links to this post
Labels: clustering, machine learning
19 February 2009
Mixing of gibbs samplers for topic (and mixture) models
This is just a quick note because it's something that had somehow never occurred to me until a few days ago. If you've ever talked to me in person about the standard (collapsed) Gibbs samplers for topic models, you know that I get concerned that these things don't mix. And it's not just the generic "how do we know" feeling that can get levied at (almost) any sampler, but a very specific we know for certain that our Gibbs samplers for topic models aren't mixing. How do we know? Because, for the most part, they don't mode switch. You can figure this out quite easily by simply watching the topic assignments. (The standard collapsed samplers for Gaussian or Multinomial mixture models exhibit the same problem.) At least if you have a reasonable amount of data.
The reason this always bugged me is because we now have definitive evidence that these things aren't mixing in the sense of mode hopping, which leads me to worry that they might also not be mixing in other, less easy to detect ways.
Well, worry no more. Or at least worry less. The mode hopping is a red herring.
Maybe the following observation is obvious, but I'll say it anyway.
Let's take our topic model Gibbs sampler and introduce a new Metropolis-Hastings step. This MH step simply takes two topic indices (say topics i and j) and swaps them. It picks i and j uniformly at random from the (K choose 2) possibilities. It's easy to verify that the acceptance probability for this move will be one (the qs will cancel and the ps are identical), which means that it's really more like a Gibbs step than an MH step (in the sense that Gibbs steps are MH steps that are always accepted).
The observation is that (a) this doesn't actually change what the sampler is doing in any real, meaningful way -- that is, re-indexing things is irrelevant; but (b) you now cannot claim that the sampler isn't mode switching. It's mode switching a lot.
Sure, it might still have poor mixing properties for other reasons, but at least now there isn't this elephant-in-the-room reason it can't possibly be mixing.
So this is a totally "useless" practical observation (sometimes we like to try to exploit the fact that it's not mixing), but from a theoretical perspective it might be interesting. For instance, if you want to prove something about a fast mixing rate for these samplers (if you believe they are actually mixing fast, which I'm somewhat keen to believe), then you're not going to be able to prove anything if you don't make this trivial change to the sampler (because without it they're not really mixing fast). So it might have interesting theoretical consequences.
Posted by
hal
at
2/19/2009 06:36:00 AM
| 5
comments
Links to this post
Labels: machine learning, mcmc, topic models
04 February 2009
Beating the state of the art, big-O style
I used to know a fair amount about math; in fact, I even applied to a few grad schools to do logic (long story, now is not the time). I was never involved in actual math research (primarily because of the necessary ramp-up time -- thank goodness this doesn't exist as much in CS!), but I did get a sense from my unofficial undergrad advisor how things worked. The reason I started thinking of this is because I recently made my way about halfway through a quite dense book on long games by a prof from grad school (I cross-enrolled at UCLA for a few semesters). The basic idea of a countable game is that there is a fixed subset A of [0,1] (subset of the reals) and two players alternate play. In each move, they play an integer from 0 to 9. They play for a countably infinite number of moves, essentially writing down (in decimal form) a real number. If, at the "end", this number is in A, then player 1 wins; otherwise, player 2 wins. Both players know A. The set A is said to be determined if there is a strategy that will force a win; it is undetermined otherwise. A long game is the obvious generalization where you play for longer than countable time. The details don't matter.
This led me to think, as someone who's moved over to working a lot on machine learning: is there an analogous question for online learning? There are several ways to set this up and I won't bore you with the details (I doubt any reader here really cares), but depending on how you set it up, you can prove several relatively trivial, but kind of cute, results (I say trivial because they took me on the order of hours, which means that someone who knows what they're doing probably would see them immediately). I basically did this as a mental exercise, not for any real reason.
But it got me thinking: obviously machine learning people wouldn't care about this because it's too esoteric and not at all a realistic setting (even for COLTers!). I strongly doubt that logicians would care either, but for a totally different reason. From my interaction, they would be interested if and only if two things were satisfied: (a) the result showed some interesting connection between a new model and existing models; (b) the proofs were non-trivial and required some new insight that could be then applied to other problems. Obviously this is not my goal in life, so I've dropped it.
This led to me introspect: what is is that we as a community need in order to find some result interesting? What about other fields that I claim to know a bit about?
Let's take algorithms for a minute. Everything here is about big-O. Like the math types, a result without an interesting proof is much less interesting than a result with an interesting proof, though if you start reading CS theory blogs, you'll find that there's a bit of divide in the community on whether this is good or not. But my sense (which could be totally broken) is that if you have a result with a relatively uninteresting proof that gets you the same big-O running time as the current state of the art, you're in trouble.
I think it's interesting to contrast this with what happens in both NLP and ML. Big-O works a bit differently here. My non-technical description of big-O to someone who knows nothing is that it measure "order of magnitude" improvements. (Okay, O(n log n) versus O(n log log n) is hard to call an order of magnitude, but you get the idea.) An equivalent on the experimental side would seem to be something like: you cut the remaining error on a problem by half or more. In other words, if state of the art is 60% accuracy, then an order of magnitude improvement would be 80% accuracy or better. At 80% it would be 90%. At 90% it would be 95% and so on. 90.5% to 91.2% is not order of magnitude.
I actually like this model for looking at experimental results. Note that this has absolutely nothing to do with statistical significance. It's kind of like reading results graphs with a pair of thick glasses on (for those who don't wear glasses) or no glasses on (for those who wear think glasses). I think the justification is that for less than order of magnitude improvement, it's really just hard to say whether the improvement is due to better tweaking or engineering or dumb luck in how some feature was implemented or what. For order of magnitude improvement, there almost has to be something interesting going on.
Now, I'm not proposing that a paper isn't publishable if it doesn't have order of magnitude improvement. Very few papers would be published this way. I'm just suggesting that improving the state of the art not be -- by itself -- a reason for acceptance unless it's an order of magnitude improvement. That is, you'd either better have a cool idea, be solving a new problem, analyzing the effect of some aspect of a problem that's important, etc., or work on a well trod task and get a big-O improvement.
What I'm saying isn't novel, of course... the various exec boards at the ACL conferences have been trying to find ways to get more "interesting" papers into the conferences for (at least) a few years. This is just a concrete proposal. Obviously it requires buy-in at least of area chairs and probably reviewers. And there are definitely issues with it. Like any attempt to make reviewing non-subjective, there are obviously corner cases (eg., you have a sort-of-interesting idea and an almost-order-of-magnitude improvement). You can't mechanize the reviewing process. But frankly, when I see paper reviews that gush over tiny improvements in the state of the art in an otherwise drab paper, I just get annoyed :).
Posted by
hal
at
2/04/2009 08:22:00 AM
| 8
comments
Links to this post
Labels: acl, community, conferences
31 January 2009
Boosting, neural networks and deep learning
I'm on a neural networks kick of late, apparently.
The following was pointed out to me recently. Maybe other people think it's obvious, and it's obvious once you hear it, but I'd never seen it that way. Suppose we're doing boosting and the weak learner that we choose is a thresholded linear model (eg., perceptron, SVM, logistic regression). Then what we get out of boosting is basically a neural network with a single hidden unit. The hidden units themselves are the weak learners, and the "alphas" that arise as part of the boosting procedure (eg., AdaBoost) are the weights connecting the hidden units to the output. (Thanks to Antonio Paiva for pointing this out to me.) So, in a sense, boosting a linear model can be seen as a kind-of-unusual method for training a two layer network. Essentially the number of boosting iterations that you run determines the number of hidden units (i.e., the network architecture).
Okay, that's great. But let's say we don't care about two layer networks, we care about deep learning. For a 3-layer network you could simple boost boosted-linear-models. For a 4-layer network you could boost boosted-boosted-linear models. And so on.
Is there an alternative?
Thinking purely procedurally, let's say my weak learner is a linear model. I start boosting and I've got 5 linear models trained in the standard AdaBoost manner. Now I have a choice. Should I train a 6th linear model to throw in to the standard boosting set? Or should I treat the 5 boosted linear models as a new base classifier and boost against the combination? If I choose the latter, I've now gone from two layers to three layers.
Why might it be a good idea to boost against the 5 collectively? Well, if you believe the whole deep learning propaganda, then it's a good idea because deep = good. From a more theoretical perspective, you might how that the extra level of recursion might get you an increased rate of improvement in the error rate. I.e., the recursion could potentially lead to stronger boosting results than the standard linear boosting. Of course, this is just a hunch: I haven't at all looked to try to figure out if it would actually work in theory. But it seems plausible. For instance, in neural networks theory, we know that a 2 layer network can approximate any (reasonable) function, but you might need an exponential number of hidden units; the number of required hidden units goes down if you make deeper networks (under assumptions).
Of course, maybe someone's already done this, maybe it doesn't work, and maybe it's a stupid idea :).
Posted by
hal
at
1/31/2009 09:57:00 AM
| 6
comments
Links to this post
Labels: machine learning
