<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-19803222</id><updated>2012-01-29T10:46:46.015-07:00</updated><category term='graphical models'/><category term='clustering'/><category term='journals'/><category term='theory'/><category term='PL'/><category term='linguistics'/><category term='research'/><category term='news'/><category term='domain adaptation'/><category term='discourse'/><category term='loss functions'/><category term='acl'/><category term='random'/><category term='community'/><category term='structured prediction'/><category term='poll'/><category term='coreference'/><category term='parsing'/><category term='algorithms'/><category term='bayesian'/><category term='sentiment'/><category term='hiring'/><category term='classification'/><category term='language modeling'/><category term='machine translation'/><category term='online learning'/><category term='problems'/><category term='topic models'/><category term='evaluation'/><category term='information retrieval'/><category term='finite state methods'/><category term='survey'/><category term='software'/><category term='ACS'/><category term='chunking'/><category term='advising'/><category term='speech'/><category term='reviewing'/><category term='summarization'/><category term='statistics'/><category term='machine learning'/><category term='mcmc'/><category term='data'/><category term='questions'/><category term='conferences'/><category term='papers'/><category term='teaching'/><title type='text'>natural language processing blog</title><subtitle type='html'>my biased thoughts on the fields of natural language processing (NLP), computational linguistics (CL) and related topics (machine learning, math, funding, etc.)</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default?start-index=101&amp;max-results=100'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>258</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-19803222.post-6489777617753556390</id><published>2011-12-12T14:58:00.000-07:00</published><updated>2011-12-13T04:23:34.290-07:00</updated><title type='text'>It's that magical time of year...</title><content type='html'>By which I mean NIPS, and the incumbent exhaustion of 14 hour days.&amp;nbsp; (P.s., if you're at NIPS, see the meta-comment I have at the end of this post, even if you skip the main body :P.)&lt;br /&gt;&lt;br /&gt;Today I went to two tutorials: one by Yee Whye Teh and Peter Orbanz (who is starting shortly at Columbia) on non-parametric Bayesian stuff, and one by Naftali Tishby on information theory in learning and control.&amp;nbsp; They were streamed online; I'm sure the videos will show up at some point on some web page, but I can't find them right now (incidentally, and shamelessly, &lt;a href="http://www.naacl.org/elections"&gt;I think NAACL should have video tutorials&lt;/a&gt;, too -- actually my &lt;a href="https://twitter.com/#%21/ccb"&gt;dear friend Chris&lt;/a&gt; wants that too, and since Kevin Knight has already promised that ACL approves all my blog posts, I suppose I can only additionally promise that I will do everything I can to keep at least a handful of MT papers appearing in each NAACL despite the fact that no one really works on it anymore :P).&amp;nbsp; Then there were spotlights followed by posters, passed (as well as sat down) hors d'oeuvres, free wine/sangria/beer/etc, and friends and colleagues.&lt;br /&gt;&lt;br /&gt;The first half of the Teh/Orbanz tutorial is roughly what I would categorize as "NP Bayes 101" -- stuff that everyone should know, with the addition of some pointers to results about consistency, rates of convergence of the posterior, etc.&amp;nbsp; The second half included a lot of stuff that's recently become interesting, in particular topics like completely random measures, coagulation/fragmentation processes, and the connection between gamma processes (an example of a completely random measure) and Dirichlet processes (which we all know and love/hate).&lt;br /&gt;&lt;br /&gt;One of the more interesting things toward the end was what I was roughly characterized as variants of the DeFinetti theorem on exchangable objects.&amp;nbsp; What follows is from memory, so please forgive errors: you can look it up in the tutorial.&amp;nbsp; DeFinetti's theorem states that if p(X1, X2, ..., Xn, ...) is exchangeable, then p has a representation as a mixture model, with (perhaps) infinite dimensional mixing coefficients.&amp;nbsp; This is a fairly well-known result, and was apparently part of the initial reason Bayesians started looking into non-parametrics.&lt;br /&gt;&lt;br /&gt;The generalizations (due to people like Kingman, Pitman, Aldous, etc...) are basically what happens for other types of data (i.e., other than just exchangeable).&amp;nbsp; For instance, if a sequence of data is block-exchangeable (think of a time-series, which is obviously &lt;i&gt;not&lt;/i&gt; exchangeable, but for which you could conceivably cut it into a bunch of contiguous pieces and these pieces would be exchangeable) then it has a representation as a mixture of Markov chains.&amp;nbsp; For graph-structured data, if the nodes are exchangeable (i.e., all that matters is the pattern of edges, not precisely which nodes they happen to connect), then this also has a mixture parameterization, though I've forgotten the details.&lt;br /&gt;&lt;br /&gt;The Tishby tutorial started off with some very interesting connections between information theory, statistics, and machine learning, essentially from the point of view of hypothesis testing.&amp;nbsp; The first half of the tutorial centered around information bottleneck, which is a very beautiful idea. You should all go read about it if you don't know it already.&lt;br /&gt;&lt;br /&gt;What actually really struck me was a comment that Tishby made somewhat off-hand, and I'm wondering if anyone can help me out with a reference.&amp;nbsp; The statement has to do with the question "why KL?"&amp;nbsp; His answer had two parts.&amp;nbsp; For the first part, consider mutual information (which is closely related to KL).&amp;nbsp; MI has the property that if "X -&amp;gt; Y -&amp;gt; Z" is a Markov chain, then the amount of information that Y gives you about Z is at most the amount of information that X gives you about Z.&amp;nbsp; In other words, if you think if Y as a "processed" version of X, then this processing cannot give you more information.&amp;nbsp; This property is more general than just MI, and I believe anything that obeys it is a Csiszar divergence.&amp;nbsp; The second part is the part that I'm not so sure of.&amp;nbsp; It originated with the observation that if you have a product, take a log, you now get an additive term.&amp;nbsp; This is really nice because you can apply results like the central limit theorem to this additive term.&amp;nbsp; (Many of the results in the first half of his tutorial hinged on this additivity.)&amp;nbsp; The claim was something like: the only divergences that have this additivity are Bregman divergences.&amp;nbsp; (This is not obvious to me, and actually not entirely obvious what the right definition of additivity is, so if someone wants to help out, please do so!)&amp;nbsp; But the connection is that MI and KL are the "intersection" of Bregman divergences and Csiszar divergences.&amp;nbsp; In other words, if you want the decreasing information property and you want the additivity property, then you MUST use information theoretic measures.&lt;br /&gt;&lt;br /&gt;I confess that roughly the middle third of the talk went above my head, but I did learn about an interesting connection between Gaussian information bottleneck and CCA: basically they're the same, up to a trimming of the eigenvalues.&amp;nbsp; This is in a 2005 JMLR paper by Amir Globerson and others.&amp;nbsp; In the context of this, actually, Tishby made a very offhand comment that I couldn't quite parse as whether it was a theorem or a hope.&amp;nbsp; Basically the context was that when working with Gaussian distributed random variables, you can do information bottleneck "easily," but that it's hard for other distributions.&amp;nbsp; So what do we do?&amp;nbsp; We do a kernel mapping into a high dimension space (they use an RBF kernel) where the data will look "more Gaussian."&amp;nbsp; As I said, I couldn't quite parse whether this is "where the data will provably look more Gaussian" or "where we hope that maybe by dumb luck the data will look more Gaussian" or something in between.&amp;nbsp; If anyone knows the answer, again, I'd love to know.&amp;nbsp; And if you're here at NIPS and can answer either of these two questions to my satisfaction, I'll buy you a glass of wine (or beer, but why would you want beer? :P).&lt;br /&gt;&lt;br /&gt;Anyway, that's my report for day one of NIPS!&lt;br /&gt;&lt;br /&gt;p.s. I made the silly decision of taking a flight from Granada to Madrid at 7a on Monday 19 Dec.&amp;nbsp; This is way too early to take a bus, and I really don't want to take a bus Sunday night.&amp;nbsp; Therefore, I will probably take a cab.&amp;nbsp; I think it will be about 90 euros.&amp;nbsp; If you also were silly and booked early morning travel on Monday and would like to &lt;i&gt;share&lt;/i&gt; said cab, please email me (me AT hal3 DOT name).&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-6489777617753556390?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/6489777617753556390/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=6489777617753556390' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6489777617753556390'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6489777617753556390'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/12/its-that-magical-time-of-year.html' title='It&apos;s that magical time of year...'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-860379158736563644</id><published>2011-10-14T08:27:00.003-06:00</published><updated>2011-10-14T08:27:50.068-06:00</updated><title type='text'>You need a job and I have $$$</title><content type='html'>If you're an NLP or ML person and graduating in the next six months or so and are looking for a postdoc position with very very open goals, read on.&amp;nbsp; The position would be at UMD College Park, in the greater DC area, with lots of awesome people around (as well as JHU and other universities a short drive/train away).&lt;br /&gt;&lt;br /&gt;This position could start as early as January 2012, probably more likely around June 2012 and could be as late as September 2012 for the right person.&amp;nbsp; Even if you're not graduating until the one-year-from-now time frame, please contact me now!&amp;nbsp; I'm looking more for a &lt;i&gt;brilliant, hard-working, creative person&lt;/i&gt; than anyone with any particular skills.&amp;nbsp; That said, you probably know what sort of problems I tend to work on, so it would be good if you're at least interested in things roughly in that space (regardless of whether you've worked on them before or not).&lt;br /&gt;&lt;br /&gt;The position would be for one year, with an extension to two if things are working out well for both of us (not subject to funding).&lt;br /&gt;&lt;br /&gt;If you're interested, please email me at &lt;a href="mailto:postdoc@hal3.name"&gt;postdoc@hal3.name&lt;/a&gt; with the following information:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Inline in the email:&lt;/li&gt;&lt;ol&gt;&lt;li&gt;Your PhD institution and advisor, thesis title, and expected graduation date.&lt;/li&gt;&lt;li&gt;Links to the two or three most awesome papers you have, together with titles and venue.&lt;/li&gt;&lt;li&gt;Link to your homepage.&lt;/li&gt;&lt;li&gt;A list of three references (names, positions and email addresses). &lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;li&gt;Attached to the email:&lt;/li&gt;&lt;ol&gt;&lt;li&gt;A copy of your CV, in PDF format.&lt;/li&gt;&lt;li&gt;A brief (one page) research statement that focuses mostly on what problem(s) you'd most like to work on in a postdoc position with me.&amp;nbsp; Also in PDF format.&lt;/li&gt;&lt;/ol&gt;&lt;/ol&gt;&amp;nbsp;I need this information by &lt;b&gt;November 1st&lt;/b&gt; so please reply quickly!!!&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-860379158736563644?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/860379158736563644/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=860379158736563644' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/860379158736563644'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/860379158736563644'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/10/you-need-job-and-i-have.html' title='You need a job and I have $$$'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-5091600234929283373</id><published>2011-10-11T13:29:00.003-06:00</published><updated>2011-10-11T13:29:50.652-06:00</updated><title type='text'>Active learning: far from solved</title><content type='html'>As &lt;a href="http://hunch.net/?p=1800"&gt;Daniel Hsu and John Langford pointed out recently&lt;/a&gt;, there has been a lot of recent progress in active learning.  This is to the point where I might actually be tempted to suggest some of these algorithms to people to use in practice, for instance the one John has that learns faster than supervised learning because it's very careful about what work it performs.  That is, in particular, I might suggest that people try it out instead of the usual query-by-uncertainty (QBU) or query-by-committee (QBC).  This post is a brief overview of what I understand of the state of the art in active learning (paragraphs 2 and 3) and then a discussion of why I think (a) researchers don't tend to make much use of active learning and (b) why the problem is far from solved.  (a will lead to b.)&lt;br /&gt;&lt;br /&gt;For those who know what QBU and QBC are, skip this paragraph.  The idea with QBU is exactly what you think: when choosing the next point to as for the label of, choose the one on which your current model is maximally uncertain.  If you're using a probabilistic model, this means something like "probability is closest to 0.5," or, in the non-binary case, something like "highest entropy of p(y|x)."  If you're using something like an SVM, perhaps margin (aka distance to the hyperplane) is a reasonable measure of uncertainty.  In QBC, the idea is still to query on uncertain points, but the uncertainty is computed by the amount of agreement among a committee of classifiers, for instance, classifiers trained in a boostrap manner on whatever data you have previously labeled.&lt;br /&gt;&lt;br /&gt;One of the issues with QBU and QBC and really a &lt;i&gt;lot&lt;/i&gt; of the classic methods for active learning is that you end up with a biased set of training data.  This makes it really hard to prove anything about how well your algorithm is going to do on future test examples, because you've intentially selected examples that are not random (and probably not representative).  One of the "obvious in retrospect" ideas that's broken this barrier is to always train your classifier on &lt;i&gt;all&lt;/i&gt; examples: the label for those that you've queried on is given by the human, and the label for those that you haven't queried on is given by your model from the previous iteration.  Thus, you are &lt;i&gt;always&lt;/i&gt; training on an iid sample from the distribution you care about (at least from a p(x) perspective).  This observation, plus a lot of other work, leads to some of the breakthroughs that John mentions.&lt;br /&gt;&lt;br /&gt;An easy empirical observation is that not many people (in my sphere) actually use active learning.  In fact, the only case that I know of was back in 2004 where IBM annotated extra coreference data for the Automatic Content Extraction (ACE) challenge using active learning.  Of course people use it to write papers about active learning, but that hardly counts.  (Note that the way that previously learned taggers, for instance the Brill Tagger, were used in the construction of the Penn Treebank does not fall under the auspices of active learning, at least as I'm thinking about it here.)&lt;br /&gt;&lt;br /&gt;It is worth thinking about why this is.  I think that the main issue is that you end up with a biased training set.  If you use QBC or QBU, this is very obvious.  If you use one of the new approaches that self-label the rest of the data to ensure that you don't have a biased training set, then of course p(x) is unbiased, but p(y|x) is very biased by whatever classifier you are using.&lt;br /&gt;&lt;br /&gt;I think the disconnect is the following.  The predominant view of active learning is that the goal is a &lt;i&gt;classifier&lt;/i&gt;.  That data that is labeled is a byproduct that will be thrown away, once the classifier exists.&lt;br /&gt;&lt;br /&gt;The problem is that this view flies in the face of the whole point of active learning: that labeling is expensive.  If labeling is so expensive, we should be able to &lt;i&gt;reuse&lt;/i&gt; this data so that the cost is amortized.  That is, yes, of course we care about a classifier.  But just as much, we care about having a data set (or "corpus" in the case of NLP).&lt;br /&gt;&lt;br /&gt;Consider, for instance, the Penn Treebank.  The sorts of techniques that are good at parsing &lt;i&gt;now&lt;/i&gt; were just flat-out not available (and perhaps not even conceivable) back in the late 1990s when the Treebank was being annotated.  If we had done active learning for the Treebank under a non-lexicalized, non-parent-annoted PCFG that gets 83% accuracy, maybe worse because we didn't know how to smooth well, then how well would this data set work for modern day state splitting grammars with all sorts of crazy Markovization and whatnot going on?&lt;br /&gt;&lt;br /&gt;The answer is: I have no idea.  I have never seen an experiment that looks at this issue.  And it would be so easy!  Run your standard active learning algorithm with one type of classifier.  Plot your usual active-versus-passive learning curves.  Now, using the &lt;i&gt;same sequence of data&lt;/i&gt;, train another classifier.  Plot that learning curve.  Does it still beat passive selection?  By how much?  And then, of course, can we say anything formal about how well this will work?&lt;br /&gt;&lt;br /&gt;There are tons of ways that this problem can arise.  For instance, when I don't have much data I might use a generative model and then when I have lots of data I might use a discriminative model.  Or, as I get more data, I add more features.  Or someone finds awesome features 5 years later for my problem.  Or new machine learning techniques are developed.  Or anything.  I don't want my data to become obselete when this happens.&lt;br /&gt;&lt;br /&gt;I am happy to acknowledge that this is a very hard problem.  In fact, I suspect that there's some sort of no-free-lunch theorem lurking in here.  Essentially, if the inductive biases of the classifier that you use to the active learning and the classifier you train at the end are too different, then you could do (arbitrarily?) badly.  But in the real world, our hypothesis classes aren't all that different, or perhaps you can assume you're using a universal function approximator or a universal kernel or something.  Assume what you want to start, but I think it's an interesting question.&lt;br /&gt;&lt;br /&gt;And then, while we're on the topic of active learning, I'd also like to see whether an active learning algorithm's performance asymptotes &lt;i&gt;before&lt;/i&gt; all your training data is exhausted.  That is, the usual model in active learning experiments is that you have 1000 training examples because that's what someone labeled.  You then do active learning up to 1000 examples, and of course at that point, everything has been labeled, so active learning performance coincides precisely with passive learning performance.  But this is a poor reflection of many problems in the world, where new inputs are almost always free.  I want the &lt;a href="http://dl.acm.org/citation.cfm?id=1073017&amp;amp;bnc=1"&gt;Banko and Brill&lt;/a&gt; paper for active learning... perhaps it's out there, and if you've seen it, I'd love a pointer.  I ran a couple experiments along these lines (nothing concrete), but it actually seemed that active learning from a &lt;i&gt;smaller&lt;/i&gt; pool was better, perhaps because you have fewer outliers (I was using QBU).  But these results are by no means concrete, so don't cite me as saying this &lt;tt&gt;:)&lt;/tt&gt;.&lt;br /&gt;&lt;br /&gt;At any rate, I agree that active learning has come a long way.  I would humbly suggest that the goal of simply building a classifier is not in the real spirit of trying to save money.  If you wanted to save money, you would save your data and share it (modulo lawyers).  In the long run, passive learning currently seems &lt;i&gt;much&lt;/i&gt; less expensive than active learning to me.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-5091600234929283373?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/5091600234929283373/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=5091600234929283373' title='17 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5091600234929283373'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5091600234929283373'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/10/active-learning-far-from-solved.html' title='Active learning: far from solved'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>17</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-5169776816311940991</id><published>2011-09-29T06:32:00.001-06:00</published><updated>2011-09-29T06:32:55.325-06:00</updated><title type='text'>A technique for me is a task for you</title><content type='html'>Originally in the context of &lt;a href="http://braque.cc/"&gt;Braque&lt;/a&gt; and now in the context of &lt;a href="http://www.iarpa.gov/solicitations_fuse.html"&gt;FUSE&lt;/a&gt;, I've thought a bit about understanding the role of techniques and tasks in scientific papers (admittedly, mostly NLP and ML, which I realize are odd and biased).&amp;nbsp; I worked with &lt;a href="http://www.cs.utah.edu/%7Esandeepp/"&gt;Sandeep Pokkunuri&lt;/a&gt;, a MS student at Utah, looking at the following problem: given a paper (title, abstract, fulltext), determine what &lt;i&gt;task&lt;/i&gt; is being solved and what &lt;i&gt;technique&lt;/i&gt; is being used to solve it.&amp;nbsp; For instance, a paper like "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data" the &lt;i&gt;task&lt;/i&gt; would be "segmenting and labeling sequence data" and the &lt;i&gt;technique&lt;/i&gt; would be "conditional random fields."&lt;br /&gt;&lt;br /&gt;You can actually go a long way just looking for simple patterns in paper titles, like "TECH for TASK" or "TASK by TECH" and a few things like that (after doing some NP chunking and clean-up).&amp;nbsp; From there you can get a good list of seed tasks and techniques, and could conceivably bootstrap your way from there.&amp;nbsp; We never got a solid result out of these, and sadly I moved and Sandeep graduated and it never went anywhere.&amp;nbsp; What we &lt;i&gt;wanted&lt;/i&gt; to do was automatically generate tables of "for this TASK, here are all the TECHs that have been applied (and maybe here are some results) oh and by the way maybe applying these other TECHs would make sense."&amp;nbsp; Or visa-verse: this TECH has been applied to blah blah blah tasks.&amp;nbsp; You might even be able to tell what TECHs are better for what types of tasks, but that's quite a bit more challenging.&lt;br /&gt;&lt;br /&gt;At any rate, a sort of "obvious in retrospect" thing that we noticed was that what I might consider a technique, you might consider a task.&amp;nbsp; And you can construct a chain, typically &lt;a href="http://xkcd.com/435/"&gt;all the way back to math&lt;/a&gt;.&amp;nbsp; For instance, I might consider movie recommendations a task.&amp;nbsp; To solve recommendations, I apply the technique of sparse matrix factorization.&amp;nbsp; But then to you, sparse matrix factorization is a task and to solve it, you apply the technique of compressive sensing.&amp;nbsp; But to Scott Tanner, compressive sensing is a task, and he applies the technique of smoothed analysis (okay this is now false, but you get the idea).&amp;nbsp; But to Daniel Spielman, smoothed analysis is the task, and he applies the technique of some other sort of crazy math.&amp;nbsp; And then eventually you get to set theory (or some might claim you get to category theory, but they're weirdos :P).&lt;br /&gt;&lt;br /&gt;(Note: I suspect the same thing happens in other fields, like bio, chem, physics, etc., but I cannot offer such an example because I don't know those areas.&amp;nbsp; Although not so obvious, I &lt;i&gt;do&lt;/i&gt; think it holds in math: I use the proof technique of Shelah35 to prove blah -- there, both theorems and proof techniques are objects.)&lt;br /&gt;&lt;br /&gt;At first, this was an annoying observation.&amp;nbsp; It meant that our ontology of the world into tasks and techniques was broken.&amp;nbsp; But it did imply something of a &lt;i&gt;richer&lt;/i&gt; structure than this simple ontology.&amp;nbsp; For instance, one might posit as a theory of science and technologies studies (STS, a subfield of social science concerned with related things) that the most basic thing that matters is that you have objects (things of study) and an &lt;i&gt;appliedTo&lt;/i&gt; relationship.&amp;nbsp; So recommender systems, matrix factorization, compressive sensing, smoothed analysis, set theory, etc., are all objects, and they are linked by &lt;i&gt;appliedTo&lt;/i&gt;s.&lt;br /&gt;&lt;br /&gt;You can then start thinking about what sort of properties &lt;i&gt;appliedTo&lt;/i&gt; might have.&amp;nbsp; It's certainly not a function (many things can be applied to any X, and any Y can be applied to many things).&amp;nbsp; I'm pretty sure it should be antireflexive (you cannot apply X to solve X).&amp;nbsp; It should probably also be antisymmetric (if X is applied to Y, probably Y cannot be applied to X).&amp;nbsp; Transitivity is not so obvious, but I think you could argue that it might hold: if I apply gradient descent to an optimization problem, and my particular implementation of gradient descent uses line search, then I kind of am applying line search to my problem, though perhaps not directly.&amp;nbsp; &lt;i&gt;(I'd certainly be interested to hear of counter-examples if any come to mind!)&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;If this is true, then what we're really talking about is something like a directed acyclic graph, which at least at a first cut seems like a reasonable model for this world.&amp;nbsp; Probably you can find exceptions to almost everything I've said, but that's why you need statistical models or other things that can deal with "noise" (aka model misspecification).&lt;br /&gt;&lt;br /&gt;Actually something more like a directed acyclic hypergraph might make sense, since often you simultaneously apply several techniques in tandem to solve a problem.&amp;nbsp; For instance, I apply subgradient descent and L1 regularization to my binary classification problem -- the fact that these two are being applied &lt;i&gt;together&lt;/i&gt; rather than separately seems important somehow.&lt;br /&gt;&lt;br /&gt;Not that we've gone anywhere with modeling the world like this, but I definitely thing there are some interesting questions buried in this problem.&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-5169776816311940991?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/5169776816311940991/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=5169776816311940991' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5169776816311940991'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5169776816311940991'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/09/technique-for-me-is-task-for-you.html' title='A technique for me is a task for you'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-3111213233602280741</id><published>2011-09-26T10:51:00.005-06:00</published><updated>2011-09-26T10:51:52.539-06:00</updated><title type='text'>Four months without blogs</title><content type='html'>As you've noticed, I haven't posted in a while.&amp;nbsp; I've also not been reading blogs.&amp;nbsp; My unread number of posts is now 462.&amp;nbsp; Clearly I'm not going to go back and read all 462 posts that I missed.&amp;nbsp; I will claim that this was an experiment to see what a (nearly) blog-free world is like.&lt;br /&gt;&lt;br /&gt;I actually found that I missed both the reading and the writing, so now (especially that I've switch over to public transportation and so have about an hour to kill in transportation time) I'm going to go back to reading while being transported and blogging when I have time.&lt;br /&gt;&lt;br /&gt;I figured I'd return to blogging by saying a bit about a recent experience.&amp;nbsp; Less than a month ago I had the honor of serving on Jurgen Van Gael's Ph.D. examination committee.&amp;nbsp; Jurgen did an excellent job and, as perhaps expected, passed.&amp;nbsp; But what I want to talk about is how the UK model (or at least the Cambridge model) is different from the US model.&lt;br /&gt;&lt;br /&gt;In the UK, the examination is done by two faculty members, one internal (this was Stephen Clark) and one external (that was me).&amp;nbsp; It does not involve the advisor/supervisor, though this person can sit in the room without speaking :).&amp;nbsp; There is no public presentation and the process we followed was basically to go through the dissertation chapter-by-chapter, ask clarification questions, perhaps some things to get Jurgen to think on his toes, and so on.&amp;nbsp; This took about two hours.&lt;br /&gt;&lt;br /&gt;Contrast this to the (prototypical) US model, where a committee consists of 5 people, perhaps one external (either external to CS or to the university, depending on how your institution sets it up), and includes the advisor.&amp;nbsp; The defense is typically a 45 minute public presentation followed by questions from the committee in a closed-room environment with the student.&lt;br /&gt;&lt;br /&gt;Having been involved, now, in both types, I have to say they each have their pros and cons.&amp;nbsp; I think the lack of a public presentation in the UK model is a bit of a shame, though of course students could decide to do these anyway.&amp;nbsp; But it's nice to have something official for parents or spouses to come to if they'd like.&amp;nbsp; However, in the US, the public presentation, plus the larger committee, probably leads to situation that students often joke about that not even their committee reads their dissertation.&amp;nbsp; You can always fall back on the presentation, much like students skip class reading when they know that the lecture will cover it all.&amp;nbsp; When it was just me, Stephen and Jurgen, there's really no hiding in the background :).&lt;br /&gt;&lt;br /&gt;I also like how in the UK model, you can skip over the easy stuff and really spend time talking with the student about the deep material.&amp;nbsp; I found myself much more impressed with how well Jurgen knows his stuff &lt;i&gt;after&lt;/i&gt; the examination than before, and this is not a feeling I usually get with US students because their defense it typically quite high-level.&amp;nbsp; And after 45 minutes of a presentation, plus 15 minutes of audience questions, the last thing anyone wants to do is sit around for another two hours examining the details of the defense chapter-by-chapter.&lt;br /&gt;&lt;br /&gt;Regarding the issue of having the advisor there or not, I don't have a strong preference.&amp;nbsp; The one thing I will say is that by having the advisor missing removes the &lt;i&gt;potential&lt;/i&gt; for weird politics.&amp;nbsp; For instance, I have seen one or two defenses in which an advisor tends to answer questions for the student, without the student first attempting an answer.&amp;nbsp; If I were on these committees, with a relatively senior advisor, it might be politically awkward to ask them not to do this.&amp;nbsp; Luckily this issue hasn't come up for me, but I could imagine it happening.&lt;br /&gt;&lt;br /&gt;Obviously I don't really expect anyone's policies to change, and I'm not even sure that they should, but I like thinking about things that I've grown used to taking for granted.&amp;nbsp; Plus, after having gone through the UK model, I think I will grill students a bit more during the Q/A time.&amp;nbsp; And if this means that fewer students ask me to be on their committees, then there's more time to blog :).&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-3111213233602280741?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/3111213233602280741/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=3111213233602280741' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3111213233602280741'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3111213233602280741'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/09/four-months-without-blogs.html' title='Four months without blogs'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-1346998247683034686</id><published>2011-07-07T07:49:00.000-06:00</published><updated>2011-07-07T07:49:49.791-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='papers'/><title type='text'>Introducing Braque, your paper discovery friend</title><content type='html'>&lt;a href="http://www.artchive.com/artchive/b/braque/wmn_guit.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="320" src="http://www.artchive.com/artchive/b/braque/wmn_guit.jpg" width="179" /&gt;&lt;/a&gt;(Shameless plug/advertisement follows.)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Want to be informed of new interesting papers that show up online?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Tired of trolling conference proceedings to find that one gem? &lt;br /&gt;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Want to make sure interested parties hear about your newest results? &lt;br /&gt;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Want to know when a new paper comes out that cites you? &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-large;"&gt;Braque (&lt;a class="moz-txt-link-freetext" href="http://braque.cc/"&gt;http://braque.cc&lt;/a&gt;) can help. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://braque.cc/images/braque.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://braque.cc/images/braque.png" /&gt;&lt;/a&gt;&lt;/div&gt;Braque is a news service for research papers (currently focusing primarily on NLP and ML, though it needn't be that way).&amp;nbsp; You can create &lt;i&gt;&lt;span class="moz-txt-underscore"&gt;&lt;span class="moz-txt-tag"&gt;&lt;/span&gt;channels&lt;span class="moz-txt-tag"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/i&gt; that provide email or RSS feeds for topics you care about. You can add your own publications page as a &lt;i&gt;resource&lt;/i&gt; to Braque so it knows to crawl your papers and send them out to interested parties.&lt;br /&gt;&lt;br /&gt;Braque is something I built ages ago with &lt;a href="http://www.cs.berkeley.edu/%7Epliang/"&gt;Percy Liang&lt;/a&gt;, but it's finally more or less set up after my move. Feel free to email me questions and comments or (preferably) use the online comment system.&lt;br /&gt;&lt;br /&gt;As a bit of warning: Braque is neither a paper search engine nor a paper archive.&amp;nbsp; And please be a bit forgiving if you go there immediately after this post shows up and it's a bit slow.... we only have one server :).&lt;br /&gt;&lt;br /&gt;ps., yes, Braque is sort of like &lt;a href="http://www.cs.utah.edu/%7Ehal/WhatToSee/"&gt;WhatToSee&lt;/a&gt; on crack.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-1346998247683034686?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/1346998247683034686/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=1346998247683034686' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1346998247683034686'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1346998247683034686'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/07/introducing-braque-your-paper-discovery.html' title='Introducing Braque, your paper discovery friend'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-9151843719420526696</id><published>2011-07-06T18:40:00.000-06:00</published><updated>2011-07-06T18:40:26.380-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>The conference(s) post: ACL and ICML</title><content type='html'>I'm using ACL/ICML as an excuse to jumpstart my resumed, hopefully regular, posting.&amp;nbsp; The usual "I didn't see/read everything" applies to all of this.&amp;nbsp; My general feeling about ACL (which was echoed by several other participants) was that the program was quite strong, but there weren't many papers that really stood out as especially great.&amp;nbsp; Here are some papers I liked and some attached thoughts, from ACL:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aclweb.org/anthology-new/P/P11/P11-1002.pdf"&gt;P11-1002&lt;/a&gt; [&lt;a href="http://aclweb.org/anthology-new/P/P11/P11-1002.bib"&gt;bib&lt;/a&gt;]: &lt;b&gt;Sujith Ravi; Kevin Knight&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Deciphering Foreign Language&lt;/i&gt;This paper is about building MT systems without parallel data.&amp;nbsp; There's been a bunch of work in this area.&amp;nbsp; The idea here is that if I have English text, I can build an English LM.&amp;nbsp; If you give me some French text and I hallucinate a F2E MT system, then it's output had better score high on the English LM.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aclweb.org/anthology-new/P/P11/P11-1020.pdf"&gt;P11-1020&lt;/a&gt; [&lt;a href="http://aclweb.org/anthology-new/P/P11/P11-1020.bib"&gt;bib&lt;/a&gt;] [&lt;a href="http://aclweb.org/supplementals/P/P11/P11-1020.Datasets.txt"&gt;dataset&lt;/a&gt;]: &lt;b&gt;David Chen; William Dolan&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Collecting Highly Parallel Data for Paraphrase Evaluation&lt;/i&gt;&lt;br /&gt;Although this paper is about paraphrasing, the fun part is the YouTube stuff they did.&amp;nbsp; Read it and see :).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aclweb.org/anthology-new/P/P11/P11-1060.pdf"&gt;P11-1060&lt;/a&gt; [&lt;a href="http://aclweb.org/anthology-new/P/P11/P11-1060.bib"&gt;bib&lt;/a&gt;]: &lt;b&gt;Percy Liang; Michael Jordan; Dan Klein&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Learning Dependency-Based Compositional Semantics&lt;/i&gt;&lt;br /&gt;This paper is along the lines of semantic parsing stuff that various people (Ray Mooney, Luke Zettlemoyer/Mike Collins, etc.) have been doing.&amp;nbsp; It's a nice compositional model that is learned online.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aclweb.org/anthology-new/P/P11/P11-1099.pdf"&gt;P11-1099&lt;/a&gt; [&lt;a href="http://aclweb.org/anthology-new/P/P11/P11-1099.bib"&gt;bib&lt;/a&gt;]: &lt;b&gt;Vanessa Wei Feng; Graeme Hirst&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Classifying arguments by scheme&lt;/i&gt;&lt;br /&gt;This paper is about argumentation (in the "debate" sense) and identifying different argumentation types.&amp;nbsp; There are some nice correlations with discourse theory, but in a different context.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aclweb.org/anthology-new/P/P11/P11-2037.pdf"&gt;P11-2037&lt;/a&gt; [&lt;a href="http://aclweb.org/anthology-new/P/P11/P11-2037.bib"&gt;bib&lt;/a&gt;]: &lt;b&gt;Shu Cai; David Chiang; Yoav Goldberg&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Language-Independent Parsing with Empty Elements&lt;/i&gt;&lt;br /&gt;I'm really glad to see that people are starting to take this problem seriously again.&amp;nbsp; This falls under the category of "if you've ever actually tried to use a parser to &lt;i&gt;do something&lt;/i&gt; then you need this."&lt;br /&gt;&lt;br /&gt;Okay so that's not that many papers, but I did "accidentally" skip some sections.&amp;nbsp; So you're on your own for the rest.&lt;br /&gt;&lt;br /&gt;For ICML, I actually felt it was more of a mixed bag.&amp;nbsp; Here are some things that stood out as cool:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;a href="http://www.icml-2011.org/papers.php#480"&gt;Minimum Probability Flow Learning&lt;/a&gt;&amp;nbsp;&lt;/i&gt;&lt;br /&gt;&lt;b&gt;Jascha Sohl-Dickstein; Peter Battaglino; Michael DeWeese&lt;/b&gt;&lt;br /&gt;This is one that I need to actually go read, because it seems too good to be true.&amp;nbsp; If computing a partition function ever made you squirm, read this paper.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;a href="http://www.icml-2011.org/papers.php#432"&gt;Tree-Structured Infinite Sparse Factor Model&lt;/a&gt;&amp;nbsp;&lt;/i&gt;&lt;br /&gt;&lt;b&gt;XianXing Zhang; David Dunson; Lawrence Carin&lt;/b&gt;&lt;br /&gt;This is trying to do factor analysis with tree factors; they use a "multiplicative gamma process" to accomplish it. This is something we tried to do a while ago, but could never really figure out how to do it.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;a href="http://www.icml-2011.org/papers.php#534"&gt;Sparse Additive Generative Models of Text&lt;/a&gt;&amp;nbsp;&lt;/i&gt;&lt;br /&gt;&lt;b&gt;Jacob Eisenstein; Amr Ahmed; Eric  Xing&lt;/b&gt;&lt;br /&gt;The idea here is that if you're learning a model of text, don't re-learn the same "general background" distribution over and over again.&amp;nbsp; Then learn class- or topic-specific stuff as a sparse amendment to that background.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;a href="http://www.icml-2011.org/papers.php#373"&gt;OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning&lt;/a&gt;&amp;nbsp;&lt;/i&gt;&lt;br /&gt;&lt;b&gt;Arvind Sujeeth; HyoukJoong Lee; Kevin Brown; Tiark Rompf; Hassan Chafi; Michael Wu; Anand Atreya; Martin Odersky; Kunle Olukotun&lt;/b&gt;&lt;br /&gt;Two words: MATLAB KILLER.&lt;br /&gt;Six more words: Most authors ever on ICML paper. &lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;a href="http://www.icml-2011.org/papers.php#626"&gt;Generalized Boosting Algorithms for Convex Optimization&lt;/a&gt;&amp;nbsp;&lt;/i&gt;&lt;br /&gt;&lt;b&gt;Alexander Grubb; Drew Bagnell&lt;/b&gt;&lt;br /&gt;Suppose you want to boost something that's non-smooth?&amp;nbsp; Now you can do it.&amp;nbsp; Has nice applications in imitation learning, which is I suppose why I like it.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;a href="http://www.icml-2011.org/papers.php#275"&gt;Learning from Multiple Outlooks&lt;/a&gt;&amp;nbsp;&lt;/i&gt;&lt;br /&gt;&lt;b&gt;Maayan Harel; Shie Mannor&lt;/b&gt;&lt;br /&gt;This is a nice approach based on distribution mapping to the problem of multiview learning when you don't have data with parallel views.&amp;nbsp; (I'm not sure that we need a new name for this task, but I still like the paper.)&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;a href="http://www.icml-2011.org/papers.php#125"&gt;Parsing Natural Scenes and Natural Language with Recursive Neural Networks&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;b&gt;Richard Socher; Cliff Chiung-Yu Lin; Andrew  Ng; Chris Manning&lt;/b&gt;&lt;br /&gt;This is basically about learning compositional semantics for vector space models of text, something that I think is really interesting and understudied (Mirella Lapata has done some stuff).&amp;nbsp; The basic idea is that if "red" is embedded at position x, and "sparrow" is embedded at y, then the embedding of the phrase "red sparrow" should be at f([x y]) where f is some neural network.&amp;nbsp; Trained to get good representations for parsing.&lt;br /&gt;&lt;b&gt; &lt;/b&gt;&lt;br /&gt;&lt;b&gt;Please reply in comments if you had other papers you liked!!!&lt;/b&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-9151843719420526696?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/9151843719420526696/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=9151843719420526696' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/9151843719420526696'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/9151843719420526696'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/07/conferences-post-acl-and-icml.html' title='The conference(s) post: ACL and ICML'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-3943224356416646889</id><published>2011-05-07T14:47:00.000-06:00</published><updated>2011-05-07T14:47:00.958-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hiring'/><category scheme='http://www.blogger.com/atom/ns#' term='community'/><title type='text'>CI Fellows Program, again</title><content type='html'>Are you graduating and interested in doing a fun postdoc in your area of choosing on your project of choosing?&amp;nbsp; Apply to be a &lt;a href="http://cifellows.org/match/"&gt;NSF CI Fellow&lt;/a&gt;!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-3943224356416646889?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/3943224356416646889/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=3943224356416646889' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3943224356416646889'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3943224356416646889'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/05/ci-fellows-program-again.html' title='CI Fellows Program, again'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-4693793438588242974</id><published>2011-04-12T16:03:00.000-06:00</published><updated>2011-04-12T16:03:26.107-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='random'/><title type='text'>Google scholar "cited by" changed recently???</title><content type='html'>Someone (can't remember who anymore, even though it was just a couple days ago!) pointed out to me that Google scholar seems to be doing weird things with citations.&amp;nbsp; In particular, it seems to think that the citation relation is symmetric (at least in some cases).&lt;br /&gt;&lt;br /&gt;Here's an easy example.&amp;nbsp; Look up Khalid El-Arini's 2009 paper "&lt;a href="http://scholar.google.com/scholar?q=Turning+Down+the+Noise+in+the+Blogosphere+Khalid+El-Arini+Gaurav+Veda+Dafna+Shahaf+Carlos+Guestrin"&gt;Turning down the noise in the blogosphere&lt;/a&gt;" paper on Google scholar (or just follow that link).&amp;nbsp; Apparently it's been cited by 24 papers.&amp;nbsp; Let's look at &lt;a href="http://scholar.google.com/scholar?cites=16086253890751957637&amp;amp;as_sdt=20000005&amp;amp;sciodt=0,21&amp;amp;hl=en"&gt;who cites them&lt;/a&gt;.&amp;nbsp; Apparently in 2003, in addition to inventing LDA, also invented a time machine so that he could cite Khalid's paper!&lt;br /&gt;&lt;br /&gt;The weird thing is that this doesn't seem to be a systematic error!&amp;nbsp; It only happens some times.&lt;br /&gt;&lt;br /&gt;Oh well, I won't complain -- it just makes my H index look better :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-4693793438588242974?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/4693793438588242974/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=4693793438588242974' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4693793438588242974'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4693793438588242974'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/04/google-scholar-cited-by-changed.html' title='Google scholar &quot;cited by&quot; changed recently???'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-50543944029625227</id><published>2011-04-06T18:33:00.000-06:00</published><updated>2011-04-06T18:33:43.307-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='linguistics'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Seeding, transduction, out-of-sample error and the Microsoft approach...</title><content type='html'>My past master's student Adam Teichert (now at JHU) did some work on inducing part of speech taggers using typological information.&amp;nbsp; We wanted to compare the usefulness of using small amounts of linguistic information with small amounts of lexical information in the form of seeds.&amp;nbsp; (Other papers give seeds different names, like initial dictionaries or prototypes or whatever... it's all the same basic idea.)&lt;br /&gt;&lt;br /&gt;The basic result was that if you &lt;i&gt;don't&lt;/i&gt; use seeds, then typological information can help a lot.&amp;nbsp; If you do you seeds, then your baseline performance jumps from like 5% to about 40% and then using typological information on top of this isn't really that beneficial.&lt;br /&gt;&lt;br /&gt;This was a bit frustrating, and led us to think more about the problem.&amp;nbsp; The way we got seeds was to look at the wikipedia page about Portuguese (for instance) and use &lt;i&gt;their&lt;/i&gt; example list of words for each tag.&amp;nbsp; An alternative popular way is to use labeled data and extract the few most frequent words for each part of speech type.&amp;nbsp; They're not identical, but there is definitely quite a bit of overlap between the words that Wikipedia lists as examples of determiners and the most frequent determiners (this correlation is especially strong for closed-class words).&lt;br /&gt;&lt;br /&gt;In terms of end performance, there are two reasons seeds can help.&amp;nbsp; The first, which is the &lt;i&gt;interesting&lt;/i&gt; case, is that knowing that "the" is a determiner helps you find other determiners (like "a") and perhaps also nouns (for instance, knowing the determiners often precede nouns in Portuguese).&amp;nbsp; The second, which is the &lt;i&gt;uninteresting&lt;/i&gt; case, is that now every time you see one of your seeds, you pretty much always get it right.&amp;nbsp; In other words, just by specifying seeds, especially by frequency (or approximately by frequency ala Wikipedia), you're basically ensuring that you get 90% accuracy (due to ambiguity) on some large fraction of the corpus (again, especially for closed-class words which have short tails).&lt;br /&gt;&lt;br /&gt;This phenomena is mentioned in the text (but not the tables :P), for instance, in Haghighi &amp;amp; Klein's 2006 NAACL paper on prototype-driven POS tagging, wherein they say: "Adding prototypes ... gave an accuracy of 68.8% on all tokens, but only 47.7% on non-prototype occurrences, which is only a marginal improvement over [a baseline system with no prototypes."&amp;nbsp; Their improved system remedies this and achieves better accuracy on non-prototypes as well as prototypes (aka seeds).&lt;br /&gt;&lt;br /&gt;This is very similar to the idea of transductive learning in machine learning land.&amp;nbsp; Transduction is an alternative to semi-supervised learning.&amp;nbsp; The setting is that you get a bunch of data, some of which is labeled and some of which is unlabeled.&amp;nbsp; Your goal is to simply label the unlabeled data.&amp;nbsp; You &lt;i&gt;need not&lt;/i&gt; "induce" the labeling function (though many approach do, in passing).&lt;br /&gt;&lt;br /&gt;The interesting thing is that learning with seeds is very similar to transductive learning, though perhaps with a bit stronger assumption of noise on the "labeled" part.&amp;nbsp; The irony is that in machine learning land, you would &lt;i&gt;never&lt;/i&gt; report "combined training and test accuracy" -- this would be ridiculous.&amp;nbsp; Yet this is what we seem to like to do in NLP land.&amp;nbsp; This is itself related to an old idea in machine learning wherein you rate yourself only on test example that you &lt;i&gt;didn't&lt;/i&gt; see at training time.&amp;nbsp; This is your out-of-sample error, and is obviously much harder than your standard generalization error.&amp;nbsp; (The famous no-free-lunch theorems are from an out-of-sample analysis.)&amp;nbsp; The funny thing out of sample error is that sometimes you prefer &lt;i&gt;not&lt;/i&gt; to get more training examples, because you then know you won't be tested on it!&amp;nbsp; If you were getting it right already, this just hurts you.&amp;nbsp; (Perhaps you should be allowed to see &lt;i&gt;x&lt;/i&gt; and say "no I don't want to see &lt;i&gt;y&lt;/i&gt;"?)&lt;br /&gt;&lt;br /&gt;I think the key question is: what are we trying to do.&amp;nbsp; If we're trying to build good taggers (i.e., we're engineers) then overall accuracy is what we care about and including "seed" performance in our evaluations make sense.&amp;nbsp; But when we're talking about 45% tagging accuracy (like Adam and I were), then this is a pretty pathetic claim.&amp;nbsp; In the case that we're trying to understand learning algorithms and study their performance on real data (i.e., we're scientists) then accuracy on non-seeds is perhaps more interesting.&amp;nbsp; (Please don't jump on me for the engineer/scientist distinction: it's obviously much more subtle than this.)&lt;br /&gt;&lt;br /&gt;This also reminds me of something Eric Brill said to me when I was working with him as a summer intern in MLAS at Microsoft (back when MLAS existed and back when Eric was in MLAS....).&amp;nbsp; We were working on web search stuff.&amp;nbsp; His comment was that he really didn't care about doing well on the 1000 most frequent queries.&amp;nbsp; Microsoft could always hire a couple annotators to manually do a good job on these queries.&amp;nbsp; And in fact, this is what is often done.&amp;nbsp; What we care about is the heavy tail, where there are too many somewhat common things to have humans annotate them all.&amp;nbsp; This is precisely the same situation here.&amp;nbsp; I can easily get 1000 seeds for a new language.&amp;nbsp; Do I actually care how well I do on those, or do I care how well I do on the other 20000+ things?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-50543944029625227?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/50543944029625227/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=50543944029625227' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/50543944029625227'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/50543944029625227'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/04/seeding-transduction-out-of-sample.html' title='Seeding, transduction, out-of-sample error and the Microsoft approach...'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-3369394059086205008</id><published>2011-03-10T11:58:00.000-07:00</published><updated>2011-03-10T11:58:16.774-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hiring'/><title type='text'>Postdoc Position at CLIP (@UMD)</title><content type='html'>Okay, now is why I take serious unfair advantage of having this blog.&amp;nbsp; We have a postdoc opening.&amp;nbsp; See the official ad below for details:&lt;br /&gt;&lt;br /&gt;A postdoc position is available in the Computational Linguistics and&lt;br /&gt;Information Processing (CLIP) Laboratory in the Institute for Advanced&lt;br /&gt;Computer Studies at University of Maryland.&amp;nbsp; We are seeking a talented&lt;br /&gt;researcher in natural language processing, with strong interests in&lt;br /&gt;the processing of scientific literature.&lt;br /&gt;&lt;br /&gt;A successful candidate should have a strong NLP background with a&lt;br /&gt;track record of top-tier research publications.&amp;nbsp; A Ph.D. in computer&lt;br /&gt;science and strong organizational and coordination skills are a must.&lt;br /&gt;In addition to pursuing original research in scientific literature&lt;br /&gt;processing, the ideal candidate will coordinate the efforts of the&lt;br /&gt;other members of that project.&amp;nbsp; While not necessary, experience in one&lt;br /&gt;or more of the following areas is highly advantageous: summarization,&lt;br /&gt;NLP or data mining for scientific literature, machine learning, and&lt;br /&gt;the use of linguistic knowledge in computational systems. &lt;br /&gt;Additionally, experience with large-data NLP and system building will&lt;br /&gt;be considered favorably.&lt;br /&gt;&lt;br /&gt;The successful candidate will work closely with current CLIP faculty,&lt;br /&gt;especially Bonnie Dorr, Hal Daume III and Ken Fleischmann, while&lt;br /&gt;interacting with a large team involving NLP researchers across several&lt;br /&gt;other prominent institutions.&amp;nbsp; The duration of the position is one &lt;br /&gt;year, starting Summer or Fall 2011, and is potentially extendible.&lt;br /&gt;&lt;br /&gt;CLIP is a a dynamic interdisciplinary computational linguistics &lt;br /&gt;program with faculty from across the university, and major research &lt;br /&gt;efforts in machine translation, information retrieval, semantic &lt;br /&gt;analysis, generation, and development of large-scale statistical&lt;br /&gt;language processing tools.&lt;br /&gt;&lt;br /&gt;Please send a CV and names and contact information of 3 referees,&lt;br /&gt;preferably by e-mail, to:&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Jessica Touchard&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; jessica AT cs DOT umd DOT edu&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Department of Computer Science&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; A.V. Williams Building, Room 1103&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; University of Maryland&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; College Park, MD 20742&lt;br /&gt;&lt;br /&gt;Specific questions about the position may be addressed to Hal Daume&lt;br /&gt;III at hal AT umiacs DOT umd DOT edu.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-3369394059086205008?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/3369394059086205008/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=3369394059086205008' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3369394059086205008'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3369394059086205008'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/03/postdoc-position-at-clip-umd.html' title='Postdoc Position at CLIP (@UMD)'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-5080250926854229169</id><published>2011-03-08T06:47:00.000-07:00</published><updated>2011-03-08T06:47:44.361-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='reviewing'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>Some thoughts on supplementary materials</title><content type='html'>Having the option of authors submitting supplementary materials is becoming popular in NLP/ML land.&amp;nbsp; NIPS was one of the first conferences I submit to that has allowed this; I think ACL allowed it this past year, at least for specific types of materials (code, data), and EMNLP is thinking of allowing it at some point in the near future.&lt;br /&gt;&lt;br /&gt;Here is a snippet of the &lt;a href="http://nips.cc/PaperInformation/AuthorSubmissionInstructions"&gt;NIPS call for papers&lt;/a&gt; (see section 5) that describes the role of supplementary materials:&lt;br /&gt;&lt;blockquote&gt;In addition to the submitted PDF paper, authors can additionally submit supplementary material for their paper... Such extra material may include long technical  proofs that do not fit into the paper, image, audio or video sample  outputs from your algorithm, animations that describe your algorithm,  details of experimental results, or even source code for running  experiments.&amp;nbsp; &lt;i&gt;Note that the reviewers and the program committee  reserve the right to judge the paper solely on the basis of the 8 pages,  9 pages including citations, of the paper; looking at any extra  material is up to the discretion of the reviewers and is not required.&lt;/i&gt;&lt;/blockquote&gt;(Emphasis mine.)&amp;nbsp; Now, before everyone goes misinterpreting what I'm about to say, let me make it clear that &lt;b&gt;in general I like the idea of supplementary materials, given our current publishing model.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;You can think of the emphasized part of the call as a form of &lt;i&gt;reviewer protection.&lt;/i&gt;&amp;nbsp; It basically says: look, we know that reviewers are overloaded; if your paper isn't very interesting, the reviewers aren't required to read the supplement.&amp;nbsp; (As an aside, I feel the same thing happens with pages 2-8 given page 1 in a lot of cases :P.)&lt;br /&gt;&lt;br /&gt;I think it's good to have such a form a reviewer protection.&amp;nbsp; What I wonder is whether it also makes sense to &lt;b&gt;add a form of author protection.&lt;/b&gt;&amp;nbsp; In other words, the current policy -- which seems only explicitly stated in the case of NIPS, but seems to be generally understood elsewhere, too -- is that reviewers are protected from overzealous authors.&amp;nbsp; I think we need to have additional clauses that protect authors from overzealous reviewers.&lt;br /&gt;&lt;br /&gt;Why?&amp;nbsp; Already I get annoyed with reviewers who seem to think that extra experiments, discussion, proofs or whatever can somehow magically fit in an already crammed 8 page page.&amp;nbsp; A general suggestion to reviewers is that if you're suggesting things to add, you should also suggest things to cut.&lt;br /&gt;&lt;br /&gt;This situation is exacerbated infinity-fold with the "option" of supplementary material.&amp;nbsp; There now is no length-limit reason why an author couldn't include everything under the sun.&amp;nbsp; And it's too easy for a reviewer just to say that XYZ should have been included because, well, it could just have gone in the supplementary material!&lt;br /&gt;&lt;br /&gt;So what I'm proposing is that supplementary material clauses should have &lt;i&gt;two&lt;/i&gt; forms of protection.&amp;nbsp; The first being the existing one, protecting reviewers from overzealous authors.&amp;nbsp; The second being the reverse, something like:&lt;br /&gt;&lt;blockquote&gt;Authors are not obligated to include supplementary materials.&amp;nbsp; The paper should stand on its own, excluding any supplement.&amp;nbsp; Reviewers must take into account the strict 8 page limit when evaluating papers.&lt;/blockquote&gt;Or something like that: the wording isn't quite right.&amp;nbsp; But without this, I fear that supplementary materials will, in the limit, simply turn into an arms race.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-5080250926854229169?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/5080250926854229169/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=5080250926854229169' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5080250926854229169'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5080250926854229169'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/03/some-thoughts-on-supplementary.html' title='Some thoughts on supplementary materials'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-806876833341250075</id><published>2011-03-02T05:57:00.000-07:00</published><updated>2011-03-02T05:57:37.394-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='survey'/><title type='text'>Grad school survey, revisited</title><content type='html'>You may recall a while ago I ran a &lt;a href="http://nlpers.blogspot.com/2009/09/where-did-you-apply-to-grad-school.html"&gt;survey on where people applied to grad school&lt;/a&gt;.  Obviously I've been sitting on these results for a while now, but I figured since it's that time of year when people are &lt;i&gt;choosing&lt;/i&gt; grad schools, that I would say how things turned out.&amp;nbsp; Here's a summary of things that people thought were most important (deciding factor), and moderately important (contributing factor, in parens):&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Academic Program&lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Specialty degree programs in my research area, 48%&lt;/b&gt;&lt;/li&gt;&lt;li&gt;(Availability of interesting courses, 16%)&lt;/li&gt;&lt;li&gt;(Time to completion, 4%)&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Application Process&lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Nothing&lt;b&gt;&amp;nbsp;&lt;/b&gt; &lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Faculty Member(s)&lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Read research papers by faculty member, 44%&lt;/b&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Geographic Area&lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;(Outside interests/personal preference, 15%)&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Recommendations from People&lt;/b&gt;&lt;b&gt;&amp;nbsp;&lt;/b&gt;&lt;b&gt; &lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Professors in technical area, 45%&lt;/b&gt;&lt;/li&gt;&lt;li&gt;(Teachers/academic advisors, 32%)&lt;/li&gt;&lt;li&gt;(Technical colleagues, 20%)&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Reputation&lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;... of research group, 61%&lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;... of department/college, 50%&lt;/b&gt;&lt;/li&gt;&lt;li&gt;(Ranking of university, 35%)&lt;/li&gt;&lt;li&gt;(Reputation of university, 34%)&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Research Group&lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Research group works on interesting problems, 55% &lt;/b&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Many faculty in a specialty area (eg., ML), 44%&lt;/b&gt;&lt;/li&gt;&lt;li&gt;(Many faculty/students in general area (eg., AI), 33%)&lt;/li&gt;&lt;li&gt;(Research group publishes a lot, 26%)&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Web Presence&lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;(Learned about group via web search, 37%)&lt;/li&gt;&lt;li&gt;(Learned about dept/univ via web search, 24%)&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;General&lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Funding availability, 49%&lt;/b&gt; &lt;/li&gt;&lt;li&gt;(High likelihood of being accepted, 12%) &lt;/li&gt;&lt;li&gt;(Size of dept/university, 5%)&lt;b&gt; &lt;/b&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;Overall these seem pretty reasonable.&amp;nbsp; And of course they all point to the fact that everyone should come to Maryland :P.&amp;nbsp; Except for the fact that we don't have specialty degree programs, but that's the one thing on the list that I actually think is a bit silly: it might make sense for MS, but I don't really think it should be an important consideration for Ph.D.s.&amp;nbsp; You can get the &lt;a href="http://hal3.name/tmp/gradschool-survey.pdf"&gt;full results&lt;/a&gt; if you want to read them and the comments: they're pretty interesting, IMO.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-806876833341250075?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/806876833341250075/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=806876833341250075' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/806876833341250075'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/806876833341250075'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/03/grad-school-survey-revisited.html' title='Grad school survey, revisited'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-6982589199507253162</id><published>2011-02-28T13:27:00.001-07:00</published><updated>2011-02-28T15:10:15.133-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='teaching'/><title type='text'>What are your plans between ACL and ICML?</title><content type='html'>I'll tell you what they &lt;i&gt;should be&lt;/i&gt;: attending the &lt;a href="http://www.ttic.edu/sigml/symposium2011/"&gt;Symposium on Machine Learning in Speech and Language Processing&lt;/a&gt;, jointly sponsored by IMLS, ICML and ISCA, that I'm co-organizing with Dan Roth, Geoff Zweig and Joseph Keshet (the exact date is June 27, in the same venue as ICML in Bellevue, Washington).&amp;nbsp; So far we've got a great list of invited speakers from all of these areas, including Mark Steedman, Stan Chen, Yoshua Bengio, Lawrence Saul, Sanjoy Dasgupta and more.&amp;nbsp; (See the web page for more details.)&amp;nbsp; We'll also be organizing some sort of day trips (together with the local organizers of ICML) for people who want to join!&amp;nbsp; You should also consider submitting papers (deadline is April 15).&lt;br /&gt;&lt;br /&gt;I know I said a month ago that I would blog more.&amp;nbsp; I guess that turned out to be a lie.&amp;nbsp; The problem is that I only have so much patience for writing and I've been spending a lot of time writing non-blog things recently.&amp;nbsp; I decided to use my time-off-teaching doing &lt;a href="http://hal3.name/ciml"&gt;something far more time consuming than teaching&lt;/a&gt;.  This has been a wonderously useful exercise for me and I hope that, perhaps starting in 2012, other people can take advantage of this work.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-6982589199507253162?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/6982589199507253162/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=6982589199507253162' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6982589199507253162'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6982589199507253162'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/02/what-are-your-plans-between-acl-and.html' title='What are your plans between ACL and ICML?'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-8527687550339950613</id><published>2011-01-17T12:20:00.002-07:00</published><updated>2011-01-17T13:01:48.591-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='parsing'/><title type='text'>Parsing with Transformations</title><content type='html'>I remember when I took my first "real" Syntax class, where by "real" I mean "Chomskyan."  It was at USC in Fall 2001, taught by &lt;a href="http://www-bcf.usc.edu/%7Epancheva/"&gt;Roumyana Pancheva&lt;/a&gt;.  It was hard as hell but I loved it.  However, as a computationally minded guy, I remember snickering to myself the whole time we were talking about movements that get you from &lt;a href="http://en.wikipedia.org/wiki/Transformational_grammar"&gt;deep structure to surface structure&lt;/a&gt;.  This stuff was all computationally ridiculous.&lt;br /&gt;&lt;br /&gt;But &lt;i&gt;why&lt;/i&gt; was it computationally ridiculous?  It was ridiculous because my mindset, and I think the mindset of most computational folks at the time, was that of n^3 CKY or Earley style parsing.  Namely exact parsing in a context free manner.  This whole idea of transformations would kill anything like that in a very bad way.&lt;br /&gt;&lt;br /&gt;However, there's been a recent shift in attitudes.  Sure, people still do their n^3 parsing, but of course none of it is exact anyway (due to pruning).  But more than that, things like linear time parsing algorithms as popularized by people like Joakim Nivre and Kenji Sagae and Brian Roark and Joseph Turian, have proved very useful.  They work well, are incredibly efficient, and are easy to implement.  They're also a bit more psychologically plausible (as Eugene Charniak said recently "we don't know what people are doing, but they're definitely not doing CKY.").&lt;br /&gt;&lt;br /&gt;So I'm led to wonder: &lt;b&gt;could we actually do parsing in a transformational grammar using all the new stuff we know about (for instance) left-to-right parsing?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;One thing that stands in our way, of course, is the stupid Penn Treebank, which was annotated only with very simple transformations (mostly noun phrase movements) and not really "deep" transformations as most Chomskyan linguists would recognize them.&lt;br /&gt;&lt;br /&gt;But I think you could still do it.&amp;nbsp; It would end up as being partially unsupervised, but at least from a minimum description length perspective, I can either spend weights learning more special cases, or I can learn general transformational rules.&amp;nbsp; It would take some thought and effort to write it out and figure out how to actually optimize such a thing, but I bet it could be done in a semester.&lt;br /&gt;&lt;br /&gt;So then the question is: aside from smaller models (potentially), is there any other reason to do it?&lt;br /&gt;&lt;br /&gt;I can think of at least one: parsing non-declarative sentences.&amp;nbsp; Since almost all sentences in the Treebank are declarative, parsers do pretty crappy when tested on other things.&amp;nbsp; Slav Petrov had &lt;a href="http://www.petrovi.de/data/emnlp10b.pdf"&gt;a paper at EMNLP 2010 on parsing questions&lt;/a&gt;.&amp;nbsp; Here is the abstract, which says pretty much everything:&lt;br /&gt;&lt;blockquote&gt;... We show that dependency parsers have more difficulty parsing questions than constituency parsers. In particular, deterministic shift-reduce dependency parsers ... drop to 60% labeled accuracy on a question test set. We propose an uptraining procedure in which a deterministic parser is trained on the output of a more accurate, but slower, latent variable constituency parser (converted to dependencies). Uptraining with 100K unlabeled questions achieves results comparable to having 2K labeled questions for training. With 100K unlabeled and 2K labeled questions, uptraining is able to improve parsing accuracy to 84%, closing the gap between in-domain and out-of-domain performance.&lt;/blockquote&gt;Now, at least in principle, if you can parse declarative sentences, you should be able to parse questions.&amp;nbsp; At least if you know about some basic syntactic transformations in English.&amp;nbsp; (As an aside, the "uptraining" idea is almost exactly the same as the &lt;a href="http://hal3.name/docs/daume08flat.pdf"&gt;structure compilation idea&lt;/a&gt; that Percy, Dan and I had at ICML 2008, though Slav and colleagues apply it to a domain adaptation problem, while we just did simple semi-supervised learning.)&lt;br /&gt;&lt;br /&gt;We have observed similar effects in the parsing of commands, such as "Put your head in a noose" where parsers -- even constituency ones -- really really want "Put" to be a noun!&amp;nbsp; Again, if you know simple transformations -- like subject dropping -- you should be able to parse commands if you can already parse declarations.&lt;br /&gt;&lt;br /&gt;As with any generalization, the hope is that by realizing the generalization, you don't need to store so many specific cases.&amp;nbsp; So if you can learn that commands and questions are simple transformation on declarative sentences, and you can learn to parse declaratives, you should be able to handle the other case.&lt;br /&gt;&lt;br /&gt;(Anticipating comments: yes, I know you could try to pre-transform your data, like they do in MT, but that's quite inelegant.&amp;nbsp; And yes, I know you could probably take the treebank and turn a lot of the sentences into commands or questions to create a new data set.&amp;nbsp; But that's kind of missing the point: I don't want to just handle commands or questions... I want to handle anything, even things that I might not have anticipated.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-8527687550339950613?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/8527687550339950613/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=8527687550339950613' title='22 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8527687550339950613'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8527687550339950613'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/01/parsing-with-transformations.html' title='Parsing with Transformations'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>22</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-3555243437241244556</id><published>2011-01-11T15:25:00.002-07:00</published><updated>2011-01-13T10:06:19.532-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>NIPS 2010 Retrospective</title><content type='html'>Happy New Year and I know I've been silent but I've been busy.&amp;nbsp; But no teaching this semester (YAY!) so maybe you'll see more posts.&lt;br /&gt;&lt;br /&gt;At any rate, I'm really late to the table, but here are my comments about this past year's NIPS.&amp;nbsp; Before we get to that, I hope that everyone knows that this coming NIPS will be in Granada, and then for (at least) the next five years will be in Tahoe.&amp;nbsp; Now that I'm not in ski-land, it's nice to have a yearly ski vacation ... erm I mean scientific conference.&lt;br /&gt;&lt;br /&gt;But since this was the last year of NIPS in Vancouver, I thought I'd share a conversation that occurred this year at NIPS, with participants anonymized.&amp;nbsp; (I hope everyone knows to take this in good humor: I'm perfectly happy to poke fun at people from the States, too...).&amp;nbsp; The context is that one person in a large group, which was going to find lunch, had a cell phone with a data plan that worked in Canada:&lt;br /&gt;&lt;blockquote&gt;&lt;b&gt;A: &lt;/b&gt;Wow, that map is really taking a long time to load.&lt;br /&gt;&lt;b&gt;B:&lt;/b&gt; I know.&amp;nbsp; It's probably some socialized Canadian WiFi service.&lt;br /&gt;&lt;b&gt;C:&lt;/b&gt; No, it's probably just slow because every third bit has to be a Canadian bit?&lt;br /&gt;&lt;b&gt;D: &lt;/b&gt;No no, it's because every bit has to be sent in both English and French!&lt;/blockquote&gt;Okay it's not that funny, but it was funny at the time.&amp;nbsp; (And really "B" is as much a joke about the US as it was about Canada :P.)&lt;br /&gt;&lt;br /&gt;But I'm sure you are here to hear about papers, not stupid Canada jokes.&amp;nbsp; So here's my take.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://www.cs.wisc.edu/%7Eswright/nips2010/sjw-nips10.pdf"&gt;tutorial on optimization&lt;/a&gt; by Stephen Wright was awesome.&amp;nbsp;  I hope this shows up on videolectures soon.&amp;nbsp;&lt;b&gt;(Update: &lt;a href="http://videolectures.net/nips2010_wright_oaml/"&gt;it has&lt;/a&gt;!)&lt;/b&gt; I will make it required reading / watching for students.&amp;nbsp; There's just too much great stuff in it to go in to, but how about this: momentum is the same as CG!&amp;nbsp; Really?!?!&amp;nbsp; There's tons of stuff that I want to look more deeply into, such as robust mirror descent, some work by Candes about SVD when we don't care about near-zero SVs, regularized stochastic gradient (Xiao) and sparse eigenvector work.&amp;nbsp; Lots of awesome stuff.&amp;nbsp; My favorite part of NIPS.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Some papers I saw that I really liked: &lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1265.pdf"&gt;A Theory of Multiclass Boosting&lt;/a&gt; (Indraneel Mukherjee, Robert Schapire): Formalizes boosting in a multiclass setting.&amp;nbsp; The crux is a clever generalization of the "weak learning" notion from binary.&amp;nbsp; The idea is that a weak binary classifier is one that has a small &lt;i&gt;advantage&lt;/i&gt; over random guessing (which, in the binary case, gives 50/50).&amp;nbsp; Generalize this and it works.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0875.pdf"&gt;Structured sparsity-inducing norms through submodular functions&lt;/a&gt; (Francis Bach): I need to read this.&amp;nbsp; This was one of those talks where I understood the first half and then got lost.&amp;nbsp; But the idea is that you can go back-and-forth between submodular functions and sparsity-inducing norms.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0071.pdf"&gt;Construction of Dependent Dirichlet Processes based on Poisson Processes&lt;/a&gt; (Dahua Lin, Eric Grimson, John Fisher): The title says it all!&amp;nbsp; It's an alternative construction to the Polya urn scheme and also to the stick-breaking scheme.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0743.pdf"&gt;A Reduction from Apprenticeship Learning to Classification&lt;/a&gt; (Umar Syed, Robert Schapire): Right up my alley, some surprising results about apprenticeship learning (aka Hal's version of structured prediction) and classification.&amp;nbsp; Similar to a recent paper by Stephane Ross and Drew Bagnell on &lt;a href="http://www.ri.cmu.edu/publication_view.html?pub_id=6569"&gt;Efficient Reductions for Imitation Learning&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0243.pdf"&gt;Variational Inference over Combinatorial Spaces&lt;/a&gt; (Alexandre Bouchard-Cote, Michael Jordan): When you have complex combinatorial spaces (think traveling salesman), how can you construct generic variational inference algorithms?&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0426.pdf"&gt;Implicit Differentiation by Perturbation&lt;/a&gt; (Justin Domke): This is a great example of a paper that I never would have read, looked at, seen, visited the poster of, known about etc., were it not for serendipity at conferences (basically Justin was the only person at his poster when I showed up early for the session, so I got to see this poster).&amp;nbsp; The idea is if you have a graphical model, and some loss function L(.) which is defined over the marginals mu(theta), where theta are the parameters of the model, and you want to optimize L(mu(theta)) as a function of theta.&amp;nbsp; Without making any serious assumptions about the form of L, you can actually do gradient descent, where each gradient computation costs two runs of belief propagation.&amp;nbsp; I think this is amazing.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1179.pdf"&gt;Probabilistic Deterministic Infinite Automata&lt;/a&gt; (David Pfau, Nicholas Bartlett, Frank Wood): Another one where the title says it all.&amp;nbsp; DP-style construction of infinite automata.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0455.pdf"&gt;Graph-Valued Regression&lt;/a&gt; (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): The idea here is to define a regression function over a graph.&amp;nbsp; It should be regularized in a sensible way.&amp;nbsp; Very LASSO-esque model, as you might expect given the author list :).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Other papers I saw that I liked but not enough to write mini summaries of:&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0566.pdf"&gt;Word Features for Latent Dirichlet Allocation&lt;/a&gt; (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy) &lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0146.pdf"&gt;Tree-Structured Stick Breaking for Hierarchical Data&lt;/a&gt; (Ryan Adams, Zoubin Ghahramani, Michael Jordan) &lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0513.pdf"&gt;Categories and Functional Units: An Infinite Hierarchical Model for Brain Activations&lt;/a&gt; (Danial Lashkari, Ramesh Sridharan, Polina Golland) &lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1297.pdf"&gt;Trading off Mistakes and Don't-Know Predictions&lt;/a&gt; (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum)&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0417.pdf"&gt;Joint Analysis of Time-Evolving Binary Matrices and Associated Documents&lt;/a&gt; (Eric Wang, Dehong Liu, Jorge Silva, David Dunson, Lawrence Carin) &lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1083.pdf"&gt;Learning Efficient Markov Networks&lt;/a&gt; (Vibhav Gogate, William Webb, Pedro Domingos) &lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0146.pdf"&gt;Tree-Structured Stick Breaking for Hierarchical Data&lt;/a&gt; (Ryan Adams, Zoubin Ghahramani, Michael Jordan) &lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0071.pdf"&gt;Construction of Dependent Dirichlet Processes based on Poisson Processes&lt;/a&gt; (Dahua Lin, Eric Grimson, John Fisher) &lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0427.pdf"&gt;Supervised Clustering&lt;/a&gt; (Pranjal Awasthi, Reza Bosagh Zadeh) &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Two students who work with me (though one isn't actually mine :P), who went to NIPS also shared their favorite papers.&amp;nbsp; The first is a list from &lt;a href="http://www.blogger.com/www.cs.utah.edu/%7Eavishek"&gt;Avishek Saha&lt;/a&gt;:&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1265.pdf"&gt;A Theory of Multiclass Boosting&lt;/a&gt; (Indraneel Mukherjee, Robert Schapire) &lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1173.pdf"&gt;Repeated Games against Budgeted Adversaries&lt;/a&gt; (Jacob Abernethy, Manfred Warmuth) &lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1073.pdf"&gt;Non-Stochastic Bandit Slate Problems&lt;/a&gt; (Satyen Kale, Lev Reyzin, Robert Schapire) &lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1297.pdf"&gt;Trading off Mistakes and Don't-Know Predictions&lt;/a&gt; (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum) &lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0731.pdf"&gt;Learning Bounds for Importance Weighting&lt;/a&gt; (Corinna Cortes, Yishay Mansour, Mehryar Mohri) &lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0427.pdf"&gt;Supervised Clustering&lt;/a&gt; (Pranjal Awasthi, Reza Bosagh Zadeh) &lt;br /&gt;&lt;br /&gt;&lt;b&gt;The second list is from &lt;a href="http://www.blogger.com/www.cs.utah.edu/%7Epiyush"&gt;Piyush Rai&lt;/a&gt;, who apparently aimed for recall (though not with a lack of precision) :P:&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1269.pdf"&gt;Online Learning: Random Averages, Combinatorial Parameters, and Learnability&lt;/a&gt; (Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari): defines several complexity measures for online learning akin to what we have for the batch setting (e.g., radamacher averages, covering numbers etc).&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1015.pdf"&gt;Online Learning in The Manifold of Low-Rank Matrices&lt;/a&gt; (Uri Shalit, Daphna Weinshall, Gal Chechik): nice general framework applicable in a number of online learning settings. could also be used for online multitask learning.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1048.pdf"&gt;Fast global convergence rates of gradient methods for high-dimensional   statistical recovery&lt;/a&gt; (Alekh Agarwal, Sahand Negahban, Martin Wainwright): shows that the properties of sparse estimation problems that lead to statistical efficiency also lead to computational efficiency which explains the faster practical convergence of gradient methods than what the theory guarantees.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0784.pdf"&gt;Copula Processes&lt;/a&gt; (Andrew Wilson, Zoubin Ghahramani): how do you determine the relationship between random variables which could have different marginal distributions (say one has gamma and the other has gaussian distribution)? copula process gives an answer to this.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0455.pdf"&gt;Graph-Valued Regression&lt;/a&gt; (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): usually undirected graph structure learning involves a set of random variables y drawn from a distribution p(y). but what if y depends on another variable x? this paper is about learning the graph structure of the distribution p(y|x=x).&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0875.pdf"&gt;Structured sparsity-inducing norms through submodular functions&lt;/a&gt; (Francis Bach): standard sparse recovery uses l1 norm as a convex proxy for the l0 norm (which constrains the number of nonzero coefficients to be small). this paper proposes several more general set functions and their corresponding convex proxies, and links them to known norms.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1297.pdf"&gt;Trading off Mistakes and Don't-Know Predictions&lt;/a&gt; (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum): an interesting paper -- what if in an online learning setting you could abstain from making a prediction on some of the training examples and just say "i don't know"? on others, you may or may not make the correct prediction. lies somewhere in the middle of always predicting right or wrong (i.e., standard mistake driven online learning) versus the recent work on only predicting correctly or otherwise saying "i don't know".&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0243.pdf"&gt;Variational Inference over Combinatorial Spaces&lt;/a&gt; (Alexandre Bouchard-Cote, Michael Jordan): cool paper. applicable to lots of settings.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1265.pdf"&gt;A Theory of Multiclass Boosting&lt;/a&gt; (Indraneel Mukherjee, Robert Schapire): we know that boosting in binary case requires "slightly better than random" weak learners. this paper characterizes conditions on the weak learners for the multi-class case, and also gives a boosting algorithm.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0853.pdf"&gt;Multitask Learning without Label Correspondences&lt;/a&gt; (Novi Quadrianto, Alexander Smola, Tiberio Caetano, S.V.N. Vishwanathan, James Petterson): usually mtl assumes that the output space is the same for all the tasks but in many cases this may not be true. for instance, we may have two related prediction problems on two datasets but the output spaces for both may be different and may have some complex (e.g., hierarchical, and potentially time varying) output spaces. the paper uses a mutual information criteria to learn the correspondence between the output spaces.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0462.pdf"&gt;Learning Multiple Tasks with a Sparse Matrix-Normal Penalty&lt;/a&gt; (Yi Zhang, Jeff Schneider): presents a general multitask learning framework and many recently proposed mtl models turn out to be special cases. models both feature covariance and task covariance matrices.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0791.pdf"&gt;Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions&lt;/a&gt; (Achintya Kundu, vikram Tankasali, Chiranjib Bhattacharyya, Aharon Ben-Tal): the title says it all. :) multiple kernel learning is usually applied in classification setting but due to the applicability of the proposed method for a wide variety of loss functions, one can possibly also use it for unsupervised learning problems as well (e.g., spectral clustering, kernel pca, etc).&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0929.pdf"&gt;Getting lost in space: Large sample analysis of the resistance distance&lt;/a&gt; (Ulrike von Luxburg, Agnes Radl, Matthias Hein): large sample analysis of the commute distance: shows a rather surprising result that commute distance between two vertices in the graph if the graph is "large" and nodes represent high dimensional variables is meaningless. the paper proposes a correction and calls it "amplified commute distance".&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1334.pdf"&gt;A Bayesian Approach to Concept Drift&lt;/a&gt; (Stephen Bach, Mark Maloof): gives a bayesian approach for segmenting a sequence of observations such that each "block" of observations has the same underlying concept.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0951.pdf"&gt;MAP Estimation for Graphical Models by Likelihood Maximization&lt;/a&gt; (Akshat Kumar, Shlomo Zilberstein): they show that you can think of an mrf as a mixture of bayes nets and then the map problem on the mrf corresponds to solving a form of the maximum likelihood problem on the bayes net. em can be used to solve this in a pretty fast manner. they say that you can use this methods with the max-product lp algorithms to yield even better solutions, with a quicker convergence.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1272.pdf"&gt;Energy Disaggregation via Discriminative Sparse Coding&lt;/a&gt; (J. Zico Kolter, Siddharth Batra, Andrew Ng): about how sparse coding could be used to save energy. :)&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0684.pdf"&gt;Semi-Supervised Learning with Adversarially  Missing Label Information&lt;/a&gt; (Umar Syed, Ben Taskar): standard ssl assumes that labels for the unlabeled data are missing at random but in many practical settings this isn't actually true.this paper gives an algorithm to deal with the case when the labels could be adversarially missing.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0787.pdf"&gt;Multi-View Active Learning in the Non-Realizable Case&lt;/a&gt; (Wei Wang, Zhi-Hua Zhou): shows that (under certain assumptions) exponential improvements in the sample complexity of active learning are still possible if you have a multiview learning setting.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0772.pdf"&gt;Self-Paced Learning for Latent Variable Models&lt;/a&gt; (M. Pawan Kumar, Benjamin Packer, Daphne Koller): an interesting paper, somewhat similar in spirit to curriculum learning. basically, the paper suggests that in learning a latent variable model, it helps if you provide the algorithm easy examples first.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0809.pdf"&gt;More data means less inference: A pseudo-max approach to structured learning&lt;/a&gt; (David Sontag, Ofer Meshi, Tommi Jaakkola, Amir Globerson): a pseudo-max approach to structured learning: this is somewhat along the lines of the paper on svm's inverse dependence on training size from icml a couple of years back. :)&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0757.pdf"&gt;Hashing Hyperplane Queries to Near Points  with Applications to Large-Scale Active Learning&lt;/a&gt; (Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman): selecting the most uncertain example in a pool based active learning can be expensive if the number of candidate examples is very large. this paper suggests some hashing tricks to expedite the search.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0613.pdf"&gt;Active Instance Sampling via Matrix Partition&lt;/a&gt; (Yuhong Guo): frames batch mode active learning as a matrix partitioning problems and proposes local optimization technique for the matrix partitioning problem.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0097.pdf"&gt;A Discriminative Latent Model of Image Region and Object Tag Correspondence&lt;/a&gt; (Yang Wang, Greg Mori): it's kind of doing correspondence lda on image+captions but they additionally infer the correspondences between tags and objects in the images, and show that this gives improvements over corr-lda.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0434.pdf"&gt;Factorized Latent Spaces with Structured Sparsity&lt;/a&gt; (Yangqing Jia, Mathieu Salzmann, Trevor Darrell): a multiview learning algorithm that uses sparse coding to learn shared as well as private features of different views of the data.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0566.pdf"&gt;Word Features for Latent Dirichlet Allocation&lt;/a&gt; (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy): extends lda for the case when you have access to features for each word in the vocabulary&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-3555243437241244556?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/3555243437241244556/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=3555243437241244556' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3555243437241244556'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3555243437241244556'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2011/01/nips-2010-retrospective.html' title='NIPS 2010 Retrospective'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-7142612846170713721</id><published>2010-11-18T15:52:00.000-07:00</published><updated>2010-11-18T15:52:36.547-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><title type='text'>Crowdsourcing workshop (/tutorial) decisions</title><content type='html'>Everyone at conferences (with multiple tracks) always complains that there are time slots with nothing interesting, and other time slots with too many interesting papers.&amp;nbsp; People have suggested crowdsourcing this, enabling parcipants to say -- well ahead of the conference -- which papers they'd go to... then let an algorithm schedule.&lt;br /&gt;&lt;br /&gt;I think there are various issues with this model, but don't want to talk about it.&amp;nbsp; What I do want to talk about is &lt;b&gt;applying the same ideas to workshop acceptance decisions.&lt;/b&gt;&amp;nbsp; This comes up because I'm one of the two workshop chairs for ACL this year, and because John Langford just pointed to the ICML call for tutorials.&amp;nbsp; (I think what I have to say applies equally to tutorials as to workshops.)&lt;br /&gt;&lt;br /&gt;I feel like a workshop (or tutorial) is &lt;i&gt;successful&lt;/i&gt; if it is well attended.&amp;nbsp; This applies both from a monetary perspective, as well as a scientific perspective.&amp;nbsp; (Note, though, that I think that small workshops can also be successful, especially if they are either fostering a small community, bring people in, or serving other purposes.&amp;nbsp; That is to say, size is not &lt;i&gt;all &lt;/i&gt;that matters.&amp;nbsp; But it is a big part of what matters.)&lt;br /&gt;&lt;br /&gt;We have 30-odd workshop proposals for three of us to sort through (John Carroll and I are the two workshop chairs for ACL, and Marie Candito is the workshop chair for EMNLP; workshops are being reviewed jointly -- which actually makes the allocation process &lt;i&gt;more&lt;/i&gt; difficult).&amp;nbsp; The idea would be that I could create a poll, like the following:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Are you going to ACL?&amp;nbsp; Yes, maybe, no&lt;/li&gt;&lt;li&gt;Are you going to EMNLP?&amp;nbsp; Yes, maybe, no&lt;/li&gt;&lt;li&gt;If workshop A were offered at a conference you were going to, would you go to workshop A?&lt;/li&gt;&lt;li&gt;If workshop B...&lt;/li&gt;&lt;li&gt;And so on&lt;/li&gt;&lt;/ol&gt;This gives you two forms of information.&amp;nbsp; First it can help estimate expected attendance (though we ask proposers to estimate that, too, and I think they do a reasonable job if you skew their estimates down by about 10%).&amp;nbsp; But more importantly, &lt;i&gt;it gives correlations between workshops&lt;/i&gt;.&amp;nbsp; This lets you be sure that you're not scheduling things on top of each other that people might want to go to.&amp;nbsp; Some of these are obvious (for instance, if we got 10 MT workshop proposals... which didn't actually happen but is moderately conceivable :P), but some are not.&amp;nbsp; For instance, maybe people who care about annotation also care about ML, but maybe not?&amp;nbsp; I actually have no idea.&lt;br /&gt;&lt;br /&gt;Of course we're not going to do this this year.&amp;nbsp; It's too late already, and it would be unfair to publicise all the proposals, given that we didn't tell proposers in advance that we would do so.&amp;nbsp; And of course &lt;b&gt;I don't think this should exclusively be a popularity contest&lt;/b&gt;.&amp;nbsp; But I do beleive that &lt;b&gt;popularity should be a factor.&lt;/b&gt;&amp;nbsp; And it should probably be a reasonably big factor.&amp;nbsp; Workshop chairs could then use the output of an optimization algorithm as a starting point, and use this as additional data for making decisions.&amp;nbsp; Especially since two or three people are being asked to make decisions that cover--essentially--all areas of NLP, this actually seems like a good idea to me.&lt;br /&gt;&lt;br /&gt;I actually think something like this is more likely to actually happen at a conference like ICML than ACL, since ICML seems (much?) more willing to try new things than ACL (for better or for worse).&lt;br /&gt;&lt;br /&gt;But I do think it would be interesting to try to see what sort of response you get.&amp;nbsp; Of course, just polling on this blog wouldn't be sufficient: you'd want to spam, perhaps all of last year's attendees.&amp;nbsp; But this isn't particularly difficult.&lt;br /&gt;&lt;br /&gt;Is there anything I'm not thinking of that would make this obviously not work?&amp;nbsp; I could imagine someone saying that maybe people won't propose workshops/tutorials if the proposals will be made public?&amp;nbsp; I find that a bit hard to swallow.&amp;nbsp; Perhaps there's a small embarassment factor if you're public and then don't get accepted.&amp;nbsp; But I &lt;b&gt;wouldn't advocate making the voting results public&lt;/b&gt; -- they would be private to the organizers / workshop chairs.&lt;br /&gt;&lt;br /&gt;I guess -- I feel like I'm channeling Fernando here? -- that another possible issue is that &lt;b&gt;you might not be able to decide which workshops you'd go to without seeing what papers are there and who is presenting&lt;/b&gt;.&amp;nbsp; This is probably true.&amp;nbsp; But this is also the same problem that the workshop chairs face anyway: we have to guess that good enough papers/people will be there to make it worthwhile.&amp;nbsp; I doubt I'm any better at guessing this than any other random NLP person...&lt;br /&gt;&lt;br /&gt;So what am I forgetting?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-7142612846170713721?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/7142612846170713721/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=7142612846170713721' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7142612846170713721'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7142612846170713721'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/11/crowdsourcing-workshop-tutorial.html' title='Crowdsourcing workshop (/tutorial) decisions'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-8682169235051598883</id><published>2010-11-09T07:41:00.000-07:00</published><updated>2010-11-09T07:41:40.341-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='advising'/><title type='text'>Managing group papers</title><content type='html'>Every time a major conference deadline (ACL, NIPS, EMNLP, ICML, etc...) comes around, we usually have a slew of papers (&amp;gt;=3, typically) that are getting prepared.&amp;nbsp; I would say on average 1 doesn't make it, but that's par for the course.&lt;br /&gt;&lt;br /&gt;For AI-Stats, whose deadline just passed, I circulated student paper drafts to all of my folks to solicit comments at any level that they desired.&amp;nbsp; Anywhere from not understanding the problem/motivation to typos or errors in equations.&amp;nbsp; My experience was that it was useful, both from the perspective of distributing some of my workload and getting an alternative perspective, to keeping everyone abreast of what everyone else is working on.&lt;br /&gt;&lt;br /&gt;In fact, it was so successful that two students suggested to me that I &lt;i&gt;require&lt;/i&gt; more-or-less complete drafts of papers at least one week in advance so that this can take place.&amp;nbsp; How you require something like this is another issue, but the suggestion they came up with was that I'll only cover conference travel if this occurs.&amp;nbsp; It's actually not a bad idea, but I don't know if I'm enough of a hard-ass (or perceived as enough of a hard-ass) to really pull it off.&amp;nbsp; Maybe I'll try it though.&lt;br /&gt;&lt;br /&gt;The bigger question is how to manage such a thing.&amp;nbsp; I was thinking of installing some conference management software locally (eg., &lt;a href="http://www.cs.ucla.edu/%7Ekohler/hotcrp/"&gt;HotCRP&lt;/a&gt;, which I really like) and giving students "reviewer" access.&amp;nbsp; Then, they could upload their drafts, perhaps with an email circulated when a new draft is available, and other students (and me!) could "review" them.&amp;nbsp; (Again, perhaps with an email circulated -- I'm a big fan of "push" technology: I don't have time to "pull" anymore!)&lt;br /&gt;&lt;br /&gt;The only concern I have is that it would be really nice to be able to track updates, or to have the ability for authors to "check off" things that reviewers suggested.&amp;nbsp; Or to allow discussion.&amp;nbsp; Or something like that.&lt;br /&gt;&lt;br /&gt;I'm curious if anyone has ever tried anything like this and whether it was successful or not.&amp;nbsp; It seems like if you can get a culture of this established, it could actually be quite useful.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-8682169235051598883?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/8682169235051598883/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=8682169235051598883' title='13 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8682169235051598883'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8682169235051598883'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/11/managing-group-papers.html' title='Managing group papers'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>13</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-6296520072559484676</id><published>2010-10-21T10:13:00.001-06:00</published><updated>2010-10-21T10:14:16.977-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='theory'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Comparing Bounds</title><content type='html'>This is something that's bothered me for quite a while, and I don't know of a good answer.&amp;nbsp; I used to think it was something that theory people didn't worry about, but then this exact issue was brought up by a reviewer of a theory-heavy paper that we have at NIPS this year (with Avishek Saha and &lt;a href="http://www.umiacs.umd.edu/%7Eabhishek/"&gt;Abhishek Kumar&lt;/a&gt;). There are (at least?) two issues with comparing bounds, the first is the obvious "these are both upper bounds, what does it mean to compare them?"&amp;nbsp; The second is the slightly less obvious "but your empirical losses may be totally different" issue.&amp;nbsp; It's actually the &lt;i&gt;second&lt;/i&gt; one that I want to talk about, but I have much less of a good internal feel about it.&lt;br /&gt;&lt;br /&gt;Let's say that I'm considering two learning approaches.&amp;nbsp; Say it's SVMs versus logistic regression.&amp;nbsp; Both regularized.&amp;nbsp; Or something.&amp;nbsp; Doesn't really matter.&amp;nbsp; At the end of the day, I'll have a bound that looks roughly like:&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; expected test error &amp;lt;= empirical training error + f( complexity / N)&lt;br /&gt;&lt;br /&gt;Here, f is often "sqrt", but could really be any function.&amp;nbsp; And N is the number of data points.&lt;br /&gt;&lt;br /&gt;Between two algorithms, both "f" and "complexity" can vary.&amp;nbsp; For instance, one might have a linear dependence on the dimensionality of the data (i.e., complexity looks like O(D), where D is dimensionality) and the other might have a superlinear dependence (eg., O(D log D)).&amp;nbsp; Or one might have a square root.&amp;nbsp; Who knows.&amp;nbsp; Sometimes there's an inf or sup hiding in there, too, for instance in a lot of the margin bounds.&lt;br /&gt;&lt;br /&gt;At the end of the day, we of course want to say "my algorithm is better than your algorithm."&amp;nbsp; (What else is there in life?)&amp;nbsp; The standard way to say this is that "my f(complexity / N) looks better than your f'(complexity' / N)."&lt;br /&gt;&lt;br /&gt;Here's where two issues crop up.&amp;nbsp; The first is that our bound is just an upper bound.&amp;nbsp; For instance, Alice could come up to me and say "I'm thinking of a number between 1 and 10" and Bob could say "I'm thinking of a number between 1 and 100."&amp;nbsp; Even though the bound is lower for Alice, it doesn't mean that Alice is actually thinking of a smaller number -- maybe Alice is thinking of 9 and Bob of 5.&amp;nbsp; In this way, the bounds can be misleading.&lt;br /&gt;&lt;br /&gt;My general approach with this issue is to squint, as I do for experimental results.&amp;nbsp; I don't actually care about constant factors: I just care about things like "what does the dependence on D look like."&amp;nbsp; Since D is usually huge for problems I care about, a linear or sublinear dependence on D looks really good to me.&amp;nbsp; Beyond that I don't really care.&amp;nbsp; I especially don't care if the proof techniques are quite similar.&amp;nbsp; For instance, if they both use Rademacher complexities, then I'm more willing to compare them than if one uses Rademacher complexities and the other uses covering numbers.&amp;nbsp; They somehow feel more comparable: I'm less likely to believe that the differences are due to the method of analysis.&lt;br /&gt;&lt;br /&gt;(You can also get around this issue with some techniques, like Rademacher complexities, which give you both upper and lower bounds, but I don't think anyone really does that...)&lt;br /&gt;&lt;br /&gt;The other issue I don't have as good a feeling for.&amp;nbsp; The issue is that we're entirely ignoring the "empirical training error" question.&amp;nbsp; In fact, this is often measured differently between different algorithms!&amp;nbsp; For instance, for SVMs, the formal statement is more like "expected &lt;i&gt;0/1 loss&lt;/i&gt; on test &amp;lt;= empirical &lt;i&gt;hinge loss&lt;/i&gt; on training + ..."&amp;nbsp; Whereas for logistic regression, you might be comparing expected 0/1 loss with empirical &lt;i&gt;log loss.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Now I really don't know what to do.&lt;br /&gt;&lt;br /&gt;We ran into this issue because we were trying to compare some bounds between EasyAdapt and a simple model trained just on source data.&amp;nbsp; The problem is that the source training error might be totally incomparable to the (source + target) training error.&lt;br /&gt;&lt;br /&gt;But the issue is for sure more general.&amp;nbsp; For instance, what if your training error is measured in squared error?&amp;nbsp; Now this can be &lt;i&gt;huge&lt;/i&gt; when hinge loss is still rather small.&amp;nbsp; In fact, your squared error could be quadratically large in your hinge loss.&amp;nbsp; Actually it could be arbitrarily larger, since hinge goes to zero for any sufficiently correct classification, but squared error does not.&amp;nbsp; (Neither does log loss.)&lt;br /&gt;&lt;br /&gt;This worries me greatly, much more than the issue of comparing upper bounds.&lt;br /&gt;&lt;br /&gt;Does this bother everyone, or is it just me?&amp;nbsp; Is there a good way to think about this that gets your out of this conundrum?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-6296520072559484676?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/6296520072559484676/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=6296520072559484676' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6296520072559484676'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6296520072559484676'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/10/comparing-bounds.html' title='Comparing Bounds'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-4790628870864598174</id><published>2010-10-05T11:14:00.000-06:00</published><updated>2010-10-05T11:14:21.486-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='reviewing'/><title type='text'>My Giant Reviewing Error</title><content type='html'>I try to be a good reviewer, but like everything, reviewing is a learning process.&amp;nbsp; About five years ago, I was reviewing a journal paper and made an error.&amp;nbsp; I don't want to give up anonymity in this post, so I'm going to be vague in places that don't matter.&lt;br /&gt;&lt;br /&gt;I was reviewing a paper, which I thought was overall pretty strong.&amp;nbsp; I thought there was an interesting connection to some paper from Alice Smith (not the author's real name) in the past few years and mentioned this in my review.&amp;nbsp; Not a connection that made the current paper irrelevant, but something the authors should probably talk about.&amp;nbsp; In the revision response, the authors said that they had looked to try to find Smith's paper, but could figure out which one I was talking about, and asked for a pointer.&amp;nbsp; I spend the next five hours looking for the reference and couldn't find it myself.&amp;nbsp; It turns out that actually I was thinking of a paper by Bob Jones, so I provided that citation.&amp;nbsp; But the Jones paper wasn't even as relevant as it seemed at the time I wrote the review, so I apologized and told the authors they didn't really need to cover it that closely.&lt;br /&gt;&lt;br /&gt;Now, you might be thinking to yourself: aha, now I know that Hal was the reviewer of my paper!&amp;nbsp; I remember that happening to me!&lt;br /&gt;&lt;br /&gt;But, sadly, this is not true.&amp;nbsp; &lt;b&gt;I get reviews like this all the time, and I feel it's one of the most irresponsible things reviewers can do.&lt;/b&gt;&amp;nbsp; In fact, I don't think a single reviewing cycle has passed where I don't get a review like this.&amp;nbsp; The problem with such reviews is that it enables a reviewer to make whatever claim they want, without any expectation that they have to back it up.&amp;nbsp; And the claims are usually wrong.&amp;nbsp; They're not necessarily being mean (I wasn't trying to be mean), but sometimes they are.&lt;br /&gt;&lt;br /&gt;Here are some of the most ridiculous cases I've seen.&amp;nbsp; I mention these just to show how often this problem occurs.&amp;nbsp; These are all on papers of mine.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;One reviewer wrote "This idea is so obvious this must have been done before."&amp;nbsp; This is probably the most humorous example I've seen, but the reviewer was clearly serious.&amp;nbsp; And no, this was &lt;i&gt;not&lt;/i&gt; in a review for one of the the "frustratingly easy" papers.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;In a NSF grant review for an educational proposal, we were informed by 4 of 7 reviewers (who each wrote about a paragraph) that our ideas had been done in SIGCSE several times.&amp;nbsp; Before submitting, we had skimmed/read the past 8 years of SIGCSE and could find nothing.&amp;nbsp; (Maybe it's true and we just were looking in the wrong place, but that still isn't helpful.)&amp;nbsp; It turned out to strongly seem that this was basically their way of saying "you are not one of us."&lt;br /&gt;&lt;/li&gt;&lt;li&gt;In a paper on technique X for task A, we were told hands down that it's well known that technique Y works better, with no citations.&amp;nbsp; The paper was rejected, we went and implemented Y, and found that it worked worse on task A.&amp;nbsp; We later found one paper saying that Y works better than X on task B, for B fairly different from A.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;In another paper, we were told that what we were doing had been done before and in this case a citation was provided.&amp;nbsp; The citation was to one of our own papers, and it was quite different by any reasonable metric.&amp;nbsp; At least a citation was provided, but it was clear that the reviewer hadn't bothered reading it.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;We were told that we missed an enormous amount of related work that could be found by a simple web search.&amp;nbsp; I've written such things in reviews, often saying something like "search for 'non-parametric Bayesian'" or something like that.&amp;nbsp; But here, no keywords were provided.&amp;nbsp; It's entirely possible (especially when someone moves into a new domain) that you can miss a large body of related work because you don't know how to find in: that's fine -- just tell me how to find it if you don't want to actually provide citations.&lt;/li&gt;&lt;/ul&gt;There are other examples I could cite from my own experience, but I think you get the idea.&lt;br /&gt;&lt;br /&gt;I'm posting this not to gripe (though it's always fun to gripe about reviewing), but to try to draw attention to this problem.&amp;nbsp; It's really just an issue of laziness.&amp;nbsp; If I had bothered trying to look up a reference for Alice Smith's paper, I would have immediately realized I was wrong.&amp;nbsp; But I was lazy.&amp;nbsp; Luckily this didn't really adversely affect the outcome of the acceptance of this paper (journals are useful in that way -- authors can push back -- and yes, I know you can do this in author responses too, but you really need two rounds to make it work in this case).&lt;br /&gt;&lt;br /&gt;I've really really tried ever since my experience above to not ever do this again.&amp;nbsp; And I would encourage future reviewers to try to avoid the temptation to do this: you may find your memory isn't as good as you think.&amp;nbsp; I would also encourage area chairs and co-reviewers to push their colleagues to actually provide citations for otherwise unsubstantiated claims.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-4790628870864598174?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/4790628870864598174/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=4790628870864598174' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4790628870864598174'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4790628870864598174'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/10/my-giant-reviewing-error.html' title='My Giant Reviewing Error'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-4584126750147794314</id><published>2010-09-24T19:25:00.000-06:00</published><updated>2010-09-24T19:25:36.438-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='survey'/><title type='text'>ACL / ICML Symposium?</title><content type='html'>ACL 2011 ends on June 24, in Portland (that's a Friday).  ICML 2011 begins on June 28, near Seattle (the following Tuesday).  This is pretty much as close to a co-location as we're probably going to get in a long time.  A few folks have been discussing the possibility of having a joint NLP/ML symposium in between.  The current thought is to have it on June 27 at the ICML venue (for various logistical reasons).&amp;nbsp; There are buses and trains that run between the two cities, and we might even be able to charter some buses. &lt;br /&gt;&lt;br /&gt;One worry is that it might &lt;i&gt;only&lt;/i&gt; attract ICML folks due to the weekend between the end of ACL and the beginning of said symposium.  As a NLPer/MLer, I believe in data.  So please provide data by filling out the form below and, if you wish, adding comments.&lt;br /&gt;&lt;br /&gt;If you woudn't attend any, you don't need to fill out the poll :).&lt;br /&gt;&lt;br /&gt;The last option is there if you want to tell me "I'm going to go to ACL, and I'd really like to go to the symposium, but the change in venue and the intervening weekend is too problematic to make it possible."&lt;br /&gt;&lt;br /&gt;&lt;form action="http://poll.pollcode.com/2w9s" method="post"&gt;&lt;table bgcolor="#eeeeee" border="0" cellpadding="2" cellspacing="0" style="width: 400px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td colspan="2"&gt;&lt;span style="color: black; font-family: Verdana;"&gt;&lt;b&gt;Which would you attend?&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="5"&gt;&lt;input name="answer" type="radio" value="1" /&gt;&lt;/td&gt;&lt;td&gt;&lt;span style="color: black; font-family: Verdana;"&gt;Only ACL&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="5"&gt;&lt;input name="answer" type="radio" value="2" /&gt;&lt;/td&gt;&lt;td&gt;&lt;span style="color: black; font-family: Verdana;"&gt;Only ICML&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="5"&gt;&lt;input name="answer" type="radio" value="3" /&gt;&lt;/td&gt;&lt;td&gt;&lt;span style="color: black; font-family: Verdana;"&gt;ACL and the symposium&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="5"&gt;&lt;input name="answer" type="radio" value="4" /&gt;&lt;/td&gt;&lt;td&gt;&lt;span style="color: black; font-family: Verdana;"&gt;ICML and the symposium&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="5"&gt;&lt;input name="answer" type="radio" value="5" /&gt;&lt;/td&gt;&lt;td&gt;&lt;span style="color: black; font-family: Verdana;"&gt;Only ACL and ICML&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="5"&gt;&lt;input name="answer" type="radio" value="6" /&gt;&lt;/td&gt;&lt;td&gt;&lt;span style="color: black; font-family: Verdana;"&gt;All three&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="5"&gt;&lt;input name="answer" type="radio" value="7" /&gt;&lt;/td&gt;&lt;td&gt;&lt;span style="color: black; font-family: Verdana;"&gt;Only ACL, because of convenience&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td colspan="2"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;center&gt;&lt;input type="submit" value="Vote" /&gt;&amp;nbsp;&amp;nbsp;&lt;input name="view" type="submit" value="View" /&gt;&lt;/center&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="right" bgcolor="white" colspan="2"&gt;&lt;span style="color: black; font-family: Verdana;"&gt;pollcode.com &lt;a href="http://pollcode.com/"&gt;&lt;span style="color: navy;"&gt;free polls&lt;/span&gt;&lt;/a&gt;&lt;span style="color: navy;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/form&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-4584126750147794314?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/4584126750147794314/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=4584126750147794314' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4584126750147794314'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4584126750147794314'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/09/acl-icml-symposium.html' title='ACL / ICML Symposium?'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-5842730110981448091</id><published>2010-09-15T08:42:00.002-06:00</published><updated>2010-09-15T10:09:35.664-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><title type='text'>Very sad news....</title><content type='html'>I heard earlier this morning that &lt;a href="http://www.clsp.jhu.edu/%7Ejelinek/"&gt;Fred Jelinek&lt;/a&gt; passed away last night.&amp;nbsp; Apparently he had been working during the day: a tenacious aspect of Fred that probably has a lot to do with his many successes.&lt;br /&gt;&lt;br /&gt;Fred is probably most infamous for the famous "Every time I fire a linguist the performace of the recognizer improves" quote, which Jurafsky+Martin's textbook says is actually supposed to be the more innocuous "Anytime a linguist leaves the group the recognition rate goes up."&amp;nbsp; And in Fred's 2009 &lt;a href="http://www.mitpressjournals.org/doi/abs/10.1162/coli.2009.35.4.35401"&gt;ACL Lifetime Achievement Award speech&lt;/a&gt;, he basically said that such a thing never happened.&amp;nbsp; I doubt that will have any effect on how much the story is told.&lt;br /&gt;&lt;br /&gt;Fred has had a remarkable influence on the field.&amp;nbsp; So much so that I won't attempt to list anything here: you can find all about him all of the internet.&amp;nbsp; Let me just say that the first time I met him, I was intimidated.&amp;nbsp; Not only because he was Fred, but because I knew (and still know) next to nothing about speech, and the conversation inevitably turned to speech.&amp;nbsp; Here's roughly how a segment of our conversation went:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Hal:&lt;/b&gt; What new projects are going on these days?&lt;br /&gt;&lt;b&gt;Fred:&lt;/b&gt; (Excitedly.)&amp;nbsp; We have a really exciting new speech recognition problem.&amp;nbsp; We're trying to map speech signals directly to fluent text.&lt;br /&gt;&lt;b&gt;Hal:&lt;/b&gt; (Really confused.) Isn't that the speech recognition problem?&lt;br /&gt;&lt;b&gt;Fred:&lt;/b&gt; (Playing the "teacher role" now.)&amp;nbsp; Normally when you transcribe speech, you end up with a transcrit that includes disfluencies like "uh" and "um" and also false starts &lt;i&gt;[Ed note: like "I went... I went to the um store"]&lt;/i&gt;.&lt;br /&gt;&lt;b&gt;Hal:&lt;/b&gt; So now you want to produce the actual fluent sentence, not the one that was spoken?&lt;br /&gt;&lt;b&gt;Fred: &lt;/b&gt;Right.&lt;br /&gt;&lt;br /&gt;Apparently (who knew) in speech recognition you try to transcribe disfluencies and are penalized for missing them!&amp;nbsp; We then talked for a while about how they were doing this, and other fun topics.&lt;br /&gt;&lt;br /&gt;A few weeks later, I got a voicemail on my home message machine from Fred.&amp;nbsp; That was probably one of the coolest things that have ever happened to me in life.&amp;nbsp; I actually saved it (but subsequently lost it, which saddens me greatly).&amp;nbsp; The content is irrelevant: the point is that Fred -- &lt;i&gt;Fred!&lt;/i&gt; -- called me -- &lt;i&gt;me!&lt;/i&gt; -- at &lt;i&gt;home!&lt;/i&gt;&amp;nbsp; Amazing.&lt;br /&gt;&lt;br /&gt;I'm sure that there are lots of other folks who knew Fred better than me, and they can add their own stories in comments if they'd like.&amp;nbsp; Fred was a great asset to the field, and I will certainly miss his physical presense in the future, though his work will doubtless continue to affect the field for years and decades to come.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-5842730110981448091?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/5842730110981448091/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=5842730110981448091' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5842730110981448091'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5842730110981448091'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/09/very-sad-news.html' title='Very sad news....'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-328454048805025605</id><published>2010-09-13T20:09:00.000-06:00</published><updated>2010-09-13T20:09:48.905-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>AIStats 2011 Call for Papers</title><content type='html'>The &lt;a href="http://www.aistats.org/cfp.php"&gt;full call&lt;/a&gt;, and &lt;a href="http://www.aistats.org/reviewing.php"&gt;some changes to the reviewing process&lt;/a&gt;.  The submission deadline is Nov 1, and the conference is April 11-13, in Fort Lauderdale, Florida.  Promises to be warm :).&lt;br /&gt;&lt;br /&gt;The changes the the reviewing process are interesting.&amp;nbsp; Basically the main change is that the author response is replaced by a journal-esque "revise and resubmit."&amp;nbsp; That is, you get 2 reviews, edit your paper, submit a new version, and get a 3rd review.&amp;nbsp; The hope is that this will reduce author frustration from the low bandwidth of author response.&amp;nbsp; Like with a journal, you'll also submit a "diff" saying what you've changed.&amp;nbsp; I can see this going really well: the third reviewer will presumably see a (much) better than the first two.&amp;nbsp; The disadvantage, which irked me at ICML last year, is that it often seemed like the third reviewer made the &lt;i&gt;deciding call&lt;/i&gt;, and I would want to make sure that the first two reviewers also get updated.&amp;nbsp; I can also see it going poorly: authors invest even more time in "responding" and no one listens.&amp;nbsp; That will be increased frustration :).&lt;br /&gt;&lt;br /&gt;The other change is that there'll be more awards.&amp;nbsp; I'm very much in favor of this, and I spend two years on the NAACL exec trying to get NAACL to do the same thing, but always got voted down :).&amp;nbsp; Oh well.&amp;nbsp; The reason I think it's a good idea is two-fold.&amp;nbsp; First, I think we're bad at selecting single best papers: a committee decision can often lead to selecting least offensive papers rather than ones that really push the boundary.&amp;nbsp; I also think there are lots of ways for papers to be great: they can introduce new awesome algorithms, have new theory, have a great application, introduce a cool new problem, utilize a new linguistic insight, etc., etc., etc... Second, best papers are most useful at promotion time (hiring, and tenure), where you're being compared with people from other fields.&amp;nbsp; Why should &lt;i&gt;our&lt;/i&gt; field put &lt;i&gt;our&lt;/i&gt; people at a disadvantage by not awarding great work that they can list of their CVs?&lt;br /&gt;&lt;br /&gt;Anyway, it'll be an interesting experiment, and I encourage folks to submit!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-328454048805025605?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/328454048805025605/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=328454048805025605' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/328454048805025605'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/328454048805025605'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/09/aistats-2011-call-for-papers.html' title='AIStats 2011 Call for Papers'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-8465870987707832413</id><published>2010-09-07T19:10:00.000-06:00</published><updated>2010-09-07T19:10:06.223-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Manifold Assumption versus Margin Assumption</title><content type='html'>&lt;div style="color: #660000;"&gt;&lt;span style="font-size: x-small;"&gt;[This post is based on some discussions that came up while talking about manifold learning with &lt;a href="http://www.cs.utah.edu/%7Ewhitaker/"&gt;Ross Whitaker&lt;/a&gt; and &lt;a href="http://www.cs.utah.edu/%7Esgerber/"&gt;Sam Gerber&lt;/a&gt;, who had &lt;a href="http://www.cs.utah.edu/%7Esgerber/research/kmm_iccv09.pdf"&gt;a great manifold learning paper&lt;/a&gt; at ICCV last year.]&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;There are two assumptions that are often used in statistical learning (both theory and practice, though probably more of the latter), especially in the semi-supervised setting.&amp;nbsp; Unfortunately, they're incompatible.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The margin assumption&lt;/b&gt; states that your data are well separated.&amp;nbsp; Usually it's in reference to linear, possibly kernelized, classifiers, but that need not be the case.&amp;nbsp; As most of us know, there are lots of other assumptions that boil down to the same thing, such as the low-weight-norm assumption, or the Gaussian prior assumption.&amp;nbsp; At the end of the day, it means your data looks like what you have on the left, below, not what you have on the right.&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_EZRdsBDhqno/TIbfhQiZWhI/AAAAAAAAABU/h5UilJsHmCw/s1600/margin.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/_EZRdsBDhqno/TIbfhQiZWhI/AAAAAAAAABU/h5UilJsHmCw/s320/margin.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;b&gt;The manifold assumption&lt;/b&gt; that is particularly popular in semi-supervised learning, but also shows up in supervised learning, says that your data lie on a low dimensional manifold embedded in a higher dimensional space.&amp;nbsp; One way of thinking about this is saying that your features cannot covary arbitrarily, but the manifold assumption is quite a bit stronger.&amp;nbsp; It usually assumes a Reimannian (i.e., locally Euclidean) structure, with data points "sufficiently" densely sampled.&amp;nbsp; In other words, life looks like the left, not the right, below:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_EZRdsBDhqno/TIbgdchU-kI/AAAAAAAAABc/Bgzjkt31efM/s1600/manifold.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/_EZRdsBDhqno/TIbgdchU-kI/AAAAAAAAABc/Bgzjkt31efM/s320/manifold.png" /&gt;&lt;/a&gt;&lt;/div&gt;Okay, yes, I know that the "Bad" one is a 2D manifold embedded in 2D, but that's only because I can't draw 3D images :).&amp;nbsp; And anyway, this is a "weird" manifold in the sense that at one point (where the +s and -s meet), it drops down to 1D.&amp;nbsp; This is fine in math-manifold land, but usually not at all accounted for in ML-manifold land.&lt;br /&gt;&lt;br /&gt;The problem, of course, is that once you say "margin" and "manifold" in the same sentence, things just can't possibly work out.&amp;nbsp; You'd end up with a picture like:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/_EZRdsBDhqno/TIbhStF5ZkI/AAAAAAAAABk/LVzCOksX8-g/s1600/badifold.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/_EZRdsBDhqno/TIbhStF5ZkI/AAAAAAAAABk/LVzCOksX8-g/s320/badifold.png" /&gt;&lt;/a&gt;&lt;/div&gt;This is fine from a margin perspective, but it's definitely not a (densely sampled) manifold any more.&lt;br /&gt;&lt;br /&gt;In fact, almost by definition, once you stick a margin into a manifold (which is okay, since you'll define margin Euclideanly, and manifolds know how to deal with Euclidean geometry locally), you're hosed.&lt;br /&gt;&lt;br /&gt;So I guess the question is: who do you believe?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-8465870987707832413?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/8465870987707832413/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=8465870987707832413' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8465870987707832413'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8465870987707832413'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/09/manifold-assumption-versus-margin.html' title='Manifold Assumption versus Margin Assumption'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_EZRdsBDhqno/TIbfhQiZWhI/AAAAAAAAABU/h5UilJsHmCw/s72-c/margin.png' height='72' width='72'/><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-455954435539811009</id><published>2010-08-31T19:09:00.000-06:00</published><updated>2010-08-31T19:09:27.705-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='online learning'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Online Learning Algorithms that Work Harder</title><content type='html'>It seems to be a general goal in practical online learning algorithm development to have the updates be very very simply.&amp;nbsp; Perceptron is probably the simplest, and involves just a few adds.&amp;nbsp; Winnow takes a few multiplies.&amp;nbsp; MIRA takes a bit more, but still nothing hugely complicated.&amp;nbsp; Same with stochastic gradient descent algorithms for, eg., hinge loss.&lt;br /&gt;&lt;br /&gt;I think this maybe used to make sense.&amp;nbsp; I'm not sure that it makes sense any more.&amp;nbsp; In particular, &lt;b&gt;I would be happier with online algorithms that do more work per data point, but require only one pass over the data.&lt;/b&gt;&amp;nbsp; There are really only two examples I know of: the &lt;a href="http://hal3.name/docs/daume09onepass.pdf"&gt;StreamSVM work&lt;/a&gt; that my student &lt;a href="http://www.cs.utah.edu/%7Epiyush"&gt;Piyush&lt;/a&gt; did with me and &lt;a href="http://www.cs.utah.edu/%7Esuresh/"&gt;Suresh&lt;/a&gt;, and the &lt;a href="http://www.cs.jhu.edu/%7Emdredze/publications/icml_variance.pdf"&gt;confidence-weighted&lt;/a&gt; work by Mark Dredze, Koby Crammer and Fernando Pereira (note that they maybe weren't&lt;i&gt; trying&lt;/i&gt; to make a one-pass algorithm, but it does seem to work well in that setting).&lt;br /&gt;&lt;br /&gt;Why do I feel this way?&lt;br /&gt;&lt;br /&gt;Well, if you look even at standard classification tasks, you'll find that if you have a highly optimized, dual threaded implementation of stochastic gradient descent, then your bottleneck becomes I/O, not learning.&amp;nbsp; This is what John Langford observed in his &lt;a href="http://hunch.net/%7Evw/"&gt;Vowpal Wabbit&lt;/a&gt; implementation.&amp;nbsp; He has to do multiple passes.&amp;nbsp; He deals with the I/O bottleneck by creating an I/O friendly, proprietary version of the input file during the first past, and then careening through it on subsequent passes.&lt;br /&gt;&lt;br /&gt;In this case, basically what John is seeing is that I/O is too slow.&amp;nbsp; Or, phrased differently, learning is too fast :).&amp;nbsp; I never thought I'd say that, but I think it's true.&amp;nbsp; Especially when you consider that just having two threads is a pretty low requirement these days, it would be nice to put 8 or 16 threads to good use.&lt;br /&gt;&lt;br /&gt;But I think the problem is actually quite a bit more severe.&amp;nbsp; You can tell this by realizing that the idealized world in which binary classifier algorithms usually get developed is, well, idealized.&amp;nbsp; In particular, &lt;i&gt;someone has already gone through the effort of computing all your features for you.&lt;/i&gt;&amp;nbsp; Even running something simple like a tokenizer, stemmer and stop word remover over documents takes a non-negligible amount of time (to convince yourself: run it over Gigaword and see how long it takes!), &lt;i&gt;easily&lt;/i&gt; much longer than a silly perceptron update.&lt;br /&gt;&lt;br /&gt;So in the real world, you're probably going to be computing your features and learning on the fly.&amp;nbsp; (Or at least that's what I always do.)&amp;nbsp; In which case, if you have a few threads computing features and one thread learning, your learning thread is &lt;i&gt;always&lt;/i&gt; going to be stalling, waiting for features.&lt;br /&gt;&lt;br /&gt;One way to partially circumvent this is to do a variant of what John does: create a big scratch file as you go and write everything to this file on the first pass, so you can just read from it on subsequent passes.&amp;nbsp; In fact, I believe this is what Ryan McDonald does in MSTParser (he can correct me in the comments if I'm wrong :P).&amp;nbsp; I've never tried this myself because I am lazy.&amp;nbsp; Plus, it adds unnecessary complexity to your code, requires you to chew up disk, and of course adds its own delays since you now have to be writing to disk (which gives you tons of seeks to go back to where you were reading from initially).&lt;br /&gt;&lt;br /&gt;A similar problem crops up in structured problems.&amp;nbsp; Since you usually have to run inference to get a gradient, you end up spending way more time on your inference than your gradients.&amp;nbsp; (This is similar to the problems you run into when trying to &lt;a href="http://www.ryanmcd.com/papers/parallel_perceptronNAACL2010.pdf"&gt;parallelize the structured perceptron&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;Anyway, at the end of the day, I would probably be happier with an online algorithm that spent a little more energy per-example and required fewer passes; I hope someone will invent one for me!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-455954435539811009?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/455954435539811009/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=455954435539811009' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/455954435539811009'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/455954435539811009'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/08/online-learning-algorithms-that-work.html' title='Online Learning Algorithms that Work Harder'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-2085596220220817325</id><published>2010-08-27T11:14:00.001-06:00</published><updated>2010-08-27T12:14:50.576-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment'/><category scheme='http://www.blogger.com/atom/ns#' term='community'/><title type='text'>Calibrating Reviews and Ratings</title><content type='html'>NIPS decision are going out soon, and then we're done with submitting and reviewing for a blessed few months.  Except for journals, of course.&lt;br /&gt;&lt;br /&gt;If you're not interested in paper reviews, but are interested in sentiment analysis, please skip the first two paragraphs :).&lt;br /&gt;&lt;br /&gt;One thing that anyone who has ever area chaired, or probably even ever reviewed, has noticed is that different people have different "baseline" ratings.  Conferences try to adjust for this, for instance NIPS defines their 1-10 rating scale as something like "8 = Top 50% of papers accepted to NIPS" or something like that.  Even so, some people are just harsher than others in scoring, and it seems like the area chair's job to calibrate for this.  (For instance, I know I tend to be fairly harsh -- I probably only give one 5 (out of 5) for every ten papers I review, and I probably give two or three 1s in the same size batch.  I have friends who never give a one -- except in the case of something just being &lt;span style="font-style: italic;"&gt;wrong&lt;/span&gt; -- and often give 5s.  Perhaps I should be nicer; I know CS tends to be harder on itself than other fiends.)  As an aside, this is one reason why I'm generally in favor of fewer reviewers and more reviews per reviewer: it allows easier calibration.&lt;br /&gt;&lt;br /&gt;There's also the issue of areas.  Some areas simply seem to be harder to get papers into than others (which can lead to some gaming of the system).  For instance, if I have a "new machine learning technique applied to parsing," do I want it reviewed by parsing people or machine learning people?  How do you calibrate across areas, other than by some form of affirmative action for less-represented areas?&lt;br /&gt;&lt;br /&gt;A similar phenomenon occurs in sentiment analysis, as was pointed out to me at ACL this year by Franz Och.  The example he gives is very nice.  If you go to TripAdvisor and look up &lt;a href="http://www.tripadvisor.com/Restaurant_Review-g33300-d493634-Reviews-The_French_Laundry-Yountville_Napa_Valley_California.html"&gt;The French Laundry&lt;/a&gt;, which is definitely one of the best restaurants in the U.S. (some people say &lt;span style="font-style: italic;"&gt;the best&lt;/span&gt;), you'll see that it got 4.0/5.0 stars, and a 79% recommendation.  On the other hand, if you look up &lt;a href="http://www.tripadvisor.com/Restaurant_Review-g32655-d1006054-Reviews-In_N_Out_burger-Los_Angeles_California.html"&gt;In'N'Out Burger&lt;/a&gt;, a LA-based burger chain (which, having grown up in LA, was admittedly one of my favorite places to eat in high school, back when I ate stuff like that) you see another 4.0/5.0 stars and a 95% recommendation.&lt;br /&gt;&lt;br /&gt;So now, we train a machine learning system to predict that the rating for The French Laundry is 79% and In'N'Out Burger is 95%.  And we expect this to work?!&lt;br /&gt;&lt;br /&gt;Probably the main issue here is calibrating for &lt;span style="font-style: italic;"&gt;expectations.&lt;/span&gt;  As a teacher, I've figured out quickly that managing student expectations is a big part of getting good teaching reviews.  If you go to In'N'Out, and have expectations for a Big Mac, you'll be pleasantly surprised.  If you go to The French Laundry with expectations of having a meal worth selling your soul, your children's souls, etc., for, then you'll probably be disappointed (though I can't really say: I've never been).&lt;br /&gt;&lt;br /&gt;One way that a similar problem has been dealt with on Hotels.com is that they'll show you ratings for the hotel you're looking at, and statistics of ratings for other hotels within a 10 mile radius (or something).  You could do something similar for restaurants, though distance probably isn't the right categorization: maybe price.  For "$", In'N'Out is probably near the top, and for "$$$$" The French Laundry probably is.&lt;br /&gt;&lt;br /&gt;(Anticipating comments, I don't think this is just an "aspect" issue.  I don't care how bad your palate is, even just considering the "quality of food" aspect, Laundry has to trump In'N'Out by a large margin.)&lt;br /&gt;&lt;br /&gt;I think the problem is that in all of these cases -- papers, restaurants, hotels -- and others (movies, books, etc.) there simply isn't a total order on the "quality" of the objects you're looking at.  (For instance, as soon as a book becomes a best seller, or is advocated by Oprah, I am probably &lt;span style="font-style: italic;"&gt;less&lt;/span&gt; likely to read it.)  There is maybe a situation-depend order, and the distance to hotel, or "$" rating, or area classes are heuristics for describing this "situation."  Bit without knowing the situation, or having a way to approximate it, I worry that we might be entering a garbage-in-garbage-out scenario here.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-2085596220220817325?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/2085596220220817325/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=2085596220220817325' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/2085596220220817325'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/2085596220220817325'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/08/calibrating-reviews-and-ratings.html' title='Calibrating Reviews and Ratings'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-7219466186164349152</id><published>2010-08-23T10:45:00.004-06:00</published><updated>2010-08-23T11:11:56.392-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='finite state methods'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Finite State NLP with Unlabeled Data on Both Sides</title><content type='html'>(Can you tell, by the recent frequency of posts, that I'm try not to work on getting ready for classes next week?)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(153, 0, 0);"&gt;[This post is based partially on some conversations with Kevin Duh, though not in the finite state models formalism.]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The finite state machine approach to NLP is very appealing (I mean both string and tree automata) because you get to build little things in isolation and then chain them together in cool ways.  Kevin Knight has a great slide about how to put these things together that I can't seem to find right now, but trust me that it's awesome, especially when he explains it to you :).&lt;br /&gt;&lt;br /&gt;The other thing that's cool about them is that because you get to build them in isolation, you can use different data sets, which means data sets with different assumptions about the existence of "labels", to build each part.  For instance, to do speech to speech transliteration from English to Japanese, you might build a component system like:&lt;br /&gt;&lt;br /&gt;English speech --A--&gt; English phonemes --B--&gt; Japanese phonemes --C--&gt; Japanese speech --D--&gt; Japanese speech LM&lt;br /&gt;&lt;br /&gt;You'll need a language model (D) for Japanese speech, that can be trained just on acoustic Japanese signals, then parallel Japanese speech/phonemes (for C), parallel English speech/phonemes (for A) and parallel English phonemes/Japanese phonemes (for B).  [Plus, of course, if you're missing any of these, EM comes to your rescue!]&lt;br /&gt;&lt;br /&gt;Let's take a simpler example, though the point I want to make applies to long chains, too.&lt;br /&gt;&lt;br /&gt;Suppose I want to just do translation from French to English.  I build an English language model (off of monolingual English text) and then an &lt;span style="font-style: italic;"&gt;English-to-French transducer&lt;/span&gt; (remember that in the noisy channel, things flip direction).  For the E2F transducer, I'll need parallel English/French text, of course.  The English LM gives me p(e) and the transducer gives me p(f|e), which I can put together via Bayes' rule to get something proportional to p(e|f), which will let me translate new sentences.&lt;br /&gt;&lt;br /&gt;But, presumably, I also have lots of monolingual French text.  Forgetting math for a moment, which seems to suggest that this can't help me, we can ask: &lt;span style="font-style: italic;"&gt;why&lt;/span&gt; should this help?&lt;br /&gt;&lt;br /&gt;Well, it probably won't help with my English language model, but it &lt;span style="font-style: italic;"&gt;should&lt;/span&gt; be able to help with my transducer.  Why?  Because my transducer is supposed to give me p(f|e).  If I have some French sentence in my GigaFrench corpus to which my transducer assigns zero probability (for instance, max_e p(f|e) = 0), then this is probably a sign that something bad is happening.&lt;br /&gt;&lt;br /&gt;More generally, I feel like the following two operations should probably give roughly the same probabilities:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Drawing an English sentence from the language model p(e).&lt;/li&gt;&lt;li&gt;Picking a French sentence at random from GigaFrench, and drawing an English sentence from p(e|f), where p(e|f) is the composition of the English LM and the transducer.&lt;/li&gt;&lt;/ol&gt;If you buy this, then perhaps one thing you could do is to try to learn a transducer q(f|e) that has low KL divergence between 1 and 2, above.  If you work through the (short) make, and throw away terms that are independent of the transducer, then you end up wanting to minimize &lt;span style="font-style: italic;"&gt;[ &lt;/span&gt;sum&lt;span style="font-style: italic;"&gt;_e p(e) &lt;/span&gt;log sum&lt;span style="font-style: italic;"&gt;_f q(f|e) ]&lt;/span&gt;.  Here, the sum over f is a &lt;span style="font-style: italic;"&gt;finite&lt;/span&gt; sum over GigaFrench, and the sum over e is an &lt;span style="font-style: italic;"&gt;infinite&lt;/span&gt; sum over positive probability English sentences given my the English LM p(e).&lt;br /&gt;&lt;br /&gt;One could then apply something like &lt;a href="http://www.seas.upenn.edu/%7Etaskar/pubs/pr_jmlr10.pdf"&gt;posterior regularization&lt;/a&gt; (Kuzman Ganchev, Graça and Taskar) to do the learning.  There's the nasty bit about how to compute these things, but that's why you get to be friends with Jason Eisner so he can tell you how to do anything you could ever want to do with finite state models.&lt;br /&gt;&lt;br /&gt;Anyway, it seems like an interesting idea.  I'm definitely not aware if anyone has tried it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-7219466186164349152?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/7219466186164349152/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=7219466186164349152' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7219466186164349152'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7219466186164349152'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/08/finite-state-nlp-with-unlabeled-data-on.html' title='Finite State NLP with Unlabeled Data on Both Sides'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-4819320652193757740</id><published>2010-08-21T12:42:00.003-06:00</published><updated>2010-08-21T12:49:18.956-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><title type='text'>Readers kill blogs?</title><content type='html'>I try to avoid making meta-posts, but the timing here was just too impeccable for me to avoid a short post on something that's been bothering me for a year or so.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;On the one hand, yesterday, &lt;a href="http://www.stat.columbia.edu/%7Ecook/movabletype/archives/2010/08/why_i_blog.html"&gt;Aleks stated that the main reason he blogs is to see comments&lt;/a&gt;.  (Similarly, &lt;a href="http://blog.computationalcomplexity.org/2010/08/comments.html"&gt;Lance also thinks comments are a very important part of having an "open" blog&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;On the other hand, &lt;a href="http://earningmyturns.blogspot.com/2010/08/how-do-you-consume-media.html"&gt;people are more an more moving to systems like Google Reader, as re-blogged by Fernando&lt;/a&gt;, also yesterday.&lt;/li&gt;&lt;/ul&gt;I actually complete agree with both points.  The problem is that I worry that they are actually fairly opposed.  I comment &lt;span style="font-style: italic;"&gt;much less&lt;/span&gt; on other people's blogs now that I use reader, because the 10 second overhead of clicking on the blog, being redirected, entering a comment, blah blah blah, is just too high.  Plus, I worry that no one (except the blog author) will see my comment, since most readers don't (by default) show comments in with posts.&lt;br /&gt;&lt;br /&gt;Hopefully the architects behind readers will pick up on this and make these things (adding and viewing comments, within the reader -- yes, I realize that it's then not such a "reader") easier.  That is, unless they want to lose out to tweets!&lt;br /&gt;&lt;br /&gt;Until then, I'd like to encourage people to continue commenting here.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-4819320652193757740?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/4819320652193757740/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=4819320652193757740' title='14 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4819320652193757740'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4819320652193757740'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/08/readers-kill-blogs.html' title='Readers kill blogs?'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>14</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-726284263469295772</id><published>2010-08-19T13:45:00.003-06:00</published><updated>2010-08-19T14:09:28.911-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='domain adaptation'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Multi-task learning: should our hypothesis classes be the same?</title><content type='html'>It is almost an unspoken assumption in multitask learning (and domain adaptation) that you use the same type of classifier (or, more formally, the same &lt;span style="font-style: italic;"&gt;hypothesis class&lt;/span&gt;) for all tasks.  In NLP-land, this usually means that everything is a linear classifier, and the feature sets are the same for all tasks; in ML-land, this usually means that the same kernel is used for every task.  In neural-networks land (ala Rich Caruana), this is enforced by the symmetric structure of the networks used.&lt;br /&gt;&lt;br /&gt;I probably would have gone on not even considering this unspoken assumption, until a few years ago I saw a couple papers that challenged it, albeit indirectly.  One was &lt;a href="http://acl.ldc.upenn.edu/P/P06/P06-1060.pdf"&gt;Factorizing Complex Models: A Case Study in Mention Detection&lt;/a&gt; by Radu (Hans) Florian, Hongyan Jing, Nanda Kambhatla and Imed Zitouni, all from IBM.  They're actually considering solving tasks separately rather than jointly, but joint learning and multi-task learning are very closely related.  What they see is that different features are useful for spotting entity spans, and for labeling entity types.&lt;br /&gt;&lt;br /&gt;That year, or the next, I saw another paper (can't remember who or what -- if someone knows what I'm talking about, please comment!) that basically showed a similar thing, where a linear kernel was doing best for spotting entity spans, and a polynomial kernel was doing best for labeling the entity types (with the same feature sets, if I recall correctly).&lt;br /&gt;&lt;br /&gt;Now, to some degree this is not surprising.  If I put on my feature engineering hat, then I probably &lt;span style="font-style: italic;"&gt;would&lt;/span&gt; design slightly different features for these two tasks.  On the other hand, coming from a multitask learning perspective, this is surprising: if I believe that these tasks are related, shouldn't I also believe that I can do well solving them in the same hypothesis space?&lt;br /&gt;&lt;br /&gt;This raises an important (IMO) question: if I want to allow my hypothesis classes to be different, what can I do?&lt;br /&gt;&lt;br /&gt;One way is to punt: you can just concatenate your feature vectors and cross your fingers.  Or, more nuanced, you can have some set of shared features and some set of features unique to each task.  This is similar (the nuanced version, not the punting version) to what Jenny Finkel and Chris Manning did in their ACL paper this year, &lt;a href="http://www.stanford.edu/%7Ejrfinkel/papers/hier-joint.pdf"&gt;Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;An alternative approach is to let the two classifiers "talk" via unlabeled data.  Although motivated differently, this was something of the idea behind my EMNLP 2008 paper on &lt;a href="http://hal3.name/docs/daume08hints.pdf"&gt;Cross-Task Knowledge-Constrained Self Training&lt;/a&gt;, where we run two models on unlabeled data and look for where they "agree."&lt;br /&gt;&lt;br /&gt;A final idea that comes to mind, though I don't know if anyone has tried anything like this, would be to try to do some feature extraction over the two data sets.  That is, basically think of it as a combination of multi-view learning (since we have two different hypothesis classes) and multi-task learning.  Under the assumption that we have access to examples labeled for &lt;span style="font-style: italic;"&gt;both&lt;/span&gt; tasks simultaneously (i.e., not the settings for either Jenny's paper or my paper), then one could do a 4-way kernel CCA, where data points are represented in terms of their task-1 kernel, task-2 kernel, task-1 label and task-2 label.  This would be sort of a blending of CCA-for-multiview-learning and CCA-for-multi-task learning.&lt;br /&gt;&lt;br /&gt;I'm not sure what the right way to go about this is, but I think it's something important to consider, especially since it's an assumption that usually goes unstated, even though empirical evidence seems to suggest it's not (always) the right assumption.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-726284263469295772?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/726284263469295772/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=726284263469295772' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/726284263469295772'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/726284263469295772'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/08/multi-task-learning-should-our.html' title='Multi-task learning: should our hypothesis classes be the same?'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-8966372274793856563</id><published>2010-08-02T03:49:00.001-06:00</published><updated>2010-08-02T16:24:05.370-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='discourse'/><title type='text'>Why Discourse Structure?</title><content type='html'>I come from a strong lineage of discourse folks.  Writing a parser for Rhetorical Structure Theory was one of the first class projects I had when I was a grad student.  Recently, with the release of the &lt;a href="http://nlpers.blogspot.com/2006/05/penn-discourse-treebank.html"&gt;Penn Discourse Treebank&lt;/a&gt;, there has been a bit of a flurry of interest in this problem (I had &lt;a href="http://nlpers.blogspot.com/2009/09/acl-and-emnlp-retrospective-many-days.html"&gt;some snarky comments&lt;/a&gt; right after ACL about this).  I've also talked about why this is a &lt;a href="http://nlpers.blogspot.com/2007/04/discourse-is-darn-hard.html"&gt;hard problem&lt;/a&gt;, but never really about why it is an &lt;i&gt;interesting&lt;/i&gt; problem.&lt;br /&gt;&lt;br /&gt;My thinking about discourse has changed a lot over the years.  My current thinking about it is in an "interpretation as abduction" sense.  (And I sincerely hope all readers know what that means... if not, go back and read some classic papers by Jerry Hobbs.)  This is a view I've been rearing for a while, but I finally started putting it into words (probably mostly Jerry's words) in a conversation at ACL with Hoifung Poon and Joseph Turian (I think it was Joseph... my memory fades quickly these days :P).&lt;br /&gt;&lt;br /&gt;This view is that discourse is that &lt;span style="font-style: italic;"&gt;thing&lt;/span&gt; that gives you an interpretation above and beyond whatever interpretations you get from a sentence.  Here's a slightly refined version of the example I came up with on the fly at ACL:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;I only like traveling to Europe.  So I submitted a paper to ACL.&lt;/li&gt;&lt;li&gt;I only like traveling to Europe.  Nevertheless, I submitted a paper to ACL.&lt;/li&gt;&lt;/ol&gt;What does the hearer (H) infer from these sentences.  Well, if we look at the sentences &lt;span style="font-style: italic;"&gt;on their own&lt;/span&gt;, then H infers something like Hal-likes-travel-to-Europe-and-only-Europe, and H infers something like Hal-submitted-a-paper-to-ACL.  But when you throw discourse in, you can derive two additional bits of information.  In example (1), you can infer ACL-is-in-Europe-this-year and in (2) you can infer the negation of that.&lt;br /&gt;&lt;br /&gt;Pretty amazing stuff, huh?  Replacing a "so" with a "nevertheless" completely changes this interpretation.&lt;br /&gt;&lt;br /&gt;What does this have to do with interpretation as abduction?  Well, we're going to &lt;span style="font-style: italic;"&gt;assume&lt;/span&gt; that this discourse is coherent.  Given that assumption, we have to ask ourselves: in (1), what do we have to assume about the world to make this discourse coherent?  The answer is that you have to assume that ACL is in Europe.  And similarly for (2).&lt;br /&gt;&lt;br /&gt;Of course, there are other things you could assume that would make this discourse coherent.  For (1), you could assume that I have a rich benefactor who likes ACL submissions and will send me to Europe every time I submit something to ACL.  For (2), you could assume that I didn't want my paper to get in, but I wanted a submission to get reviews, and so I submitted a crappy paper.  Or something.  But these fail the Occam's Razor test.  Or, perhaps they are a priori simply less likely (i.e., you have to assume more to get the same result).&lt;br /&gt;&lt;br /&gt;Interestingly, I can change the interpretation of (2), for instance, by adding a third sentence to the discourse: "I figured that it would be easy to make my way to Europe after going to Israel."  Here, we would abduce that ACL is in Israel, and that I'm willing to travel to Israel on my way to Europe.  For you GOFAI folks, this would be something like non-monotonic reasoning.&lt;br /&gt;&lt;br /&gt;Whenever I talk about discourse to people who don't know much about it, I always get this nagging sense of "yes, but why do I &lt;span style="font-style: italic;"&gt;care&lt;/span&gt; that you can recognize that sentence 4 is background to sentence 3, unless I want to do summarization?"  I hope that this view provides some alternative answer to that question.  Namely, that there's some information you can get from sentences, but there is additional information in how those sentences are glued together.&lt;br /&gt;&lt;br /&gt;Of course, one of the big problems we have is that we have no idea how to represent sentence-level interpretations, or at least some ideas but no way to get there in the general case.  In the sentence-level case, we've seen some progress recently in terms of representing semantics in a sort of substitutability manner (ala paraphrasing), which is nice because the representation is still text.  One could ask if something similar might be possible at a discourse level.  Obviously you could paraphrase discourse connectives, but that's missing the point.  What else could you do?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-8966372274793856563?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/8966372274793856563/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=8966372274793856563' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8966372274793856563'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8966372274793856563'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/08/why-discourse-structure.html' title='Why Discourse Structure?'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-4723287240737022045</id><published>2010-07-24T07:35:00.002-06:00</published><updated>2010-07-29T18:27:39.508-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='papers'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>ACL 2010 Retrospective</title><content type='html'>&lt;a href="http://acl2010.org/general.htm"&gt;ACL 2010&lt;/a&gt; finished up in Sweden a week ago or so.  Overall, I enjoyed my time there (the local organization was great, though I think we got hit with unexpected heat, so those of us who didn't feel like booking a room at the Best Western -- hah!  why would I have done that?! -- had no A/C and my room was about 28-30 every night).&lt;br /&gt;&lt;br /&gt;But you don't come here to hear about sweltering nights, you come to hear about papers.  My list is actually pretty short this time.  I'm not quite sure why that happened.  Perhaps NAACL sucked up a lot of the really good stuff, or I went to the wrong sessions, or something.  (Though my experience was echoed by a number of people (n=5) I spoke to after the conference.)  Anyway, here are the things I found interesting.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;    &lt;div class="ms_author_list"&gt;    &lt;div class="ms_author_list"&gt;&lt;span style="font-style: italic;"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-1160.pdf"&gt;Beyond NomBank: A Study of Implicit  Arguments for Nominal Predicates&lt;/a&gt;&lt;/span&gt;, by Matthew Gerber and Joyce Chai (this was the Best Long Paper award recipient).  This was &lt;span style="font-style: italic;"&gt;by far&lt;/span&gt; my favorite paper of the conference.  For all you students out there (mine included!), pay attention to this one.  It was great because they looked at a fairly novel problem, in a fairly novel way, put clear effort into doing something (they annotated a bunch of data by hand), developed features that were &lt;span style="font-style: italic;"&gt;significantly&lt;/span&gt; more interesting than the usual off-the-shelf set, and got impressive results on what is clearly a very hard problem.  Congratulations to Matthew and Joyce -- this was a great paper, and the award is highly deserved.&lt;br /&gt;&lt;/div&gt;&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/li&gt;&lt;li&gt;&lt;div class="ms_author_list"&gt;&lt;span style="font-style: italic;"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-1010.pdf"&gt;Challenge Paper: The Human Language Project: Building a  Universal Corpus of the World’s Languages&lt;/a&gt;&lt;/span&gt;, by Steven Abney and Steven Bird.  Basically this would be awesome if they can pull it off -- a giant structured database with stuff from tons of languages.  Even just having &lt;span style="font-style: italic;"&gt;tokenization&lt;/span&gt; in tons of languages would be useful for me.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/li&gt;&lt;li&gt;    &lt;div class="ms_author_list"&gt;&lt;span style="font-style: italic;"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-1015.pdf"&gt;Extracting Social Networks from  Literary Fiction&lt;/a&gt;&lt;/span&gt;, by David Elson, Nicholas Dames and Kathleen McKeown.  (This was the IBM best student paper.) Basically they construct networks of characters from British fiction and try to analyze some literary theories in terms of those networks, and find that there might be holes in the existing theories.  My biggest question, as someone who's not a literary theorist, is &lt;span style="font-style: italic;"&gt;why&lt;/span&gt; did those theories exist in the first place?  The analysis was over 80 or so books, surely literary theorists have read and pondered all of them.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/li&gt;&lt;li&gt;    &lt;div class="ms_author_list"&gt;&lt;span style="font-style: italic;"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-1047.pdf"&gt;Syntax-to-Morphology Mapping in Factored  Phrase-Based Statistical Machine Translation from English to Turkish&lt;/a&gt;&lt;/span&gt;, by Reyyan Yeniterzi and Kemal Oﬂazer.  You probably know that I think translating morphology and translating out of English are both interesting topics, so it's perhaps no big surprise that I liked this paper.  The other thing I liked about this paper is that they presented things that worked, as well as things that might well have worked but didn't.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/li&gt;&lt;li&gt;    &lt;div class="ms_author_list"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-2034.pdf"&gt;Learning Common Grammar from Multilingual  Corpus&lt;/a&gt;, by Tomoharu Iwata, Daichi Mochihashi and Hiroshi Sawad.  I wouldn't go so far as to say that I thought this was a great paper, but I would say there is the beginning of something interesting here.  They basically learn a coupled PCFG in Jenny Finkel hierarchical-Bayes style, over multiple languages.  The obvious weakness is that languages don't all have the same structure. If only there were an &lt;a href="http://hal3.name/docs/daume07implication.pdf"&gt;area of linguistics that studies how they differ&lt;/a&gt;....  (Along similar lines, see&lt;br /&gt;   &lt;div class="ms_author_list"&gt;&lt;span style="font-style: italic;"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-1131.pdf"&gt;Phylogenetic Grammar Induction&lt;/a&gt;&lt;/span&gt;, by Taylor Berg-Kirkpatrick and Dan Klein, which has a similar approach/goal.)  &lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/li&gt;&lt;li&gt;    &lt;div class="ms_author_list"&gt;&lt;span style="font-style: italic;"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-1088.pdf"&gt;Bucking the Trend: Large-Scale Cost-Focused  Active Learning for Statistical Machine Translation&lt;/a&gt;&lt;/span&gt;, by Michael Bloodgood and Chris Callison-Burch.  The "trend" referenced in the title is that active learning always asymptotes depressingly early.  They have turkers translate bits of sentences in context (i.e., in a whole sentence, translate the highlighted phrase) and get a large bang-for-the-buck.  Right now they're looking primarily at out-of-vocabulary stuff, but there's a lot more to do here.&lt;/div&gt;&lt;/li&gt;&lt;/ul&gt;A few papers that I didn't see, but other people told me good things about:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;    &lt;div class="ms_author_list"&gt;&lt;span style="font-style: italic;"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-1018.pdf"&gt;“Was It Good? It Was Provocative.” Learning  the Meaning of Scalar Adjectives&lt;/a&gt;&lt;/span&gt;, by Marie-Catherine de Marneffe, Christopher D. Manning and Christopher  Pott.  &lt;/div&gt;&lt;/li&gt;&lt;li&gt;    &lt;div class="ms_author_list"&gt;&lt;span style="font-style: italic;"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-1031.pdf"&gt;Unsupervised Ontology Induction from Text&lt;/a&gt;&lt;/span&gt;, by Hoifung Poon and Pedro Domingos.&lt;/div&gt;&lt;/li&gt;&lt;li&gt;    &lt;div class="ms_author_list"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-1046.pdf"&gt;Improving the Use of Pseudo-Words for  Evaluating Selectional Preferences&lt;/a&gt;, by Nathanael Chambers and Daniel Jurafsky.&lt;/div&gt;&lt;/li&gt;&lt;li&gt;    &lt;div class="ms_author_list"&gt;&lt;span style="font-style: italic;"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-1083.pdf"&gt;Learning to Follow Navigational Directions&lt;/a&gt;&lt;/span&gt;, by Adam Vogel and Daniel Jurafsky.&lt;/div&gt;&lt;/li&gt;&lt;li&gt;    &lt;div class="ms_author_list"&gt;&lt;span style="font-style: italic;"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-1093.pdf"&gt;Compositional Matrix-Space Models of  Language&lt;/a&gt;&lt;/span&gt;, by Sebastian Rudolph and Eugenie Giesbrecht.  (This was described to me as "thought provoking" though not necessarily more.)&lt;/div&gt;&lt;/li&gt;&lt;li&gt;    &lt;div class="ms_author_list"&gt;&lt;span style="font-style: italic;"&gt;&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-2037.pdf"&gt;Top-Down K-Best A* Parsing&lt;/a&gt;&lt;/span&gt;, by Adam Pauls, Dan Klein and Chris Quirk.&lt;/div&gt;&lt;/li&gt;&lt;/ul&gt;At any rate, I guess that's a reasonably long list.  There were definitely good things, but with a fairly heavy tail.  If you have anything you'd like to add, feel free to comment.  (As an experiment, I've turned comment moderation on as a way to try to stop the spam... I'm not sure I'll do it indefinitely; I hadn't turned it on before because I always thought/hoped that Google would just start doing spam detection and/or putting hard captcha's up or &lt;span style="font-style: italic;"&gt;something&lt;/span&gt; to try to stop spam, but sadly they don't seem interested.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-4723287240737022045?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/4723287240737022045/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=4723287240737022045' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4723287240737022045'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4723287240737022045'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/07/acl-2010-retrospective.html' title='ACL 2010 Retrospective'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-4429690407859784904</id><published>2010-06-28T19:42:00.002-06:00</published><updated>2010-06-28T19:49:52.409-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='papers'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>ICML 2010 Retrospective</title><content type='html'>Just got back from Israel for &lt;a href="http://www.icml2010.org/"&gt;ICML&lt;/a&gt;, which was a great experience: I'd wanted to go there for a while and this was a perfect opportunity. I'm very glad I spent some time afterwards out of Haifa, though.&lt;br /&gt;&lt;br /&gt;Overall, I saw a lot of really good stuff.  The usual caveats apply (I didn't see everything it's a biased sample, blah blah blah).  Here are some things that stood out:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.icml2010.org/papers/522.pdf"&gt;Structured Output Learning with Indirect Supervision&lt;/a&gt; (M.-W. Chang, V. Srikumar, D. Goldwasser, D. Roth).  This was probably one of my favorite papers of the conference, even though I had learned some about the work when I visited UIUC a few months ago.  Let's say you're trying to do word alignment, and you have a few labeled examples of alignments.  But then you also have a bunch of parallel data.  What can you do?  You can turn the parallel data into a &lt;i&gt;classification&lt;/i&gt; problem: are these two sentences translations of each other.  You can pair random sentences to get negative examples.  A very clever observation is basically that the weight vector for this binary classifier should point in the same direction as the weight vector for the (latent variable) structured problem!  (Basically the binary classifier should say "yes" only when there exists an alignment that renders these good translations.)  Tom Dietterich asked a question during Q/A: these binary classification problems seem very hard: is that bad?  Ming-Wei reassured him that it wasn't. In thinking about it after the fact, I wonder if it is actually &lt;i&gt;really importantant&lt;/i&gt; that they're hard: namely, if they were easy, then you could potentially answer the question without bothering to make up a reasonable alignment.  I suspect this might be the case.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.icml2010.org/papers/384.pdf"&gt;A Language-based Approach to Measuring Scholarly Impact&lt;/a&gt; (S. Gerrish, D. Blei).  The idea here is that without using citation structure, you can model influence in large document collections.  The basic idea is that when someone has a new idea, they often introduce new terminology to a field that wasn't there before.  The important bit is that they don't change all of science, or even all of ACL: they only change what gets talked about in their particular sub-area (aka topic :P).  It was asked during Q/A what would happen if you did use citations, and my guess based on my own small forays in this area is that the two sources would really reinforce eachother.  That is, you might regularly cite the original EM even if your paper has almost nothing to do with it.  (The example from the talk was then Penn Treebank paper: one that has a bajillion citations, but hasn't &lt;i&gt;lexically&lt;/i&gt; affected how people talk about research.)&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.icml2010.org/papers/495.pdf"&gt;Hilbert Space Embeddings of Hidden Markov Models&lt;/a&gt; (L. Song, B. Boots, S. Saddiqi, G. Gordon, A. Smola).  This received one of the best paper awards.  While I definitely liked this paper, actually what I liked more what that it taught me something from COLT last year that I hadn't known (thanks to Percy Liang for giving me more details on this).  That paper was &lt;a href="http://www.cs.mcgill.ca/%7Ecolt2009/papers/011.pdf"&gt;A spectral algorithm for learning hidden Markov models&lt;/a&gt; (D. Hsu, S. Kakade, T. Zhang) and basically shows that you can use spectral decomposition techniques to "solve" the HMM problem.  You create the matrix of observation pairs (A_ij = how many times did I see observation j follow observation i) and then do some processing and then a spectral decomposition and, voila, you get parameters to an HMM!  In the case that the data was actually generated by an HMM, you get good performance and good guarantees. Unfortunately, if the data was &lt;i&gt;not&lt;/i&gt; generated by an HMM, then the theory doesn't work and the practice does worse than EM.  That's a big downer, since nothing is ever generated by the model we use, but it's a cool direction.  At any rate, the current paper basically asks what happens if your observations are drawn from an RKHS, and then does an analysis.  (Meta-comment: as was pointed out in the Q/A session, and then later to me privately, this has fairly strong connections to some stuff that's been done in Gaussian Process land recently.)&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.icml2010.org/papers/549.pdf"&gt;Forgetting Counts: Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process&lt;/a&gt; (N. Bartlett, D. Pfau, F. Wood).  This paper shows that if you're building a hierarchical Pitman-Yor language model (think Kneser-Ney smoothing if that makes you feel more comfortable) in an online manner, then you should feel free to throw out entire restaurants as you go through the process.  (A restaurant is just the set of counts for a given context.)  You do this to maintain a maximum number of restaurants at any given time (it's a fixed memory algorithm).  You can do this intelligently (via a heuristic) or just stupidly: pick them at random.  Turns out it doesn't matter.  The explanation is roughly that if it were important, and you threw it out, you'd see it again and it would get re-added.  The chance that something that occurs a lot keeps getting picked to be thrown out is low.  There's some connection to using &lt;a href="http://www.blogger.com/post-create.g?blogID=19803222"&gt;approximate counting for language modeling&lt;/a&gt;, but the Bartlett et al. paper is being even stupider than we were being!&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.icml2010.org/papers/587.pdf"&gt;Learning efficiently with approximate inference via dual losses&lt;/a&gt; (O. Meshi, D. Sontag, T. Jaakkola, A. Globerson).  Usually when you train structured models, you alternate between running inference (a maximization to find the most likely output for a given training instance) and running some optimization (a minimization to move your weight vector around to achieve lower loss).  The observation here is that by taking the dual of the inference problem, you turn the maximization into a minimization.  You now have a dual minimization, which you can solve &lt;i&gt;simultaneously&lt;/i&gt;, meaning that when your weights are still crappy, you aren't wasting time finding perfect outputs.  Moreover, you can "warm start" your inference for the next round.  It's a very nice idea.  I have to confess I was a bit disappointed by the experimental results, though: the gains weren't quite what I was hoping.  However, most of the graphs they were using weren't very large, so maybe as yo move toward harder problems, the speed-ups will be more obvious.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.icml2010.org/papers/458.pdf"&gt;Deep learning via Hessian-free optimization&lt;/a&gt; (J. Martens).  Note that I neither saw this presentation nor read the paper (skimmed it!), but I talked with James about this over lunch one day.  The "obvious" take away message is that you should read up on your optimization literature, and start using second order methods instead of your silly gradient methods (and don't store that giant Hessian: use efficient matrix-vector products).  But the less obvious take away message is that some of the prevailing attitudes about optimizing deep belief networks may be wrong.  For those who don't know, the usual deal is to train the networks layer by layer in an auto-encoder fashion, and then at the end apply back-propogation.  The party line that I've already heard is that the layer-wise training is &lt;i&gt;very important&lt;/i&gt; to getting the network near a "good" local optimum (whatever that means).  But if James' story holds out, this seems to not be true: he doesn't do any clever initialization and still find good local optima!&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.icml2010.org/papers/638.pdf"&gt;A theoretical analysis of feature pooling in vision algorithms&lt;/a&gt; (Y.-L. Boureau, J. Ponce, Y. LeCun). Yes, that's right: a vision paper.  Why should you read this paper? Here's the question they're asking: after you do some blah blah blah feature extraction stuff (specifically: Sift features), you get something that looks like a multiset of features (hrm.... sounds familiar).  These are often turned into a histogram (basically taking averages) and sometimes just used as a bag: did I see this feature or not.  (Sound familiar yet?)  The analysis is: why should one of these be better and, in particular, why (in practice) do vision people see multiple regimes.  Y-Lan et al. provide a simple, obviously broken, model (that assumes feature independence... okay, this has to sound familiar now) to look at the discriminability of these features (roughly the ration of between-class variances and overall variances) to see how these regimes work out.  And they look basically how they do in practice (modulo one "advanced" model, which doesn't quite work out how they had hoped).&lt;br /&gt;&lt;br /&gt;Some other papers that I liked, but don't want to write too much about:&lt;br /&gt;&lt;ul&gt; &lt;li&gt;&lt;a href="http://www.icml2010.org/papers/568.pdf"&gt;Learning Programs: A Hierarchical   Bayesian Approach&lt;/a&gt; (P. Liang, M. Jordan, D. Klein).  Structured models over programs are &lt;span style="font-style: italic;"&gt;very&lt;/span&gt; hard; this paper gives one approach to modeling them.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.icml2010.org/papers/433.pdf"&gt;Budgeted Nonparametric Learning from   Data Streams&lt;/a&gt; (R. Gomes, A. Krause).  Shows that a clustering problem and a Gaussian process problem are submodular, goes from there.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.icml2010.org/papers/442.pdf"&gt;Internal Rewards Mitigate Agent   Boundedness&lt;/a&gt; (J. Sorg, S. Singh, R. Lewis).  Exactly what the title says.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.icml2010.org/papers/248.pdf"&gt;The Translation-invariant Wishart-Dirichlet Process for Clustering Distance Data&lt;/a&gt; (J. Vogt, S. Prabhakaran, T. Fuchs, V. Roth).  Been wanting to do something like this for a while, but they did it better than I would have!&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.icml2010.org/papers/636.pdf"&gt;Sparse Gaussian Process Regression   via L_1 Penalization&lt;/a&gt; (F. Yan, Y. Qi).  Very interesting way to get sparsity in a GP basically by changing your approximating distribution. &lt;/li&gt;&lt;/ul&gt;Some papers that other people said they liked were:&lt;br /&gt;&lt;ul&gt; &lt;li&gt;&lt;a href="http://www.icml2010.org/papers/569.pdf"&gt;Multi-Class Pegasos on a Budget&lt;/a&gt; (Z. Wang, K. Crammer, S. Vucetic)&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.icml2010.org/papers/376.pdf"&gt;Risk minimization, probability   elicitation, and cost-sensitive SVMs&lt;/a&gt; (H. Masnadi-Shirazi,   N. Vasconcelos)&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.icml2010.org/papers/107.pdf"&gt;Asymptotic Analysis of Generative Semi-Supervised Learning&lt;/a&gt; (J. Dillon, K. Balasubramanian, G. Lebanon) &lt;/li&gt;&lt;/ul&gt;Hope to see you at ACL!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-4429690407859784904?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/4429690407859784904/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=4429690407859784904' title='12 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4429690407859784904'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4429690407859784904'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/06/icml-2010-retrospective.html' title='ICML 2010 Retrospective'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>12</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-1650900416182826748</id><published>2010-06-07T20:11:00.000-06:00</published><updated>2010-06-07T20:13:33.044-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='papers'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>NAACL 2010 Retrospective</title><content type='html'>I just returned from &lt;a href="http://www.naacl2010.org/"&gt;NAACL 2010&lt;/a&gt;, which was simultaneously located in my home town of Los Angeles and located nowhere near my home town of Los Angeles.  (That's me trying to deride downtown LA as being nothing like real LA.)&lt;br /&gt;&lt;br /&gt;Overall I was pleased with the program.  I saw a few talks that changed (a bit) how I think about some problems.  There were only one or two talks I saw that made me wonder how "that paper" got in, which I think is an acceptable level.  Of course I spend a great deal of time not at talks, but no longer feel bad about doing so.&lt;br /&gt;&lt;br /&gt;On tutorials day, I saw Hoifung Poon's tutorial on Markov Logic Networks.  I think Hoifung did a great job of targetting the tutorial at just the right audience, which probably wasn't exactly me (though I still quite enjoyed it myself).  I won't try to describe MLNs, but my very brief summary is "language for compactly expressing complex factor graphs (or CRFs, if you prefer)."  That's not exactly right, but I think it's pretty close.  You can check back in a few months and see if there are going to be any upcoming "X, Y and Daume, 2011" papers using MLNs :P.  At any rate, I think it's a topic worth knowing about, especially if you really just want to get a system up and running quickly.  (I'm also interested in trying Andrew McCallum's &lt;a href="http://code.google.com/p/factorie/"&gt;Factorie&lt;/a&gt; system, which, to some degree, trades easy of use for added functionality.  But honestly, I don't really have time to just try things these days: students have to do that for me.)&lt;br /&gt;&lt;br /&gt;One of my favorite papers of the conference was one that I hadn't even planned to go see!  It is &lt;a href="http://www.blogger.com/aclweb.org/anthology-new/N/N10/N10-1120.pdf"&gt;Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables&lt;/a&gt; by Tetsuji Nakagawa, Kentaro Inui and Sadao Kurohashi. (I saw it basically because by the end of the conference, I was too lazy to switch rooms after the prvious talk.)  There are two things I really like about this paper.  The first is that the type of sentiment they're going after is really broad.  Example sentences included things that I'd love to look up, but apparently were only in the slides... but definitely more than "I love this movie."  The example in the paper is "Tylenol prevents cancer," which is a nice positive case. &lt;br /&gt;&lt;br /&gt;The basic idea is that some words give you sentiment.  For instance, by itself, "cancer" is probably negative.  But then some words &lt;i&gt;flip&lt;/i&gt; polarity.  Like "prevents."  Or negation.  Or other things.  They set up a model based on sentence level annotations with latent variables for the "polarity" words and for the "flipping" words.  The flipping words are allowed to flip any sentiment below them in the dependency tree.  Cool idea!  Of course, I have to nit-pick the paper a bit.  It probably would be better to allow arguments/adjuncts to flip polarity, too.  Otherwise, negation (which is usually a leaf) will never flip anything.  And adjectives/adverbs can't flip either (eg., going from "happy" to "barely happy").  But overall I liked the paper.&lt;br /&gt;&lt;br /&gt;A second thing I learned is that XOR problems &lt;i&gt;do&lt;/i&gt; exist in real life, which &lt;a href="http://hunch.net/?p=245"&gt;I had previously questioned&lt;/a&gt;.  The answer came (pretty much unintentionally) from the paper &lt;a href="http://www.aclweb.org/anthology/N/N10/N10-1119.pdf"&gt;The viability of web-derived polarity lexicons&lt;/a&gt; by Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Hannan and Ryan McDonald.  I won't talk much about this paper other than to say that if you have 4 billion web pages, you can get some pretty good sentimenty words, if you're careful to not blindly apply graph propagation.  But at the end, they throw a meta classifier on the polarity classification task, whose features include things like (1) how many positive terms are in the text, (2) how many negative terms are in the text, (3) how many negations are in the text.  Voila!  XOR!  (Because negation XORs terms.)&lt;br /&gt;&lt;br /&gt;I truly enjoyed Owen Rambow's poster on &lt;a href="http://www.aclweb.org/anthology/N/N10/N10-1049.pdf"&gt;The   Simple Truth about Dependency and Phrase Structure Representations:   An Opinion Piece&lt;/a&gt;.  If you're ever taken a class in mathematical logic, it is very easy for me to summarize this paper: parse trees (dependency or phrase structure) are your languge, but unless you have a theory of that language (in the model-theoretic sense) then whatever you do is meaningless.  In more lay terms: you can always push symbols around, but unless you tie a semantics to those symbols, you're really not doing anything.  Take home message: pay attention to the meaning of your symbols!&lt;br /&gt;&lt;br /&gt;In the category of "things everyone should know about", there was &lt;a href="http://www.aclweb.org/anthology/N/N10/N10-1083.pdf"&gt;Painless unsupervised learning with features&lt;/a&gt; by Taylor Berg-Kirkpatrick, Alexandre Bouchard Côté, John DeNero and Dan Klein.  The idea is that you can replace your multinomails in an HMM (or other graphical model) with little maxent models.  Do EM in this for unsuperviesd learning and you can throw in a bunch of extra features. I would have liked to have seen a comparison against naive Bayes with the extra features, but &lt;a href="http://hal3.name/docs/daume09unsearn.pdf"&gt;my prior belief&lt;/a&gt; is sufficiently strong that I'm willing to believe that it's helpful.  The only sucky thing about this training regime is that training maxent models with (tens of) thousands of classes is pretty painful.  Perhaps a reduction like tournaments or SECOC would help bring it down to a log factor.&lt;br /&gt;&lt;br /&gt;I didn't see the presentation for &lt;a href="http://www.aclweb.org/anthology/N/N10/N10-1116.pdf"&gt;From baby steps to leapfrog: How "Less is More" in unsupervised dependency parsing&lt;/a&gt; by Valetin Spitkovsky, Hiyan Alshawi and Dan Jurafsky, but I read it. The idea is that you can do better unsupervised dependency parsing by giving your learner progressively harder examples.  I really really really tried to get something like this to work for &lt;a href="http://hal3.name/docs/daume09unsearn.pdf"&gt;unsearn&lt;/a&gt;, but nothing helped and most things hurn.  (I only tried adding progressively longer sentences: other ideas, based on conversations with other folks, include looking at vocabulary size, part of speech (eg., human babies learn words in a particular order), etc.)  I'm thrilled it actually works.&lt;br /&gt;&lt;br /&gt;Again, I didn't see &lt;a href="http://www.aclweb.org/anthology/N/N10/N10-1066.pdf"&gt;Discriminative Learning over Constrained Latent Representations&lt;/a&gt; by Ming-Wei Chang, Dan Goldwasser, Dan Roth and Vivek Srikumar, but I learned about the work when I visited UIUC recently (thanks again for the invitation, Dan R.!).  This paper does exactly what you would guess from the title: learns good discriminative models when you have complex latent structures that you know something about a priori.&lt;br /&gt;&lt;br /&gt;I usually ask people at the end of conferences what papers they liked. Here are some papers that were spoken highly of by my fellow NAACLers. (This list is almost unadulterated: one person actually nominated one of the papers I thought shouldn't have gotten in, so I've left it off the list.  Other than that, I think I've included everything that was specifically mentioned to me.) &lt;ol&gt; &lt;li&gt;&lt;a href="http://www.aclweb.org/anthology/N/N10/N10-1118.pdf"&gt;Optimal     Parsing Strategies for Linear Context-Free Rewriting Systems&lt;/a&gt;     by Daniel Gildea.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.aclweb.org/anthology/N/N10/N10-1003.pdf"&gt;Products     of Random Latent Variable Grammars&lt;/a&gt; by Slav Petrov.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.aclweb.org/anthology/N/N10/N10-1015.pdf"&gt;Joint     Parsing and Alignment with Weakly Synchronized Grammars&lt;/a&gt; by     David Burkett, John Blitzer and Dan Klein.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.aclweb.org/anthology/N/N10/N10-1056.pdf"&gt; For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia&lt;/a&gt; by Mark Yatskar, Bo Pang, Cristian   Danescu-Niculescu-Mizil and Lillian Lee.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.aclweb.org/anthology/N/N10/N10-1082.pdf"&gt;Type-Based     MCMC&lt;/a&gt; by Percy Liang, Michael I. Jordan and Dan Klein. &lt;/li&gt;&lt;/ol&gt;I think I probably have two high level "complaints" about the program this year.  &lt;b&gt;First&lt;/b&gt;, I feel like we're seeing more and more "I downloaded blah blah blah data and trained a model using entirely standard features to predict something and it kind of worked" papers. I apologize if I've just described your paper, but these papers really rub me the wrong way.  I feel like I just don't learn anything from them: we already know that machine learning works surprisingly well and I don't really need more evidence of that.  Now, if my sentence described your paper, but your paper additionally had a really interesting &lt;i&gt;analysis&lt;/i&gt; that helps us understand something about language, then you rock!  &lt;b&gt;Second&lt;/b&gt;, I saw a &lt;i&gt;lot&lt;/i&gt; of presentations were speakers were somewhat embarassingly unaware of very prominent very relevant prior work.  (In none of these cases was the prior work my own: it was work that's much more famous.) Sometimes the papers were cited (and it was more of a "why didn't you compare against that" issue) but very frequently they were not. Obviously not everyone knows about all papers, but I recognized this even for papers that aren't even close to my area.&lt;br /&gt;&lt;br /&gt;Okay, I just ranted, so let's end on a positive note.  I'm leaving the conference knowing more than when I went, and I had fun at the same time.  Often we complain about the obvious type I errors and not-so-obvious type II errors, but overall I found the program strong.  Many thanks to the entire program committee for putting together an on-average very good set of papers, and many thanks to all of &lt;i&gt;you&lt;/i&gt; for writing these papers!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-1650900416182826748?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/1650900416182826748/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=1650900416182826748' title='16 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1650900416182826748'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1650900416182826748'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/06/naacl-2010-retrospective.html' title='NAACL 2010 Retrospective'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>16</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-5939551284766191861</id><published>2010-04-29T09:43:00.002-06:00</published><updated>2010-04-29T09:53:33.235-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hiring'/><category scheme='http://www.blogger.com/atom/ns#' term='community'/><title type='text'>Graduating?  Want a post-doc?  Let NSF pay!</title><content type='html'>I get many of emails of the form "I'm looking for a postdoc...."  I'm sure that other, more senior, more famous people get lots of these.  My internal reaction is always "Great: I wish I could afford that!"  NSF has a solution: let them pay for it!  This is the second year of the &lt;a href="http://cifellows.org/"&gt;CI fellows program&lt;/a&gt;, and I know of two people who did it last year (one in NLP, one in theory).  I think it's a great program, both for faculty and for graduates (especially since the job market is so sucky this year).&lt;br /&gt;&lt;br /&gt;If you're graduating, you should apply (unless you have other, better, job options already).  But beware, the deadline is &lt;span style="font-weight: bold;"&gt;May 23&lt;/span&gt;!!! Here's more info directly from NSF's mouth:&lt;br /&gt;&lt;blockquote style="font-style: italic;"&gt;The CIFellows Project is an opportunity for recent Ph.D. graduates in computer science and closely related fields to obtain one- to two-year postdoctoral positions at universities, industrial research laboratories, and other organizations that advance the field of computing and its positive impact on society. The goals of the CIFellows project are to retain new Ph.D.s in research and teaching and to support intellectual renewal and diversity in the computing fields at U.S. organizations.&lt;br /&gt;.....&lt;br /&gt;Every CIFellow application must identify 1-3 host mentors. &lt;a href="http://match.cifellows.org/"&gt;Click here&lt;/a&gt; to visit a website where prospective mentors have posted their information. In addition, openings that have been posted over the past year (and may be a source of viable mentors/host organizations for the CIFellowships) are &lt;a href="http://cifellows.org/opportunities"&gt;available here&lt;/a&gt;.&lt;br /&gt;&lt;/blockquote&gt;Good luck!&lt;br /&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-5939551284766191861?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/5939551284766191861/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=5939551284766191861' title='16 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5939551284766191861'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5939551284766191861'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/04/graduating-want-post-doc-let-nsf-pay.html' title='Graduating?  Want a post-doc?  Let NSF pay!'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>16</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-6521860775208709654</id><published>2010-04-20T11:40:00.001-06:00</published><updated>2010-04-20T11:40:58.695-06:00</updated><title type='text'>((A =&gt; B) and (not B)) =&gt; (not A)</title><content type='html'>I remember back in middle school or high school -- added uncertainty so as not to date myself too much -- I first learned of the existence of file compression software.  We used pkzip, mostly.  I remember one of the first things I did was: compress myfile.exe to myfile.exe.zip.  Then compress that to myfile.exe.zip.zip.  Then myfile.exe.zip.zip.zip.  I cannot tell you how disappointed I was when, at one stage, the file size went up after "compression."&lt;br /&gt;&lt;br /&gt;We read papers all the time that demonstrate something roughly of the form "if A then B."  The happens most obviously the closer you get to theory (when such statements can be made precise), but happens also in non-theoretical work.  The point of this post is: &lt;b&gt;if you believe A =&gt; B, then you have to ask yourself: which do I believe more?  A, or not B?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The simplest case is the compression case.  Let's say a weak compressor is one that always reduces a (non-empty) file's size by one bit.  A strong compressor is one that cuts the file down to one bit.  I can easily prove to you that if you give me a weak compressor, I can turn it into a strong compressor by running it N-1 times on files of size N.  Trivial, right?  But what do you conclude from this?  You're certainly not happy, I don't think  For what I've proved really is that weak compressors don't exist, not that strong compressors do.  That is, you believe &lt;i&gt;so strongly&lt;/i&gt; that a strong compressor is impossible, that you must conclude from (weak) =&gt; (strong) that (weak) cannot possibly exist.&lt;br /&gt;&lt;br /&gt;Let's take the most obvious example from machine learning: boosting.  The basic result of boosting is that I can turn a "weak classifier" into a "strong classifier."  A strong classifier is one that (almost always) does a really good job at classification.  A weak classifier is one that (always always) does a not completely crappy job at classification.  That is, a strong classifier will get you 99.9% accuracy (most of the time) while I weak classifier will only get you 50.1% accuracy (most of the time).  Boosting works by repeatedly applying the weak classifier to modified data sets in order ot get a strong classifier out.&lt;br /&gt;&lt;br /&gt;In all the boosting papers, the results are presented as positive.  That is: look, obviously we want strong classifiers.  But weak classifiers look much more attainable.  And voila, by boosting magic, we can turn the weak into the strong.  This is an A =&gt; B setting: (existence of weak classifiers) =&gt; (existence of strong classifiers).&lt;br /&gt;&lt;br /&gt;Sounds great, right?  But let me ask you: do you believe more strongly that (weak classifiers exist) or more strongly that (strong classifiers do not exist)?  For me, it's the latter.  To some degree, no free lunch theorems apply here.  This yields a &lt;b&gt;totally different read&lt;/b&gt; of boosting results, namely: &lt;b&gt;weak classifiers don't even exist!!!&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;More on the language side, one of the more classic experimentally results we have is that if you give me a really good semantic representation, I can do an awesome job doing natural language generation from those semantics.  In order to do translation, for instance, I just have to generate the semantics of a French sentence and then I can generate a near-perfect English translation.  (French semantics) =&gt; (Perfect translations).  But do I believe mose strongly that we can get perfect French semantics or that we can not get perfect translation?  Right now, almost certainly the latter.&lt;br /&gt;&lt;br /&gt;My point is really one about critical reading.  When you read a paper, things will be presented in one light.  But that's often not the only light in which they can be interpreted.  Apply your own interpretation.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-6521860775208709654?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/6521860775208709654/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=6521860775208709654' title='25 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6521860775208709654'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6521860775208709654'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/04/b-and-not-b-not.html' title='((A =&gt; B) and (not B)) =&gt; (not A)'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>25</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-2082227438864109984</id><published>2010-04-12T14:57:00.002-06:00</published><updated>2010-04-12T15:09:26.672-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='teaching'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>How I teach machine learning</title><content type='html'>I've had discussions about this with tons of people, and it seems like my approach is fairly odd.  So I thought I'd blog about it because I've put a lot of thought into it over the past four offerings of the machine learning course here at Utah.&lt;br /&gt;&lt;br /&gt;At a high level, if there is one thing I want them to remember after the semester is over it's the idea of &lt;span style="font-style: italic;"&gt;generalization&lt;/span&gt; and how it relates to function complexity&lt;span style="font-style: italic;"&gt;.&lt;/span&gt;  That's it.  Now, more operationally, I'd like them to learn SVMs (and kernels) and EM for generative models.&lt;br /&gt;&lt;br /&gt;In my opinion, the whole tenor of the class is set by how it starts.  Here's how I start.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Decision trees.  No entropy.  No mutual information.  Just decision trees based on classification accuracy.  &lt;span style="font-style: italic;"&gt;Why?&lt;/span&gt;  Because the point isn't to teach them decision trees.  The point is to get as quickly as possible to the point where we can talk about things like generalization and function complexity.  Why decision trees?  Because EVERYONE gets them.  They're so intuitive.  And analogies to 20 questions abound.  We also talk about the who notion of data being drawn from a distribution and what it means to predict well in the future.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Nearest neighbor classifiers.  No radial basis functions, no locally weighted methods, etc.  &lt;span style="font-style: italic;"&gt;Why&lt;/span&gt;?  Because I want to introduce the idea of thinking of data as points in high dimensional space.  This is a &lt;span style="font-style: italic;"&gt;big&lt;/span&gt; step for a lot of people, and one that takes some getting used to.  We then do k-nearest neighbor and relate it to generalization, overfitting, etc.  The punch line of this section is the idea of a decision boundary and the complexity of decision boundaries.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Linear algebra and calculus review.  At this point, they're ready to see why these things matter.  We've already hinted at learning as some sort of optimization (via decision trees) and data in high dimensions, hence calculus and linear algebra.  Note: no real probability here.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Linear classifiers as methods for directly optimizing a decision boundary.  We start with 0-1 loss and then move to perceptron.  Students love perceptron because it's so procedural.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;The rest follows mostly as almost every other machine learning course out there.  But IMO these first four days are &lt;span style="font-style: italic;"&gt;crucial.&lt;/span&gt;  I've tried (in the past) starting with linear regression or linear classification and it's just a disaster.  You spend too much time talking about unimportant stuff.  The intro with error-based decision trees moving to kNN is amazingly useful.&lt;br /&gt;&lt;br /&gt;The sad thing is that there are basically no books that follow any order even remotely like this.  Except...drum roll... it's actually not far from what Mitchell's book does.  Except he does kNN much later.  It's really depressing how bad most machine learning books are from a pedagogical perspective... you'd think that in 12 years someone would have written something that works better.&lt;br /&gt;&lt;br /&gt;On top of that, the most recent time I taught ML, I structured everything around recommender systems.  You can actually make it all work, and it's a lot of fun.  We actually did recommender systems for classes here at the U (I had about 90-odd students from AI the previous semester fill out ratings on classes they'd taken in the past).  The data was a bit sparse, but I think it was a lot of fun.&lt;br /&gt;&lt;br /&gt;The other thing I change most recently that I'm very happy with is that I have a full project on feature engineering.  (It ties in to the course recommender system idea.)  Why?  Because most people who take ML, if they ever use it at all, will need to do this.  It's maybe one of the most important things that they'll have to learn.  We should try to teach it.  Again, something that no one ever talks about in books.&lt;br /&gt;&lt;br /&gt;Anyway, that's my set of tricks.  If you have some that you particularly like, feel free to share!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-2082227438864109984?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/2082227438864109984/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=2082227438864109984' title='29 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/2082227438864109984'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/2082227438864109984'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/04/how-i-teach-machine-learning.html' title='How I teach machine learning'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>29</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-8857615392509433634</id><published>2010-04-07T10:26:00.002-06:00</published><updated>2010-04-09T12:46:40.890-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>When Maximum Likelihood Doesn't Make Sense</title><content type='html'>One of the most fun AHA moments in ML or stats is when you see that for multinomial distributions, your naive idea of relative frequency corresponds (through a bunch of calculus) to maximum likelihood estimation.  Ditto means of Gaussians, though that's never quite as amazing because Gaussians seems sort of odd beasts anyway.  It's nice when our intuitions match what stats tells us.&lt;br /&gt;&lt;br /&gt;I'll often watch a DVD, and then leave it in the DVD player until I want to watch something else.  Usually it will return to the menu screen and the timer display will go through 0:00, 0:01, 0:02, ..., until it hits the length of the title screen loop, and then go back to 0:00.  What I always wonder when I glance up at it is "what is the actual length of the title screen loop?"  That is, what is the highest value it'll count up to.&lt;br /&gt;&lt;br /&gt;Being a good statistician, I set up a model.  First I play the frequentist game.  Since I glance up and pretty arbitrary times, my observations can be modeled as a uniform random variable from 0 to some upper bound U, which is the quantity I want to infer.&lt;br /&gt;&lt;br /&gt;Supposing that I only make one observation, say 0:27, I want a maximum likelihood estimate of U.  It's easy to see that U=27 is the maximum likelihood solution.  Anything less than 27 will render my observation impossible.  For any U&gt;=27, my observation has probability 1/(U+1), so the ML solution is precisely U=27.  Note that even if I observe five observations a &lt;= b &lt;= c &lt;= d &lt;= e, the ML solution is still U=e.  Huh.  The math works.  The model is as correct as I could really imagine it.  But this answer doesn't really seem reasonable to me.  Somehow I expect a number &lt;span style="font-style: italic;"&gt;greater than&lt;/span&gt; my observation to come about.&lt;br /&gt;&lt;br /&gt;Perhaps I need a prior.  That'll solve it.  I'll add a prior p(U) and then do maximum a posteriori.  Well, it's easy to see that if p(U) is unimodal, and its mode is less than my (highest) observation, then the MAP solution will still be the (highest) observation.  If it's mode is more than my (highest) observation, then the MAP solution will be the mode of p(U).  If I think about this a bit, it's hard for me to justify a multi-modal p(U), and also hard for me to be happy with a system in which my prior essentially completely determines my solution.&lt;br /&gt;&lt;br /&gt;Or I could be fully Bayesian and look at a posterior distribution p(U | data).  This will just be a left-truncated version of my prior, again, not very satisfying.&lt;br /&gt;&lt;br /&gt;(Note that the "Maximum Likelihood Sets" idea, which I also like quite a bit, doesn't fix this problem either.)&lt;br /&gt;&lt;br /&gt;It also really bugs me that only my largest observation really affects the answer.  If I get one hundred observations, most of them centered around 0:10, and then one at 0:30, then I'd guess that 0:30 or 0:31 or 0:32 is probably the right answer.  But if I observe a bunch of stuff spread roughly uniformly between 0:00 and 0:30, then I'd be more willing to believe something larger.&lt;br /&gt;&lt;br /&gt;I don't really have a solution to this dilemma.  Perhaps you could argue that my difficulties arise from the use of a Uniform distribution -- namely, that were I to use another distribution, these problems wouldn't really arise. I don't think this is quite true, as described below:&lt;br /&gt;&lt;br /&gt;We actually run in to this problem in NLP fairly often.  I observe a billion documents and see, in this 1b documents, 100k unique words, many of which are singletons.  But I'm not willing to believe that if I see that document 1,000,000,001, that there won't be a new unseen word.  Sure it's less likely that I see a new unique word than in document 1001, where it is almost guaranteed, but there's still a relatively large probability that some new word will show up in that next document.&lt;br /&gt;&lt;br /&gt;The whole point is that somehow we expect random samples to be representatives of the underlying distribution.  We don't expect them to &lt;span style="font-style: italic;"&gt;define&lt;/span&gt; the underlying distribution.  I don't actually expect to ever see the corner cases, unless I've seen everything else.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;UPDATE: &lt;/span&gt;The Bayesian approach actually does do something reasonable.  Here is &lt;a href="http://hal3.name/nlpers/uniform.m"&gt;some Matlab code&lt;/a&gt; for computing posteriors with a uniform prior, and results with an upper bound of 100 and various observations are plotted below:&lt;br /&gt;&lt;br /&gt;&lt;img src="http://hal3.name/nlpers/uniform-obs20.png"&gt;&lt;br/&gt;&lt;br /&gt;&lt;img src="http://hal3.name/nlpers/uniform-obs50.png"&gt;&lt;br/&gt;&lt;br /&gt;&lt;img src="http://hal3.name/nlpers/uniform-obs20-50.png"&gt;&lt;br/&gt;&lt;br /&gt;&lt;img src="http://hal3.name/nlpers/uniform-obs1-50.png"&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-8857615392509433634?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/8857615392509433634/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=8857615392509433634' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8857615392509433634'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8857615392509433634'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/04/when-maximum-likelihood-doesnt-make.html' title='When Maximum Likelihood Doesn&apos;t Make Sense'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-4986175089813985432</id><published>2010-04-01T08:18:00.002-06:00</published><updated>2010-04-01T08:32:55.783-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='domain adaptation'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Classification weirdness, regression simplicity</title><content type='html'>In the context of some work on multitask learning, we came to realize that classification is kind of weird.  Or at least linear classification.  It's not that it's weird in a way that we didn't already know: it's just sort of a law of unexpected consequences.&lt;br /&gt;&lt;br /&gt;If we're doing linear (binary) classification, we all know that changing the magnitude of the weight vector doesn't change the predictions.  A standard exercise in a machine learning class might be to show that if your data is linearly separable, then for some models (for instance, unregularized models), the best solution is usually an infinite norm weight vector that's pointing in the right direction.&lt;br /&gt;&lt;br /&gt;This is definitely not true of (linear) regression.  Taking a good (or even perfect) linear regressor and blowing up the weights by some constant will kill your performance.  By adding a regularizer, what you're basically doing is just saying how big you want that norm to be.&lt;br /&gt;&lt;br /&gt;Of course, by regression I simply mean minimizing something like squared error and by classification I mean something like 0/1 loss or hinge loss or logistic loss or whatever.&lt;br /&gt;&lt;br /&gt;I think this is stuff that we all know.&lt;br /&gt;&lt;br /&gt;Where this can bite you in unexpected ways is the following.  In lots of problems, like domain adaptation and multitask learning, you end up making assumptions roughly of the form "my weight vector for domain A should look like my weight vector for domain B" where "look like" is really the place where you get to be creative and define things how you feel best.&lt;br /&gt;&lt;br /&gt;This is all well and good in the regression setting.  A magnitude 5 weight in regression means a magnitude 5 weight in regression.  But not so in classification.  Since you can arbitrarily scale your weight vectors and still get the same decision boundaries, a magnitude 5 weight kind of means nothing.  Or at least it means something that has to do more with the difficulty of the problem and how you chose to set your regularization parameter, rather than something to do with the task itself.&lt;br /&gt;&lt;br /&gt;Perhaps we should be looking for definitions of "look like" that are &lt;span style="font-style: italic;"&gt;insensitive&lt;/span&gt; to things like magnitude.  Sure you can always normalize all your weight vectors to unit norm before you co-regularize them, but that loses information as well.&lt;br /&gt;&lt;br /&gt;Perhaps this is a partial explanation of some negative transfer.  One thing that you see, when looking at the literature in DA and MTL, is that all the tasks are typically of about the same difficulty.  My expectation is that if you have two tasks that are highly related, but one is way harder than the other, is going to lead to negative transfer.  Why?  Because the easy task will get low norm weights, and the hard task will get high norm weights.  The high norm weights will pull the low norm weights toward them too much, leading to worse performance on the "easy" task.  In a sense, we actually want the &lt;span style="font-style: italic;"&gt;opposite&lt;/span&gt; to happen: if you have a really hard task, it shouldn't screw up everyone else that's easy!  (Yes, I know that being Bayesian might help here since you'd get a lot of uncertainty around those high norm weight vectors!)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-4986175089813985432?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/4986175089813985432/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=4986175089813985432' title='15 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4986175089813985432'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4986175089813985432'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/04/classification-weirdness-regression.html' title='Classification weirdness, regression simplicity'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>15</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-474582300475348739</id><published>2010-02-20T14:22:00.002-07:00</published><updated>2010-02-20T14:35:07.656-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data'/><title type='text'>Google 5gram corpus has unreasonable 5grams</title><content type='html'>In the context of something completely unrelated, I was looking for a fairly general pattern in the Google 1TB corpus.  In particular, I was looking for verbs that are sort of transitive.  I did a quick grep for 5grams of the form "the SOMETHING BLAHed the SOMETHING."  Or, more specifically:&lt;br /&gt;&lt;pre&gt;    grep -i '^the [a-z][a-z]* [a-z][a-z]*ed the [a-z]*'&lt;br /&gt;&lt;/pre&gt;I then took these, lower cased them, and then merged the counts.  Here are the top 25, sorted and with counts:&lt;br /&gt;&lt;pre&gt;     1  101500  the surveyor observed the use&lt;br /&gt;    2   30619  the rivals shattered the farm&lt;br /&gt;    3   27999  the link entitled the names&lt;br /&gt;    4   22928  the trolls ambushed the dwarfs&lt;br /&gt;    5   22843  the dwarfs ambushed the trolls&lt;br /&gt;    6   21427  the poet wicked the woman&lt;br /&gt;    7   15644  the software helped the learning&lt;br /&gt;    8   13481  the commission released the section&lt;br /&gt;    9   12273  the mayor declared the motion&lt;br /&gt;   10   11046  the player finished the year&lt;br /&gt;   11   10809  the chicken crossed the road&lt;br /&gt;   12    8968  the court denied the motion&lt;br /&gt;   13    8198  the president declared the bill&lt;br /&gt;   14    7890  the board approved the following&lt;br /&gt;   15    7848  the bill passed the house&lt;br /&gt;   16    7373  the fat feed the muscle&lt;br /&gt;   17    7362  the report presented the findings&lt;br /&gt;   18    7115  the committee considered the report&lt;br /&gt;   19    6956  the respondent registered the domain&lt;br /&gt;   20    6923  the chairman declared the motion&lt;br /&gt;   21    6767  the court rejected the argument&lt;br /&gt;   22    6307  the court instructed the jury&lt;br /&gt;   23    5962  the complaint satisfied the formal&lt;br /&gt;   24    5688  the lord blessed the sabbath&lt;br /&gt;   25    5486  the bill passed the senate&lt;br /&gt;&lt;/pre&gt;What the heck?!  First of all, the first one is shocking, but maybe you could convince me.  How about numbers 4 and 5?  "The trolls ambushed the dwarfs" (and vice versa)?  These things are the fourth and fifth most common five grams matching my pattern on the web?  "The poet wicked the woman"?  What does "wicked" even mean?  And yet these all beat out "The bill passed the house" and "The court instructed the jury".  But then #23: "The prince compiled the Mishna"???  (#30 is also funny: "the matrix reloaded the matrix" is an amusing segmentation issue.)&lt;br /&gt;&lt;br /&gt;If we do a vanilla google search for the counts of some of these, we get:&lt;br /&gt;&lt;pre&gt;     1     10900  the surveyor observed the use&lt;br /&gt;    4      7750  the trolls ambushed the dwarfs&lt;br /&gt;    5      7190  the dwarfs ambushed the trolls&lt;br /&gt;    6     &lt;span style="font-weight: bold;"&gt;ZERO!&lt;/span&gt;  the poet wicked the woman&lt;br /&gt;   15  20200000  the bill passed the house&lt;br /&gt;   22   3600000  the court instructed the jury&lt;br /&gt;&lt;/pre&gt;This just flabbergasts me.  I'm told that lots of people have expressed worries over the Google 1TB corpus, but have never actually heard anything myself...  And never seen anything myself.&lt;br /&gt;&lt;br /&gt;Does anyone have an explanation for these effects?  How can I expect to get anything done with such ridiculous data!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-474582300475348739?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/474582300475348739/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=474582300475348739' title='74 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/474582300475348739'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/474582300475348739'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/02/google-5gram-corpus-has-unreasonable.html' title='Google 5gram corpus has unreasonable 5grams'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>74</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-211401623882542010</id><published>2010-02-17T09:32:00.003-07:00</published><updated>2010-02-17T11:29:33.455-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='linguistics'/><category scheme='http://www.blogger.com/atom/ns#' term='problems'/><title type='text'>Senses versus metaphors</title><content type='html'>I come from a tradition of not really believing in word senses.  I fondly remember a talk &lt;a href="http://www.isi.edu/%7Ehovy"&gt;Ed Hovy&lt;/a&gt; gave when I was a grad student.  He listed the following example sentences and asked each audience member to group them in to senses:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;John drove his car to work.&lt;/li&gt;&lt;li&gt;We dove to the university every morning.&lt;/li&gt;&lt;li&gt;She drove me to school every day.&lt;/li&gt;&lt;li&gt;He drives me crazy.&lt;/li&gt;&lt;li&gt;She is driven by her passion.&lt;/li&gt;&lt;li&gt;He drove the enemy back.&lt;/li&gt;&lt;li&gt;She finally drove him to change jobs.&lt;/li&gt;&lt;li&gt;He drove a nail into the wall.&lt;/li&gt;&lt;li&gt;Bill drove the ball far out into the field.&lt;/li&gt;&lt;li&gt;My students are driving away at ACL papers.&lt;/li&gt;&lt;li&gt;What are you driving at?&lt;/li&gt;&lt;li&gt;My new truck drives well.&lt;/li&gt;&lt;li&gt;He drives a taxi in New York.&lt;/li&gt;&lt;li&gt;The car drove around the corner.&lt;/li&gt;&lt;li&gt;The farmer drove the cows into the barn.&lt;/li&gt;&lt;li&gt;We drive the turnpike to work.&lt;/li&gt;&lt;li&gt;Sally drove a golf ball clear across the green.&lt;/li&gt;&lt;li&gt;Mary drove the baseball with the bat.&lt;/li&gt;&lt;li&gt;We drove a tunnel through the hill.&lt;/li&gt;&lt;li&gt;The steam drives the engine in the train.&lt;/li&gt;&lt;li&gt;We drove the forest looking for game.&lt;/li&gt;&lt;li&gt;Joe drove the game from their hiding holes.&lt;/li&gt;&lt;/ol&gt;Most people in the audience came up with 5 or 6 senses.  One came up with two (basically the physical versus mental distinction).  According to wordnet, each of these is a separate sense.  (And this is only for the verb form!)  A common "mistake" people made was to group 1, 2, 3, 13 and 14, all of which seem to have to do with driving cars.  The key distinction is that 1 expresses the operation of the vehicle, 2 expresses being transported, 3 expresses being caused to move and 13 expresses driving for a job.  You can read the &lt;a href="http://wordnetweb.princeton.edu/perl/webwn?s=drive&amp;amp;sub=Search+WordNet&amp;amp;o2=&amp;amp;o0=1&amp;amp;o7=&amp;amp;o5=&amp;amp;o1=1&amp;amp;o6=&amp;amp;o4=&amp;amp;o3=&amp;amp;h="&gt;full WordNet descriptions&lt;/a&gt; if you don't believe me.&lt;br /&gt;&lt;br /&gt;Now, the point of this isn't to try to argue that WordNet is wacky in any way.  The people who put it together really know what they're talking about.  After all, these senses are all really different, in the sense there really is a deep interprative difference between 1, 2, 3 and 13.  It's just sufficiently subtle that unless it's pointed out to you, it's not obvious.  There's been a lot of work recently from Ed and others on "consolidating" senses in the OntoNotes project: in fact, they have &lt;a href="http://www.bbn.com/ontonotes/components/word_sense"&gt;exactly the same example&lt;/a&gt; (how convenient) where they've grouped the verb drive in to seven senses, rather than 22.  These are:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;operating or traveling via a vehicle (WN 1, 2, 3, 12, 13, 14, 16)&lt;/li&gt;&lt;li&gt;force to a position or stance (WN 4, 6, 7, 8, 15, 22)&lt;/li&gt;&lt;li&gt;exert energy on behalf of something (WN 5, 10)&lt;/li&gt;&lt;li&gt;cause object to move rapidly by striking it (WN 9, 17, 18)&lt;/li&gt;&lt;li&gt;a directed course of conversation (WN 11)&lt;/li&gt;&lt;li&gt;excavate horizontally, as in mining (WN 19)&lt;/li&gt;&lt;li&gt;cause to function or operate (WN 20)&lt;/li&gt;&lt;/ol&gt;Now, again, I'm not here to argue that these are better or worse or anything in comparison to WordNet.&lt;br /&gt;&lt;br /&gt;The point is that there are (at least) two ways of explaining the wide senses of a word like "drive."  One is through senses and this is the typical approach, at least in NLP land.  The other is &lt;a href="http://www.pineforge.com/upm-data/6031_Chapter_10_O%27Brien_I_Proof_5.pdf"&gt;metaphor&lt;/a&gt; (and yes, that is a &lt;span style="font-style: italic;"&gt;different&lt;/span&gt; Mark Johnson).  I'm not going to go so far as &lt;a href="http://cogweb.ucla.edu/CogSci/Lakoff.html"&gt;to claim that everything is a metaphor&lt;/a&gt;, but I do think it provides an alternative perspective on this issue.  And IMO, alternative perspectives, if plausible, are always worth looking at.&lt;br /&gt;&lt;br /&gt;Let's take a really simple "off the top of my head" example based on "drive."  Let's unrepentantly claim that there is exactly one sense of drive.  Which one?  It seems like the most reasonable is probably OntoNotes' sense 2; Merriam-Webster claims that drive derives from Old-High-German "triban" which, from what I can tell in about a five minute search, has more to do with driving cattle than anything else.  (But even if I'm wrong, this is just a silly example.)&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Obviously we don't drive cars like we drive cattle.  For one, we're actually inside the cars.  But the whole point of driving cattle is to get them to go somewhere.  If we think of cars as metaphorical cattle, then by operating them, we are "driving" them (in the drive-car sense).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;These are mostly literal.  However, for "drive a nail", we need to think of the nail as like a cow that we're trying to get into a pen (the wall).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;This is, I think, the most clear metaphorical usage.  "He is driving away at his thesis" really means that he's trying to get his thesis to go somewhere (where == to completion).&lt;/li&gt;&lt;li&gt;Driving balls is like driving cattle, except you have to work harder to do it because they aren't self-propelled.  This is somewhat like driving nails.&lt;/li&gt;&lt;li&gt;"What are you driving at" is analogous to driving-at-thesis to me.&lt;/li&gt;&lt;li&gt;"Drive a tunnel through the mountain" is less clear to me.  But it's also not a sense of this word that I think I have ever used or would ever use.  So I can't quite figure it out.&lt;/li&gt;&lt;li&gt;"Steam drives an engine" is sort of a double metaphor.  Engine is standing in for cow and steam is standing in for cowboy.  But otherwise it's basically the same as driving cattle.&lt;/li&gt;&lt;/ol&gt;Maybe this isn't the greatest example, but hopefully at least it's a bit thought-worthy.  (And yes, I know I'm departing from Lakoff... in a Lakoff style, there's always a concrete thing and a non-concrete thing in the Lakoff setup from what I understand.)&lt;br /&gt;&lt;br /&gt;This reminds me of the annoying thing my comrades and I used to do as children.  "I come from a tradition..." Yields "You &lt;span style="font-style: italic;"&gt;literally&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;come&lt;/span&gt; from a tradition?"  (No, I was educated in such a tradition.... although even that you could ask whether I was really inside a tradition.)  "A talk Ed Hovy gave..."  Yields  "Ed &lt;span style="font-style: italic;"&gt;literally&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;gave&lt;/span&gt; a talk?"  (No, he spoke to an audience.)  "I drove the golf ball across the field" Yields "You got in the golf ball and drove it across the field?"  Sigh.  Kids are annoying.&lt;br /&gt;&lt;br /&gt;Why should I care which analysis I use (senses or metaphor)?  I'm not sure.  It's very rare that I actually feel like I'm being seriously hurt by the word sense issue, and it seems that if you want to use sense to do a real task like translation, you have to &lt;a href="http://www.cs.ust.hk/%7Emarine/papers/CarpuatWu_EMNLP2007.pdf"&gt;depart from human-constructed sense inventories&lt;/a&gt; anyway.&lt;br /&gt;&lt;br /&gt;But I can imagine a system roughly like the following.  First, find the verb and it's frame and true literal meaning (maybe it actually does have more than one).  This verb frame will impose some restrictions on its arguments (for instance, drive might say that both the agent and theme have to be animate).  If you encounter something where this is not true (eg., a "car" as a theme or "passion" as an agent), you know that this must be a metaphorical usage.  At this point, you have to deduce what it must mean.  That is, if we have some semantics associated with the literal interpretation, we have to figure out how to munge it to work in the metaphorical case.  For instance, for drive, we might say that the semantics are roughly "E = theme moves &amp;amp; E' = theme executes E &amp;amp; agent causes E'"  If the patient cannot actually execute things (it's a nail), then we have to figure that something else (eg., in this case, the agent) did the actual executing.  Etc.&lt;br /&gt;&lt;br /&gt;So it seems like the options are: come up with semantics and frames for every sense (this is what's done, eg., in &lt;a href="http://verbs.colorado.edu/verb-index/vn/drive-11.5.php#drive-11.5"&gt;VerbNet&lt;/a&gt;).  Or, have a single (or small number) of semantics and frames and have some generic rules (hopefully generic!) for how to derive metaphorical uses from them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-211401623882542010?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/211401623882542010/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=211401623882542010' title='18 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/211401623882542010'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/211401623882542010'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/02/senses-versus-metaphors.html' title='Senses versus metaphors'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>18</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-4949780778665499024</id><published>2010-02-07T10:29:00.004-07:00</published><updated>2010-02-07T11:00:03.369-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hiring'/><category scheme='http://www.blogger.com/atom/ns#' term='community'/><title type='text'>Blog heads east</title><content type='html'>I started this blog ages ago while still in grad school in California at  &lt;a href="http://www.cs.usc.edu/"&gt;USC&lt;/a&gt;/&lt;a href="http://www.isi.edu/"&gt;ISI&lt;/a&gt;.  It came with me three point five years ago when I started as an Assistant Professor at the &lt;a href="http://www.cs.utah.edu/"&gt;University of Utah&lt;/a&gt;.  Starting some time this coming summer, I will take it even further east: to &lt;a href="http://www.cs.umd.edu/"&gt;CS at the University of Maryland&lt;/a&gt; where I have just accepted a faculty offer.&lt;br /&gt;&lt;br /&gt;These past (almost) four years at Utah have been fantastic for me, which has made this decision to move very difficult.  I feel very lucky to have been able to come here.  I've had enormous freedom to work in directions that interest me, great teaching opportunities (which have taught &lt;span style="font-style: italic;"&gt;me&lt;/span&gt; a lot), and great colleagues.  Although I know that moving doesn't mean forgetting one's friends, it does mean that I won't run in to them in hallways, or grab lunch, or afternoon coffee or whatnot anymore.  &lt;a href="http://www.cs.utah.edu/%7Eriloff"&gt;Ellen&lt;/a&gt;, &lt;a href="http://www.cs.utah.edu/%7Esuresh"&gt;Suresh&lt;/a&gt;, &lt;a href="http://www.cs.utah.edu/%7Eelb"&gt;Erik&lt;/a&gt;, &lt;a href="http://www.cs.utah.edu/~jmh/"&gt;John&lt;/a&gt;, &lt;a href="http://www.sci.utah.edu/%7Efletcher/"&gt;Tom&lt;/a&gt;, &lt;a href="http://www.sci.utah.edu/%7Etolga/"&gt;Tolga, &lt;/a&gt;&lt;a href="http://www.cs.utah.edu/%7Ewhitaker"&gt;Ross&lt;/a&gt;, and everyone else here have made my time wonderful.  I will miss all of them.&lt;br /&gt;&lt;br /&gt;The University here has been incredibly supportive in every way, and I've thoroughly enjoyed my time here.  Plus, having &lt;a href="http://www.alta.com/"&gt;world-class skiing&lt;/a&gt; a half hour drive from my house isn't too shabby either.  (Though my # of days skiing per year declined geometrically since I started: 30-something the first year, then 18, then 10...  so far only a handful this year.  Sigh.)  Looking back, my time here has been great and I'm glad I had the opportunity to come.&lt;br /&gt;&lt;br /&gt;That said, I'm of course looking forward to moving to Maryland also, otherwise I would not have done it!  There are a number of great people there in natural language processing, machine learning and related fields.  I'd like to think that UMD should be and is one of the go-to places for these topics, and am excited to be a part of it.  Between &lt;a href="http://www.umiacs.umd.edu/%7Ebonnie/"&gt;Bonnie&lt;/a&gt;, &lt;a href="http://www.umiacs.umd.edu/%7Eresnik/"&gt;Philip&lt;/a&gt;, &lt;a href="http://www.umiacs.umd.edu/%7Egetoor/"&gt;Lise&lt;/a&gt;, &lt;a href="http://www.glue.umd.edu/%7Emharper/"&gt;Mary&lt;/a&gt;, &lt;a href="http://www.glue.umd.edu/%7Eoard/"&gt;Doug&lt;/a&gt;, &lt;a href="http://www.umiacs.umd.edu/%7Ejklavans/"&gt;Judith&lt;/a&gt;, &lt;a href="http://www.umiacs.umd.edu/%7Ejimmylin/"&gt;Jimmy&lt;/a&gt; and the other folks in &lt;a href="http://www.umiacs.umd.edu/research/CLIP/people.htm"&gt;CLIP&lt;/a&gt; and related groups, I think it will be a fantastic place for me to be, and a fantastic place for all those PhD-hungry students out there to go!  Plus, having all the great folks at &lt;a href="http://www.clsp.jhu.edu/"&gt;JHU's CLSP&lt;/a&gt; a 45 minute drive a way will be quite convenient.&lt;br /&gt;&lt;br /&gt;A part of me is sad to be leaving, but another part of me is excited at new opportunities.  The move will take place some time over the summer (carefully avoiding conferences), so if I blog less then, you'll know why.  Thanks again to everyone who has made my life here fantastic.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-4949780778665499024?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/4949780778665499024/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=4949780778665499024' title='27 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4949780778665499024'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4949780778665499024'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/02/blog-heads-east.html' title='Blog heads east'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>27</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-928704673462621954</id><published>2010-01-31T08:49:00.004-07:00</published><updated>2010-01-31T12:37:08.036-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='information retrieval'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Coordinate ascent and inverted indices...</title><content type='html'>Due to a small off-the-radar project I'm working on right now, I've been building my own inverted indices.  (Yes, I'm vaguely aware of discussions in DB/Web search land about whether you should store your inverted indices in a database or whether you should handroll your own.  This is tangential to the point of this post.)&lt;br /&gt;&lt;br /&gt;For those of you who don't remember your IR 101, here's the deal with inverted indices.  We're a search engine and want to be able to quickly find pages that contain query terms.  One way of storing our set of documents (eg., the web) is to store a list of documents, each of which is a list of words appearing in that document.  If there are N documents of length L, then answering a query is O(N*L) since we have to look over each document to see if it contains the word we care about.  The alternative is to store an inverted index, where we have a list of &lt;span style="font-style: italic;"&gt;words&lt;/span&gt; and for each word, we store the list of documents it appears in.  Answering a query here is something like O(1) if we hash them, O(log |V|) if we do binary search (V = vocabulary), etc.  Why it's called an inverted index is beyond me: it's really just like the index you find at the back of a textbook.  And the computation difference is like trying to find mentions of "Germany" in a textbook by reading every page and looking for "Germany" versus going to the index in the back of the book.&lt;br /&gt;&lt;br /&gt;Now, let's say we have an inverted index for, say, the web.  It's pretty big (and in all honesty, probably distributed across multiple storage devices or multiple databases or whatever).  But regardless, a linear scan of the index would give you something like: here's word 1 and here are the documents it appears in; here's word 2 and its doucments; here's word &lt;span style="font-style: italic;"&gt;v&lt;/span&gt; and its documents.&lt;br /&gt;&lt;br /&gt;Suppose that, outside of the index, we have a classification task over the documents on the web.  That is, for any document, we can (efficiently -- say O(1) or O(log N)) get the "label" of this document.  It's either +1, -1 or ? (? == unknown, or unlabeled).&lt;br /&gt;&lt;br /&gt;My argument is that this is a very plausible set up for a very large scale problem.&lt;br /&gt;&lt;br /&gt;Now, if we're trying to solve this problem, doing a "common" optimization like stochastic (sub)gradient descent is just not going to work, because it would require us to iterate over &lt;span style="font-style: italic;"&gt;documents&lt;/span&gt; rather than iterating over &lt;span style="font-style: italic;"&gt;words&lt;/span&gt; (where I'm assuming words == features, for now...).  That would be ridiculously expensive.&lt;br /&gt;&lt;br /&gt;The alternative is to do some sort of coordinate ascent algorithm.  These actually used to be quite popular in maxent land, and, in particular, Joshua Goodman had a &lt;a href="http://www.research.microsoft.com/%7Ejoshuago/fasthigh.ps"&gt;coordinate ascent algorithm for maxent models&lt;/a&gt; that apparently worked quite well.  (In searching for that paper, I just came across a 2009 &lt;a href="http://www.aclweb.org/anthology/P/P09/P09-2072.pdf"&gt;paper on roughly the same topic&lt;/a&gt; that I hadn't seen before.)&lt;br /&gt;&lt;br /&gt;Some other algorithms have a coordinate ascent feel, for instance the &lt;a href="http://www-stat.stanford.edu/%7Etibs/lasso/simple.html"&gt;LASSO&lt;/a&gt; (and relatives, including the &lt;a href="http://www-rcf.usc.edu/%7Egareth/research/JRSSB_copy.pdf"&gt;Dantzig selector+LASSO = DASSO&lt;/a&gt;), but they wouldn't really scale well in this problem because it would require a single pass over the entire index to make one update.  Other approaches, such as boosting, etc., would fare very poorly in this setting.&lt;br /&gt;&lt;br /&gt;This observation first led me to wonder if we can do something LASSO or boosting like in this setting.  But then that made me wonder if this is a special case, or if there are other cases in the "real world" where you data is naturally laid out as features * data points rather than data points * features.  Sadly, I cannot think of any.  But perhaps that's not because there aren't any.&lt;br /&gt;&lt;br /&gt;(Note that I also didn't really talk about how to do semi-supervised learning in this setting... this is also quite unclear to me right now!)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-928704673462621954?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/928704673462621954/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=928704673462621954' title='39 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/928704673462621954'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/928704673462621954'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/01/coordinate-ascent-and-inverted-indices.html' title='Coordinate ascent and inverted indices...'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>39</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-794943753630408905</id><published>2010-01-26T08:00:00.003-07:00</published><updated>2010-01-26T08:15:19.027-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><title type='text'>A machine learner's apology</title><content type='html'>Andrew Gelman recently &lt;a href="http://www.stat.columbia.edu/%7Ecook/movabletype/archives/2010/01/three_rivers_in.html"&gt;announced an upcoming talk&lt;/a&gt; by &lt;a href="http://www.cs.cmu.edu/%7Elafferty/"&gt;John Lafferty&lt;/a&gt;.  This reminded me of a post I've been meaning to write for ages (years, really) but haven't gotten around to.  Well, now I'm getting around to it.&lt;br /&gt;&lt;br /&gt;A colleague from Utah (not in ML) went on a trip and spent some time talking to a computational statistician, who will remain anonymous.  But let's call this person Alice.  The two were talking about various topics and at one point machine learning came up.  Alice commented:&lt;br /&gt;&lt;blockquote&gt;"Machine learning is just non-rigorous computational statistics."&lt;/blockquote&gt;Or something to that effect.&lt;br /&gt;&lt;br /&gt;A first reaction is to get defensive: &lt;span style="font-style: italic;"&gt;no it's not!&lt;/span&gt;  But Alice has a point.  Some subset of machine learning, in particular the side more Bayesian, tends to overlap quite a bit with compstats, so much so that in some cases they're probably not really that differentiable.  (That is to say, there's a subset of ML that's very very similar to a subset of compstats... you could probably fairly easily find some antipoles that are amazingly different.)&lt;br /&gt;&lt;br /&gt;And there's a clear intended interpretation to the comment: it's not that we're not rigorous, it's that we're &lt;span style="font-style: italic;"&gt;not rigorous in the way that computational statisticians are&lt;/span&gt;.  To that end, let me offer a glib retort:&lt;br /&gt;&lt;blockquote&gt;Computational statistics is just machine learning where you don't care about the computation.&lt;/blockquote&gt;In much the same way that I think Alice's claim is true, I think this claim is also true.  The part of machine learning that's really strong on the CS side, cares a &lt;span style="font-style: italic;"&gt;lot&lt;/span&gt; about computation: how long, how much space, how many samples, etc., will it take to learn something.  This is something that I've rarely seen in compstats, where the big questions really have to do with things like: is this distribution well defined, can I sample from it, etc., now let me run Metropolis-Hastings.  (Okay, I'm still being glib.)&lt;br /&gt;&lt;br /&gt;I saw a discussion on a theory blog recently that STOC/FOCS is about "THEORY of algorithms" while SODA is about "theory of ALGORITHMS" or something like that.  (Given the capitalization, perhaps it was Bill Gasarch :)?)  Likewise, I think it's fair to say that classic ML is "MACHINE learning" or "COMPUTATIONAL statistics" and classic compstats is "machine LEARNING" or "computational STATISTICS."  We're really working on very similar problems, but the things that we value tend to be different.&lt;br /&gt;&lt;br /&gt;Due to that, I've always found it odd that there's not more interaction between compstats and ML.  Sure, there's some... but not very much.  Maybe it's cultural, maybe it's institutional (conferences versus journals), maybe we really know everything we need to know about the other side and talking wouldn't really get us anywhere.  But if it's just a sense of "I don't like you because you're treading on my field," then that's not productive (either direction), even if it is an initial gut reaction.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-794943753630408905?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/794943753630408905/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=794943753630408905' title='31 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/794943753630408905'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/794943753630408905'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/01/machine-learners-apology.html' title='A machine learner&apos;s apology'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>31</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-8079573521796335753</id><published>2010-01-19T11:54:00.002-07:00</published><updated>2010-01-19T12:11:08.719-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hiring'/><title type='text'>Interviewing Follies</title><content type='html'>Continuing on the theme of &lt;a href="http://nlpers.blogspot.com/2009/09/some-notes-on-job-search.html"&gt;applying for jobs&lt;/a&gt;, I thought I'd share some interviewing follies that have happened to me, that I've observed others do, and that I've heard about.  There is a moral to this story; if you want to skip the stories and get to the moral, scroll to past the bullet points.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Missing your plane.  I had an interview in a place that was about a 1-2 hour flight away.  I flew out first thing in the morning and back last thing at night.  Except I didn't fly out first thing in the morning: I missed my flight.  Why?  Because I cut flights close (someone once said "if you've never missed a flight, you're spending too much time in the airport") and the particular flight I was on left not out of a normal gate, but out of one of those that you have to take a shuttle bus to.  I didn't know that, didn't factor in the extra 5 minutes, and missed the flight.  I called the places I was interviewing at, re-arranged meetings and the day proceeded with a small hiccup.&lt;br /&gt;&lt;br /&gt;I ended up getting an offer from this place.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Missing a meeting.  I was interviewing at a different place, going through my daily meetings, got really tired and misread my schedule.  I though I was done when in fact I had one meeting to go.  I caught a cab to the airport, flew back home, and noticed a few frantic emails trying to figure out where I was (this is before I had an email-capable cell phone).  (As an aside, someone once told me that they would &lt;span style="font-style: italic;"&gt;intentionally&lt;/span&gt; skip meetings on interview days with people outside their area, figuring that neither the candidate nor the interviewee really wanted such a meeting.  They would hang out in the restroom or something, and blame a previous meeting running long on the miss.  This was &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; the case in my setting.)&lt;br /&gt;&lt;br /&gt;I did &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; end up getting an offer from this place.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Someone interviewing here a long time ago was scheduled to give their talk using transparencies.  Two minutes before the talk they realized that they had left them in the hotel room.  The already-assembled audience was asked to stay put, the speaker was quickly driven to the hotel and back, and proceeded to give one of the best interview talks on record here.&lt;br /&gt;&lt;br /&gt;This person ended up getting a job offer.&lt;/li&gt;&lt;li&gt;Someone interviewing somewhere I know left their laptop in their hotel, just like number 3.  But instead of having their host drive them back to the hotel, they borrowed someone's car to drive back to the hotel.  They crashed the car, but managed to get their laptop, and gave a great talk.&lt;br /&gt;&lt;br /&gt;This person ended up getting a job offer.&lt;/li&gt;&lt;li&gt;I flew in late to an interview, getting to my hotel around midnight.  I woke up the next morning at seven for an 8:30 breakfast meeting.  I unpacked my suit, tie, belt, socks and shoes.  And realized I had forgotten to pack a dress shirt.  All I had was the shirt I had worn on the plane: a red graphic tee shirt.  My mind quickly raced to figure out if there was a place I could buy a dress shirt in the middle of nowhere at 7am.  I quickly realized that it was impossible, wore my tee shirt under my suit jacket, and went through the day as if that was how I had planned it.&lt;br /&gt;&lt;br /&gt;I ended up getting a job offer.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;The moral of this story is that bad things happen during interviews.  I can't compare any of my stories to the crash-the-car story, but we've all been there, done stupid things, and gotten through it unscathed.  I think the trick is to pretend like it was intentional, or at least not get flustered.  Yes I missed my flight, yes I forgot my shirt, yes I crashed your car.  But it doesn't affect the rest of my day.  You have to be able to relax and forgive yourself minor mistakes: the interviewers &lt;span style="font-style: italic;"&gt;really&lt;/span&gt; are looking at the bigger picture.&lt;br /&gt;&lt;br /&gt;That said, there are definitely things you can do to botch an interview.  They have to do with things like giving a bad talk (my experience is that a huge weight is placed on the quality of your job talk) and not having a clear vision for where you're going in research life.  Don't be afraid to disagree with people you're talking to: we usually aren't trying to hire yes-men or yes-women.  Once you get a job, especially a faculty job, you are the one who is going to make things happen for you.  You have a world view and that's part of what we're hiring.  Let us see it.  We might not always agree, but if you have &lt;span style="font-style: italic;"&gt;reasons&lt;/span&gt; for your view that you can &lt;span style="font-style: italic;"&gt;articulate&lt;/span&gt;, we'll listen.&lt;br /&gt;&lt;br /&gt;But don't focus on the little things, and don't get flustered.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-8079573521796335753?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/8079573521796335753/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=8079573521796335753' title='31 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8079573521796335753'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8079573521796335753'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/01/interviewing-follies.html' title='Interviewing Follies'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>31</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-7221439898325560811</id><published>2010-01-04T10:36:00.002-07:00</published><updated>2010-01-04T11:20:33.893-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='poll'/><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>ArXiV and NLP, ML and Computer Science</title><content type='html'>Arxiv is something of an underutilized resource in computer science.  Indeed, many computer scientists seems not to even know it exists, despite it having been around for two decades now!  On the other hand, it is immensely popular among (some branches of) mathematics and physics.  This used to strike me as odd: arxiv is a computer service, why haven't computer scientists jumped on it.  Indeed, I spent a solid day a few months ago putting all my (well almost all my) papers on arxiv.  One can always point to "culture" for such things, but I suspect there are more rational reasons why it hasn't affected us as much as it has others.&lt;br /&gt;&lt;br /&gt;I ran in to arxiv first when I was in math land.  The following is a cartoon view of how (some branches of) math research gets published:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Authors write a paper&lt;/li&gt;&lt;li&gt;Authors submit paper to a journal&lt;/li&gt;&lt;li&gt;Authors simultaneously post paper on arxiv&lt;/li&gt;&lt;li&gt;Journal publishes (or doesn't publish) paper&lt;/li&gt;&lt;/ol&gt;We can contrast this with how life goes in CS land:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Conference announces deadline&lt;/li&gt;&lt;li&gt;One day before deadline, authors write a paper&lt;/li&gt;&lt;li&gt;Conference publishes (or rejects) paper&lt;/li&gt;&lt;/ol&gt;I think there are a few key differences that matter.  Going up to the mathematician model, we can ask ourselves, why do they do #3?  It's a way to get the results out without having to wait for a journal to come back with a go/no-go response.  Basically in the mathematician model, arxiv is used for advertising while a journal is used for a stamp of approval (or correctness).&lt;br /&gt;&lt;br /&gt;So then why don't we do arxiv too?  I think there are two reasons.  First, we think that conference turn around is good enough -- we don't need anything faster.  Second, it completely screws up our notions of blind review.  If everyone simultaneously posted a paper on arxiv when submitting to a conference, we could no longer claim, at all, to be blind.  (Please, I beg of you, do not start commenting about blind review versus non-blind review -- I hate this topic of conversation and it never goes anywhere!)  Basically, we rely on our conferences to do &lt;span style="font-style: italic;"&gt;both&lt;/span&gt; advertising &lt;span style="font-style: italic;"&gt;and&lt;/span&gt; stamp of approval.  Of course, the speed of conferences is mitigated by the fact that you sometimes have to go through two or three before your paper gets in, which can make it as slow, or slower than, journals.&lt;br /&gt;&lt;br /&gt;In a sense, I think that largely because of the blind thing, and partially because conferences tend to be faster than journals, the classic usage of arxiv is not really going to happen in CS.&lt;br /&gt;&lt;br /&gt;(There's one other potential use for arxiv, which I'll refer to as the tech-report effect.  I've many times seen short papers posted on people's web pages either as tech-reports or as unpublished documents.  I don't mean tutorial like things, like I have, but rather real semi-research papers.  These are papers that contain a nugget of an idea, but for which the authors seem unwilling to go all the way to "make it work."  One could imagine posting such things on arxiv.  Unfortunately, I really dislike such papers.  It's very much a "flag planting" move in my opinion, and it makes life difficult for people who follow.  That is, if I have an idea that's in someone elses semi-research paper, do I need to cite them?  Ideas are a dime a dozen: making it work is often the hard part.  I don't think you should get to flag plant without going through the effort of making it work.  But that's just me.)&lt;br /&gt;&lt;br /&gt;However, there is one prospect that arxiv could serve that I think would be quite valuable: literally, as an archive.  Right now, ACL has the ACL anthology.  UAI has its own repository.  ICML has a rather sad state of affairs where, from what I can tell, papers from ICML #### are just on the ICML #### web page and if that happens to go down, oh well.  All of these things could equally well be hosted on arxiv, which has strong government support to be sustained, is open access, blah blah blah.&lt;br /&gt;&lt;br /&gt;This brings me to a question for you all: how would you feel if &lt;span style="font-style: italic;"&gt;all&lt;/span&gt; (or nearly all) ICML papers were to be published on arxiv?  That is, if your paper is accepted, instead of uploading a camera-ready PDF to the ICML conference manager website, you instead uploaded to arxiv and then sent your arxiv DOI link to the ICML folks?&lt;br /&gt;&lt;br /&gt;&lt;form method="post" action="http://poll.pollcode.com/yyB"&gt;&lt;table style="background-color: rgb(238, 238, 238); color: rgb(0, 0, 0); font-family: 'Verdana'; font-size: 13px;" width="350" border="0" cellpadding="2" cellspacing="0"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td colspan="2" style="padding: 2px;"&gt;&lt;strong&gt;How do you feel about arxiving ICML?&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="5"&gt;&lt;input name="answer" value="1" type="radio"&gt;&lt;/td&gt;&lt;td style="padding: 2px;"&gt;No, please don't put my paper on arxiv.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="5"&gt;&lt;input name="answer" value="2" type="radio"&gt;&lt;/td&gt;&lt;td style="padding: 2px;"&gt;I'm happy to have my paper on arxiv, but you should do it for me!&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width="5"&gt;&lt;input name="answer" value="3" type="radio"&gt;&lt;/td&gt;&lt;td style="padding: 2px;"&gt;I'm happy to upload my paper to arxiv.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td colspan="2"&gt;&lt;center&gt;&lt;input value="Vote" type="submit"&gt;  &lt;input name="view" value="View" type="submit"&gt;&lt;/center&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td colspan="2" bg="" style="color: white;" align="right"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/form&gt;Obviously there are some constraints, so there would need to be an opt-out policy, but I'm curious how everyone feels about this....&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-7221439898325560811?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/7221439898325560811/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=7221439898325560811' title='43 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7221439898325560811'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7221439898325560811'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2010/01/arxiv-and-nlp-ml-and-computer-science.html' title='ArXiV and NLP, ML and Computer Science'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>43</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-6087417128880409526</id><published>2009-12-30T16:18:00.001-07:00</published><updated>2010-01-12T09:04:34.493-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>Some random NIPS thoughts...</title><content type='html'>I missed the first two days of NIPS due to teaching.  Which is sad -- I heard there were great things on the first day.  I did end up seeing a lot that was nice.  But since I missed stuff, I'll instead post some paper suggests from one of my students, &lt;a href="http://www.cs.utah.edu/%7Epiyush/"&gt;Piyush Rai&lt;/a&gt;, who was there.  You can tell his biases from his selections, but that's life :).  More of my thoughts after his notes...&lt;br /&gt;&lt;br /&gt;Says Piyush:&lt;br /&gt;&lt;blockquote&gt;There was an interesting tutorial by &lt;a href="http://nips.cc/Conferences/2009/Program/speaker-info.php?ID=6358"&gt;Gunnar Martinsson&lt;/a&gt; on using randomization to speed-up matrix factorization (SVD, PCA etc) of really really large matrices (by "large", I mean something like 106 x 106). People typically use Krylov subspace methods (e.g., the Lanczos algo) but these require multiple passes over the data. It turns out that with the randomized approach, you can do it in a single pass or a small number of passes (so it can be useful in a streaming setting).  The idea is quite simple. Let's assume you want the top K evals/evecs of a large matrix A. The randomized method draws K *random* vectors from a Gaussian and uses them in some way (details &lt;a href="http://amath.colorado.edu/faculty/martinss/Talks/2009_NIPS_tutorial.pdf"&gt;here&lt;/a&gt;) to get a "smaller version" of A on which doing SVD can be very cheap. Having got the evals/evecs of B, a simple transformation will give you the same for the original matrix A.&lt;br /&gt;The success of many matrix factorization methods (e.g., the Lanczos) also depends on how quickly the spectrum decays (eigenvalues) and they also suggest ways of dealing with cases where the spectrum doesn't quite decay that rapidly.&lt;br /&gt;&lt;br /&gt;Some papers from the main conference that I found interesting:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips22/NIPS2009_0523.pdf"&gt;Distribution Matching for Transduction&lt;/a&gt; (Alex Smola and 2 other guys): They use maximum mean discrepancy (MMD) to do predictions in a transduction setting (i.e., when you also have the test data at training time). The idea is to use the fact that we expect the output functions f(X) and f(X') to be the same or close to each other (X are training and X' are test inputs). So instead of using the standard regularized objective used in the inductive setting, they use the distribution discrepancy (measured by say D) of f(X) and f(X') as a regularizer. D actually decomposes over pairs of training and test examples so one can use a stochastic approximation of D (D_i for the i-th pair of training and test inputs) and do something like an SGD.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.cse.ohio-state.edu/%7Embelkin/papers/SSL_SEB_NIPS_09.pdf"&gt;Semi-supervised Learning using Sparse Eigenfunction Bases&lt;/a&gt; (Sinha and Belkin from Ohio): This paper uses the cluster assumption of semi-supervised learning. They use unlabeled data to construct a set of basis functions and then use labeled data in the LASSO framework to select a sparse combination of basis functions to learn the final classifier.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf"&gt;Streaming k-means approximation&lt;/a&gt; (Nir Ailon et al.): This paper does an online optimization of the k-means objective function. The algo is based on the previously proposed kmeans++ algorithm.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://psiexp.ss.uci.edu/research/papers/RankAggregation_Distribute.pdf"&gt;The Wisdom of Crowds in the Recollection of Order Information&lt;/a&gt;.  It's about aggregating rank information from various individuals to reconstruct the global ordering.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.cc.gatech.edu/%7Esyang46/papers/nips09.pdf"&gt;Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora&lt;/a&gt; (by some folks at gatech): The problem setting is interesting here. Here the "multi-instance" is a bit of a misnomer. It means that each example in turn can consists of several sub-examples (which they call instances). E.g., a document consists of several paragraphs, or a webpage consists of text, images, videos.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://mlg.eng.cam.ac.uk/porbanz/reports/porbanz_NIPS09TR.pdf"&gt;Construction of Nonparametric Bayesian Models from Parametric Bayes Equations&lt;/a&gt; (Peter Orbanz): If you care about Bayesian nonparametrics. :) It basically builds on the Kolmogorov consistency theorem to formalize and sort of gives a recipe for the construction of nonparametric Bayesian models from their parametric counterparts. Seemed to be a good step in the right direction.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.nips.cc/papers/files/nips22/NIPS2009_0464.pdf"&gt;Indian Buffet Processes with Power-law Behavior&lt;/a&gt; (YWT and Dilan Gorur): This paper actually does the exact opposite of what I had thought of doing for IBP. The IBP (akin to the sense of the Dirichlet process) encourages the "rich-gets-richer" phenomena in the sense that a dish that has been already selected by a lot of customers is highly likely to be selected by future customers as well. This leads to the expected number of dishes (and thus the latent-features) to be something like O(alpha* log n). This paper tries to be even more aggressive and makes the relationship have a power-law behavior. What I wanted to do was a reverse behavior -- maybe more like a "socialist IBP" :) where the customers in IBP are sort of evenly distributed across the dishes.&lt;/blockquote&gt;The rest of this post are random thoughts that occurred to me at NIPS.  Maybe some of them will get other people's wheels turning?  This was originally an email I sent to my students, but I figured I might as well post it for the world.  But forgive the lack of capitalization :):&lt;br /&gt;&lt;br /&gt;persi diaconis' invited talk about reinforcing random walks... that is,  you take a random walk, but every time you cross an edge, you increase  the probability that you re-cross that edge (see coppersmith + diaconis,  rolles + diaconis).... this relates to a post i had a while ago:  nlpers.blogspot.com/2007/04/multinomial-on-graph.html ... i'm thinking  that you could set up a reinforcing random walk on a graph to achieve  this.  the key problem is how to compute things -- basically want you  want is to know for two nodes i,j in a graph and some n &gt;= 0, whether  there exists a walk from i to j that takes exactly n steps.  seems like  you could craft a clever data structure to answer this question, then  set up a graph multinomial based on this, with reinforcement (the  reinforcement basically looks like the additive counts you get from  normal multinomials)... if you force n=1 and have a fully connected  graph, you should recover a multinomial/dirichlet pair.&lt;br /&gt;&lt;br /&gt;also from persi's talk, persi and some guy sergei (sergey?) have a paper  on variable length markov chains that might be interesting to look at,  perhaps related to frank wood's sequence memoizer paper from icml last year.&lt;br /&gt;&lt;br /&gt;finally, also from persi's talk, steve mc_something from ohio has a  paper on using common gamma distributions in different rows to set  dependencies among markov chains... this is related to something i was  thinking about a while ago where you want to set up transition matrices  with stick-breaking processes, and to have a common, global, set of  sticks that you draw from... looks like this steve mc_something guy has  already done this (or something like it).&lt;br /&gt;&lt;br /&gt;not sure what made me think of this, but related to a talk we had here a  few weeks ago about unit tests in scheme, where they basically randomly  sample programs to "hope" to find bugs... what about setting this up as  an RL problem where your reward is high if you're able to find a bug  with a "simple" program... something like 0 if you don't find a bug, or  1/&lt;code class="moz-txt-verticalline"&gt;&lt;span class="moz-txt-tag"&gt;|&lt;/span&gt;P&lt;span class="moz-txt-tag"&gt;|&lt;/span&gt;&lt;/code&gt; if you find a bug with program P.  (i think this came up when i  was talking to percy -- liang, the other one -- about some semantics  stuff he's been looking at.)  afaik, no one in PL land has tried  ANYTHING remotely like this... it's a little tricky because of the  infinite but discrete state space (of programs), but something like an  NN-backed Q-learning might do something reasonable :P.&lt;br /&gt;&lt;br /&gt;i also saw a very cool "survey of vision" talk by bill freeman... one of  the big problems they talked about was that no one has a good p(image)  prior model.  the example given was that you usually have de-noising  models like p(image)*p(noisy image|image) and you can weight p(image) by  ^alpha... as alpha goes to zero, you should just get a copy of your  noisy image... as alpha goes to infinity, you should end up getting a  good image, maybe not the one you &lt;b class="moz-txt-star"&gt;&lt;span class="moz-txt-tag"&gt;*&lt;/span&gt;want&lt;span class="moz-txt-tag"&gt;*&lt;/span&gt;&lt;/b&gt;, but an image nonetheless.  this doesn't happen.&lt;br /&gt;&lt;br /&gt;one way you can see that this doesn't happen is in the following task.  take two images and overlay them.  now try to separate the two.  you  &lt;b class="moz-txt-star"&gt;&lt;span class="moz-txt-tag"&gt;*&lt;/span&gt;clearly&lt;span class="moz-txt-tag"&gt;*&lt;/span&gt;&lt;/b&gt; need a good prior p(image) to do this, since you've lost half  your information.&lt;br /&gt;&lt;br /&gt;i was thinking about what this would look like in language land.  one  option would be to take two sentences and randomly interleave their  words, and try to separate them out.  i actually think that we could  solve this tasks pretty well.  you could probably formulate it as a FST  problem, backed by a big n-gram language model.  alternatively, you  could take two DOCUMENTS and randomly interleave their sentences, and  try to separate them out.  i think we would fail MISERABLY on this task,  since it requires actually knowing what discourse structure looks like.   a sentence n-gram model wouldn't work, i don't think.  (although maybe  it would?  who knows.)  anyway, i thought it was an interesting thought  experiment.  i'm trying to think if this is actually a real world  problem... it reminds me a bit of a paper a year or so ago where they  try to do something similar on IRC logs, where you try to track who is  speaking when... you could also do something similar on movie transcripts.&lt;br /&gt;&lt;br /&gt;hierarchical topic models with latent hierarchies  drawn from the coalescent, kind of like hdp, but not quite.  (yeah yeah  i know i'm like a parrot with the coalescent, but it's pretty freaking  awesome :P.)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;That's it!   Hope you all had a great holiday season, and enjoy your New Years (I know I'm going skiing.  A lot. So there, Fernando! :)).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-6087417128880409526?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/6087417128880409526/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=6087417128880409526' title='40 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6087417128880409526'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6087417128880409526'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/12/some-random-nips-thoughts.html' title='Some random NIPS thoughts...'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>40</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-9173611680043894328</id><published>2009-12-16T09:19:00.003-07:00</published><updated>2009-12-16T09:48:50.640-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='online learning'/><category scheme='http://www.blogger.com/atom/ns#' term='classification'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>From Kivenen/Warmuth and EG to CW learning and Adaptive Regularization</title><content type='html'>This post is a bit of a historical retrospective, because it's only been recently that these things have aligned themselves in my head.&lt;br /&gt;&lt;br /&gt;The all goes back to &lt;a href="http://dx.doi.org/10.1006/inco.1996.2612"&gt;Jyrki Kivenen and Manfred Warmuth's paper on exponentiated gradient descent&lt;/a&gt; that dates back to STOC 1995.  For those who haven't read this paper, or haven't read it recently, it's a great read (although it tends to repeat itself a lot).  It's particularly interesting because they &lt;i&gt;derive&lt;/i&gt; gradient descent and exponentiated gradient descent (GD and EG) as a &lt;i&gt;consequence&lt;/i&gt; of other assumptions.&lt;br /&gt;&lt;br /&gt;In particular, suppose we have an online learning problem, where at each time step we receive an example &lt;i&gt;x&lt;/i&gt;, make a linear prediction &lt;i&gt;(w'x)&lt;/i&gt; and then suffer a loss.  The idea is that if we suffer no loss, then we leave &lt;i&gt;w&lt;/i&gt; as is; if we do suffer a loss, then we want to balance two goals:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Change &lt;span style="font-style: italic;"&gt;w&lt;/span&gt; enough so that we wouldn't make this error again&lt;/li&gt;&lt;li&gt;Don't change &lt;span style="font-style: italic;"&gt;w&lt;/span&gt; too much&lt;/li&gt;&lt;/ol&gt;The key question is how to define "too much."  Suppose that we measure changes in &lt;span style="font-style: italic;"&gt;w&lt;/span&gt; by looking at Euclidean distance between the updated &lt;span style="font-style: italic;"&gt;w&lt;/span&gt; and the old &lt;span style="font-style: italic;"&gt;w&lt;/span&gt;.  If we work through the math for enforcing 1 while minimizing 2, we derive the gradient descent update rule that's been used for optimizing, eg., perceptrons for squared loss for ages.&lt;br /&gt;&lt;br /&gt;The magic is what happens if we use something &lt;span style="font-style: italic;"&gt;other than&lt;/span&gt; Euclidean distance.  If, instead, we assume that the &lt;span style="font-style: italic;"&gt;w&lt;/span&gt;s are all positive, we can use an (unnormalized) KL divergence to measure differences between weight vectors.  Doing this leads to multiplicative updates, or the exponentiated gradient algorithm.&lt;br /&gt;&lt;br /&gt;(Obvious (maybe?) open question: what happens if you replace the distance with some other divergence, say a Bregman, or alpha or phi-divergence?)&lt;br /&gt;&lt;br /&gt;This line of thinking leads naturally to &lt;a href="http://jmlr.csail.mit.edu/papers/v7/crammer06a.html"&gt;Crammer et al.'s work on Online Passive Aggressive&lt;/a&gt; algorithms, from JMLR 2006.  Here, the idea remains the same, but instead of simply ensuring that we make a correct classification, ala rule (1) above, we ensure that we make a correct classification &lt;span style="font-style: italic;"&gt;with a margin of at least 1&lt;/span&gt;.  They use Euclidean distance to measure the difference in weight vectors, and, for many cases, can get closed-form updates that look GD-like, but not exactly GD.  (Aside: what happens if you use, eg., KL instead of Euclidean?)&lt;br /&gt;&lt;br /&gt;Two years later, &lt;a href="http://webee.technion.ac.il/people/koby/publications/icml08_variance.pdf"&gt;Mark Dredze, Koby Crammer and Fernando Pereira presented Confidence-Weighted Linear Classification&lt;/a&gt;.  The idea here is the same: don't change the weight vectors too much, but achieve good classification.  The insight here is to represent weight vectors by &lt;span style="font-style: italic;"&gt;distributions&lt;/span&gt; over weight vectors, and the goal is to change these &lt;span style="font-style: italic;"&gt;distributions&lt;/span&gt; enough, but not too much.  Here, we go back to KL, because KL makes more sense for distributions, and make a Gaussian assumption on the weight vector distribution.  (This has close connections both to PAC-Bayes and, if I'm wearing my Bayesian hat, Kalman filtering when you make a Gaussian assumption on the posterior, even though it's not really Gaussian... it would be interesting to see how these similarities play out.)&lt;br /&gt;&lt;br /&gt;The cool thing here is that you effectively get variable learning rates on different parameters, where confident parameters get moved less.  (In practice, one really awesome effect is that you tend to only need one pass over your training data to do a good job!)  If you're interested in the Bayesian connection, you can get a very similar style algorithm if you do &lt;a href="http://research.microsoft.com/apps/pubs/default.aspx?id=79460"&gt;EP on a Bayesian classification algorithm (by Stern, Herbrich and Graepel)&lt;/a&gt;, which is what Microsoft Bing uses for online ads.&lt;br /&gt;&lt;br /&gt;This finally bring us to NIPS this year, where &lt;a href="http://books.nips.cc/papers/files/nips22/NIPS2009_0611.pdf"&gt;Koby Crammer, Alex Kulesza and Mark Dredze presented work on Adaptive Regularization of Weight Vectors&lt;/a&gt;.  Here, they take Confidence Weighted classification and turn the constraints into pieces of the regularizer (somewhat akin to doing a Lagrangian trick).  Doing so allows them to derive a representer theorem.  But again, the intuition is exactly the same: don't change the classifier too much, but enough.&lt;br /&gt;&lt;br /&gt;All in all, this is a very interesting line of work.  The reason I'm posting about it is because I think seeing the connections makes it easier to sort these different ideas into bins in my head, depending on what your loss is (squared versus hinge), what your classifier looks like (linear versus distribution over linear) and what your notion of "similar classifiers" is (Euclidean or KL).&lt;br /&gt;&lt;br /&gt;(Aside: &lt;a href="http://www.stat.rutgers.edu/%7Etzhang/papers/nips00-rwinnow.pdf"&gt;Tong Zhang has a paper on regularized winnow methods&lt;/a&gt;, which fits in here somewhere, but not quite as cleanly.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-9173611680043894328?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/9173611680043894328/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=9173611680043894328' title='35 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/9173611680043894328'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/9173611680043894328'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/12/from-kivenenwarmuth-and-eg-to-cw.html' title='From Kivenen/Warmuth and EG to CW learning and Adaptive Regularization'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>35</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-8565049129965680682</id><published>2009-11-17T08:30:00.003-07:00</published><updated>2009-11-17T08:57:14.265-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='graphical models'/><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>K-means vs GMM, sum-product vs max-product</title><content type='html'>I finished K-means and Gaussian mixture models in class last week or maybe the week before.  I've &lt;a href="http://nlpers.blogspot.com/2009/03/mixture-models-clustering-or-density.html"&gt;previously discussed&lt;/a&gt; the fact that these two are really solving different problems (despite being structurally so similar), but today's post is about something different.&lt;br /&gt;&lt;br /&gt;There are two primary differences between the typical presentation of K-means and the typical presentation of GMMs.  (I say "typical" because you can modify these algorithms fairly extensively as you see fit.)  &lt;span style="font-weight: bold;"&gt;The first difference&lt;/span&gt; is that GMMs just have more parameters.  The parameters of K-means are typically the cluster assignments ("z") and the means ("mu").  The parameters of a GMM are typically these (z and mu) as well as the class prior probabilities ("pi") and cluster covariances ("Sigma").  The GMM model is just richer.  Of course, you can restrict it so all clusters are isotropic and all prior probabilities are even, in which case you've effectively removed this difference (or you can add these things into K-means).  &lt;span style="font-weight: bold;"&gt;The second difference&lt;/span&gt; is that GMMs operate under the regime of "soft assignments," meaning that points aren't wed to clusters: they only prefer (in a probabilistic sense) some clusters to others.  This falls out naturally from the EM formalization, where the soft assignments are simply the expectations of the assignments under the current model.&lt;br /&gt;&lt;br /&gt;One can get rid of the second difference by running "hard EM" (also called "Viterbi EM" in NLP land), where the expectations are clamped at their most likely value.  This leads to something that has much more of a K-means feel.&lt;br /&gt;&lt;br /&gt;This "real EM" versus "hard EM" distinction comes up a lot in NLP, where computing exact expectations is often really difficult.  (Sometimes you get complex variants, like the "pegging" approaches in the IBM machine translation models, but my understanding from people who run in this circle is that pegging is much ado about nothing.)  My general feeling has always been "if you don't have much data, do real EM; if you have tons of data, hard EM is probably okay."  (This is purely from a practical perspective.)  The idea is that when you have tons and tons of data, you can approximate expectations reasonably well by averaging over many data points.  (Yes, this is hand-wavy and it's easy to construct examples where it fails.  But it seems to work many times.)  Of course, you can get pedantic and say "hard EM sucks: it's maximizing p(x,z) but I really want to maximize p(x)" to which I say: ho hum, who cares, you don't actually care about p(x), you care about some extrinsic evaluation metric which, crossing your fingers, you hope correlates with p(x), but for all I know it correlates better with p(x,z).&lt;br /&gt;&lt;br /&gt;Nevertheless, a particular trusted friend has told me he's always remiss when he can't do full EM and has to do hard EM: he's never seen a case where it doesn't help.  (Or maybe "rarely" would be more fair.)  Of course, this comes at a price: for many models, maximization (search) can be done in polynomial time, but computing expectations can be #P-hard (basically because you have to enumerate -- or count -- over every possible assignment).&lt;br /&gt;&lt;br /&gt;Now let's think about approximate inference in graphical models.  Let's say I have a graphical model with some nodes I want to maximize over (call them "X") and some nodes I want to marginalize out (call them "Z").  For instance, in GMMs, the X nodes would be the means, covariances and cluster priors; the Z nodes would be the assignments.  (Note that this is departing slightly from typical notation for EM.)  Suppose I want to do inference in such a model.  Here are three things I can do:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Just run max-product.  That is, maximize p(X,Z) rather than p(X).&lt;/li&gt;&lt;li&gt;Just run sum-product.  That is, compute expectations over X and Z, rather than just over Z.&lt;/li&gt;&lt;li&gt;Run EM, by alternating something like sum-product on Z and something like max-product onX.&lt;/li&gt;&lt;/ol&gt;Of these, only (3) is really doing the "right thing."  Further, let's get away from the notion of p(X) not correlating with some extrinsic evaluation by just measuring ourselves against &lt;span style="font-style: italic;"&gt;exact inference.&lt;/span&gt;  (Yes, this limits us to relatively small models with 10 or 20 binary nodes.)&lt;br /&gt;&lt;br /&gt;What do you think happens?  Well, first, things vary as a function of the number of X nodes versus Z nodes in the graph.&lt;br /&gt;&lt;br /&gt;When most of the nodes are X (maximization) nodes, then max-product does best and EM basically does the same.&lt;br /&gt;&lt;br /&gt;Whe most of the nodes are Z (marginalization) nodes, then EM does best and sum-product does almost the same.  &lt;span style="font-weight: bold;"&gt;But max product also does almost the same.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is an effect that we've been seeing regularly, regardless of what the models look like (chains or lattices), what the potentials look like (high temperature or low temperature) and how you initialize these models (eg., in the chain case, EM converges to different places depending on initialization, while sum- and max-product do not).  &lt;span style="font-weight: bold;"&gt;Max product is just unbeatable.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In a sense, from a practical perspective, this is nice.  It says: if you have a mixed model, just run max product and you'll do just as well as if you had done something more complicated (like EM).  But it's also frustrating: we &lt;span style="font-style: italic;"&gt;should&lt;/span&gt; be getting some leverage out of marginalizing over the nodes that we should marginalize over.  Especially in the high temperature case, where there is lots of uncertainty in the model, max product should start doing worse and worse (note that when we evaluate, we only measure performance on the "X" nodes -- the "Z" nodes are ignored).&lt;br /&gt;&lt;br /&gt;Likening this back to K-means versus GMM, for the case where the models are the same (GMM restricted to not have priors or covariances), the analogy is that &lt;span style="font-style: italic;"&gt;as far as the means go, it doesn't matter which one you use.&lt;/span&gt;  Even if there's lots of uncertainty in the data.  Of course, you may get much better &lt;span style="font-style: italic;"&gt;assignments&lt;/span&gt; from GMM (or you may not, I don't really know).  But if all you really care about at the end of the day are the Xs (the means), then our experience with max-product suggests that it just won't matter.  At all.  Ever.&lt;br /&gt;&lt;br /&gt;Part of me finds this hard to believe, and note that I haven't actually run experiments with K-means and GMM, but the results in the graphical model cases are sufficiently strong and reproducible that I'm beginning to trust them.  Shouldn't someone have noticed this before, though?  For all the effort that's gone into various inference algorithms for graphical models, why haven't we ever noticed that you just can't beat max-product?&lt;br /&gt;&lt;br /&gt;(Yes, I'm aware of some theoretical results, eg., the Wainwright result that sum-product + randomized rounding is a provably good approximation to the MAP assignment, but this result actually goes the other way, and contradicts many of our experimental studies where sum-product + rounding just flat out sucks.  Maybe there are other results out there that we just haven't been able to dig up.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-8565049129965680682?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/8565049129965680682/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=8565049129965680682' title='60 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8565049129965680682'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/8565049129965680682'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/11/k-means-vs-gmm-sum-product-vs-max.html' title='K-means vs GMM, sum-product vs max-product'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>60</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-5649405799149328916</id><published>2009-11-07T10:38:00.001-07:00</published><updated>2009-11-09T07:59:52.117-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='linguistics'/><category scheme='http://www.blogger.com/atom/ns#' term='problems'/><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='discourse'/><category scheme='http://www.blogger.com/atom/ns#' term='structured prediction'/><title type='text'>NLP as a study of representations</title><content type='html'>&lt;a href="http://www.cs.utah.edu/%7Eriloff/"&gt;Ellen Riloff&lt;/a&gt; and I run an NLP reading group pretty much every semester.  Last semester we covered "old school NLP."  We independently came up with lists of what we consider some of the most important ideas (idea = paper) from pre-1990 (most are much earlier) and let students select which to present.  There was a lot of overlap between Ellen's list and mine (not surprisingly).  &lt;span style="color: rgb(255, 0, 0);"&gt;&lt;del&gt;If people are interested, I can provide the whole list (just post a comment and I'll dig it up)&lt;/del&gt;&lt;/span&gt;.  &lt;span style="color: rgb(51, 51, 255);"&gt;The whole list of topics is posted as a comment.&lt;/span&gt;  The topics that were actually selected are &lt;a href="http://www.cs.utah.edu/nlp/nlpmtg-spring09.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I hope the students have found this exercise useful.  It gets you thinking about language in a way that papers from the 2000s typically do not.  It brings up a bunch of issues that we no longer think about frequently.  Like language.  (Joking.)  (Sort of.)&lt;br /&gt;&lt;br /&gt;One thing that's really stuck out for me is how much "old school" NLP comes across essentially as a study of &lt;i&gt;representations&lt;/i&gt;.  Perhaps this is a result of the fact that AI -- as a field -- was (and, to some degree, still is) enamored with knowledge representation problems.  To be more concrete, let's look at a few examples.  It's already been a while since I read these last (I had meant to write this post during the spring when things were fresh in my head), so please forgive me if I goof a few things up.&lt;br /&gt;&lt;br /&gt;I'll start with one I know well: Mann and Thompson's rhetorical structure theory paper from 1988.  This is basically "the" RST paper.  I think that when a many people think of RST, they think of it as a list of ways that sentences can be organized into hierarchies.  Eg., this sentence provides background for that one, and together they argue in favor of yet a third.  But this isn't really where RST begins.  It begins by trying to understand the communicative role of text structure.  That is, when I write, I am trying to communicate something.  Everything that I write (if I'm writing "well") is toward that end.  For instance, in this post, I'm trying to communicate that old school NLP views representation as the heart of the issue.  This current paragraph is supporting that claim by providing a concrete example, which I am using to try to &lt;span style="font-style: italic;"&gt;convince&lt;/span&gt; you of my claim.&lt;br /&gt;&lt;br /&gt;As a more detailed example, take the "Evidence" relation from RST.  M+T have the following characterization of "Evidence."  Herein, "N" is the nucleus of the relation, "S" is the satellite (think of these as sentences), "R" is the reader and "W" is the writer:&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-style: italic;"&gt;relation name:&lt;/span&gt;           Evidence&lt;br /&gt;&lt;span style="font-style: italic;"&gt;constraints on N:  &lt;/span&gt;     R might not believe N to a degree satisfactory to W&lt;br /&gt;&lt;span style="font-style: italic;"&gt;constraints on S:&lt;/span&gt;        R believes S or will find it credible&lt;br /&gt;&lt;span style="font-style: italic;"&gt;constraints on N+S:&lt;/span&gt;  R's comprehending S increases R's belief of N&lt;br /&gt;&lt;span style="font-style: italic;"&gt;the effect:&lt;/span&gt;                  R's belief of N is increased&lt;br /&gt;&lt;span style="font-style: italic;"&gt;locus of effect:&lt;/span&gt;           N&lt;br /&gt;&lt;/blockquote&gt;This is a totally different way from thinking about things than I think we see nowadays.  I kind of liken it to how I tell students &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; to program.  If you're implementing something moderately complex (say, forward/backward algorithm), first write down all the math, then start implementing.  Don't start implementing first.  I think nowadays (and sure, I'm guilty!) we see a lot of implementing without the math.  Or rather, with plenty of math, but without a representational model of what it is that we're studying.&lt;br /&gt;&lt;br /&gt;The central claim of the RST paper is that one can think of texts as being organized into elementary discourse units, and these are connected into a tree structure by relations like the one above.  (Or at least this is my reading of it.)  That is, they have &lt;span style="font-style: italic;"&gt;laid out a representation of text&lt;/span&gt; and claimed that this is how texts get put together.&lt;br /&gt;&lt;br /&gt;As a second example (this will be sorter), take Wendy Lehnert's 1982 paper, "Plot units and narrative summarization."  Here, the story is about how stories get put together.  The most interesting thing about the plot units model to me is that it breaks from how one might naturally think about stories.  That is, I would naively think of a story as a series of events.  The claim that Lehnert makes is that this is not the right way to think about it.  Rather, we should think about stories as sequences of &lt;span style="font-style: italic;"&gt;affect states&lt;/span&gt;.  Effectively, an affect state is how a character is feeling at any time.  (This isn't quite right, but it's close enough.)  For example, Lehnert presents the following story:&lt;br /&gt;&lt;blockquote&gt;When John tried to start his care this morning, it wouldn't turn over.  He asked his neighbor Paul for help.  Paul did something to the carburetor and got it going.  John thanked Paul and drove to work.&lt;/blockquote&gt;The representation put forward for this story is something like: (1) negative-for-John (the car won't start), which leads to (2) motivation-for-John (to get it started, which leads to (3) positive-for-John (it's started), when then links back and &lt;span style="font-style: italic;"&gt;resolves&lt;/span&gt; (1).  You can also analyze the story from Paul's perspective, and then add links that go between the two characters showing how things interact.  The rest of the paper describes how these relations work, and how they can be put together into more complex event sequences (such as "promised request bungled").  Again, a high level representation of how stories work &lt;span style="font-style: italic;"&gt;from the perspective of the characters.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;So now I, W, hope that you, R, have an increased belief in the title of the post.&lt;br /&gt;&lt;br /&gt;Why do I think this is interesting?  Because at this point, we know &lt;span style="font-style: italic;"&gt;a lot&lt;/span&gt; about how to deal with structure in language.  From a machine learning perspective, if you give me a structure and some data (and some features!), I will learn something.  It can even be unsupervised if it makes you feel better.  So in a sense, I think we're getting to a point where we can go back, look at some really hard problems, use the deep linguistic insights from two decades (or more) ago, and start taking a crack at things that are really deep.  Of course, features are a big problem; as a very wise man once said to me: "Language is hard.  The fact that statistical association mining at  the word level made it appear easy for the past decade doesn't alter  the basic truth.  &lt;span class="moz-smiley-s1"&gt;&lt;span&gt; :-)."  We've got many of the ingredients to start making progress, but it's not going to be easy!&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-5649405799149328916?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/5649405799149328916/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=5649405799149328916' title='62 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5649405799149328916'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5649405799149328916'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/03/nlp-as-study-of-representations.html' title='NLP as a study of representations'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>62</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-1816082572594767533</id><published>2009-11-06T10:53:00.002-07:00</published><updated>2009-11-06T10:58:31.224-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='news'/><category scheme='http://www.blogger.com/atom/ns#' term='papers'/><category scheme='http://www.blogger.com/atom/ns#' term='bayesian'/><category scheme='http://www.blogger.com/atom/ns#' term='survey'/><title type='text'>Getting Started In: Bayesian NLP</title><content type='html'>This isn't so much a post in the "GSI" series, but just two links that recently came out.  &lt;a href="http://www.isi.edu/%7Eknight/"&gt;Kevin Knight&lt;/a&gt; and &lt;a href="http://www.umiacs.umd.edu/%7Eresnik/"&gt;Philip Resnik&lt;/a&gt; both just came out with tutorials for Bayesian NLP.  They're both excellent, and almost entirely non-redundant.  I highly recommend reading both.  And I thank Kevin and Philip from the bottom of my heart, since I'd been toying with the idea of writing such a thing (for a few years!) and they've saved me the effort.  I'd probably start with Kevin's and then move on to Philip's (which is more technically meaty), but either order is really fine.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.isi.edu/natural-language/people/bayes-with-tears.pdf"&gt;Bayesian Inference with Tears&lt;/a&gt; by Kevin&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.umiacs.umd.edu/%7Eresnik/pubs/gibbs.pdf"&gt;Gibbs Sampling for the Uninitiated&lt;/a&gt; by Philip&lt;/li&gt;&lt;/ul&gt;Thanks again to both of them.  (And if you haven't read &lt;a href="http://www.isi.edu/natural-language/mt/wkbk.pdf"&gt;Kevin's previous workbook on SMT&lt;/a&gt; -- which promises free beer! -- I highly recommend that, too.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-1816082572594767533?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/1816082572594767533/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=1816082572594767533' title='55 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1816082572594767533'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1816082572594767533'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/11/getting-started-in-bayesian-nlp.html' title='Getting Started In: Bayesian NLP'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>55</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-9119507162108309109</id><published>2009-10-21T09:18:00.002-06:00</published><updated>2009-10-21T09:40:40.071-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bayesian'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><category scheme='http://www.blogger.com/atom/ns#' term='structured prediction'/><title type='text'>Convex or not, my schizophrenic personality</title><content type='html'>Machine learning as a field has been very convex-happy for the past decade or so.  So much so that when I saw a tutorial on submodular optimization in ML (one of the best tutorials I've seen), they said something along the lines of "submodularity will be for this decade what convexity was for the last decade."  (Submodularity is cool and I'll post about it more in the future, but it's kind of a discrete analog of convexity.  There's a NIPS workshop on the topic coming up.)  This gives a sense of how important convexity has been.&lt;br /&gt;&lt;br /&gt;There's also a bit of an undercurrent of "convexity isn't so great" from other sides of the ML community (roughly from the neural nets folks); see, for instance, Yann LeCun's talk &lt;a href="http://videolectures.net/eml07_lecun_wia/"&gt;Who's Afraid of Non-convex Loss Functions&lt;/a&gt;, a great and entertaining talk.&lt;br /&gt;&lt;br /&gt;There's a part of me that loves convexity.  Not having to do random restarts, being assured of global convergence, etc., all sounds very nice.  I use logistic regression/maxent for almost all of my classification needs, have never run a neural network, and have only occasionally used svms (though of course they are convex, too).  When I teach ML (as I'm doing now), I make a bit deal about convexity: it makes life easy in many ways.&lt;br /&gt;&lt;br /&gt;That said, almost none of my recent papers reflect this.  In fact, in the &lt;a href="http://hal3.name/docs/daume08flat.pdf"&gt;structure compilation paper&lt;/a&gt;, we flat out say that non-linearity in the model (which leads to a non-convex loss function) is the major reason why CRFs outperform independent classifiers in structured prediction tasks!  Moreover, whenever I start doing Bayesian stuff, usually solved with some form of &lt;a href="http://hal3.name/HBC"&gt;MCMC&lt;/a&gt;, I've completely punted on everything convex.  In a "voting with my feet" world, I could care less about convexity!  For the most part, if you're using EM or sampling or whatever, you don't care much about it either.  Somehow we (fairly easily!) tolerate whatever negative effects there are of non-convex optimization.&lt;br /&gt;&lt;br /&gt;I think one reason why such things don't both us, as NLPers, as much as they bother the average machine learning person is that we are willing to invest some energy in intelligent initialization.  This already puts us in a good part of the optimization landscape, and doing local hillclimbing from there is not such a big deal.  A classic example is the "Klein and Manning" smart initializer for unsupervised parsing, where a small amount of human knowledge goes a long way above a random initializer.&lt;br /&gt;&lt;br /&gt;Another style of initialization is the IBM alignment model style.  IBM model 4 is, of course, highly non-convex and ridiculously difficult to optimize (the E step is intractable).  So they do a smart initialization, using the output of model 3.  Model 3, too, is highly non-convex (but not quite so much so), so they initialize with model 2.  And so on, down to model 1, which is actually convex and fairly easy to optimize.  This sequencing of simple models to complex models also happens in some statistical analysis, where you first fit first order effects and then later fit higher order effects.  The danger, of course, is that you got to a bad hill to climb, but this overall generally appears to be a bigger win than starting somewhere in the middle of a random swamp.  (Of course, later, Bob Moore had this cute argument that even though model 1 is convex, we don't actually ever optimize it to the global optimum, so doing clever initialization for model 1 is also a good idea!)&lt;br /&gt;&lt;br /&gt;These two ideas: clever initialization, and sequential initialization, seem like powerful ideas that I would like to see make their way into more complex models.  For instance, in the original LDA paper, Dave Blei used an initialization where they pick some random documents as seeds for topics.  As far as I know, no one really does this anymore (does anyone know why: does it really not matter?), but as we keep building more and more complex models, and lose hope that our off the shelf optimizer (or sampler) is going to do anything reasonable, we're probably going to need to get back to this habit, perhaps trying to formalize it in the meantime.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-9119507162108309109?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/9119507162108309109/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=9119507162108309109' title='30 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/9119507162108309109'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/9119507162108309109'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/10/convex-or-not-my-schizophrenic.html' title='Convex or not, my schizophrenic personality'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>30</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-7690788007844261334</id><published>2009-09-25T08:29:00.003-06:00</published><updated>2009-09-25T08:54:01.876-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hiring'/><title type='text'>Some notes on job search</title><content type='html'>There are tons of "how to apply for academic jobs" write-ups out there; this is not one of them.  It's been four years (egads!) since I began my job search and there are lots of things I think I did well and lots of things I wish I had done differently.&lt;br /&gt;&lt;br /&gt;When I entered grad school, I was fairly sure that I eventually wanted a university job.  During high school, my career goal was to be a high school math teacher.  Then I went to college and realized that, no, I wanted to teach math to undergraduates.  Then I was an advanced undergraduate and realized that I wanted to teach grads and do research.  Teaching was always very important to me, though of course I fell in love with research later.  It was unfortunate that it took so long for me to actually get involved in research, but my excuse was that I wasn't in CS, where REU-style positions are plentiful and relatively easy to come by (system development, anyone?).&lt;br /&gt;&lt;br /&gt;However, the more time I spend in grad school, including an internship at MSR with Eric Brill (but during which I befriended many in the NLP group at MSR, a group that I still love), I realized that industry labs were a totally great place to go, too.&lt;br /&gt;&lt;br /&gt;I ended up applying to basically everything under the sun, provided they had a non-zero number of faculty in either NLP or ML.  I talked (mostly off the record) with a few people about post-doc positions (I heard later than simultaneously exploring post-docs and academic positions is not a good idea: hiring committees don't like to "reconsider" people; I don't know how true this is, but I heard it too late myself to make any decisions based on it), applied for some (okay, many) tenure-track positions, some research-track positions (okay, few) and to the big three industry labs.  I wrote three cover letters, one more tailored to NLP, one more to ML and one more combined, three research statements (ditto) and one teaching statement.  In retrospect, they were pretty reasonable, I think, though not fantastic.  I don't think I did enough to make my future research plans &lt;span style="font-style:italic;"&gt;not&lt;/span&gt; sound like "more of the same."&lt;br /&gt;&lt;br /&gt;I suppose my biggest piece of advice for applying is (to the extent possible) find someone you know and trust at the institution and try to figure out exactly what they're looking for.  Obviously you can't change who you are and the work you've done, but you definitely can sell it in slightly different ways.  This is why I essentially had three application packages -- the material was the same, the focus was different.  But, importantly, they were all &lt;span style="font-style:italic;"&gt;true&lt;/span&gt;.  The more this person trusts you, the more of the inside scoop they can give you.  For instance, we had a robotics/ML position open (which, sadly, we had to close due to budget issues), but in talking to several ML people, they felt that they weren't sufficiently "robotics" enough; I think I was able to dissuade them of this opinion and we ended up getting a lot of excellent applicants before we shut down the slot.&lt;br /&gt;&lt;br /&gt;Related, it's hard to sell yourself across two fields.  At the time I graduated, I saw myself as basically straddling NLP and ML.  This can be a hard sell to make.  I feel in retrospect that you're often better off picking something and really selling that aspect.  From the other side of the curtain, what often happens is that you need an advocate (or two) in the department to which you're applying.  If you sell yourself as an X person, you can get faculty in X behind you; if you sell yourself as a Y person, you can get faculty in Y behind you.  However, if you sell yourself as a mix, the X faculty might prefer a pure X and the Y faculty might prefer a pure Y.  Of course, this isn't always true: &lt;a href="http://www.cs.umd.edu"&gt;Maryland&lt;/a&gt; is basically looking for a combined NLP/ML person this year to compliment their existing strengths.  Of course, this doesn't always hold: this is something that you should try to find out from friends at the places to which you're applying.&lt;br /&gt;&lt;br /&gt;For the application process itself, my experience here and what I've heard from &lt;span style="font-style:italic;"&gt;most&lt;/span&gt; (but not all) universities is that interview decisions (who to call in) get made by a topic-specific hiring committee.  This means that to get in the door, you have to appeal to the hiring committee, which is typically people in your area, if it's an area-specific call for applications.  Typically your application will go to an admin, first, who will filter based on your cover letter to put you in the right basket (if there are multiple open slots) or the waste basket (for instance, if you don't have a PhD).  It then goes to the hiring committee.  Again, if you have a friend in the department, it's not a bad idea to let them know by email that you've applied after everything has been submitted (including letters) to make sure that you don't end up in the waste bin.&lt;br /&gt;&lt;br /&gt;Once your application gets to the hiring committee, the hope is that they've already heard of you.  But if they haven't, hopefully they've heard of at least one of your letter writers.  When we get applications, I typically first sort by whether I've heard of the applicant, then by the number of letter writers they have that I've heard of, then loosely by the reputation of their university.  And I make my way down the list, not always all the way to the bottom.  (Okay, I've only done this once, and I think I got about 90% of the way through.)&lt;br /&gt;&lt;br /&gt;In my experience, what we've looked for in applications is (a) a good research statement, including where you're going so as to distinguish yourself from your advisor, (b) a not-bad teaching statement (it's hard to get a job at a research university on a great teaching statement, but it's easy to lose an offer on a bad one... my feeling here is just to be concrete and not to pad it with BS -- if you don't have much to say, don't say much), (c) great letters, and (d) an impressive CV.  You should expect that the hiring committee &lt;span style="font-style:italic;"&gt;will&lt;/span&gt; read some of your papers before interviewing you.  This means that if you have dozens, you should highlight somewhere (probably the research statement) what are they best ones that they should read.  Otherwise they'll choose essentially randomly, and (depending on your publishing style) this could hurt.  As always, put your best foot forward and make it &lt;span style="font-style:italic;"&gt;easy&lt;/span&gt; for the hiring committee to find out what's so great about you.&lt;br /&gt;&lt;br /&gt;Anyway, that's basically it.  There's lots more at interview stage, but these are my feelings for application stage.  I'd be interested to hear if my characterization of the hiring process is vastly different than at other universities; plus, if there are other openings that might be relevant to NLP/ML folks, I'm sure people would be very pleased to seem them in the comments section.&lt;br /&gt;&lt;br /&gt;Good luck, all your graduating folks!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-7690788007844261334?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/7690788007844261334/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=7690788007844261334' title='53 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7690788007844261334'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7690788007844261334'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/09/some-notes-on-job-search.html' title='Some notes on job search'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>53</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-1399106941578360738</id><published>2009-09-09T15:15:00.001-06:00</published><updated>2009-09-09T15:17:35.664-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='survey'/><title type='text'>Where did you Apply to Grad School?</title><content type='html'>&lt;a href="http://www.cs.utah.edu/%7Eriloff"&gt;Ellen&lt;/a&gt; and I are interested (for obvious reasons) in how people choose what schools to apply to for grad school.  Note that this is &lt;i&gt;not&lt;/i&gt; the question of how you chose where to go.  This is about what made the list of where you actually applied.  We'd really appreciate if you'd fill out our 10-15 minute survey and pass it along to your friends (and enemies).  If you're willing, please go &lt;a href="http://www.surveymonkey.com/s.aspx?sm=oyWkBK96uKCxs2_2fIGBnyBw_3d_3d"&gt;here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-1399106941578360738?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/1399106941578360738/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=1399106941578360738' title='34 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1399106941578360738'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1399106941578360738'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/09/where-did-you-apply-to-grad-school.html' title='Where did you Apply to Grad School?'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>34</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-3100762567405425285</id><published>2009-09-07T13:34:00.002-06:00</published><updated>2009-09-07T14:15:07.321-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='papers'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>ACL and EMNLP retrospective, many days late</title><content type='html'>Well, ACL and EMNLP are long gone.  And sadly I missed one day of each due either to travel or illness, so most of my comments are limited to Mon/Tue/Fri.  C'est la vie.  At any rate, here are the papers I saw or read that I really liked.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt; &lt;/p&gt;&lt;p&gt;&lt;a href="http://aclweb.org/anthology-new/P/P09/P09-1010.pdf"&gt;P09-1010&lt;/a&gt; [&lt;a href="http://aclweb.org/anthology-new/P/P09/P09-1010.bib"&gt;bib&lt;/a&gt;]: &lt;b&gt;S.R.K. Branavan; Harr Chen; Luke Zettlemoyer; Regina Barzilay&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Reinforcement Learning for Mapping Instructions to Actions&lt;br /&gt;&lt;br /&gt;&lt;/i&gt;and&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="http://aclweb.org/anthology-new/P/P09/P09-1011.pdf"&gt;P09-1011&lt;/a&gt; [&lt;a href="http://aclweb.org/anthology-new/P/P09/P09-1011.bib"&gt;bib&lt;/a&gt;]: &lt;b&gt;Percy Liang; Michael Jordan; Dan Klein&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Learning Semantic Correspondences with Less Supervision&lt;br /&gt;&lt;br /&gt;&lt;/i&gt;these papers both address what might roughly be called the grounding problem, or at least trying to learn something about semantics by looking at data.  I really really like this direction of research, and both of these papers were really interesting.  Since I really liked both, and since I think the directions are great, I'll take this opportunity to say what I felt was a bit lacking in each.  In the Branavan paper, the particular choice of reward was both clever and a bit of a kludge.  I can easily imagine that it wouldn't generalize to other domains: thank goodness those Microsoft UI designers happened to call the Start Button something like UI_STARTBUTTON.  In the Liang paper, I worry that it relies too heavily on things like lexical match and other very domain specific properties.  They also should have cited Fleischman and Roy, which Branavan et al did, but which many people in this area seem to miss out on -- in fact, I feel like the Liang paper is in many ways a cleaner and more sophisticated version of the Fleischman paper.&lt;br /&gt;&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;&lt;a href="http://aclweb.org/anthology-new/P/P09/P09-1054.pdf"&gt;P09-1054&lt;/a&gt; [&lt;a href="http://aclweb.org/anthology-new/P/P09/P09-1054.bib"&gt;bib&lt;/a&gt;]: &lt;b&gt;Yoshimasa Tsuruoka; Jun’ichi Tsujii; Sophia Ananiadou&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty&lt;br /&gt;&lt;br /&gt;&lt;/i&gt;This paper is kind of an extension of the truncated gradient approach to learning l1-regularized models that John, Lihong and Tong had last year at NIPS.  The paper did a great job at motivated why L1 penalties is hard.  The first observation is that L1 regularizes optimized by gradient steps like to "step over zero."  This is essentially&lt;i&gt; &lt;/i&gt;the observation in truncated gradient and frankly kind of an obvious one (I always thought this is how &lt;span style="font-style: italic;"&gt;everyone&lt;/span&gt; optimized these models, though of course John, Lihong and Tong actually proved something about it).  The second observation, which goes into this current paper, is that you often end up with a lot of non-zeros simply because you haven't run enough gradient steps since the last increase.  They have a clever way to accumulating these penalties lazily and applying them at the end.  It seems to do very well, is easy to implement, etc.  But they can't (or haven't) proved anything about it.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;&lt;a href="http://aclweb.org/anthology-new/P/P09/P09-1057.pdf"&gt;P09-1057&lt;/a&gt; [&lt;a href="http://aclweb.org/anthology-new/P/P09/P09-1057.bib"&gt;bib&lt;/a&gt;]: &lt;b&gt;Sujith Ravi; Kevin Knight&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Minimized Models for Unsupervised Part-of-Speech Tagging&lt;br /&gt;&lt;br /&gt;&lt;/i&gt;I didn't actually see this paper (I think I was chairing a session at the time), but I know about it from talking to Sujith.  Anyone who considers themselves a Bayesian in the sense of "let me put a prior on that and it will solve all your ills" should read this paper.  Basically they show that sparse priors don't give you things that are sparse enough, and that by doing some ILP stuff to minimize dictionary size, you can get tiny POS tagger models that do very well.&lt;br /&gt;&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1006.pdf"&gt;D09-1006&lt;/a&gt;: [&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1006.bib"&gt;bib&lt;/a&gt;] &lt;b&gt;&lt;author&gt;&lt;first&gt;Omar F.&lt;/first&gt; &lt;last&gt;Zaidan&lt;/last&gt;&lt;/author&gt;; &lt;author&gt;&lt;first&gt;Chris&lt;/first&gt; &lt;last&gt;Callison-Burch&lt;/last&gt;&lt;/author&gt;&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Feasibility of Human-in-the-loop Minimum Error Rate Training&lt;br /&gt;&lt;br /&gt;&lt;/i&gt;Chris told me about this stuff back in March when I visited JHU and I have to say I was totally intrigued.  Adam already discussed this paper in an earlier post, so I won't go into more details, but it's definitely a fun paper.&lt;br /&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1011.pdf"&gt;D09-1011&lt;/a&gt;: [&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1011.bib"&gt;bib&lt;/a&gt;] &lt;b&gt;&lt;author&gt;&lt;first&gt;Markus&lt;/first&gt; &lt;last&gt;Dreyer&lt;/last&gt;&lt;/author&gt;; &lt;author&gt;&lt;first&gt;Jason&lt;/first&gt; &lt;last&gt;Eisner&lt;/last&gt;&lt;/author&gt;&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Graphical Models over Multiple Strings&lt;br /&gt;&lt;br /&gt;&lt;/i&gt;This paper is just fun from a technological perspective.  The idea is to have graphical models, but where nodes are distributions over strings represented as finite state automata.  You do message passing, where your messages are now automata and you get to do all your favorite operations (or at least all of Jason's favorite operations) like intersection, composition, etc. to compute beliefs.  Very cool results.&lt;br /&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1024.pdf"&gt;D09-1024&lt;/a&gt;: [&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1024.bib"&gt;bib&lt;/a&gt;] &lt;b&gt;&lt;author&gt;&lt;first&gt;Ulf&lt;/first&gt; &lt;last&gt;Hermjakob&lt;/last&gt;&lt;/author&gt;&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Improved Word Alignment with Statistics and Linguistic Heuristics&lt;br /&gt;&lt;br /&gt;&lt;/i&gt;Like the Haghighi coreference paper below, here we see how to do word alignment without fancy math!&lt;i&gt;&lt;br /&gt;&lt;br /&gt;&lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1120.pdf"&gt;D09-1120&lt;/a&gt;: [&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1120.bib"&gt;bib&lt;/a&gt;] &lt;b&gt;&lt;author&gt;&lt;first&gt;Aria&lt;/first&gt; &lt;last&gt;Haghighi&lt;/last&gt;&lt;/author&gt;; &lt;author&gt;&lt;first&gt;Dan&lt;/first&gt; &lt;last&gt;Klein&lt;/last&gt;&lt;/author&gt;&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Simple Coreference Resolution with Rich Syntactic and Semantic Features&lt;br /&gt;&lt;br /&gt;&lt;/i&gt;How to do coreference without math!&lt;i&gt; &lt;/i&gt; I didn't know you could still get papers accepted if they didn't have equations in them!&lt;/li&gt;&lt;/ul&gt;In general, here's a trend I've seen in both ACL and EMNLP this year.  It's the "I find a new data source and write a paper about it" trend.  I don't think this trend is either good or bad: it simply is.  A lot of these data sources are essentially Web 2.0 sources, though some are not.  Some are Mechanical Turk'd sources.  Some are the Penn Discourse Treebank (about which there were a ridiculous number of papers: it's totally unclear to me why everyone all of a sudden thinks discourse is cool just because there's a new data set -- what was wrong with the RST treebank that it turned everyone off from discourse for ten years?!  Okay, that's being judgmental and I don't totally feel that way.  But I partially feel that way.)&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;p&gt;&lt;/p&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-3100762567405425285?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/3100762567405425285/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=3100762567405425285' title='58 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3100762567405425285'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3100762567405425285'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/09/acl-and-emnlp-retrospective-many-days.html' title='ACL and EMNLP retrospective, many days late'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>58</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-6973000274164228070</id><published>2009-08-14T09:40:00.002-06:00</published><updated>2009-08-14T09:58:34.125-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='evaluation'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Classifier performance: alternative metrics of success</title><content type='html'>I really enjoyed Mark Dredze's talk at EMNLP on &lt;a href="http://www.cs.jhu.edu/%7Emdredze/publications/emlnp09_mccw.pdf"&gt;multiclass confidence weighted algorithms&lt;/a&gt;, where they take their CW binary predictors and extend them in two (basically equivalent) ways to a multiclass/structured setting (warning: I haven't read the paper!).  Mark did a great job presenting, as always, and dryly suggested that we should all throw away our perceptrons and our MIRAs and SVMs and just switch to CW permanently.  It was a pretty compelling case.&lt;br /&gt;&lt;br /&gt;Now, I'm going to pick on basically every "yet another classifier" paper I've read in the past ten years (read: ever).  I'm not trying to point fingers, but just try to better understand why I, personally, haven't yet switched to using these things and continue to use either logistic regression or averaged perceptron for all of my classification needs (aside from the fact that I am rather fond of a particular &lt;a href="http://hal3.name/megam"&gt;software package&lt;/a&gt; for doing these things -- note, though, that it does support PA and maybe soon CW if I decide to spend 10 minutes implementing it!).&lt;br /&gt;&lt;br /&gt;Here's the deal.  Let's look at SVM versus logreg.  Whether this is &lt;i&gt;actually&lt;/i&gt; true or not, I have this gut feeling that logreg is much less sensitive to hyperparameter selection than are SVMs.  This is not at all based on any science, and the experience that it's based on it somewhat unfair (comparing megam to libSVM, for instance, which use very different optimization methods, and libSVM doesn't do early stopping while megam does).  However, I've heard from at least two other people that they have the same internal feelings.  In other words, here's a caricature of how I believe logreg and SVM behave:&lt;br /&gt;&lt;br /&gt;&lt;img src="http://hal3.name/nlpers/lrsvm.png" /&gt;&lt;br /&gt;&lt;br /&gt;That is, if you really tune the regularizer (lambda) well, then SVMs will win out.  But for the majority of the settings, they're either the same or logreg is a bit better.&lt;br /&gt;&lt;br /&gt;As a result, what do I do?  I use logreg with lambda=1.  That's it.  No tuning, no nothing.&lt;br /&gt;&lt;br /&gt;(Note that, as I said before, I haven't ever run experiments to verify this.  I think it would be a moderately interesting thing to try to see if it really holds up when all else -- eg., the optimization algorithm, early stopping, implementation, choice of regularizer (L1, L2, KL, etc.), and so on -- are held constant... maybe it's not true.  But if it is, then it's an interesting theoretical question: hinge loss and log loss don't look &lt;i&gt;that&lt;/i&gt; different, despite the fact that &lt;a href="http://hunch.net/?p=85"&gt;John seems to not like how log loss diverges&lt;/a&gt;: why should this be true?)&lt;br /&gt;&lt;br /&gt;This is also why I use averaged perceptron: there aren't &lt;i&gt;any&lt;/i&gt; hyperparameters to select.  It just runs.&lt;br /&gt;&lt;br /&gt;What I'd really like to see in future "yet another classifier" papers is an analysis of sensitivity to hyperparameter selection.  You could provide graphs and stuff, but these get hard to read.  I like numbers.  I'd like a single number that I can look at.  Here are two concrete proposals for what such a number could be (note: I'm assuming you're also going to provide performance numbers at the best possible selection of hyperparameters from development data or cross validation... I'm talking about something in addition):&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Performance at a &lt;span style="font-style: italic;"&gt;default setting&lt;/span&gt; of the hyperparameter.  For instance, SVM-light uses something like average inverse norm of the data vectors as the &lt;span style="font-style: italic;"&gt;C&lt;/span&gt; parameter. Or you could just us &lt;span style="font-style: italic;"&gt;1&lt;/span&gt;, like I do for logreg.  In particular, suppose you're testing your algorithm on 20 data sets from UCI.  Pick a single regularization parameter (or parameter selection scheme, ala SVM-light) to use for &lt;span style="font-style: italic;"&gt;all&lt;/span&gt; of them and report results using that value.  If this is about the same as the "I carefully tuned" setting, I'm happy.  If it's way worse, I'm not so happy.&lt;/li&gt;&lt;li&gt;Performance within a range.  Let's say that if I do careful hyperparameter selection then I get an accuracy of &lt;span style="font-style: italic;"&gt;X.&lt;/span&gt;  How large is the range of hyperparameters for which my accuracy is at least &lt;span style="font-style: italic;"&gt;X*0.95&lt;/span&gt;?  I.e., if I'm willing to suffer 5% multiplicative loss, how lazy can I be about hp selection?  For this, you'll probably need to grid out your performance and then do empirical integration to approximate this.  Of course, you'll need to choose a bounded range for your hp (usually zero will be a lower bound, but you'll have to pick an upper bound, too -- but this is fine: as a practitioner, if you don't give me an upper bound, I'm going to be somewhat unhappy).&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;Neither of these is totally ideal, but I think they'd be a lot better than the current situation of really having no idea!  Maybe there are other proposals out there that I don't know about, or maybe other readers have good ideas.  But for me, if you're going to convince me to switch to your algorithm, this is something that I really really want to know.&lt;br /&gt;&lt;br /&gt;(As an aside, Mark, if you're reading this, I can imagine the whole CW thing getting a bit confused if you're using feature hashing: have you tried this?  Or has someone else?)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-6973000274164228070?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/6973000274164228070/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=6973000274164228070' title='82 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6973000274164228070'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6973000274164228070'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/08/classifier-performance-alternative.html' title='Classifier performance: alternative metrics of success'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>82</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-5864752401833622021</id><published>2009-08-03T17:38:00.004-06:00</published><updated>2009-08-03T17:46:07.367-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine translation'/><category scheme='http://www.blogger.com/atom/ns#' term='ACS'/><title type='text'>ACS: Machine Translation Papers at EMNLP</title><content type='html'>&lt;span style="color: rgb(255, 0, 0);font-family:arial;" &gt;[Guest Post by &lt;a href="http://homepages.inf.ed.ac.uk/alopez/"&gt;Adam Lopez&lt;/a&gt;... thanks, Adam!  Hal's comment: you may remember that a while ago I proposed the idea of &lt;a href="http://nlpers.blogspot.com/search/label/ACS"&gt;conference area chairs posting summaries of their areas&lt;/a&gt;; well, Adam is the first to take me up on this idea... I still think it's a good idea, so anyone else who wants to do so in the future, let me know!]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Conferences can be exhausting, and back-to-back conferences can be &lt;i&gt;really&lt;/i&gt; exhausting, so I want to convince you to pace yourself and save some energy for EMNLP at the end of the week, because we have some really interesting MT papers.  I'll focus mainly on oral presentations, because unlike poster sessions, the parallel format of the oral sessions entails a hard choice between mutually exclusive options, and part of my motivation is to help you make that choice. That being said, there are many interesting papers at the poster session, so do take a look at them!&lt;br /&gt;&lt;br /&gt;MT is a busy research area, and we have a really diverse set of papers covering the whole spectrum of ideas: from blue sky research on novel models, formalisms, and algorithms, to the hard engineering problems of wringing higher accuracy and speed out of a mature, top-scoring NIST system.  I &lt;i&gt;occasionally&lt;/i&gt; feel that my colleagues on far reaches of either side of this spectrum are too dismissive of work on the other side; we need both if we're going to improve translation.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Outside the Box&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Before giving you a guided tour through that spectrum, I want to highlight one paper that I found thought-provoking, but hard to classify.  &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1006.pdf"&gt;Zaidan &amp;amp; Callison-Burch&lt;/a&gt; question a basic assumption underlying most machine learning approaches to NLP: that we must optimize on an easily computable approximation to the true loss function.  They ask: why not optimize for human judgement?  They design a metric that uses judgements on small snippets of a target sentence (defined by a spanning nonterminal in a parse tree of the aligned source sentence) and figure how many judgements they would need to collect (using Amazon Mechanical Turk) to cover an iteration of MERT, exploiting the fact that these snippets reoccur repeatedly during optimization.  How hard is this exactly?  I would say, in terms of &lt;a href="http://nlpers.blogspot.com/2006/02/art-of-loss-functions.html"&gt;this scale of loss functions&lt;/a&gt;, that their metric is a 2.  Yet, it turns out to be cheap and fast to compute.  The paper doesn't report results of an actual optimization run, but it's in the works... hopefully you'll learn more at the conference.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Connecting Theory and Practice&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;A few papers combine deep theoretical insight with convincing empirical results.  &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1007.pdf"&gt;Hopkins &amp;amp; Langmead&lt;/a&gt; improve on &lt;a href="http://aclweb.org/anthology-new/J/J07/J07-2003.pdf"&gt;cube pruning&lt;/a&gt;, a popular approximate search technique for structured models with non-local features (i.e. translation with an integrated language model).  They move cube pruning from its ad hoc roots to a firm theoretical basis by constructing a reduction to A* search, connecting it to classical AI search literature.  This informs the derivation of new heuristics for a syntax-based translation model, including an admissible heuristic to perform &lt;i&gt;exact&lt;/i&gt; cube pruning.  It's still globally approximate, but exact for the local prediction problem that cube pruning solves (i.e., what are the n-best state splits of an item, given the n-best input states from previous deductions?).  Amazingly, this is only slightly slower than the inexact version and improves the accuracy of a strong baseline on a large-scale Arabic-English task.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1005.pdf"&gt;Li &amp;amp; Eisner&lt;/a&gt; show how to compute a huge number of statistics efficiently over a combinatorially large number of hypotheses represented in a hypergraph.  The statistics include expected hypothesis length, feature expectation, entropy, cross-entropy, KL divergence, Bayes risk, variance of hypothesis length, gradient of entropy and Bayes risk, covariance and Hessian matrix.  It's beautifully simple: they recast the quantities of interest as semirings and run the inside (or inside-outside) algorithm.  As an example application, they perform minimum risk training on a small Chinese-English task,  reporting gains in accuracy.  For a related paper on minimum risk techniques, see the poster by &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1147.pdf"&gt;Pauls et al.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Novel Modeling and Learning Approaches&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1105.pdf"&gt;Tromble &amp;amp; Eisner&lt;/a&gt; also connect translation to theory by way of a novel model, framing reordering as an instance of the &lt;i&gt;linear ordering problem&lt;/i&gt;: given a matrix of pairwise ordering preferences between all words in a sentence, can we find a permutation that optimizes the global score?  This is NP-hard, but they give a reasonable approximation based on ITG, with some clever dynamic programming tricks to make it work.  Then they show how to learn the matrix and use it to reorder test sentences prior to translation, improving over the lexicalized reordering model of Moses on German-English.&lt;br /&gt;&lt;br /&gt;However, most of the new models at EMNLP are syntax-based.  In the last few years, syntax-based modeling has focused primarily on variants of synchronous context-free grammar (SCFG).  This year there's a lot of work investigating more expressive formalisms.&lt;br /&gt;&lt;br /&gt;Two papers model translation with restricted variants of synchronous tree-adjoining grammar (STAG).  &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1021.pdf"&gt;Carreras &amp;amp; Collins&lt;/a&gt; model syntax atop phrase pairs with a parser using sister adjunction (as in their &lt;a href="http://aclweb.org/anthology-new/W/W08/W08-2102.pdf"&gt;2008 parser&lt;/a&gt;).  The model resembles a synchronous version of Markov grammar, which also connects it to recent dependency models of translation (e.g. &lt;a href="http://aclweb.org/anthology-new/P/P08/P08-1066.pdf"&gt;Shen et al. 2008&lt;/a&gt;, &lt;a href="http://aclweb.org/anthology-new/P/P09/P09-1087.pdf"&gt;Galley et al. 2009&lt;/a&gt;, Gimpel &amp;amp; Smith below, and  &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1123.pdf"&gt;Hassan et al.&lt;/a&gt; in the poster session).  Decoding is NP-complete, and devising efficient beam search is a key point in the paper.  The resulting system outperforms Pharaoh on German-English.  &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1076.pdf"&gt;DeNeefe &amp;amp; Knight&lt;/a&gt; model target-side syntax via synchronous tree insertion grammar (STIG).  It's similar to synchronous tree substitution grammar (STSG; previously realized in MT as &lt;a href="http://aclweb.org/anthology-new/N/N04/N04-1035.pdf"&gt;GHKM&lt;/a&gt;) with added left- and right-adjunction operations to model optional arguments.  They show how to reuse a lot of the STSG machinery via a grammar transformation from STIG to STSG, and the results improve on a strong Arabic-English baseline.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1023.pdf"&gt;Gimpel &amp; Smith&lt;/a&gt; use a relatively new formalism: &lt;a href="http://aclweb.org/anthology-new/W/W06/W06-3104.pdf"&gt;quasi-synchronous&lt;/a&gt; dependency grammar (QDG).  In quasi-synchronous grammar, the generation of a target syntax tree is conditioned on (but not necessarily isomorphic to) a source syntax tree.  Formally, each target node can be annotated with &lt;i&gt;any&lt;/i&gt; source node.  Since in dependency grammar the nodes are words, their QDG  model resembles a word-to-word model.  Decoding with QDG was not obvious given past work, and is one of several novel contributions of the paper.  Another is the idea that all possible biphrases can fire an associated feature, regardless of overlap.  &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1107.pdf"&gt;Kääriäinen&lt;/a&gt; makes this idea central.  Instead of reasoning over the latent derivations of a generative model, his model directly optimizes a feature-based representation of the target sentence, where the features consist of any biphrase in the training set (per standard heuristics).  This raises some new problems -- such as how to find the target sentence given the optimal feature vector -- which are solved with dynamic programming.  The decoder doesn't quite beat Moses when used with a language model, but it's an order of magnitude faster!&lt;br /&gt;&lt;br /&gt;Three other papers operate on STSG models, with an emphasis on learning techniques. &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1037.pdf"&gt;Cohn &amp;amp; Blunsom&lt;/a&gt; reformulate tree-to-string STSG induction as a problem in non-parametric Bayesian inference, extending &lt;a href="http://aclweb.org/anthology-new/N/N09/N09-1062.pdf"&gt;their TSG model&lt;/a&gt; for monolingual parsing, and removing the dependence on heuristics over noisy GIZA++ word alignments.   The model produces more compact rules, and outperforms GHKM on a Chinese-English task. This is a hot topic: check out &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1136.pdf"&gt;Liu &amp;amp; Gildea&lt;/a&gt;'s poster for an alternative Bayesian formulation of the same problem and language pair.  &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1039.pdf"&gt;Galron et al.&lt;/a&gt; look at &lt;i&gt;tree-to-tree&lt;/i&gt; STSG (from a Data-Oriented Parsing perspective), with an eye towards discriminatively learning STSG rules to optimize for translation accuracy.&lt;br /&gt;&lt;br /&gt;Bayesian inference also figures in the model of &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1075.pdf"&gt;Chung &amp; Gildea&lt;/a&gt;, who aim at bilingually-informed segmentation of a source language.  The model is like IBM Model 1, except that the source positions are actually substrings of the source instead of single positions.  Reasoning over the substring boundaries makes it resemble an HMM, and they use a sparse prior to avoid overfitting.  Tokenizing new text uses the marginal distribution on source language segmentations, and this performs almost as well as a supervised segmenter on Chinese, and better on Korean, in end-to-end translation.&lt;br /&gt;&lt;br /&gt;SCFG models aren't completely forgotten: &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1073.pdf"&gt;Zhang &amp; Li&lt;/a&gt; offer a new twist on reordering in binary-branching SCFG. Given a source parse, we could &lt;a href="http://aclweb.org/anthology-new/C/C08/C08-1127.pdf"&gt;train a maximum entropy classifier&lt;/a&gt; to decide whether any binary production should be inverted; this requires a lot of computation over sparse vectors.  They instead represent the features implicitly using a &lt;a href="http://books.nips.cc/papers/files/nips14/AA58.pdf"&gt;tree convolution kernel&lt;/a&gt;, showing nice gains in Chinese-English.&lt;br /&gt;&lt;br /&gt;On the algorithmic side, &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1079.pdf"&gt;Levenberg &amp; Osborne&lt;/a&gt; look at language modeling under the condition that we have unbounded data streams in both source and target language, bounded computation, and the desire to bias our language model towards more recent language use without constantly retraining it.  They accomplish this with online perfect hashing (extending &lt;a href="http://aclweb.org/anthology-new/P/P08/P08-1058.pdf"&gt;previous&lt;/a&gt; &lt;a href="http://aclweb.org/anthology-new/P/P07/P07-1065.pdf"&gt;work&lt;/a&gt;) in a succinct data structure that supports deletions, showing that they can draw on recent information in both the source and the target to incrementally update the model while keeping a bounded memory footprint.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1050.pdf"&gt;Bai et al.&lt;/a&gt; focus on the problem of acquiring multiword expressions (i.e. idioms), showing why typical word alignment methods fail, and using a combination of statistical association measures and heuristics to fix the problem, with small gains in Chinese-English.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Decoding&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Since SCFG models have become mainstream, there's been a greater emphasis on decoding.  Following a &lt;a href="http://aclweb.org/anthology-new/N/N06/N06-1033.pdf"&gt;recent strand&lt;/a&gt; of &lt;a href="http://aclweb.org/anthology-new/N/N09/N09-1026.pdf"&gt;research&lt;/a&gt; on grammar transformations for SCFG, &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1038.pdf"&gt;Xiao et al.&lt;/a&gt; observe that, in the space of possible transformations, many will pair source yields with huge numbers of target yields, which compete during decoding and thus result in more search errors.  The trick is to select a transform that distributes target yields more evenly across source yields.  They pose this as an optimization problem and give a greedy algorithm; the resulting grammar is reliably better under a variety of conditions on a Chinese-English task. Meanwhile, &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1108.pdf"&gt;Zhang et al.&lt;/a&gt; engineer more efficient STSG decoding for the case in which the source is a parse forest and source units are tree fragments.  The trick is to encode translation rules in the tree path equivalent of a prefix tree. On Chinese-English this improves decoding speed and ultimately translation accuracy, because the decoder can consider larger fragments much more efficiently.  Finally, see &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1117.pdf"&gt;Finch &amp; Sumita&lt;/a&gt;'s comprehensive poster on bidirectional phrase-based decoding for a huge number of language pairs.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Onwards and Upwards&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The align/extract/MERT pipeline popularized by Moses and other NIST-style systems is incredibly hard to improve, but several papers manage just that.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1024.pdf"&gt;Hermjakob&lt;/a&gt;'s word aligner starts from lexical translation parameters learned by a statistical alignment model. Then, following some fairly general observations on different linguistic classes of words, it uses some well-motivated heuristics to fix a whole bunch of little things that many more principled models ignore: the different behavior of content words (improved via careful manipulation of pointwise mutual information) and function words (improved via constraints from parse structure) is treated along with careful handling of numbers, transliterations, and morphology to give strong improvements in Arabic-English.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1106.pdf"&gt;Liu et al.&lt;/a&gt; then extract phrases by relaxing &lt;a href="http://aclweb.org/anthology-new/N/N03/N03-1017.pdf"&gt;standard heuristic constraints&lt;/a&gt;.  Given a posterior probability for every alignment point, they simply calculate the probability that a phrase would be extracted, and use this as their count in the typical frequency-based estimate.  It's efficient and improves Chinese-English.&lt;br /&gt;&lt;br /&gt;Three papers incorporate new feature types into strong baseline translation models, following a &lt;a href="http://aclweb.org/anthology-new/D/D08/D08-1024.pdf"&gt;recent&lt;/a&gt; &lt;a href="http://aclweb.org/anthology-new/N/N09/N09-1025.pdf"&gt;trend&lt;/a&gt;. &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1008.pdf"&gt;Shen et al.&lt;/a&gt; devise some clever local features using source-side context, derivation span length, and dependency modeling to make impressive improvements on an already &lt;a href="http://aclweb.org/anthology-new/P/P08/P08-1066.pdf"&gt;impressive baseline system&lt;/a&gt; in both Chinese-English and Arabic-English.  &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1074.pdf"&gt;Matsoukas et al.&lt;/a&gt; then show how a mixed-genre system can effectively be adapted for a particular target domain, by using a small amount data to tune weights tied to genre and collection types in the training corpus, again with strong results in Arabic-English.  &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1022.pdf"&gt;Mauser et al.&lt;/a&gt; take their previous &lt;a href="http://aclweb.org/anthology-new/D/D08/D08-1039.pdf"&gt;triplet lexicon model&lt;/a&gt; (a probabilistic feature using an outside source word as additional conditioning context) and move it from a reranking step into the decoding step, with a nice experimental treatment showing improvements in large-scale Chinese-English and Arabic-English.&lt;br /&gt;&lt;br /&gt;If you've seen the latest NIST results, you know that system combination gives &lt;i&gt;huge&lt;/i&gt; improvements.   Check out posters by &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1125.pdf"&gt;He &amp; Toutanova&lt;/a&gt;, &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1114.pdf"&gt;Duan et al.&lt;/a&gt;, and &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1115.pdf"&gt;Feng et al.&lt;/a&gt; to learn the latest techniques.  Last but not least, if you need a strategy for language pairs with very little parallel data, the poster by &lt;a href="http://aclweb.org/anthology-new/D/D09/D09-1141.pdf"&gt;Nakov &amp; Ng&lt;/a&gt; will interest you.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Thanks&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;EMNLP was the first time I've been area chair for a conference, and it was really rewarding to work with such great volunteers and&lt; to see the great papers that were selected (I should note here that I included two papers &lt;i&gt;not&lt;/i&gt; on my track that I'm quite familiar with -- the ones from Edinburgh).  It was also very enlightening, but that's another story.  Many thanks to Hal for &lt;a href="http://nlpers.blogspot.com/2008/06/old-school-conference-blogging.html"&gt;offering this forum&lt;/a&gt; to share the results!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-5864752401833622021?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/5864752401833622021/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=5864752401833622021' title='151 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5864752401833622021'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5864752401833622021'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/08/acs-machine-translation-papers-at-emnlp.html' title='ACS: Machine Translation Papers at EMNLP'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>151</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-4852035910034519912</id><published>2009-08-01T17:17:00.003-06:00</published><updated>2009-08-01T18:10:24.840-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>Destination: Singapore</title><content type='html'>Welcome to everyone to &lt;a href="http://www.acl-ijcnlp-2009.org/"&gt;ACL&lt;/a&gt;!  It's pretty rare for me to end up conferencing in a country I've been before, largely because I try to avoid it.  When I was here last time, I stayed with &lt;a href="http://www.blogger.com/www.gatsby.ucl.ac.uk/%7Eywteh"&gt;Yee Whye&lt;/a&gt;, who was here at the time as a postdoc at NUS, and lived here previously in his youth.  As a result, he was an excellent "tour guide."  With his help, here's a list of mostly food related stuff that you should definitely try while here (see also the &lt;a href="http://www.colips.org/blog/acl-ijcnlp-2009/"&gt;ACL blog&lt;/a&gt;):&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Pepper crab.  The easiest to find are the "No Signboard" restaurant chain.  Don't wear a nice shirt unless you plan on doing laundry.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Chicken rice.  This sounds lame.  Sure, chicken is kind of tasty.  Rice is kind of tasty.  But the key is that the rice is cooked in or with melted chicken fat.  It's probably the most amazingly simple and delicious dish I've ever had.  "Yet Kun" (or something like that) is along Purvis street.&lt;/li&gt;&lt;li&gt;Especially for dessert, there's Ah Chew, a Chinese place around Liang Seah street in the Bugis area (lots of other stuff there too).&lt;/li&gt;&lt;li&gt;Hotpot is another local specialty: there is very good  spicy Szechuan hotpot around Liang Seah street.&lt;/li&gt;&lt;li&gt;For real Chinese tea, &lt;a href="http://www.tea-chapter.com.sg/"&gt;here&lt;/a&gt;.  (Funny aside: when I did this, they first asked "have you had tea before?"  Clearly the meaning is "have you had real Chinese tea prepared traditionally and tasted akin to a wine tasting?"  But I don't think I would ever ask someone "have you had wine before?"  But I also can't really think of a better way to ask this!)&lt;/li&gt;&lt;li&gt;Good late night snacks can be found at Prata stalls (eg., indian roti with curry).&lt;/li&gt;&lt;li&gt;The food court at Vivocity, despite being a food court, is very good.  You should have some hand-pressed sugar cane juice -- very sweet, but very tasty (goes well with some spicy hotpot).&lt;/li&gt;&lt;li&gt;Chinatown has good Chinese dessert (eg., bean stuff) and frog leg porridge.&lt;/li&gt;&lt;/ul&gt;Okay, so this list is &lt;span style="font-style: italic;"&gt;all&lt;/span&gt; food.  But frankly, what else are you going to do here?  Go to malls?  :).  There's definitely nice architecture to be seen; I would recommend the Mosque off of Arab street; of course you have to go to the Esplanade (the durian-looking building); etc.  You can see a &lt;a href="http://www.cs.utah.edu/%7Ehal/photos_singapore.html"&gt;few photos&lt;/a&gt; from my last trip here.&lt;br /&gt;&lt;br /&gt;Now, I realize that most of the above list is not particularly friendly to my happy cow friends.  Here's a &lt;a href="http://www.happycow.net/gmaps/searchaddmaps.php?distance=20&amp;amp;address=Central%20Singapore,%20Singapore"&gt;list of restaurants&lt;/a&gt; that happy cow provides.  There are quite a few vegetarian options, probably partially because of the large Muslim population here.  There aren't as many vegan places, but certainly enough.  For the vegan minded, there is a &lt;a href="http://living-vegan.blogspot.com/"&gt;good blog about being vegan in Singapore&lt;/a&gt; (first post is about a recent local talk by Campbell, the author of &lt;a href="http://www.amazon.com/China-Study-Comprehensive-Nutrition-Implications/dp/1932100660/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1249169838&amp;amp;sr=8-1"&gt;The China Study&lt;/a&gt;, which I recommend everyone at least reads).  I can't vouch for the quality of these places, but here's a short list drawn from Living Vegan:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Mushroom hotpot at &lt;a href="http://living-vegan.blogspot.com/2009/07/mushroom-hotpot-and-dimsum-buffet-ling.html"&gt;Ling Zhi&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Fried fake meat &lt;a href="http://living-vegan.blogspot.com/2009/06/delivege-udon-noodles.html"&gt;udon noodles&lt;/a&gt; (though frankly I'm not a big fan of fake meat)&lt;/li&gt;&lt;li&gt;&lt;a href="http://living-vegan.blogspot.com/2009/06/new-green-pasture-cafe.html"&gt;Green Pasture cafe&lt;/a&gt;; looks like you probably have to be a bit careful here in terms of what you order&lt;/li&gt;&lt;li&gt;&lt;a href="http://living-vegan.blogspot.com/2009/01/yes-natural-and-lotus-heart.html"&gt;Yes Natural&lt;/a&gt;; seems like it has a bunch of good options&lt;/li&gt;&lt;li&gt;&lt;a href="http://living-vegan.blogspot.com/2008/08/set-meals-at-lotus-veg-restaurant.html"&gt;Lotus Veg restaurant&lt;/a&gt;, seems to have a bunch of dim sum (see &lt;a href="http://living-vegan.blogspot.com/2008/04/dim-sum-lotus-vegetarian-restaurant.html"&gt;here&lt;/a&gt;, too)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;If you must, there's &lt;a href="http://living-vegan.blogspot.com/2008/08/vegan-pizzas-in-singapore.html"&gt;pizza&lt;/a&gt;&lt;/li&gt;&lt;li&gt;And oh-my-gosh, there's actually &lt;a href="http://living-vegan.blogspot.com/2008/03/vegetarian-chicken-rice-at-fork-and.html"&gt;veggie chicken rice&lt;/a&gt;, though it doesn't seem like it holds up to the same standards as real chicken rice (if it did, that would be impressive)&lt;/li&gt;&lt;/ul&gt;Okay, you can find more for yourself if you go through their links :).&lt;br /&gt;&lt;br /&gt;Enjoy your time here!&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Quick update:&lt;/i&gt; Totally forgot about coffee.  If you need your espresso kick, Highlander coffee (49 Kampong Bahru Road) comes the most recommended, but is a bit of a hike from the conference area.  Of course, you could also try the local specialty: burnt crap with condensed milk (lots and lots of discussion especially on page two &lt;a href="http://www.coffeegeek.com/forums/worldregional/australasia/310510?Page=2"&gt;here&lt;/a&gt;).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-4852035910034519912?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/4852035910034519912/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=4852035910034519912' title='142 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4852035910034519912'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4852035910034519912'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/08/destination-singapore.html' title='Destination: Singapore'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>142</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-7278192582658457328</id><published>2009-07-21T07:41:00.000-06:00</published><updated>2009-07-21T08:41:05.699-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='linguistics'/><category scheme='http://www.blogger.com/atom/ns#' term='bayesian'/><title type='text'>Non-parametric as memorizing, in exactly the wrong way?</title><content type='html'>There is a cool view of the whole non-parametric Bayes thing that I think is very instructive.  It's easiest to see in the case of the Pitman-Yor language modeling work by Frank Wood and Yee Whye Teh.  The view is "memorize what you can, and back off (to a parametric model) when you can't."  This is basically the "backoff" view... where NP Bayes comes in is to control the whole "what you can" aspect.  In other words, if you're doing language modeling, try to memorize two grams; but if you haven't seen enough to be confident in your memorization, back off to one grams; and if you're not confident there, back off to a uniform distribution (which is our parametric model -- the base distribution).&lt;br /&gt;&lt;br /&gt;Or, if you think about the state-splitting PCFG work (done both at Berkeley and at Stanford), basically what's going on is that we're memorizing as many differences of tree labels as we can, and then backing off to the "generic" label when memorization fails. Or if we look at Trevor Cohn's NP-Bayes answer to DOP, we see a similar thing: memorize big tree chunks, but if you can't, fall back to a simple CFG (our parametric model).&lt;br /&gt;&lt;br /&gt;Now, the weird thing is that this mode of memorization is kind of backwards from how I (as an outsider) typically interpret cognitive models (any cogsci people out there, feel free to correct me!).  If you take, for instance, morphology, there's evidence that this is exactly &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; what humans do.  We (at least from a popular science) perspective, basically memorize simple rules and then remember exceptions.  That is, we remember that to make the past tense of a verb, we add "-ed" (the sound, not the characters) but for certain verbs, we don't: go/went, do/did, etc.  You do little studies where you ask people to inflect fake words and they generally follow the rule, not the exceptions (but see * below).&lt;br /&gt;&lt;br /&gt;If NP Bayes had its way on this problem (or at least if the standard models I'm familiar with had their way), they would memorize "talk" -&gt; "talked" and "look" -&gt; "looked" and so on because they're so familiar.  Sure, it would still memorize the exceptions, but it would also memorize the really common "rule cases too... why?  Because of course it &lt;span style="font-style:italic;"&gt;could&lt;/span&gt; fall back to the parametric model, but these are so common that the standard models would really like to take advantage of the rich-get-richer phenomenon on things like DPs, thus saving themselves cost by memorizing a new "cluster" for each common word.  (Okay, this is just my gut feeling about what such models would do, but I think it's at least defensible.)  Yes, you could turn the DP "alpha" parameter down, but I guess I'm just not convinced this would do the right thing.  Maybe I should just implement such a beast but, well, this is a blog post, not a *ACL paper :P.&lt;br /&gt;&lt;br /&gt;Take as an alternative example the language modeling stuff.  Basically what it says is "if you have enough data to substantiate memorizing a 5 gram, you should probably memorize a 5 gram."  But why?  If you could get the same effect with a 2 or 3 gram, why waste the storage/time?!&lt;br /&gt;&lt;br /&gt;I guess you could argue "your prior is wrong," which is probably true for most of these models.  In which case I guess the question is "what prior does what I want?"  I don't have a problem with rich get richer -- in fact, I think it's good in this case.  I also don't have a problem with a logarithmic growth rate in the number of exceptions (though I'd be curious how this holds up empirically -- in general, I'm a big fan of checking if your prior makes sense; for instance, Figure 3 (p16) of &lt;a href="http://hal3.name/docs#daume05dpscm"&gt;my supervised clustering paper&lt;/a&gt;).  I just don't like the notion of memorizing when you don't have to.&lt;br /&gt;&lt;br /&gt;(*) I remember back in grad school a linguist from Yale came and gave a talk at USC.  Sadly, I can't remember who it was: if anyone wants to help me out, I'd really appreciate it!  The basic claim of the talk is that humans actually memorize a lot more than we give them credit for.  The argument was in favor of humans basically memorizing all morphology and &lt;span style="font-style:italic;"&gt;not&lt;/span&gt; backing off to rules like "add -ed."  One piece of evidence in favor of this was timing information for asking people to inflect words: the timing seemed to indicate a &lt;span style="font-style:italic;"&gt;linear search&lt;/span&gt; through a long list of words that could possibly be inflected.  I won't say much more about this because I'm probably already misrepresenting it, but it's an interesting idea.  And, if true, maybe the NP models are doing exactly what they should be doing!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-7278192582658457328?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/7278192582658457328/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=7278192582658457328' title='147 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7278192582658457328'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7278192582658457328'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/07/non-parametric-as-memorizing-in-exactly.html' title='Non-parametric as memorizing, in exactly the wrong way?'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>147</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-7392957434501063185</id><published>2009-07-05T07:55:00.000-06:00</published><updated>2009-07-05T08:56:01.602-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='linguistics'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Small changes beget good or bad examples?</title><content type='html'>If you compare vision research with NLP research, there are a lot of interesting parallels.  Like we both like linear models.  And conditional random fields.  And our problems are a lot harder than binary classification.  And there are standard data sets that we've been evaluating on for decades and continue to evaluate on (I'm channeling Bob here :P).&lt;br /&gt;&lt;br /&gt;But there's one thing that happens, the difference of which is so striking, that I'd like to call it to center stage.  It has to do with "messing with our inputs."&lt;br /&gt;&lt;br /&gt;I'll spend a bit more time describing the vision approach, since it's probably less familiar to the average reader.  Suppose I'm trying to handwriting recognition to identify digits from zero to nine (aka MNIST).  I get, say, 100 labeled zeros, 100 labeled ones, 100 labeled twos and so on.  So a total of 1000 data points.  I can train any off the shelf classifier based on pixel level features and get some reasonable performance (maybe 80s-90s, depending).&lt;br /&gt;&lt;br /&gt;Now, I want to insert knowledge.  The knowledge that I want to insert is some notion of invariance.  I.e., if I take an image of a zero and translate it left a little bit, it's still a zero.  Or up a little bit.  Of if I scale it up 10%, it's still a zero.  Or down 10%.  Or if I rotate it five degrees.  Or negative five.  All zeros.  Same hold for all the other digits.&lt;br /&gt;&lt;br /&gt;One way to insert this knowledge is to muck with the learning algorithm.  That's too complicated for me: I want something simpler.  So what I'll do is take my 100 zeros and 100 ones and so on and just manipulate them a bit.  That is, I'll sample a random zero, and apply some small random transformations to it, and call it another labeled example, also a zero.  Now I have 100,000 training points.  I train my off the shelf classifier based on pixel level features and get 99% accuracy or more.  The same trick works for other vision problem (eg., recognizing animals).  (This process is so common that it's actually described in Chris Bishop's new-ish PRML book!)&lt;br /&gt;&lt;br /&gt;This is what I mean by small changes (to the input) begetting good example.  A slightly transformed zero is still a zero.&lt;br /&gt;&lt;br /&gt;Of course, you have to be careful.  If you rotate a six by 180 degrees, you get a nine.  If you rotate a cat by 180 degrees, you get an unhappy cat.  More seriously, if you're brave, you might start looking at a class of transformations called diffeomorphisms, which are fairly popular around here.  These are nice because of their nice mathematical properties, but un-nice because they can be slightly too flexible for certain problems.&lt;br /&gt;&lt;br /&gt;Now, let's go over to NLP land.  Do we ever futz with our inputs?&lt;br /&gt;&lt;br /&gt;Sure!&lt;br /&gt;&lt;br /&gt;In language modeling, we'll sometimes permute words or replace one word with another to get a negative example.  Noah Smith futzed with his inputs in contrastive estimation to produce negative examples by swapping adjacent words, or deleting words.&lt;br /&gt;&lt;br /&gt;In fact, try as I might, I cannot think of a single case in NLP where we make small changes to an input to get another good input: we always do it to get a &lt;span style="font-style: italic;"&gt;bad&lt;/span&gt; input!&lt;br /&gt;&lt;br /&gt;In a sense, this means that one thing that vision people have that we don't have is a notion of semantics preserving transformations.  Sure, linguists (especially those from that C-guy) study transformations.  And there's a vague sense that work in paraphrasing leads to transformations that maintain semantic equivalence.  But the thing is that we really don't know any transformations that preserve semantics.  Moreover, some transformations that seem benign (eg., passivization) actually are not: one of my favorite papers at NAACL this year by &lt;a href="http://umiacs.umd.edu/~resnik/pubs/greene_resnik_naacl2009.pdf"&gt;Greene and Resnik showed that syntactic structure affects sentiment&lt;/a&gt; (well, them, drawing on a lot of psycholinguistics work)!&lt;br /&gt;&lt;br /&gt;I don't have a significant point to this story other than it's kind of weird.  I mentioned this to some people at ICML and got a reaction that replacing words with synonyms should be fine.  I remember doing this in high school, when word processors first started coming with thesauri packed in.  The result seemed to be that if I actually knew the word I was plugging in, life was fine... but if not, it was usually a bad replacement.  So this seems like something of a mixed bag: depending on how liberal you are with defining "synonym" you might be okay do this, but you might also not be.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-7392957434501063185?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/7392957434501063185/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=7392957434501063185' title='174 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7392957434501063185'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7392957434501063185'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/07/small-changes-beget-good-or-bad.html' title='Small changes beget good or bad examples?'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>174</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-5007776794380796144</id><published>2009-06-30T06:50:00.003-06:00</published><updated>2009-06-30T07:13:09.141-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>ICML/COLT/UAI 2009 retrospective</title><content type='html'>This will probably be a bit briefer than my corresponding NAACL post because even by day two of ICML, I was a bit burnt out; I was also constantly swapping in other tasks (grants, etc.).  Note that John has already posted his &lt;a href="http://hunch.net/?p=813"&gt;list of papers&lt;/a&gt;.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;a href="http://www.cs.utah.edu/%7Ehal/tmp/icml/abstracts.html#317"&gt;#317&lt;/a&gt;: Multi-View Clustering via Canonical Correlation Analysis (&lt;i&gt;Chaudhuri, Kakade, Livescu, Sridharan&lt;/i&gt;).  This paper shows a new application of CCA to clustering across multiple views.  They use some wikipedia data in experiments and actually prove something about the fact that (under certain multi-view-like assumptions), CCA does the "right thing."&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.cs.utah.edu/%7Ehal/tmp/icml/abstracts.html#295"&gt;#295&lt;/a&gt;: Learning Nonlinear Dynamic Models (&lt;i&gt;Langford, Salakhutdinov,, Zhang&lt;/i&gt;).  The cool idea here is to cut a deterministic classifier in half and use its internal state as a sort of sufficient statistic.  Think about what happens if you represent your classifier as a circuit (DAG); then anywhere you cut along the circuit gives you a sufficient representation to predict.  To avoid making circuits, they use neural nets, which have an obvious "place to cut" -- namely, the internal nodes.&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.cs.utah.edu/%7Ehal/tmp/icml/abstracts.html#364"&gt;#364&lt;/a&gt;: Online Dictionary Learning for Sparse Coding (&lt;i&gt;Mairal, Bach, Ponce, Sapiro&lt;/i&gt;).  A new approach to sparse coding; the big take-away is that it's online and fast.&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.cs.utah.edu/%7Ehal/tmp/icml/abstracts.html#394"&gt;394&lt;/a&gt;: MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification (&lt;i&gt;Zhu, Ahmed, Xing&lt;/i&gt;).  This is a very cute idea for combining objectives across topic models (namely, the variational objective) and classification (the SVM objective) to learn topics that are good for performing a classification task.&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.cs.utah.edu/%7Ehal/tmp/icml/abstracts.html#393"&gt;#393&lt;/a&gt;: Learning from Measurements in Exponential Families (&lt;i&gt;Liang, Jordan, Klein&lt;/i&gt;).  Suppose instead of seeing (x,y) pairs, you just see some statistics on (x,y) pairs -- well, you can still learn.  (In a sense, this formalizes some work out of the UMass group; see also the Bellare, Druck and McCallum paper at UAI this year.)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.cs.utah.edu/%7Ehal/tmp/icml/abstracts.html#119"&gt;#119&lt;/a&gt;: Curriculum Learning (&lt;i&gt;Bengio, Louradour, Collobert, Weston&lt;/i&gt;).  The idea is to present examples in a well thought-out order rather than randomly.  It's a cool idea; I've tried it in the context of unsupervised parsing (the unsearn paper at ICML) and it never helped and often hurt (sadly).  I curriculum-ified by sentence length, though, which is maybe not a good model, especially when working with WSJ10 -- maybe using vocabulary would help.&lt;/li&gt;&lt;li&gt; &lt;a href="http://www.cs.utah.edu/%7Ehal/tmp/icml/abstracts.html#319"&gt;#319&lt;/a&gt;: A Stochastic Memoizer for Sequence Data (&lt;i&gt;Wood, Archambeau, Gasthaus, James, Whye Teh&lt;/i&gt;).  If you do anything with Markov models, you should read this paper.  The take away is: how can I learn a Markov model with (potentially) infinite memory in a linear amount of time and space, and with good "backoff" properties.  Plus, there's some cool new technology in there.&lt;/li&gt;&lt;li&gt; A Uniqueness Theorem for Clustering     &lt;i&gt;Reza Bosagh Zadeh, Shai Ben-David.  &lt;/i&gt;I already talked about this issue a bit, but the idea here is that if you fix k, then the clustering axioms become satisfiable, and are satisfied by two well known algorithms.  Fixing k is a bit unsatisfactory, but I think this is a good step in the right direction.&lt;/li&gt;&lt;li&gt;Convex Coding&lt;i&gt; David Bradley, J. Andrew Bagnell.  &lt;/i&gt;The idea is to make coding convex by making it infinite!  And then do something like boosting.&lt;/li&gt;&lt;li&gt;On Smoothing and Inference for Topic Models&lt;i&gt;  Arthur Asuncion, Max Welling, Padhraic Smyth, Yee Whye Teh.  &lt;/i&gt;If you do topic models, read this paper: basically, none of the different inference algorithms do any better than the others (perplexity-wise) if you estimate hyperparameters well.  Come are, of course, faster though.&lt;/li&gt;&lt;li&gt;Correlated Non-Parametric Latent Feature Models &lt;i&gt;Finale Doshi-Velez, Zoubin Ghahramani.&lt;/i&gt;  This is an indian-buffet-process-like model that allows factors to be correlated.  It's somewhat in line with our &lt;a href="http://www.cs.utah.edu/%7Ehal/docs/#daume08ihfrm"&gt;own paper from NIPS&lt;/a&gt; last year.  There's still something a bit unsatisfactory in both our approach and their approach that we can't do this "directly."&lt;/li&gt;&lt;li&gt;Domain Adaptation: Learning Bounds and Algorithms.      &lt;i&gt;Yishay Mansour, Mehryar Mohri and Afshin Rostamizadeh&lt;/i&gt;.  Very good work on some learning theory for domain adaptation based on the idea of stability.&lt;/li&gt;&lt;/ol&gt;Okay, that's it.  Well, not really: there's lots more good stuff, but those were the things that caught my eye.  Feel free to tout your own favorites in the comments.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-5007776794380796144?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/5007776794380796144/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=5007776794380796144' title='45 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5007776794380796144'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5007776794380796144'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/06/icmlcoltuai-2009-retrospective.html' title='ICML/COLT/UAI 2009 retrospective'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>45</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-952319119169860390</id><published>2009-06-25T14:40:00.001-06:00</published><updated>2009-06-25T14:43:08.653-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>Should there be a shared task for semi-supervised learning in NLP?</title><content type='html'>&lt;span style="color: rgb(255, 0, 0);"&gt;(Guest post by &lt;/span&gt;&lt;a style="color: rgb(255, 0, 0);" href="http://ssli.ee.washington.edu/people/duh/"&gt;Kevin Duh&lt;/a&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;.  Thanks, Kevin!)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;At the &lt;a href="http://sites.google.com/site/sslnlp/"&gt;NAACL SSL-NLP Workshop&lt;/a&gt; recently, we discussed whether there ought to be a "shared task" for semi-supervised learning in NLP. The panel discussion consisted of &lt;a href="http://www.cs.utah.edu/%7Ehal/"&gt;Hal Daume&lt;/a&gt;, &lt;a href="http://www.cs.brown.edu/%7Edmcc"&gt;David McClosky&lt;/a&gt;, and &lt;a href="http://pages.cs.wisc.edu/%7Egoldberg/"&gt;Andrew Goldberg&lt;/a&gt; as panelists and audience input from Jason Eisner, Tom Mitchell, and many others. Here we will briefly summarize the points raised and hopefully solicit some feedback from blog readers.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Three motivations for a shared task&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;A1. Fair comparison of methods: A common dataset will allow us to compare different methods in an insightful way. Currently, different research papers use different datasets or data-splits, making it difficult to draw general intuitions from the combined body of research.&lt;br /&gt;&lt;br /&gt;A2. Establish a methodology for evaluating SSLNLP results: How exactly should a semi-supervised learning method be evaluated? Should we would evaluate the same method for both low-resource scenarios (few labeled points, many unlabeled points) and high-resource scenarios (many labeled points, even more unlabeled points)? Should we evaluate the same method under different ratios of labeled/unlabeled data? Currently there is no standard methodology for evaluating SSLNLP results, which means that the completeness/quality of experimental sections in research papers varies considerably.&lt;br /&gt;&lt;br /&gt;A3. Encourage more research in the area: A shared task can potentially lower the barrier of entry to SSLNLP, especially if it involves pre-processed data and  community support network. This will make it easier for researchers in outside fields, or researchers with smaller budgets to contribute their expertise to the field. Furthermore, a shared task can potentially direct the community research efforts in a particular subfield. For example, "online/lifelong learning for SSL" and "SSL as joint inference of multiple tasks and heterogeneous labels" (a la Jason Eisner's keynote) were identified as new, promising areas to focus on in the panel discussion. A shared task along those lines may help us rally the community behind these efforts.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Arguments against the above points&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;B1. Fair comparison: Nobody really argues against fair comparison of methods. The bigger question, however, is whether there exist a *common* dataset or task where everyone is interested in. At the SSLNLP Workshop, for example, we had papers in a wide range of areas ranging from information extraction to parsing to text classification to speech. We also had papers where the need for unlabeled data is intimately tied in to particular components of a larger system. So, a common dataset is good, but what dataset can we all agree upon?&lt;br /&gt;&lt;br /&gt;B2. Evaluation methodology: A consistent standard for evaluating SSLNLP results is nice to have, but this can be done independently from a shared task through, e.g. an influential paper or gradual recognition of its importance by reviewers. Further, one may argue: what makes you think that your particular evaluation methodology is the best? What makes you think people will adopt it generally, both inside and outside of the shared task?&lt;br /&gt;&lt;br /&gt;B3. Encourage more research: It is nice to lower the barriers to entry, especially if we have pre-processed data and scripts. However, it has been observed in other shared tasks that often it is the pre-processing and features that matter most (more than the actual training algorithm). This presents a dilemma: If the shared task pre-processes the data to make it easy for anyone to join, will we lose the insights that may be gained via domain knowledge? On the other hand, if we present the data in raw form, will this actually encourage outside researchers to join the field?&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Rejoinder&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;A straw poll at the panel discussion showed that people are generally in favor of looking into the idea of a shared task. The important question is how to make it work, and especially how to address counterpoints B1 (what task to choose) and B3 (how to prepare the data). We did not have enough time during the panel discussion to go through the details, but here are some suggestions:&lt;br /&gt;&lt;ul&gt;&lt;li&gt; We can view NLP problems as several big "classes" of problems: sequence prediction, tree prediction, multi-class classification, etc. In choosing a task, we can pick a representative task in each class, such as name-entity recognition for sequence prediction, dependency parsing for tree prediction, etc. This common dataset won't attract everyone in NLP, but at least it will be relevant for a large subset of researchers.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; If participants are allowed to pre-process their own data, the evaluation might require participant to submit a supervised system along with their semi-supervised system, using the same feature set and setup, if possible. This may make it easier to learn from results if there are differences in pre-processing.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; There should also be a standard supervised and semi-supervised baseline (software) provided by the shared task organizer. This may lower the barrier of entry for new participants, as well as establish a common baseline result.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Any suggestions? Thoughts?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-952319119169860390?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/952319119169860390/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=952319119169860390' title='40 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/952319119169860390'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/952319119169860390'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/06/should-there-be-shared-task-for-semi.html' title='Should there be a shared task for semi-supervised learning in NLP?'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>40</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-3983224320399781458</id><published>2009-06-20T09:43:00.000-06:00</published><updated>2009-06-20T13:53:55.391-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Why I Don't Buy Clustering Axioms</title><content type='html'>In NIPS 15, Jon Kleinberg presented some &lt;a href="www.cs.cornell.edu/home/kleinber/nips15.pdf"&gt;impossibility results for clustering&lt;/a&gt;.  The idea is to specify three axioms that all clustering functions should obey and examine those axioms.&lt;br /&gt;&lt;br /&gt;Let (X,d) be a metric space (so X is a discrete set of points and d is a metric over the points of X).  A clustering function F takes d as input and produces a clustering of the data.  The three axioms Jon states that all clustering functions should satisfy are:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Scale invariance:&lt;/span&gt; For all d, for all a&gt;0, F(ad) = F(d).  In other words, if I transform all my distances by scaling them uniformly, then the output of my clustering function should stay the same.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Richness:&lt;/span&gt; The range of F is the set of all partitions.  In other words, there isn't any bias that prohibits us from producing some particular partition.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Consistency:&lt;/span&gt; Suppose F(d) gives some clustering, C.  Now, modify d by shrinking distances within clusters of C and expanding distances between clusters in C.  Call the new metric d'.  Then F(d') = C.&lt;/li&gt;&lt;/ol&gt;Kleinberg's result is that there is no function &lt;span style="font-style: italic;"&gt;F&lt;/span&gt; that simultaneously satisfies all these requirements.  Functions can satisfy two, but never all three.  There have been a bunch of follow on papers, including one at &lt;a href="http://books.nips.cc/papers/files/nips21/NIPS2008_0383.pdf"&gt;NIPS&lt;/a&gt; last year and one that I just saw at &lt;a href="http://www.andrew.cmu.edu/user/rbosaghz/papers/slunique.pdf"&gt;UAI&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;If you think about these axioms a little bit, they kind of make sense.  My problem is that if you think about them a lot of bit, you (or at least I) realize that they're broken.  The biggest broken one is consistency, which becomes even more broken when combined with scale invariance.&lt;br /&gt;&lt;br /&gt;What I'm going to do to convince you that consistency is broken is start with some data in which there is (what I consider) a natural clustering into two clusters.  I'll then apply consistency a few times to get something that (I think) should yield a different clustering.&lt;br /&gt;&lt;br /&gt;Let's start with some data.  The colors are my opinion as to how the data should be clustered:&lt;br /&gt;&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 256px; height: 196px;" src="http://www.cs.utah.edu/%7Ehal/nlpers/data1.png" alt="" border="0" /&gt;I hope you agree with my color coding.  Now, let's apply consistency.  In particular, let's move some of the red points, only reducing inter-clustering distances.  Formally, we find the closest pair of points and move things toward those.&lt;br /&gt;&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 256px; height: 196px;" src="http://www.cs.utah.edu/%7Ehal/nlpers/data2.png" alt="" border="0" /&gt;The arrows denote the directions these points will be moved.    To make the situation more visually appealing, let's move things into lines:&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 256px; height: 196px;" src="http://www.cs.utah.edu/%7Ehal/nlpers/data3.png" alt="" border="0" /&gt;Okay, this is already looking funky.  Let's make it even worse.  Let's apply consistency again and start moving some blue points:&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 256px; height: 196px;" src="http://www.cs.utah.edu/%7Ehal/nlpers/data4.png" alt="" border="0" /&gt;Again, let's organize these into a line:&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 256px; height: 196px;" src="http://www.cs.utah.edu/%7Ehal/nlpers/data5.png" alt="" border="0" /&gt;And if I had given you this data to start with, my guess is the clustering you'd have come up with is more like:&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 256px; height: 196px;" src="http://www.cs.utah.edu/%7Ehal/nlpers/data6.png" alt="" border="0" /&gt;This is a violation of consistency.&lt;br /&gt;&lt;br /&gt;So, what I'd like someone to do is to argue to my &lt;span style="font-style: italic;"&gt;why&lt;/span&gt; consistency is actually a desirable property.&lt;br /&gt;&lt;br /&gt;I can come up with lots of other examples.  One reason why this invariance is bad is because it renders the notion of "reference sizes" irrelevant.  This is of course a problem if you have prior knowledge (eg., one thing measured in millimeters, the other in kilometers).  But even in the case where you don't know knowledge, what you can do is take the following.  Take data generated by thousands of well separated Gaussians, so that the clearly right thing to do is have one cluster per Gaussian.  Now, for each of these clusters except for one, shrink them down to single points.  This is possible by consistency.  Now, your data basically looks like thousands-1 of clusters with zero inter-cluster distances and then one cluster that's spread out.  But now it seems that the reasonable thing is to put each data point that was in this old cluster into its own cluster, essentially because I feel like the other data shows you what clusters should look like.  If you're not happy with this, apply scaling and push these points out super far from each other.  (I don't think this example is as compelling as the one I drew in pictures, but I still think it's reasonable.&lt;br /&gt;&lt;br /&gt;Now, in the UAI paper this year, they show that if you fix the number of clusters, these axioms are now consistent.  (Perhaps this has to do with the fact that all of my "weird" examples change the number of clusters -- though frankly I don't think this is necessary... I probably could have arranged it so that the resulting green and blue clusters look like a single line that maybe should just be one cluster by itself.)  But I still feel like consistency isn't even something we want.&lt;br /&gt;&lt;br /&gt;(Thanks to the &lt;a href="http://apollonius.cs.utah.edu/mediawiki/index.php/Algorithms_Seminar"&gt;algorithms group&lt;/a&gt; at Utah for discussions related to some of these issues.)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;UPDATE 20 JUNE 2009, 3:49PM EST&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Here's some data to justify the "bad things happen even when the number of clusters stays the same" claim.&lt;br /&gt;&lt;br /&gt;Start with this data:&lt;br /&gt;&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://www.cs.utah.edu/%7Ehal/nlpers/clusters1.png" alt="" border="0" /&gt;&lt;br /&gt;Now, move some points toward the middle (note they have to spread to the side a bit so as not to decrease intra-cluster distances).&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://www.cs.utah.edu/%7Ehal/nlpers/clusters2.png" alt="" border="0" /&gt;&lt;br /&gt;Yielding data like the following:&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://www.cs.utah.edu/%7Ehal/nlpers/clusters3.png" alt="" border="0" /&gt;&lt;br /&gt;Now, I feel like two horizontal clusters are most natural here.  But you may disagree.  What if I add some more data (ie., this is data that would have been in the original data set too, where it clearly would have been a third cluster):&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://www.cs.utah.edu/%7Ehal/nlpers/clusters4.png" alt="" border="0" /&gt;&lt;br /&gt;And if you still disagree, well then I guess that's fine.  But what if there were hundreds of other clusters like that.&lt;br /&gt;&lt;br /&gt;I guess the thing that bugs me is that I seem to like clusters that have similar structures.  Even if some of these bars were rotated arbitrarily (or, preferably, in an axis-aligned manner), I would still feel like there's some information getting shared across the clusters.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-3983224320399781458?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/3983224320399781458/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=3983224320399781458' title='61 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3983224320399781458'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3983224320399781458'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/06/why-i-dont-buy-clustering-axioms.html' title='Why I Don&apos;t Buy Clustering Axioms'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>61</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-689871759604573903</id><published>2009-06-14T21:14:00.000-06:00</published><updated>2009-06-14T21:29:19.515-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='problems'/><title type='text'>NL Generation: A new problem</title><content type='html'>Those who talk to me a lot over the years know that I really think that generation is a cool and interesting problem, but one that is hampered by a lack of clarity of what it is, or at least what the input/output is.  It's like the problem with defining summarization, but one hundred times worse.&lt;br /&gt;&lt;br /&gt;I have no idea how it came up.  I think I was talking to a bunch of people from Microsoft Research and we were talking about translation on the web and what not.  And how this is not the business model of Language Weaver.  And how when you're on the web you make money by advertising.&lt;br /&gt;&lt;br /&gt;And voila!  The most incredible NL Generation task occurred to me!  (I apologize if you ran in to me at all during NAACL because this was pretty much all I could talk about :P.)  The initial idea for the task was embedded in MT, though it needn't be.  But I'll tell it in the MT setting.&lt;br /&gt;&lt;br /&gt;So I go to some web page in some weirdo language (say, French) that I don't understand (because I was a moron and took Latin instead of French or Spanish in middle school and high school).  So I ask my favorite translation system (Google or Microsoft or Babelfish or whatever) to translate it.  However, the translation system takes certain &lt;span style="font-style: italic;"&gt;liberties&lt;/span&gt; with the translation.  In particular, it might embed a few "product placements" in the text.  For instance, maybe it's translating "Je suis vraiment soif" into English (if this is incorrect, blame Google).  And perhaps it decides that instead of translating this as "I'm really thirsty," it will translate it as "I'm really thirsty for a Snapple," or "I'm really thirsty: I could go for a Snapple," perhaps with a link to snapple.com.&lt;br /&gt;&lt;br /&gt;Product placement is all over the place, even in America where it's made fun of and kept a bit at bay.  Not so in China: any American remotely turned off by the Coca-cola cups from which the judges on American Idol drink twice a week would be appalled by the ridiculous (my sentiment) product placement that goes on over there.  The presence of the link would probably give away that it's an ad, though of course you could leave this off.&lt;br /&gt;&lt;br /&gt;But why limit this to translation.  You could put such a piece of technology directly on blogger, or on some proxy server that you can go through to earn some extra cash browsing the web (thanks to someone -- can't remember who -- at NAACL for this latter suggestion).  I mean, you could just have random ads pop up in the middle of text on any web page, for instance one you could host on webserve.ca!&lt;br /&gt;&lt;br /&gt;(See what I did there?)&lt;br /&gt;&lt;br /&gt;So now here's a real generation problem!  You're given some text.  And maybe you're even given adwords or something like that, so you can assume that the "select which thing to advertise" problem has been solved.  (Yes, I know it's not.)  Your job is just to insert the ad in the most natural way in the text.  You could evaluate in at least two ways: click through (as is standard in a lot of this advertising business) and human judgments of naturalness.  I think the point of product placement is to (a) get your product on the screen more or less constantly, rather than just during commercial breaks which most people skip anyway, and (b) perhaps appeal to people's subconscious.  I don't know.  My parents (used to) do advertising like things but I know next to nothing about that world.&lt;br /&gt;&lt;br /&gt;Okay, so this is slightly tongue in cheek, but not entirely.  And frankly, I wouldn't be surprised if something like it were the norm in five years.  (If you want to get more fancy, insert product placements into youtube videos!)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-689871759604573903?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/689871759604573903/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=689871759604573903' title='37 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/689871759604573903'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/689871759604573903'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/06/nl-generation-new-problem.html' title='NL Generation: A new problem'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>37</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-4754975319897954564</id><published>2009-06-12T15:30:00.000-06:00</published><updated>2009-06-12T16:38:04.722-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='papers'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>NAACL-HLT 2009 Retrospective</title><content type='html'>I hope this post will be a small impetus to get other people to post comments about papers they saw at NAACL (and associated workshops) that they really liked.&lt;br /&gt;&lt;br /&gt;As usual, I stayed for the whole conference, plus workshops.  As usual, I also hit that day -- about halfway through the first workshop day -- where I was totally burnt out and was wondering why I always stick around for the entire week.  That's not to say anything bad about the workshops specifically (there definitely were good ones going on, in fact, see some comments below), but I was just wiped.&lt;br /&gt;&lt;br /&gt;Anyway, I saw a bunch of papers and missed even more.  I don't think I saw any papers that I actively didn't like (or wondered how they got in), including short papers, which I think is fantastic.  Many thanks to all the &lt;a href="http://clear.colorado.edu/NAACLHLT2009/committee.html"&gt;organizers&lt;/a&gt; (Mari Ostendorf for organizing everything, Mike Collins, Lucy Vanderwende, Doug Oard and Shri Narayanan for putting together a great program, James Martin and Martha Palmer for local arrangements -- which were fantastic -- and all the other organizers who sadly we -- i.e., the NAACL board -- didn't get a chance to thank publicly).&lt;br /&gt;&lt;br /&gt;Here are some things I thought were interesting:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/N/N09/N09-3001.pdf"&gt;Classifier Combination Techniques Applied to Coreference Resolution&lt;/a&gt; (Vemulapalli, Luo, Pitrelli and Zitouni).  This was a student research workshop paper; in fact, it was the one that I was moderating (together with Claire Cardie).  The student author, Smita, performed this work while at IBM; though her main research is on similar techniques applied to &lt;a href="http://users.ece.gatech.edu/%7Esmita/research.html"&gt;really cool sounding problems in recognizing stuff that happens in the classroom&lt;/a&gt;.  Somehow classifier combination, and general system combination, issues came up a lot at this conference (mostly in the hallways where &lt;span style="font-style: italic;"&gt;someone&lt;/span&gt; was begrudgingly admitting to working on something as dirty as system combination).  I used to think system combination was yucky, but I don't really feel that way anymore.  Yes, it would be nice to have one huge monolithic system that does everything, but that's often infeasible.  My main complaint with system combination stuff is that in many cases I don't really understand why it's helping, which means that unless it's applied to a problem I really care about (of which there are few), it's hard for me to take anything away.  But I think it's interesting.  Getting back to Smita's paper, the key thing she did to make this work is introduce the notion of alignments between different clusterings, which seemed like a good idea.  The results probably weren't as good as they were hoping for, but still interesting.  My only major pointers as a panelist were to try using different systems, rather than bootstrapped versions of the same system, and to take a look at the literature on consensus clustering, which is fairly relevant for this problem.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/N/N09/N09-1014.pdf"&gt;Graph-based Learning for Statistical Machine Translation&lt;/a&gt; (Alexandrescu and Kirchhoff).  I'd heard of some of this work before in small group meetings with Andrei and Kathrin, but this is the first time I'd seen the results they presented.  This is an MT paper, but really it's about how to do graph-based semi-supervised learning in a structured prediction context, when you have some wacky metric (read: BLEU) on which you're evaluating.  Computation is a problem, but we should just hire some silly algorithms people to figure this out for us.  (Besides, there was a paper last year at ICML -- I'm too lazy to dig it up -- that showed how to do graph-based stuff on billions of examples.)&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/N/N09/N09-1015.pdf"&gt;Intersecting Multilingual Data for Faster and Better Statistical Translations&lt;/a&gt; (Chen, Kay and Eisele).  This is a very simple idea that works shockingly well.  Had I written this paper, "Frustrating" would probably have made it into the title.  Let's say we want an English to French phrase table.  Well, we do phrase table extraction and we get something giant and ridiculous (have you ever &lt;span style="font-style: italic;"&gt;looked&lt;/span&gt; at those phrase pairs) that takes tons of disk space and memory, and makes translation slow (it's like the "grammar constant" in parsing that means that O(n^3) for n=40 is impractical).  Well, just make two more phrase tables, English to German and German to French and intersect.  And viola, you have tiny phrase tables and even slightly better performance.  The only big caveat seems to be that they estimate all these things on Europarl.  What if your data sets are disjoint: I'd be worried that you'd end up with nothing in the resulting phrase table except the/le and sometimes/quelquefois (okay, I just used that example because I love that word).&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/N/N09/N09-2062.pdf"&gt;Quadratic Features and Deep Architectures for Chunking&lt;/a&gt; (Turian, Bergstra and Bengio).  I definitely have &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; drunk the deep architectures kool-aid, but I still think this sort of stuff is interesting.  The basic idea here stems from some work Bergstra did for modeling vision, where they replaced a linear classifier(&lt;span style="font-style: italic;"&gt;y = w'x&lt;/span&gt;) with a low rank approximation to a quadratic classifier (&lt;span style="font-style: italic;"&gt;y = w'x + sqrt[(a'x)^2 + (b'x)^2 + ... ]&lt;/span&gt;).  Here, the &lt;span style="font-style: italic;"&gt;a,b,...&lt;/span&gt; vectors are all estimated as part of the learning process (eg., by stochastic gradient descent).  If you use a dozen of them, you get some quadratic style features, but without the expense of doing, say, an implicit (or worse, explicit) quadratic kernel.  My worry (that I asked about during the talk) is that you obviously can't initialize these things to zero or else you're in a local minimum, so you have to do some randomization and maybe that makes training these things a nightmare.  Joseph reassured me that they have initialization methods that make my worries go away.  If I have enough time, maybe I'll give it a whirl.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/N/N09/N09-1041.pdf"&gt;Exploring Content Models for Multi-Document Summarization&lt;/a&gt; (Haghighi and Vanderwende).  This combines my two favorite things: summarization and topic models.  My admittedly biased view was they started with something similar to &lt;a href="http://hal3.name/docs#daume06bqfs"&gt;BayeSum&lt;/a&gt; and then ran a marathon.  There are a bunch of really cool ideas in here for content-based summarization.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/N/N09/N09-1042.pdf"&gt;Global Models of Document Structure using Latent Permutations&lt;/a&gt; (Chen, Branavan, Barzilay and Karger).  This is a really cool idea (previously mentioned in a comment on this blog) based on using generalized Mallow's models for permutation modeling (incidentally, see a &lt;a href="http://www.jmlr.org/papers/volume10/huang09a/huang09a.pdf"&gt;just-appeared JMLR paper&lt;/a&gt; for some more stuff related to permutations!).  The idea is that documents on a similar topic (eg., "cities") tend to structure their information in similar ways, which is modeled as a permutation over "things that could be discussed."  It's really cool looking, and I wonder if something like this could be used in conjunction with the paper I talk about below on summarization for scientific papers (9, below).  One concern raised during the questions that I also had was how well this would work for things not as standardized as cities, where maybe you want to express preferences of pairwise ordering, not overall permutations.  (Actually, you can do this, at least theoretically: a recent math visitor here, &lt;a href="http://www.math.duke.edu/%7Emhuber/"&gt;Mark Huber&lt;/a&gt;, has some papers on exact sampling from permutations under such partial order constraints using coupling from the past.)  The other thing that I was thinking during that talk that I thought would be totally awesome would be to do a hierarchical Mallow's model.  Someone else asked this question, and Harr said they're thinking about this.  Oh, well... I guess I'm not the only one :(.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Dan Jurafsky's invited talk was awesome.  It appealed to me in three ways: as someone who loves language, as a foodie, and as an NLPer.  You just had to be there.  I can't do it justice in a post.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/N/N09/N09-1057.pdf"&gt;More than Words: Syntactic Packaging and Implicit Sentiment&lt;/a&gt; (Greene and Resnik).  This might have been one of my favorite papers of the conference.  The idea is that how you say things can express your point of view as much as what you say.  They look specifically at effects like passivization in English, where you might say something like "The truck drove into the crowd" rather than "The soldier drove the truck into the crowd."  The missing piece here seems to be identifying the "whodunnit" in the first sentence.  This is like figuring out subjects in languages that like the drop subjects (like Japanese).  Could probably be done; maybe it has been (I know it's been worked on in Japanese; I don't know about English).&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/N/N09/N09-1066.pdf"&gt;Using Citations to Generate Surveys of Scientific Paradigms&lt;/a&gt; (Mohammad, Dorr, Egan, Hassan, Muthukrishan, Qazvinian, Radev and Zajic).  I really really want these guys to succeed.  They basically study how humans and machines create summaries of scientific papers when given either the text of the paper, or citation snippets to the paper.  The idea is to automatically generate survey papers.  This is actually an area I've toyed with getting in to for a while.  The summarization aspect appeals to me, and I actually know and understand the customer very well.  The key issue I would like to see addressed is how these summaries vary across different users.  I've basically come to the conclusion that in summarization, if you don't pay attention to the user, you're sunk.  This is especially true here.  If I ask for a summary of generalization bound stuff, it's going to look very different than if Peter Bartlett asks for it.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aclweb.org/anthology-new/N/N09/N09-1069.pdf"&gt;Online EM for Unsupervised Models&lt;/a&gt; (Liang and Klein).  If you want to do online EM read this paper.  On the other hand, you're going to have to worry about things like learning rate and batch size (think Pegasos).  I was thinking about stuff like this a year or two ago and was wondering how this would compare to doing SGD on the log likelihood directly and not doing EM at all.  Percy says that asymptotically they're the same, but who knows what they're like in the real world :).  I think it's interesting, but I'm probably not going to stop doing vanilla EM.&lt;/li&gt;&lt;/ol&gt;I then spent some time at workshops.&lt;br /&gt;&lt;br /&gt;I spent the first morning in the &lt;a href="http://www.aclweb.org/aclwiki/index.php?title=CALC-09"&gt;Computational Approaches to Linguistic Creativity&lt;/a&gt; workshop, which was just a lot of fun.  I really liked all of the morning talks: if you love language and want to see stuff somewhat off the beaten path, you should definitely read these.  I went by the &lt;a href="http://www.lsi.upc.edu/%7Elluism/sew2009/index.html"&gt;Semantic Evaluation Workshop&lt;/a&gt; for a while and learned that the most frequent sense baseline is hard to beat.  Moreover, there might be something to this discourse thing after all: &lt;a href="http://www1.ccls.columbia.edu/%7Emarine"&gt;Marine&lt;/a&gt; tells us that translators don't like to use multiple translations when one will do (akin to the one sense per discourse observation).  The biggest question in my head here is how much the direction of translation matters (eg., when this heuristic is violated, is it violated by the translator, or the original author)?  Apparently this is under investigation.  But it's cool because it says that even MT people shouldn't just look at one sentence at a time!&lt;br /&gt;&lt;br /&gt;Andrew McCallum gave a great, million-mile-an-hour invited talk on joint inference in &lt;a href="http://www.cnts.ua.ac.be/conll"&gt;CoNLL&lt;/a&gt;.  I'm pretty interested in this whole joint inference business, which also played a big role in Jason Eisner's invited talk (that I sadly missed) at the &lt;a href="http://sites.google.com/site/sslnlp/"&gt;semi-supervised learning workshop&lt;/a&gt;.  To me, the big question is: what happens if you don't actually care about some of the tasks.  In a probabilistic model, I suppose you'd marginalize them out... but how should you train?  In a sense, since you don't care about them, it doesn't make sense to have a real loss associated with them.  But if you don't put a loss, what are you doing?  Again,in probabilistic land you're saved because you're just modeling a distribution, but this doesn't answer the whole question.&lt;br /&gt;&lt;br /&gt;Al Aho gave a fantastically entertaining talk in the &lt;a href="http://www.cs.ust.hk/%7Edekai/ssst/"&gt;machine translation workshop&lt;/a&gt; about unnatural language processing.  How the heck they managed to get Al Aho to give an invited talk is beyond me, but I suspect we owe Dekai some thanks for this.  He pointed to some interesting work that I wasn't familiar with, both in &lt;a href="http://link.aip.org/link/?SMJCAT/1/305/1"&gt;raw parsing&lt;/a&gt; (eg., how to parse errorfull strings with a CFG when you want to find the closest in edit distance string that is parseable by a CFG) and natural language/programming &lt;a href="http://en.scientificcommons.org/40792271"&gt;language&lt;/a&gt; &lt;a href="http://portal.acm.org/citation.cfm?id=1270244"&gt;interfaces&lt;/a&gt;.  (In retrospect, the first result is perhaps obvious had I actually thought much about it, though probably not so back in 1972: you can represent edit distance by a lattice and then parse the lattice, which we know is efficient.)&lt;br /&gt;&lt;br /&gt;Anyway, there were other things that were interesting, but those are the ones that stuck in my head somehow (note, of course, that this list is unfairly biased toward my friends... what can I say? :P).&lt;br /&gt;&lt;br /&gt;So, off to ICML on Sunday.  I hope to see many of you there!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-4754975319897954564?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/4754975319897954564/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=4754975319897954564' title='34 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4754975319897954564'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4754975319897954564'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/06/naacl-hlt-2009-retrospective.html' title='NAACL-HLT 2009 Retrospective'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>34</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-9107259395898924805</id><published>2009-06-09T08:25:00.001-06:00</published><updated>2009-06-09T08:30:05.672-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>Navigating ICML</title><content type='html'>I couldn't find a schedule for ICML that had paper titles/authors written in, so I joined the existing schedule with the abstracts to create &lt;a href="http://www.cs.utah.edu/%7Ehal/tmp/icml/"&gt;a printable schedule with titles&lt;/a&gt;.  You can find organizer-created schedules for &lt;a href="http://www.cs.mcgill.ca/%7Euai2009/schedule.html"&gt;UAI&lt;/a&gt; and &lt;a href="http://www.cs.mcgill.ca/%7Ecolt2009/schedule.html"&gt;COLT&lt;/a&gt; already.  In addition, I've put ICML 2009 up on &lt;a href="http://www.blogger.com/post-create.g?blogID=19803222"&gt;WhatToSee&lt;/a&gt;, so have fun!  (I haven't done UAI and/or COLT because their papers haven't appeared online yet.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-9107259395898924805?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/9107259395898924805/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=9107259395898924805' title='121 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/9107259395898924805'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/9107259395898924805'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/06/navigating-icml.html' title='Navigating ICML'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>121</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-1357760324510812830</id><published>2009-06-08T11:36:00.000-06:00</published><updated>2009-06-08T11:54:27.856-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>The importance of input representations</title><content type='html'>As some of you know, I run a (machine learning) reading group every semester.  &lt;a href="http://apollonius.cs.utah.edu/mediawiki/index.php/MLRG"&gt;This summer&lt;/a&gt; we're doing "assorted" topics, which basically means students pick a few papers from the past 24 months that are related and present on them.  The week before I went out of town, we read two papers about inferring features from raw data; one was &lt;a href="http://jmlr.csail.mit.edu/proceedings/papers/v5/salakhutdinov09a/salakhutdinov09a.pdf"&gt;a deep learning approach&lt;/a&gt;; the other was &lt;a href="http://cocosci.berkeley.edu/tom/papers/features1.pdf"&gt;more Bayesian&lt;/a&gt;.  (As a total aside, I found it funny that in the latter paper they talk a lot about trying to find independent features, but in all cog sci papers I've seen where humans list features of objects/categories, they're highly dependent: eg., "has fur" and "barks" are reasonable features that humans like to produce that are very much not independent.  In general, I tend to think that modeling things as &lt;a href=""http://hal3.name/docs#daume08ihfrm&gt;explicitly dependent&lt;/a&gt; is a good idea.)&lt;br /&gt;&lt;br /&gt;Papers like this love to use vision examples, I guess because we actually have some understanding of how the visual cortex words (from a neuroscience perspective), which we sorely lack for language (it seems much more complicated).  They also love to start with pixel representations; perhaps this is neurologically motivated: I don't really know.  But I find it kind of funny, primarily because there's a ton of information hard wired into the pixel representation.  Why not feed .jpg and .png files directly into your system?&lt;br /&gt;&lt;br /&gt;On the language side, an analogy is the bag of words representation.  Yes, it's simple.  But only simple if you know the language.  If I handed you a bunch of text files in Arabic (suppose you'd never done any Arabic NLP) and asked you to make a bag of words, what would you do?  What about Chinese?  There, it's well known that word segmentation is hard.  There's already a huge amount of information in a bag of words format.&lt;br /&gt;&lt;br /&gt;The question is: does it matter?&lt;br /&gt;&lt;br /&gt;Here's an experiment I did.  I took the twenty newsgroups data (standard train/test split) and made classification data.  To make the classification data, I took a posting, fed it through a module "X".  "X" produced a sequence of tokens.  I then extract n-gram features over these tokens and throw out anything that appears less than ten times.  I then train a multiclass SVM on these (using libsvm).  The only thing that varies in this setup is what "X" does.  Here are four "X"s that I tried:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Extract words.  When composed with extracting n-gram tokens, this leads to a bag of words, bag of bigrams, bag of trigrams, etc., representation.&lt;/li&gt;&lt;li&gt;Extract characters.  This leads to character unigrams, character bigrams, etc.&lt;/li&gt;&lt;li&gt;Extract bits from characters.  That is, represent each character in its 8 bit ascii form and extract a sequence of zeros and ones.&lt;/li&gt;&lt;li&gt;Extract bits from a gzipped version of the posting.  This is the same as (3), but before extracting the data, we gzip the file.&lt;/li&gt;&lt;/ol&gt;The average word length for me is 3.55 characters, so a character ngram with length 4.5 is approximately equivalent to a bag of words model.  I've plotted the results below for everything except words (words were boring: BOW got 79% accuracy, going to higher ngram length hurt by 2-3%).  The x-axis is number of bits, so the unigram character model starts out at eight bits.  The y-axis is accuracy:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://hal3.name/nlpers/repr.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 415px; height: 324px;" src="http://hal3.name/nlpers/repr.png" alt="" border="0" /&gt;&lt;/a&gt;As we can see, characters do well, even at the same bit sizes.  Basically you get a ton of binary sequence features from raw bits that are just confusing the classifier.  Zipped bits do markedly worse than raw bits.  The reason the bit-based models don't extend further is because it started taking gigantic amounts of memory (more than my poor 32gb machine could handle) to process and train on those files.  But 40 bits is about five characters, which is just over a word, so in theory the 40 bit models have the same information that the bag of words model (at 79% accuracy) has.&lt;br /&gt;&lt;br /&gt;So yes, it does seem that the input representation matters.  This isn't shocking, but I've never seen anyone actually try something like this before.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-1357760324510812830?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/1357760324510812830/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=1357760324510812830' title='27 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1357760324510812830'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1357760324510812830'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/06/importance-of-input-representations.html' title='The importance of input representations'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>27</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-1510406730035463265</id><published>2009-06-01T09:23:00.001-06:00</published><updated>2009-06-01T09:23:52.698-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>NAACL-HLT 2006 has been WhatToSee-d!</title><content type='html'>See &lt;a href="http://www.cs.utah.edu/~hal/WhatToSee/"&gt;here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-1510406730035463265?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/1510406730035463265/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=1510406730035463265' title='20 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1510406730035463265'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1510406730035463265'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/06/naacl-hlt-2006-has-been-whattosee-d.html' title='NAACL-HLT 2006 has been WhatToSee-d!'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>20</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-7446893917312833513</id><published>2009-05-30T22:18:00.000-06:00</published><updated>2009-05-30T22:18:18.355-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Semi-supervised or Semi-unsupervised?  (SSL-NLP invited papers)</title><content type='html'>&lt;a href="http://ssli.ee.washington.edu/people/duh/"&gt;Kevin Duh&lt;/a&gt; not-so-recently asked me to write a "position piece" for the workshop he's co-organizing on &lt;a href="http://sites.google.com/site/sslnlp/"&gt;semi-supervised learning in NLP&lt;/a&gt;.  I not-so-recently agreed.  And recently I actually wrote said &lt;a href="http://www.cs.utah.edu/%7Ehal/docs/daume09sslnlp.pdf"&gt;position piece&lt;/a&gt;.  You can also find a link off the workshop page.  I hope people recognize that it's intentionally a bit tongue-in-cheek.  If you want to discuss this stuff or related things in general, come to the panel at NAACL from 4:25 to 5:25 on 4 June at the workshop!  You can read the paper for more information, but my basic point is that we can typically divide semi-supervised approached into one lump (semi-supervised) that work reasonably well with only labeled data and are just improved with unlabeled data and one lump (semi-unsupervised) that work reasonably well with only unlabeled data and are just improved with labeled data.  The former are typically encode lots of prior information; the latter do not.  Let's combine!  (Okay, my claim is more nuanced than that, but that's the high-order bit.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-7446893917312833513?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/7446893917312833513/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=7446893917312833513' title='29 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7446893917312833513'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/7446893917312833513'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/05/semi-supervised-or-semi-unsupervised.html' title='Semi-supervised or Semi-unsupervised?  (SSL-NLP invited papers)'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>29</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-5365060044632284141</id><published>2009-05-29T19:33:00.000-06:00</published><updated>2009-05-29T19:35:54.787-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>How to reduce reviewing overhead?</title><content type='html'>It's past most reviewing time (for the year), so somehow conversations I have with folks I visit tend to gravitate toward the awfulness of reviewing.  That is, there is too much to review and too much garbage among it (of course, garbage can be slightly subjective).  Reviewing plays a very important role but is a very fallible system, as everyone knows, both in terms of precision and recall.  Sometimes there even seems to be evidence of abuse.&lt;br /&gt;&lt;br /&gt;But this post isn't about good reviewing and bad reviewing.  This is about whether it's possible to cut down on the sheer volume of reviewing.  The key aspect of cutting down reviewing, to me, is that in order to be effective, the reduction has to be significant.  I'll demonstrate by discussing a few ideas that have come up, and some notes about why I think they would or wouldn't work:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Tiered reviewing (this was done at ICML this year).  The model at ICML was that everyone was guaranteed two reviews, and only a third if your paper was "good enough."  I applaud ICML for trying this, but as a reviewer I found it useless.  First, it means that at most 1/3 of reviews are getting cut (assuming all papers are bad), but in practice it's probably more like 1/6 that get reduced.  This means that if on average a reviewer would have gotten six papers to review, he will now get five.  First, this is a very small decrease.  Second, it comes with an additional swapping overhead: effectively I now have to review for ICML &lt;span style="font-style: italic;"&gt;twice&lt;/span&gt;, which makes scheduling a much bigger pain.  It's also harder for me to be self-consistent in my evaluations.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Reject without review (this was suggested to me at dinner last night: if you'd like to de-anonymize yourself, feel free in comments).  Give area chairs the power that editors of journals have to reject papers out of hand.  This gives area chairs much more power (I actually think this is a good thing: area chairs are too often too lazy in my experience, but that's another post), so perhaps there would be a cap on the number of reject without reviews.  If this number is less that about 20%, then my reviewing load will drop in expectation from 5 to 4, which, again, is not a big deal for me.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Cap on submissions (again, a suggestion from dinner last night): authors may only submit one paper to any conference on which their name comes first.  (Yes, I know, this doesn't work in theory land where authorship is alphabetical, but I'm trying to address our issues, not someone else's.)  I've only twice in my life had two papers at a conference where my name came first, and maybe there was a third where I submitted two and one was rejected (I really can't remember).  At NAACL this year, there are four such papers; at ACL there are two.  If you assume these are equally distributed (which is probably a bad assumption, since the people who submit multiple first author papers at a conference probably submit stronger papers), then this is about 16 submissions to NAACL and 8 submissions to ACL.  Again, which is maybe 1-4% of submitted papers: again, something that won't really affect me as a reviewer (this, even less than the above two).&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Strong encouragement of short papers (my idea, finally, but with tweaks from others): right now I think short papers are underutilized, perhaps partially because they're seen (rightly or wrongly) as less significant than "full" papers.  I don't think this need be the case.  Short papers &lt;span style="font-style: italic;"&gt;definitely&lt;/span&gt; take less time to review.  A great "short paper tweak" that was suggested to me is to allow only 3 pages of text, but essentially arbitrarily many pages of tables/figures (probably not arbitrarily, but at least a few... plus, maybe just make data online).  This would encourage experimental evaluation papers to be submitted as shorts (currently these papers typically just get rejected as being longs because they don't introduce really new ideas, and rejected as shorts because its hard to fit lots of experiments in four pages).  Many long papers that appear in ACL could easily be short papers (I would guesstimate somewhere around 50%), especially ones that have the flavor of "I took method X and problem Y and solved Y with X (where both are known)" or "I took known system X, did new tweak Y and got better results."  One way to start encouraging short papers is to just have an option that reviews can say something like "I will accept this paper as a short paper but not a long paper -- please rewrite" and then just have it accepted (with some area chair supervision) without another round of reviewing.  The understanding would have to be that it would be poor form as an author to pull your paper out just because it got accepted short rather than accepted long, and so authors might be encouraged just to submit short versions.  (This is something that would take a few years to have an effect, since it would be partially social.)&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Multiple reviewer types (an idea that's been in the ether for a while).  The idea would be that you have three reviewers for each paper, but each serves a specific role.  For instance, one would exclusively check technical details.  The other two could then ignore these.  Or maybe one would be tasked with "does this problem/solution make sense."  This would enable area chairs (yes, again, more work for area chairs) to assign reviewers to do things that they're good at.  You'd still have to review as many papers, but you wouldn't have to do the same detailed level of review for all of them.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Require non-student authors on papers to review 3 times as many papers as they submit to any given conference, no exceptions ("three" because that's how many reviews they will get for any given paper).  I know several faculty to follow the model of "if there is a deadline, we will submit."  I don't know how widespread this is.  The idea is that even half-baked ideas will get garner reviews that can help direct the research.  I try to avoid offending people here, but that's what colleagues are for: please stop wasting my time as a reviewer by submitting papers like this.  If they get rejected, you've wasted my time; if they get accepted, it's embarrassing for you (unless you take time by camera ready to make them good, which happens only some of the time).  Equating "last author" = "senior person", there were two people at NAACL who have three papers and nine who have two.  This means that these two people (who in expectation submitted 12 papers -- probably not true, probably more like 4 or 5) should have reviewed 12-15 papers.  The nine should probably have reviewed 9-12 papers.  I doubt they did.  (I hope these two people know that I'm not trying to say they're evil in any way :P.)  At ACL, there are four people with three papers (one is a dupe with a three from NAACL -- you know who you are!) and eight people with two.  This would have the added benefit of having lots of reviews done by senior people (i.e., no crummy grad student reviews) with the potential downside that these people would gain more control over the community (which could be good or bad -- it's not a priori obvious that being able to do work that leads to many publications is highly correlated with being able to identify good work done by others).&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Make the job of the reviewer more clear.  Right now, most reviews I read have a schizophrenic feel.  On the one hand, the reviewer justifies his rating to the area chair.  On the other, he provides (what he sees as) useful feedback to the authors.  I know that in my own reviewing, I have cut down on the latter.  This is largely in reaction to the "submit anything and everything" model that some people have.  I'll typically give (what I hope is) useful feedback to papers I rate highly, largely because I have questions whose answers I am curious about, but for lower ranked papers (1-3), I tend to say things like "You claim X but your experiments only demonstrate Y."  Rather than "[that] + ... and in order to show Y you should do Z."  Perhaps I would revert to my old ways if I had less to review, but this was a choice I made about a year ago.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;I'd be interested to hear if others have additional "small changes" ideas.  There are "large delta" ideas, such as Fernando's "everything is a journal" model, which I actually like, but is likely to be hard to implement because it's hard to make sweeping changes to a system (though VLDB -- or was it SIGMOD -- managed to do it).&lt;br /&gt;&lt;br /&gt;I actually think that together, some of these ideas could have a significant impact.  For instance, I would imagine 2 and 4 together would probably cut a 5-6 paper review down to a 3-4 paper review, and doing 6 on top of this would probably take the &lt;span style="font-style: italic;"&gt;average&lt;/span&gt; person's review load down maybe one more.  Overall, perhaps a 50% reduction in number of papers to review, unless you're one of the types who submits lots of papers.  I'd personally like to see it done!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-5365060044632284141?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/5365060044632284141/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=5365060044632284141' title='36 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5365060044632284141'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5365060044632284141'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/05/how-to-reduce-reviewing-overhead.html' title='How to reduce reviewing overhead?'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>36</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-6740851294672958732</id><published>2009-04-26T08:01:00.000-06:00</published><updated>2009-04-26T08:15:34.160-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='algorithms'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><category scheme='http://www.blogger.com/atom/ns#' term='questions'/><title type='text'>Viterbi search, minimum Bayes risk and laplacian eigenmaps</title><content type='html'>I've been slow at posting because I wanted to post on this current topic for a while, but wanted to work out some more details before doing so.  Well, it didn't happen.  So I'm writing &lt;span style="font-style:italic;"&gt;sans&lt;/span&gt; details.&lt;br /&gt;&lt;br /&gt;Let's think for a minute about non-linear dimensionality reduction, aka manifold learning.  If we compare what, say, ISOMAP does with what laplacian eigenmaps (LE) does, they're really quite similar.  In both cases, we construct a graph over our data points, usually a kNN graph or something, sometimes with edge weights, sometimes not.  We then attempt to embed the points in a low dimensional space to minimize some sort of distortion.  In ISOMAP, the distortion is based on the &lt;span style="font-style:italic;"&gt;shortest path&lt;/span&gt; distance between the two points in our constructed graph.  In LE, the distance is measures according to the Laplacian of the graph.  The notion of a Laplacian of a graph is basically a discrete version of the more standard notion of the differential operator by the same name (that comes up in most standard analysis courses).  In continuous land, it is the divergence of the differential, which essentially means that measures some form of diffusion (and has its original applications in physics).  In discrete land, where we live, it can be thought of as a measure of flow on a graph.&lt;br /&gt;&lt;br /&gt;The key difference, then, between ISOMAP and LE is whether you measure distances according to shortest path or to flow, which has a very "average path" feel to it.  The advantage to LE is that it tends to be less susceptible to noise, which is easy to understand when viewed this way.&lt;br /&gt;&lt;br /&gt;Now, getting back to NLP stuff, we often find ourselves doing some sort of shortest path search.  It's well known that the much loved Viterbi algorithm is exactly an application of dynamic programming to search in the lattice defined by an HMM.  This extends in well known ways to other structures.  Of course, Viterbi search isn't the only game in town.  There are two other popular approaches to "decoding" in HMMs.  One is marginal decoding, where we individually maximize the probability of each node.  The other is minimum Bayes risk decoding.  Here, we take a user-supplied risk (aka loss) function and try to find the output that minimizes the expected risk, where the probabilities are given by our current model.  MBR has been shown to outperform Viterbi in many applications (speech, MT, tagging, etc.).  If your risk (loss) function is 0/1 loss, then these recover the same solution.&lt;br /&gt;&lt;br /&gt;What I'm curious about is whether this is a connection here.  I'm not exactly sure how the construction would go -- I'm thinking something like a graph defined over the lattice vertices with edges that reflect the loss function -- but there definitely seems to be a similarity between MBR and average path, which is approximately equal to a Laplacian operation (aka spectral operation).  The reason I think this would be interesting is because a &lt;span style="font-style:italic;"&gt;lot&lt;/span&gt; is known about spectral computations and we might be able to use this knowledge to our advantage (coming up with MBR algorithms is sometimes a bit painful).  An additional complication (or bit of fun, if you see it that way) is that there are at least three standard ways to generalize the notion of a Laplacian to a hypergraph, which is what we would really need to do.  Perhaps we can help pick out the "right" one through the MBR connection.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-6740851294672958732?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/6740851294672958732/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=6740851294672958732' title='33 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6740851294672958732'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/6740851294672958732'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/04/viterbi-search-minimum-bayes-risk-and.html' title='Viterbi search, minimum Bayes risk and laplacian eigenmaps'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>33</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-1821143759987652555</id><published>2009-04-24T08:59:00.000-06:00</published><updated>2009-04-24T09:00:21.238-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>...and here it is for ACL</title><content type='html'>Papers are &lt;a href="http://www.acl-ijcnlp-2009.org/main/acceptedfullpapers.html"&gt;here&lt;/a&gt;.  Words:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt; translat (19)&lt;br /&gt;&lt;li&gt; model (19)&lt;br /&gt;&lt;li&gt; base (19)&lt;br /&gt;&lt;li&gt; learn (16)&lt;br /&gt;&lt;li&gt; semant (12)&lt;br /&gt;&lt;li&gt; supervis (10)&lt;br /&gt;&lt;li&gt; machin (10)&lt;br /&gt;&lt;li&gt; depend (10)&lt;br /&gt;&lt;li&gt; automat (10)&lt;br /&gt;&lt;li&gt; word (9)&lt;br /&gt;&lt;li&gt; pars (9)&lt;br /&gt;&lt;li&gt; approach (8)&lt;br /&gt;&lt;li&gt; system (7)&lt;br /&gt;&lt;li&gt; relat (7)&lt;br /&gt;&lt;li&gt; gener (7)&lt;br /&gt;&lt;li&gt; web (6)&lt;br /&gt;&lt;li&gt; unsupervis (6)&lt;br /&gt;&lt;li&gt; train (6)&lt;br /&gt;&lt;li&gt; languag (6)&lt;br /&gt;&lt;li&gt; label (6)&lt;br /&gt;&lt;li&gt; decod (6)&lt;br /&gt;&lt;li&gt; align (6)&lt;br /&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-1821143759987652555?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/1821143759987652555/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=1821143759987652555' title='21 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1821143759987652555'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/1821143759987652555'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/04/and-here-it-is-for-acl.html' title='...and here it is for ACL'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>21</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-3872214256776830756</id><published>2009-04-23T10:00:00.001-06:00</published><updated>2009-04-23T10:03:26.911-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>NAACL program up</title><content type='html'>See &lt;a href="http://clear.colorado.edu/NAACLHLT2009/program.html"&gt;here&lt;/a&gt;.  I see a bunch that I reviewed and really liked, so I'm overall pretty happy.  (Though I also see one or two in the "other" category :P.)&lt;br /&gt;&lt;br /&gt;At any rate, here are the top stemmed/stopped words with frequencies:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;model (30)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;base (19)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;translat (17)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;languag (16)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;speech (14)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;improv (14)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;statist (12)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;machin (12)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;unsupervis (11)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;recognit (10)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;pars (10)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;system (9)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;learn (9)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;word (8)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;depend (8)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;approach (8)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;spoken (7)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;semant (7)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;search (7)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;linear (7)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;extract (7)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Looks like machine translation and unsupervised learning are the winners.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-3872214256776830756?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/3872214256776830756/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=3872214256776830756' title='16 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3872214256776830756'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/3872214256776830756'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/04/naacl-program-up.html' title='NAACL program up'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>16</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-4755520838852805533</id><published>2009-03-22T09:13:00.000-06:00</published><updated>2009-03-22T09:28:55.253-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='PL'/><category scheme='http://www.blogger.com/atom/ns#' term='survey'/><title type='text'>Programming Language of Choice</title><content type='html'>Some of you know that I (at least used to be) a bit of a &lt;a href="http://www.cs.utah.edu/~hal/htut/"&gt;programming&lt;/a&gt; &lt;a href="http://www.google.com/search?q=hal+daume+haskell"&gt;language&lt;/a&gt; &lt;a href="http://www.cs.utah.edu/~hal/docs/why-not-c.txt"&gt;snob&lt;/a&gt;.  In fact, on several occasions, I've met (in NLP or ML land) someone who recognizes my name from PL land and is surprised that I'm not actually a PL person.  My favorite story is after teaching machine learning for the second time, I had &lt;a href="http://www.cs.rutgers.edu/~ccshan/"&gt;Ken Shan&lt;/a&gt;, a friend from my PL days, visit.  I announced his visit and got an email from a student who had taken ML from me saying:&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;I _knew_ your name was familiar! I learned a ton about Haskell from your tutorial, for what's worth.. Great read back in my freshman year in college. (Belatedly) Thanks for writing it!&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;And it's not like my name is particularly common!&lt;br /&gt;&lt;br /&gt;At any rate (and, admittedly, this is a somewhat an &lt;a href="http://nlpers.blogspot.com/2007/08/hierarchical-bayes-compiler.html"&gt;HBC&lt;/a&gt;-related question) I'm curious what programming language(s) other NLP folks tend to use.  I've tried to include a subset of the &lt;a href="http://shootout.alioth.debian.org/"&gt;programming language shootout&lt;/a&gt; list here that I think are most likely to be used, but if you need to write-in, feel free to do so in a comment.  You can select as many as you would like, but please just try to vote for those that you actually use regularly, and that you actually use for large projects.  Eg., I use Perl a lot, but only for o(100) line programs... so I wouldn't select Perl.&lt;br /&gt;&lt;br /&gt;&lt;!-- // Begin Pollhost.com Poll Code // --&gt;&lt;br /&gt;&lt;form method=post action=http://poll.pollhost.com/vote.cgi&gt;&lt;table border=0 width=150 bgcolor=#EEEEEE cellspacing=0 cellpadding=2&gt;&lt;tr&gt;&lt;td colspan=2&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;&lt;b&gt;What programming language(s) do you use for large-ish projects?&lt;/b&gt;&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=1&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;C/C#/C++/Objective-C&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=2&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;D&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=3&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Eiffel&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=4&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Erlang&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=5&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;F#&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=6&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Haskell&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=7&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Java&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=8&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Lisp&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=9&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Matlab&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=10&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;OCaml/SML/ML&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=11&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Perl&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=12&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Python&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=13&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;R&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=14&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Ruby&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=15&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Scala&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=16&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Sheme&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=17&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Smalltalk&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td width=5&gt;&lt;input type=checkbox name=answer value=18&gt;&lt;/td&gt;&lt;td&gt;&lt;font face="Arial" size=-1 color="#000000"&gt;Other&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=2&gt;&lt;input type=hidden name=config value="aGNkYXVtZTMJMTIzNzczNTIwNglFRUVFRUUJMDAwMDAwCUFyaWFsCUFzc29ydGVk"&gt;&lt;center&gt;&lt;input type=submit value=Vote&gt;&amp;nbsp;&amp;nbsp;&lt;input type=submit name=view value=View&gt;&lt;/center&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td bgcolor=#FFFFFF colspan=2 align=right&gt;&lt;font face="Arial" size=-2 color="#000000"&gt;&lt;a href=http://www.pollhost.com/&gt;&lt;font color=#000099&gt;Free polls from Pollhost.com&lt;/font&gt;&lt;/a&gt;&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/form&gt;&lt;br /&gt;&lt;!-- // End Pollhost.com Poll Code // --&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-4755520838852805533?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/4755520838852805533/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=4755520838852805533' title='60 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4755520838852805533'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/4755520838852805533'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/03/programming-language-of-choice.html' title='Programming Language of Choice'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>60</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-5792160030119921828</id><published>2009-03-07T11:10:00.000-07:00</published><updated>2009-03-07T11:47:43.189-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='language modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='machine translation'/><title type='text'>n-gram words an language Ordering model with</title><content type='html'>N-gram language models have been fairly successful at the task of distinguishing homophones, in the context of speech recognition.  In machine translation (and other tasks, such as summarization, headline generation, etc.), this is not their job.  Their job is to select fluent/grammatical sentences, typically ones which have undergone significant reordering.  In a sense, they have to order words.  A large part of the &lt;a href="http://www.blogger.com/www.isi.edu/%7Eradu"&gt;thesis&lt;/a&gt; of my academic sibling, &lt;a href="http://www.blogger.com/www.isi.edu/%7Eradu"&gt;Radu Soricut&lt;/a&gt;, had to do with exploring how well ngram language models can reorder sentences.  Briefly, they don't do very well.  This is something that our advisor, &lt;a href="http://www.isi.edu/%7Emarcu/"&gt;Daniel Marcu&lt;/a&gt;, likes to talk about when he gives invited talk; he shows a 15 word sentence and the preferred reorderings by a ngram LM and they're total hogwash, even though audience members can fairly quickly solve the exponential time problem of reordering the words to make a good sounding sentence.  (As an aside, Radu found that if you add in a syntactic LM, things get better... if you don't want to read the whole thesis, just skip forward to section 8.4.2.)&lt;br /&gt;&lt;br /&gt;Let's say we like ngram models.  They're friendly for many reasons.  What could we do to make them more word-order sensitive?  I'm not claiming that none of these things have been tried; just that I'm not aware of them having been tried :).&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Discriminative training.  There's lots of work on discriminative training of language models, but, from what I've seen, it usually has to do with trying to discriminate true sentences from fake sentences, where the fake sentences are generated by some process (eg., an existing MT or speech system, a trigram LM, etc.).  The alternative is to directly train a language model to order words.  Essentially think of it as a structured prediction problem and try to predict the 8th word based on (say) the two previous.  The correct answer is the actual 8th word; the incorrect answer is any other word in the sentence.  Words that don't appear in the sentence are "ignored."  This is easy to implement and seems to do something reasonable (on a small set of test data).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Add syntactic features to words, eg., via cluster-based language models.  My thought here is to look at syntactic features of words (for instance, CCG-style lexicon information) and use these to create descriptors of the words; these can then be clustered (eg., use tree-kernel-style-features) to give a cluster LM.  This is similar to how people have added CCG/supertag information to phrase-based MT, although they don't usually do the clustering step.  The advantage to clustering is then you (a) get generalization to new words and (b) it fits in nicely with the cluster LM framework.&lt;/li&gt;&lt;/ol&gt;These both seem like such obvious ideas that they must have been tried... maybe they didn't work?  Or maybe I just couldn't dig up papers.  Or maybe they're just not good ideas so everyone else dismissed them :).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-5792160030119921828?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/5792160030119921828/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=5792160030119921828' title='39 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5792160030119921828'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5792160030119921828'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/03/n-gram-words-language-ordering-model.html' title='n-gram words an language Ordering model with'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>39</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-23230210546515760</id><published>2009-03-02T21:13:00.000-07:00</published><updated>2009-03-02T21:29:56.243-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Mixture models: clustering or density estimation</title><content type='html'>My colleague &lt;a href="http://www.cs.utah.edu/~suresh/"&gt;Suresh Venkatasubramanian&lt;/a&gt; is running as seminar on &lt;a href="http://apollonius.cs.utah.edu/mediawiki/index.php/Algorithms_Seminar/Spring09"&gt;clustering&lt;/a&gt; this semester.  Last week we discussed EM and mixture of Gaussians.  I almost skipped because it's a relatively old hat topic for me (how many times have I given this lecture?!), and had some grant stuff going out that day.  But I decided to show up anyway.  I'm glad I did.&lt;br /&gt;&lt;br /&gt;We discussed a lot of interesting things, but something that had been bugging me for a while finally  materialized in a way about which I can be precise.  I basically have two (purely qualitative) issues with mixture of Gaussians as a clustering method.  (No, I'm not actually suggesting there's anything wrong with using it in practice.)  My first complaint is that many times, MoG is used to get the cluster assignments, or to get soft-cluster assignments... but this has always struck me as a bit weird because then we should be maximizing over the cluster assignments and doing expectations over everything else.  Max Welling has &lt;a href="http://www.ics.uci.edu/~welling/publications/papers/BKMSiam06-short-v2.pdf"&gt;done some work related to this&lt;/a&gt; in the Bayesian setting.  (I vaguely remember that someone else did basically the same thing at basically the same time, but can't remember any more who it was.)&lt;br /&gt;&lt;br /&gt;But my more fundamental question is this.  When we start dealing with MoG, we usually say something like... suppose we have a density F which can be represented at F = pi_0 F_0 + pi_1 F_1 + ... + pi_K F_K, where the pis give a convex combination of "simpler" densities F_k.  This question arose in the context of density estimation (if my history is correct) and the maximum likelihood solution via expectation maximization was developed to solve the density estimation problem.  That is, the ORIGINAL goal in this case was to do density estimation; the fact that "cluster assignments" were produced as a byproduct was perhaps not the original intent.&lt;br /&gt;&lt;br /&gt;I can actually give a fairly simple example to try to make this point visually.  Here is some data generated by a mixture of uniform distributions.  And I'll even tell you that K=2 in this case.  There are 20,000 points if I recall correctly:&lt;br /&gt;&lt;br /&gt;&lt;img src="http://www.cs.utah.edu/~hal/nlpers/munif.png"&gt;&lt;br /&gt;&lt;br /&gt;Can you tell me what the distribution is?  Can you give me the components?  Can you give me cluster assignments?&lt;br /&gt;&lt;br /&gt;The problem is that I've constructed this to be non-identifiable.  Here are two ways of writing down the components.  (I've drawn this in 2D, but only pay attention to the x dimension.)  They give rise to exactly the same distribution.  One is equally weighted components, one uniform on the range (-3,1) and one uniform on the range (-1,3).  The other is to have to components, one with 2/3 weight on the range (-3,3) and one with 1/3 weight on the range (-1,1).&lt;br /&gt;&lt;br /&gt;I could imagine some sort of maximum likelihood parameter estimation giving rise to either of these (EM is hard to get to work here because once a point is outside the bounds of a uniform, it has probability zero).  They both correctly recover the distribution, but would give rise to totally different (and sort of weird) cluster assignments.&lt;br /&gt;&lt;br /&gt;I want to quickly point out that this is a very different issue from the standard "non-identifiability in mixture models issue" that has to do with the fact that any permutation of cluster indices gives rise to the same model.&lt;br /&gt;&lt;br /&gt;So I guess that all this falls under the category of "if you want X, go for X."  If you want a clustering, go for a clustering -- don't go for density estimation and try to read off clusters as a by-product.  (Of course, I don't entirely believe this, but I still think it's worth thinking about.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-23230210546515760?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/23230210546515760/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=23230210546515760' title='26 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/23230210546515760'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/23230210546515760'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/03/mixture-models-clustering-or-density.html' title='Mixture models: clustering or density estimation'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>26</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-5650953100314700092</id><published>2009-02-19T06:36:00.000-07:00</published><updated>2009-02-19T06:48:39.399-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mcmc'/><category scheme='http://www.blogger.com/atom/ns#' term='topic models'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Mixing of gibbs samplers for topic (and mixture) models</title><content type='html'>This is just a quick note because it's something that had somehow never occurred to me until a few days ago.  If you've ever talked to me in person about the standard (collapsed) Gibbs samplers for topic models, you know that I get concerned that these things don't mix.  And it's not just the generic "how do we know" feeling that can get levied at (almost) &lt;i&gt;any&lt;/i&gt; sampler, but a very specific &lt;i&gt;we &lt;b&gt;know&lt;/b&gt; for certain that our Gibbs samplers for topic models aren't mixing.&lt;/i&gt;  How do we know?  Because, for the most part, they don't mode switch.  You can figure this out quite easily by simply watching the topic assignments.  (The standard collapsed samplers for Gaussian or Multinomial mixture models exhibit the same problem.)  At least if you have a reasonable amount of data.&lt;br /&gt;&lt;br /&gt;The reason this always bugged me is because we now have definitive evidence that these things aren't mixing in the sense of mode hopping, which leads me to worry that they might also not be mixing in other, less easy to detect ways.&lt;br /&gt;&lt;br /&gt;Well, worry no more.  Or at least worry less.  The mode hopping is a red herring.&lt;br /&gt;&lt;br /&gt;Maybe the following observation is obvious, but I'll say it anyway.&lt;br /&gt;&lt;br /&gt;Let's take our topic model Gibbs sampler and introduce a new Metropolis-Hastings step.  This MH step simply takes two topic indices (say topics i and j) and swaps them.  It picks i and j uniformly at random from the (K choose 2) possibilities.  It's easy to verify that the acceptance probability for this move will be one (the qs will cancel and the ps are identical), which means that it's really more like a Gibbs step than an MH step (in the sense that Gibbs steps are MH steps that are always accepted).&lt;br /&gt;&lt;br /&gt;The observation is that (a) this doesn't actually change what the sampler is doing in any real, meaningful way -- that is, re-indexing things is irrelevant; but (b) you now cannot claim that the sampler isn't mode switching.  It's mode switching a &lt;i&gt;lot&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;Sure, it might still have poor mixing properties for other reasons, but at least now there isn't this elephant-in-the-room reason it can't possibly be mixing.&lt;br /&gt;&lt;br /&gt;So this is a totally "useless" practical observation (sometimes we like to try to exploit the fact that it's not mixing), but from a theoretical perspective it might be interesting.  For instance, if you want to prove something about a fast mixing rate for these samplers (if you believe they are actually mixing fast, which I'm somewhat keen to believe), then you're not going to be able to prove anything if you don't make this trivial change to the sampler (because without it they're not &lt;i&gt;really&lt;/i&gt; mixing fast).  So it might have interesting theoretical consequences.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/19803222-5650953100314700092?l=nlpers.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpers.blogspot.com/feeds/5650953100314700092/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=19803222&amp;postID=5650953100314700092' title='28 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5650953100314700092'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/19803222/posts/default/5650953100314700092'/><link rel='alternate' type='text/html' href='http://nlpers.blogspot.com/2009/02/mixing-of-gibbs-samplers-for-topic-and.html' title='Mixing of gibbs samplers for topic (and mixture) models'/><author><name>hal</name><uri>http://www.blogger.com/profile/02162908373916390369</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://www.cs.utah.edu/~hal/portrait.jpg'/></author><thr:total>28</thr:total></entry><entry><id>tag:blogger.com,1999:blog-19803222.post-5891668590533621105</id><published>2009-02-04T08:22:00.000-07:00</published><updated>2009-02-04T08:22:30.975-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='acl'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>Beating the state of the art, big-O style</title><content type='html'>I used to know a fair amount about math; in fact, I even applied to a &lt;a href="http://www.math.ucla.edu/"&gt;few&lt;/a&gt; &lt;a href="http://www.math.huji.ac.il/"&gt;grad&lt;/a&gt; &lt;a href="http://www.math.uiuc.edu/ResearchAreas/logic/"&gt;schools&lt;/a&gt; to do logic (long story, now is not the time).  I was never involved in actual math research (primarily because of the necessary ramp-up time -- thank goodness this doesn't exist as much in CS!), but I did get a sense from my &lt;a href="http://www.math.cmu.edu/%7Erami"&gt;unofficial undergrad advisor&lt;/a&gt; how things worked.  The reason I started thinking of this is because I recently made my way about halfway through a quite dense book on &lt;a href="http://www.amazon.com/Determinacy-Games-Gruyter-Logic-Applications/dp/3110183412/ref=sr_11_1?ie=UTF8&amp;amp;qid=1233759484&amp;amp;sr=11-1"&gt;long games&lt;/a&gt; by a prof from grad school (I cross-enrolled at UCLA for a few semesters).  The basic idea of a countable game is that there is a fixed subset A of [0,1] (subset of the reals) and two players alternate play.  In each move, they play an integer from 0 to 9.  They play for a countably infinite number of moves, essentially writing down (in decimal form) a real number.  If, at the "end", this number is in A, then player 1 wins; otherwise, player 2 wins.  Both players know A.  The set A is said to be determined if there is a strategy that will force a win; it is undetermined otherwise.  A long game is the obvious generalization where you play for longer than countable time.  The details don't matter.&lt;br /&gt;&lt;br /&gt;This led me to think, as someone who's moved over to working a lot on machine learning: is there an analogous question for online learning?  There are several ways to set this up and I won't bore you with the details (I doubt any reader here really cares), but depending on how you set it up, you can prove several relatively trivial, but kind of cute, results (I say trivial because they took me on the order of hours, which means that someone who knows what they're doing probably would see them immediately).  I basically did this as a mental exercise, not for any real reason.&lt;br /&gt;&lt;br /&gt;But it got me thinking: obviously machine learning people wouldn't care about this because it's too esoteric and not at all a realistic setting (even for COLTers!).  I strongly doubt that logicians would care either, but for a totally different reason.  From my interaction, they would be interested if and only if two things were satisfied: (a) the result showed some interesting connection between a new model and existing models; (b) the proofs were non-trivial and required some new insight that could be then applied to other problems.  Obviously this is not my goal in life, so I've dropped it.&lt;br /&gt;&lt;br /&gt;This led to me introspect: what is is that we as a community need in order to find some result interesting?  What about other fields that I claim to know a bit about?&lt;br /&gt;&lt;br /&gt;Let's take algorithms for a minute.  Everything here is about big-O.  Like the math types, a result without an interesting proof is much less interesting than a result with an interesting proof, though if you start reading CS theory blogs, you'll find that there's a bit of divide in the community on whether this is good or not.  But my sense (which could be totally broken) is that if you have a result with a relatively uninteresting proof that gets you the same big-O running time as the current state of the art, you're in trouble.&lt;br /&gt;&lt;br /&gt;I think it's interesting to contrast this with what happens in both NLP and ML.  Big-O works a bit differently here.  My non-technical description of big-O to someone who knows nothing is that it measure "order of magnitude" improvements.  (Okay, O(n log n) versus O(n log log n) is hard to call an order of magnitude, but you get the idea.)  An equivalent on the experimental side would seem to be something like: you cut the remaining error on a problem by half or more.  In other words, if state of the art is 60% accuracy, then an order of magnitude improvement would be 80% accuracy or better.  At 80% it would be 90%.  At 90% it would be 95% and so on.  90.5% to 91.2% is not order of magnitude.&lt;br /&gt;&lt;br /&gt;I actually like this model for looking at experimental results.  Note that this has absolutely nothing to do with statistical significance.  It's kind of like reading results graphs with a pair of thick glasses on (for those who don't wear glasses) or no glasses on (for those who wear think glasses).  I think the justification is that for less than order of magnitude improvement, it's really just hard to say whether the improvement is due to better tweaking or engineering or dumb luck in how some feature was implemented or what.  For order of magnitude improvement, there almost &lt;i&gt;has&lt;/i&gt; to be something interesting going on.&lt;br /&gt;&lt;br /&gt;Now, I'm not proposing that a paper isn't publishable if it doesn't have order of magnitude improvement.  Very few papers would be published this way.  I'm just suggesting that improving the state of the art &lt;i&gt;not&lt;/i&gt; be -- by itself -- a reason for acceptance &lt;i&gt;unless&lt;/i&gt; it's an order of magnitude improvement.  That is, you'd either better have a cool idea, be solving a new problem, analyzing the effect of some aspect of a problem that's important, etc., &lt;i&gt;or&lt;/i&gt; work on a well trod task and get a big-O improvement.&lt;br /&gt;&lt;br /&gt;What I'm saying isn't novel, of course... the various exec boards at the ACL conferences have been trying to find ways to get more "interesting" papers into the conferences for (at least) a few years.  This is just a concrete proposal.  Obviously it requires buy-in at least of area chairs and probably reviewers.  And there are definitely issues with it.  Like any attempt to make reviewing non-subjective, there are obviously corner cases (eg., you have a sort-of-interesting idea and an almost-order-of-magnitude improvement).  You can't mechanize the reviewing process.  But frankly, when I see paper reviews that gush
