natural language processing blog: problems

Showing posts with label problems. Show all posts

17 February 2010

Senses versus metaphors

I come from a tradition of not really believing in word senses. I fondly remember a talk Ed Hovy gave when I was a grad student. He listed the following example sentences and asked each audience member to group them in to senses:

John drove his car to work.
We dove to the university every morning.
She drove me to school every day.
He drives me crazy.
She is driven by her passion.
He drove the enemy back.
She finally drove him to change jobs.
He drove a nail into the wall.
Bill drove the ball far out into the field.
My students are driving away at ACL papers.
What are you driving at?
My new truck drives well.
He drives a taxi in New York.
The car drove around the corner.
The farmer drove the cows into the barn.
We drive the turnpike to work.
Sally drove a golf ball clear across the green.
Mary drove the baseball with the bat.
We drove a tunnel through the hill.
The steam drives the engine in the train.
We drove the forest looking for game.
Joe drove the game from their hiding holes.

Most people in the audience came up with 5 or 6 senses. One came up with two (basically the physical versus mental distinction). According to wordnet, each of these is a separate sense. (And this is only for the verb form!) A common "mistake" people made was to group 1, 2, 3, 13 and 14, all of which seem to have to do with driving cars. The key distinction is that 1 expresses the operation of the vehicle, 2 expresses being transported, 3 expresses being caused to move and 13 expresses driving for a job. You can read the full WordNet descriptions if you don't believe me.

Now, the point of this isn't to try to argue that WordNet is wacky in any way. The people who put it together really know what they're talking about. After all, these senses are all really different, in the sense there really is a deep interprative difference between 1, 2, 3 and 13. It's just sufficiently subtle that unless it's pointed out to you, it's not obvious. There's been a lot of work recently from Ed and others on "consolidating" senses in the OntoNotes project: in fact, they have exactly the same example (how convenient) where they've grouped the verb drive in to seven senses, rather than 22. These are:

operating or traveling via a vehicle (WN 1, 2, 3, 12, 13, 14, 16)
force to a position or stance (WN 4, 6, 7, 8, 15, 22)
exert energy on behalf of something (WN 5, 10)
cause object to move rapidly by striking it (WN 9, 17, 18)
a directed course of conversation (WN 11)
excavate horizontally, as in mining (WN 19)
cause to function or operate (WN 20)

Now, again, I'm not here to argue that these are better or worse or anything in comparison to WordNet.

The point is that there are (at least) two ways of explaining the wide senses of a word like "drive." One is through senses and this is the typical approach, at least in NLP land. The other is metaphor (and yes, that is a different Mark Johnson). I'm not going to go so far as to claim that everything is a metaphor, but I do think it provides an alternative perspective on this issue. And IMO, alternative perspectives, if plausible, are always worth looking at.

Let's take a really simple "off the top of my head" example based on "drive." Let's unrepentantly claim that there is exactly one sense of drive. Which one? It seems like the most reasonable is probably OntoNotes' sense 2; Merriam-Webster claims that drive derives from Old-High-German "triban" which, from what I can tell in about a five minute search, has more to do with driving cattle than anything else. (But even if I'm wrong, this is just a silly example.)

Obviously we don't drive cars like we drive cattle. For one, we're actually inside the cars. But the whole point of driving cattle is to get them to go somewhere. If we think of cars as metaphorical cattle, then by operating them, we are "driving" them (in the drive-car sense).
These are mostly literal. However, for "drive a nail", we need to think of the nail as like a cow that we're trying to get into a pen (the wall).
This is, I think, the most clear metaphorical usage. "He is driving away at his thesis" really means that he's trying to get his thesis to go somewhere (where == to completion).
Driving balls is like driving cattle, except you have to work harder to do it because they aren't self-propelled. This is somewhat like driving nails.
"What are you driving at" is analogous to driving-at-thesis to me.
"Drive a tunnel through the mountain" is less clear to me. But it's also not a sense of this word that I think I have ever used or would ever use. So I can't quite figure it out.
"Steam drives an engine" is sort of a double metaphor. Engine is standing in for cow and steam is standing in for cowboy. But otherwise it's basically the same as driving cattle.

Maybe this isn't the greatest example, but hopefully at least it's a bit thought-worthy. (And yes, I know I'm departing from Lakoff... in a Lakoff style, there's always a concrete thing and a non-concrete thing in the Lakoff setup from what I understand.)

This reminds me of the annoying thing my comrades and I used to do as children. "I come from a tradition..." Yields "You literally come from a tradition?" (No, I was educated in such a tradition.... although even that you could ask whether I was really inside a tradition.) "A talk Ed Hovy gave..." Yields "Ed literally gave a talk?" (No, he spoke to an audience.) "I drove the golf ball across the field" Yields "You got in the golf ball and drove it across the field?" Sigh. Kids are annoying.

Why should I care which analysis I use (senses or metaphor)? I'm not sure. It's very rare that I actually feel like I'm being seriously hurt by the word sense issue, and it seems that if you want to use sense to do a real task like translation, you have to depart from human-constructed sense inventories anyway.

But I can imagine a system roughly like the following. First, find the verb and it's frame and true literal meaning (maybe it actually does have more than one). This verb frame will impose some restrictions on its arguments (for instance, drive might say that both the agent and theme have to be animate). If you encounter something where this is not true (eg., a "car" as a theme or "passion" as an agent), you know that this must be a metaphorical usage. At this point, you have to deduce what it must mean. That is, if we have some semantics associated with the literal interpretation, we have to figure out how to munge it to work in the metaphorical case. For instance, for drive, we might say that the semantics are roughly "E = theme moves & E' = theme executes E & agent causes E'" If the patient cannot actually execute things (it's a nail), then we have to figure that something else (eg., in this case, the agent) did the actual executing. Etc.

So it seems like the options are: come up with semantics and frames for every sense (this is what's done, eg., in VerbNet). Or, have a single (or small number) of semantics and frames and have some generic rules (hopefully generic!) for how to derive metaphorical uses from them.

07 November 2009

NLP as a study of representations

Ellen Riloff and I run an NLP reading group pretty much every semester. Last semester we covered "old school NLP." We independently came up with lists of what we consider some of the most important ideas (idea = paper) from pre-1990 (most are much earlier) and let students select which to present. There was a lot of overlap between Ellen's list and mine (not surprisingly). ~~If people are interested, I can provide the whole list (just post a comment and I'll dig it up)~~. The whole list of topics is posted as a comment. The topics that were actually selected are here.

I hope the students have found this exercise useful. It gets you thinking about language in a way that papers from the 2000s typically do not. It brings up a bunch of issues that we no longer think about frequently. Like language. (Joking.) (Sort of.)

One thing that's really stuck out for me is how much "old school" NLP comes across essentially as a study of representations. Perhaps this is a result of the fact that AI -- as a field -- was (and, to some degree, still is) enamored with knowledge representation problems. To be more concrete, let's look at a few examples. It's already been a while since I read these last (I had meant to write this post during the spring when things were fresh in my head), so please forgive me if I goof a few things up.

I'll start with one I know well: Mann and Thompson's rhetorical structure theory paper from 1988. This is basically "the" RST paper. I think that when a many people think of RST, they think of it as a list of ways that sentences can be organized into hierarchies. Eg., this sentence provides background for that one, and together they argue in favor of yet a third. But this isn't really where RST begins. It begins by trying to understand the communicative role of text structure. That is, when I write, I am trying to communicate something. Everything that I write (if I'm writing "well") is toward that end. For instance, in this post, I'm trying to communicate that old school NLP views representation as the heart of the issue. This current paragraph is supporting that claim by providing a concrete example, which I am using to try to convince you of my claim.

As a more detailed example, take the "Evidence" relation from RST. M+T have the following characterization of "Evidence." Herein, "N" is the nucleus of the relation, "S" is the satellite (think of these as sentences), "R" is the reader and "W" is the writer:

relation name: Evidence
constraints on N: R might not believe N to a degree satisfactory to W
constraints on S: R believes S or will find it credible
constraints on N+S: R's comprehending S increases R's belief of N
the effect: R's belief of N is increased
locus of effect: N

This is a totally different way from thinking about things than I think we see nowadays. I kind of liken it to how I tell students not to program. If you're implementing something moderately complex (say, forward/backward algorithm), first write down all the math, then start implementing. Don't start implementing first. I think nowadays (and sure, I'm guilty!) we see a lot of implementing without the math. Or rather, with plenty of math, but without a representational model of what it is that we're studying.

The central claim of the RST paper is that one can think of texts as being organized into elementary discourse units, and these are connected into a tree structure by relations like the one above. (Or at least this is my reading of it.) That is, they have laid out a representation of text and claimed that this is how texts get put together.

As a second example (this will be sorter), take Wendy Lehnert's 1982 paper, "Plot units and narrative summarization." Here, the story is about how stories get put together. The most interesting thing about the plot units model to me is that it breaks from how one might naturally think about stories. That is, I would naively think of a story as a series of events. The claim that Lehnert makes is that this is not the right way to think about it. Rather, we should think about stories as sequences of affect states. Effectively, an affect state is how a character is feeling at any time. (This isn't quite right, but it's close enough.) For example, Lehnert presents the following story:

When John tried to start his care this morning, it wouldn't turn over. He asked his neighbor Paul for help. Paul did something to the carburetor and got it going. John thanked Paul and drove to work.

The representation put forward for this story is something like: (1) negative-for-John (the car won't start), which leads to (2) motivation-for-John (to get it started, which leads to (3) positive-for-John (it's started), when then links back and resolves (1). You can also analyze the story from Paul's perspective, and then add links that go between the two characters showing how things interact. The rest of the paper describes how these relations work, and how they can be put together into more complex event sequences (such as "promised request bungled"). Again, a high level representation of how stories work from the perspective of the characters.

So now I, W, hope that you, R, have an increased belief in the title of the post.

Why do I think this is interesting? Because at this point, we know a lot about how to deal with structure in language. From a machine learning perspective, if you give me a structure and some data (and some features!), I will learn something. It can even be unsupervised if it makes you feel better. So in a sense, I think we're getting to a point where we can go back, look at some really hard problems, use the deep linguistic insights from two decades (or more) ago, and start taking a crack at things that are really deep. Of course, features are a big problem; as a very wise man once said to me: "Language is hard. The fact that statistical association mining at the word level made it appear easy for the past decade doesn't alter the basic truth. :-)." We've got many of the ingredients to start making progress, but it's not going to be easy!

14 June 2009

NL Generation: A new problem

Those who talk to me a lot over the years know that I really think that generation is a cool and interesting problem, but one that is hampered by a lack of clarity of what it is, or at least what the input/output is. It's like the problem with defining summarization, but one hundred times worse.

I have no idea how it came up. I think I was talking to a bunch of people from Microsoft Research and we were talking about translation on the web and what not. And how this is not the business model of Language Weaver. And how when you're on the web you make money by advertising.

And voila! The most incredible NL Generation task occurred to me! (I apologize if you ran in to me at all during NAACL because this was pretty much all I could talk about :P.) The initial idea for the task was embedded in MT, though it needn't be. But I'll tell it in the MT setting.

So I go to some web page in some weirdo language (say, French) that I don't understand (because I was a moron and took Latin instead of French or Spanish in middle school and high school). So I ask my favorite translation system (Google or Microsoft or Babelfish or whatever) to translate it. However, the translation system takes certain liberties with the translation. In particular, it might embed a few "product placements" in the text. For instance, maybe it's translating "Je suis vraiment soif" into English (if this is incorrect, blame Google). And perhaps it decides that instead of translating this as "I'm really thirsty," it will translate it as "I'm really thirsty for a Snapple," or "I'm really thirsty: I could go for a Snapple," perhaps with a link to snapple.com.

Product placement is all over the place, even in America where it's made fun of and kept a bit at bay. Not so in China: any American remotely turned off by the Coca-cola cups from which the judges on American Idol drink twice a week would be appalled by the ridiculous (my sentiment) product placement that goes on over there. The presence of the link would probably give away that it's an ad, though of course you could leave this off.

But why limit this to translation. You could put such a piece of technology directly on blogger, or on some proxy server that you can go through to earn some extra cash browsing the web (thanks to someone -- can't remember who -- at NAACL for this latter suggestion). I mean, you could just have random ads pop up in the middle of text on any web page, for instance one you could host on webserve.ca!

(See what I did there?)

So now here's a real generation problem! You're given some text. And maybe you're even given adwords or something like that, so you can assume that the "select which thing to advertise" problem has been solved. (Yes, I know it's not.) Your job is just to insert the ad in the most natural way in the text. You could evaluate in at least two ways: click through (as is standard in a lot of this advertising business) and human judgments of naturalness. I think the point of product placement is to (a) get your product on the screen more or less constantly, rather than just during commercial breaks which most people skip anyway, and (b) perhaps appeal to people's subconscious. I don't know. My parents (used to) do advertising like things but I know next to nothing about that world.

Okay, so this is slightly tongue in cheek, but not entirely. And frankly, I wouldn't be surprised if something like it were the norm in five years. (If you want to get more fancy, insert product placements into youtube videos!)

24 November 2008

Supplanting vs Augmenting Human Language Capabilities

With an analogy to robotics, I've seen two different approaches. The first is to develop humanoid robots. The second is to develop robots that enhance human performance. The former supplants a human (eg., the long awaited robot butler); the latter augments a human. There are parallels in many AI fields.

What about NLP?

I would say that most NLP research aims to supplant humans. Machine translation puts translators out of work. Summarization puts summarizers out of work (though there aren't as many of these). Information extraction puts (one form of) information analysts out of work. Parsing puts, well... hrm...

There seems actually to be quite little in the way of trying to augment human capabilities. Web search might be one such area, though this is only tenuously an "NLP" endeavor. Certainly there is a reasonable amount of work in translation assistance: essentially fancy auto-completion for translators. Some forms of IE might look like this: find all the names, coreferenceify them and then present "interesting" results to a real human analyst who just doesn't have time to look through all the documents... though this looks to me like a fancy version of some IR task that happens to use IE at the bottom.

Where else might NLP technology be used to augment, rather than supplant?

A student here recently suggested the following. When learning a new language, you are often reading an encounter unknown words. These words could be looked up in a dictionary and described in a little pop-up window. Of course, if the definition (hopefully sense-disambiguated) itself contains unknown words, you'd need to recurse. He then suggested a similar model for reading Wikipedia pages: tell Wikipedia everything you know and then have it regenerate the "variational EM" page explaining things that you don't know about (again, recursively). This could either be interactive or not. Wikipedia is nice here because you can probably look up most things that a reader might not know via internal Wikipedia links.
Although really a field of IR, there's the whole interactive track at TREC that essentially aims for an interactive search experience, complete with suggestions, refinements, etc.
I can imagine electronic tutorials that automatically follow your progress in some task (eg., learning to use photoshop) and auto-generate text explaining parts where you seem to be stuck, rather than just providing you with random, consistent advice. (Okay, this starts to sound a bit like our mutual enemy clippy... but I suspect it could actually be done well, especially if it were really in the context of learning.)
Speaking of learning, someone (I don't even remember anymore! Sorry!) suggested the following talk to me a while ago. When trying to learn a foreign language, there could be some proxy server you go through that monitors when you are reading pages in this language that you want to learn. It can keep track of what you know and offer mouseover suggestions for words you don't know. This is a bit like the first suggestion above.

That's all I can come up with now.

One big "problem" with working on such problems is that you then cannot avoid actually doing user studies, and we all know how much we love doing this in NLP these days.

03 February 2008

The behemoth, PubMed

The friend I crashed with while attending SODA is someone I've known since we were five years old. (Incidentally, there's actually someone in the NLP world who I've actually known from earlier...small world.) Anyway, the friend I stayed with is just finishing med school at UCSF and will soon be staying there for residency. His specialty is neurosurgery, and his interests are in neural pathologies. He spent some time doing research on Alzheimer's disease, effectively by studying mice (there's something I feel sort of bad about finding slightly amusing about mice with Alzheimer's disease). Needless to say, in the process of doing research, he made nearly daily use out of PubMed. (For those of you who don't know, PubMed is like the ACL anthology, but with hundreds of thousands of papers, with new ones being added by the truckload daily, and will a bunch of additional things, like ontologies and data sets.)

There are two things I want to talk about regarding PubMed. I think both of these admit very interesting problems that we, as NLPers, are qualified to tackle. I think the most important thing, however, is opening and maintaining a wide channel of communication. There seems to be less interaction between people who do (for instance) bio-medical informatics (we have a fairly large group here) and what I'll term as mainstream NLPers. Sure, there have been BioNLP workshops at ACLs, but I really think that both communities would be well-served to interact more. And for those of you who don't want to work on BioNLP because it's "just a small domain problem", let me assure you: it is not easy... don't think of it in the same vein as a true "sublanguage" -- it is quite broad.

I suppose I should give a caveat that my comments below are based on a sample size of one (my friend), so it may not be totally representative. But I think it generalizes.

Search in PubMed, from what I've heard, is good in the same ways that web search is good and bad in the same ways that web search is bad. It is good when you know what you're looking for (i.e., you know the name for it) and bad otherwise. One of the most common sorts of queries that my friend wants to do is something like "show me all the research on proteins that interact in some way with XXX in the context of YYY" where XXX is (eg) a gene and YYY is (eg) a disease. The key is that we don't know which proteins these are and so it's hard to query for them directly. I know that this is something that the folks at Penn (and probably elsewhere) are working on, and I get the impression that a good solution to this problem would make lots and lots of biologists much happier (and more productive). One thing that was particularly interesting, however, is that he was pretty averse to using structured queries like the one I gave above. He effectively wants to search for "XXX YYY" and have it realize that XXX is a gene, YYY is a disease, and that it's "obvious" that what he wants is proteins that interact with (or even, for instance, pathways that contain) XXX in the context of disease YYY. On the other hand, if YYY were another gene, then probably he's be looking for diseases or pathways that are regulated by both XXX and YYY. It's a bit complex, but I don't think this is something particularly beyond our means.

The other thing I want to talk about is summarization. PubMed actually archives a fairly substantial collection of human-written summaries. These fall into one of two categories. The first, called "systematic reviews" are more or less what we would think of as summaries. However, they are themselves quite long and complex. They're really not anything like sentence extracts. The second, called "meta analyses" are really not like summaries at all. In a meta analysis, an author will consider a handful of previously published papers on, say, the effects of smoking on lifespan. He will take the data and results published in these individual papers, and actually do novel statistical analyses on them to see how well the conclusions hold.

From a computational perspective, the automatic creation of meta analyses would essentially be impossible, until we have machines that can actually run experiments in the lab. "Systematic reviews", on the other hand, while totally outside the scope of our current technology, are things we could hope to do. And they give us lots of training data. There are somewhere around ten to forty thousand systematic reviews on PubMed, each about 20 pages long, and each with references back to papers, almost all of which are themselves in PubMed. Finding systematic reviews older than a few years ago (when the began being tagged explicitly) has actually sprouted a tiny cottage industry. And PubMed nicely makes all of their data available for download, without having to crawl, something that makes life much easier for us.

My friend warns that it might not be a good idea to use all systematic reviews, but only those from top journals. (They tend to be less biased, and better written.) However, in so far as I don't think we'd even have hope of getting something as good as a systematic review from the worst journal in the world, I'm not sure this matters much. Maybe all it says is that for evaluation, we should be careful to evaluate against the top.

Now, I should point out that people in biomedical informatics have actually been working on the summarization problem too. From what I can tell, the majority of effort there is on rule-based systems that build on top of more rule-based systems that extract genes, pathways and relations. People at the National Library of Medicine, Rindflesch and Fiszman, use SemRep to do summarization, and they have tried applying it to some medical texts. Two other people that I know are doing this kind of work are Kathy McKeown and Greg Whalen, both at Columbia. The Columbia group has access to a medically informed NLP concept extractor called MedLEE, which gives them a leg up on the low-level processing details. If you search for 'summarization medical OR biomedical' in GoogleScholar, you'll get a raft of hits (~9000).

Now, don't get me wrong -- I'm not saying that this is easy -- but for summarization folks who are constantly looking for "natural artifacts" of their craft, this is an enormous repository.

22 January 2008

An NLPer in the Algorithmists Court

I just returned from SODA (the Symposium on Discrete Algorithms). Obviously (I suppose), I didn't have a paper there, but I was interested in learning new things. Moreover, it was in SF which is a short cheap flight and admits free housing by crashing with one of my very close friends. My fellow professorial newbie, Suresh Venkatasubramanian, ran a seminar this past Fall on approximate high-dimensional geometry that I attended. This is a topic that is excruciatingly close to many areas of machine learning, and I suspect that there will be a lot of cross-polination in the next few years. I'll try to get around at some point to post about things I learned in the seminar. Some are more ML-related, some more NLP. Anyway, I wanted to (a) see what a theory conference was like and (b) learn about the latest, greatest techniques. (Sadly, I had to miss the last day because I teach early Tuesday mornings.)

Below are a few of the papers I saw that I liked enough to put them on my "to read" list. I'm not quite sure what my qualification for "liked" is, so take this with a huge grain of salt. If you want more authoritative opinions, see either Suresh's blog or one of the other theory blogs. I also warn you guys that most of these things really have nothing to do with NLP.

The first invited talk was by Bonnie Berger, talking about computational biology. Most of the topics were pretty standard: RNA/protein structure prediction, protein-protein interaction graphcs, etc. There are some surface connections to NLP (eg., some other people use stochastic CFGs to do structure prediction (though she does not). The twist that she is taking is to try to solve these problems simultaneously for more than two organisms (typically: yeast, worm, fly, mouse, human; these are the organisms for which we have the most data). The rational is based on the hypothesis that if a structure is biologically important, then it is (roughly) conserved across species. I feel like a lot of linguistics is based on roughly the same hypothesis, and this is probably a topic I'll try to blog about in the near(ish) future.

The second invited talk was by Persi Diaconis, who I've now seen give two talks and I just love him (I wish I had been able to take probability from him as an undergrad). At a very high level, his talk was a bit of evangelizing functional combinatorics, which is essentially a paradigm for understanding combinatorial problems that unifies a large body of disparate problems in terms of the analysis of symmetric polynomials (which have, themselves, a rich theory). The particular example he gave was a relationship between carrying (like, what you do when you add numbers with pen and paper) and shuffling. I confess I learned a lot more about shuffling (which seems to be a bit of a love for Perci), but it was still pretty interesting. Not enough for me to learn functional combinatorics, but still interesting.

One of my favorite papers was Fast dimension reduction using Rademacher series on dual BCH codes by Nir Ailon and Edo Liberty. Johnson-Lindenstrauss gives us a method of embedding N points in D dimensions into K<<D dimensions that preserve l2 distances with high probability, using random projection matrices (lots of follow on work, too). One problem is that these methods require storing a D*K matrix and each projection takes D*K operations. The result of this paper is that we can get the same results using much smaller space and time for each projection (or, in certain boundary cases, a matched space/time complexity). The key idea is to force oneself to use a random diagonal K-matrix composed with "something else" that gives us the desired property. The paper analyzes what properties the "something else" must have, and then constructs such a (deterministic) matrix based on Fourier codes. I thought it was both interesting theoretically and liketly to be quite practical (though time will tell on the latter).

Another paper I really liked was Declaring independence via the sketching of sketches by Piotr Indyk and Andrew McGregor. The problem they are tackling is trying to determing whether two variables are correlated, when the variables come in pairs in a stream. (For those who don't know streaming, think of it as an algorithm that gets one data point at a time and has to do something quickly using small space.) The two quantities that are relevant are the joint distribution over the pairs, and the product-of-marginals distribution over the pairs. If these are very similar, the two items in the pair are likely to be independent. They measure distance in three ways: Euclidean (l2) distance, variational (l1) distance, and mutual information. For l2, the have a very clever analysis that is very straightforward to understand if you know anything about sketching (it's basically a natural extension of previous results by Indyk). They are able to get pretty good results. They have similar results for the other metrics. Again, I can already think of a handful of possible applications of this stuff. It also leaves open an interesting question: what if you get k-tuples (instead of pairs) and you want to extract the M-most-correlated pairs. I suspect you can do this efficiently using a combination of this result with known results on frequent item sets. This would have immediate application to Bayes net structure learning, among other things.

Venkatesan Guruswami, James Lee and Alexander Razborov had a paper on Almost Euclidean subspaces of l_1^N via expander codes (I can't find a link because Google doesn't like latex codes) that I thought was both interesting and very well presented. This paper is a bit harder for me to get across (much less understand!). At a high level, their observation is that for a lot of metric embedding problems, the only known way to get a good embedding (i.e., one that preserves distances or norms) is to use a randomize construction (and then prove that it happens with high probability). Their result is a deterministic, constructive embedding matrix for embedding l1 into l2 (under some qualifications). The other plus of this method is that the embedding matrix is sparse, which means that the image of the vectors under the embedding ar sparse. There are also interesting connections to compressed sensing, which are mentioned in the paper.

There were other papers that I liked, but that I'm not going to try to summarize. Timothy Chan had a impressive (if somewhat ugly) result On the bichormatic k-set problem, which is effectively the problem of trying to figure out how many bad halfspace classifiers there are for labeled data (translated into machine learning lingo). Hubert Chan, Anupam Gupta and Kunal Talwar had a nice result on Ultra-low-dimensional embeddings for doubling metrics (I could only find a previous version that appeared in a NIPS workshop) that shows that if your metric doesn't have nearly uniform volume (characterized by the doubling dimension) then you can embed into Euclidean space with low distortion (this is somewhat surprising: classic results of Bourgain show that in general this is not possible). Despite the niceness of their result, the thing I found most interesting in the talk was a reference to a construction due to Satish Rao on Small distortion and volume preserving embeddings for planar and Euclidean metrics that they basically are extending.

Overall, I enjoyed the conference, despite getting pretty lost in a bunch of the talks. I wish I could have stayed for the last day, since there are a handful of papers that I think would probably be interesting. But I suppose that's what this 1300 page proceedings they dumped on me is for. In particular, the ones that look interesting from titles and skimming are: Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm (actually, I've read the TR version of this paper and it's very good); Sampling algorithms and coresets for l_p regression (only considers the N>>D case, which is a bit limiting) and Linked decompositions of networks and power of choice in Polya urns, which has interesting connections to (shockingly) Polya urn schemes. In particular, they show that if you have a Polya urn where in each step you sample two urns proportional to their occupancy, but then insert into the smaller of the two, then you get a balanced distribution. This is in contrast to typically Polya urn schemes that show up in Dirichlet processes and Pitman-Yor processes where you get power law distributions. (The "power of two" references is to hashing strategies where you hash by two independent hash functions and insert into the lesser occupied of the two bins.)

(p.s., I apologize if I misquoted the exact results in any of these papers; most of this post is based on memory and very short notes I took.)

19 July 2007

What's the Use of a Crummy Translation?

I'm currently visiting Microsoft Research Asia (in Beijing) for two weeks (thanks for having me, guys!). I speak basically no Chinese. I took one half of a semester about 6 years ago. I know much more Japanese; enough so that I can read signs that indicate direction, dates and times, but that's about it... the remainder is too divergent for me to make out at all (perhaps a native Japanese speaker would feel differently, but certainly not a gaijin like me).

My experience here has reminded me of a paper that Ken Church and Ed Hovy wrote almost 15 years ago now, Good Applications for Crummy Machine Translation. I'm not sure how many people have read it recently, but it essentially makes the following point: MT should enter the users world in small steps, only insofar as it is actually going to work. To say that MT quality has improved significantly in 15 years is probably already an understatement, but it is still certainly far from something that can even compare to translation quality of a human, even in the original training domain.

That said, I think that maybe we are a bit too modest as a community. MT output is actually relatively readable these days, especially for relatively short input sentences. The fact that "real world companies" such as Google and LanguageWeaver seem to anticipate making a profit off of MT shows that at least a few crazies out there believe that it is likely to work well enough to be useful.

At this point, rather than gleefully shouting the glories of MT, I would like to point out the difference between the title of this post and the title of the Church/Hovy paper. I want to know what to do with a crummy translation. They want to know what to do with crummy machine translation. This brings me back to the beginning of this post: my brief experience in Beijing. (Discourse parsers: I challenge you to get that dependency link!)

This voucher can not be encashed and can be used for one sitting only.
The management reserves the right of explanation.
Office snack is forbidden to take away.
Fizzwater bottles please recycle.

The first two are at my hotel, which is quite upscale; the second two are here on the fridge at Microsoft. There are so many more examples, in subway stations, on the bus, on tourism brochures, in trains, at the airport, I could go on collecting these forever. The interesting this is that although two of these use words that aren't even in my vocabulary (encashed and fizzwater), one is grammatical but semantically nonsensical (what are they explaining?) and one is missing an indirect object (but if it had one, it would be semantically meaningless), I still know what they all mean. Yes, they're sometimes amusing and worth a short chuckle, but overall the important points are gotten across: no cash value; you can be kicked out; don't steal snacks; recycle bottles.

The question I have to ask myself is: are these human translations really better than something a machine could produce? My guess is that machine translation outputs would be less entertaining, but I have a hard time imagine that they would be less comprehensible. I guess I want to know: if we're holding ourselves to the standard of a level of human translation, what level is this? Clearly it's not the average translation level that large tourism companies in China hold themselves to. Can we already beat these translations? If so, why don't we relish in this fact?

25 May 2007

Math as a Natural Language

Mathematics (with a capital "M") is typically considered a formal language. While I would agree that it tends to be more formal than, say, English, I often wonder whether it's truly formal (finding a good definition of formal would be useful, but is apparently somewhat difficult). In other words, are there properties that the usual natural languages have that math does not. I would say there are only a few, and they're mostly unimportant. Secondly, note that I bolded the "a" above. This is because I also feel that Mathematics is more like a collection of separate languages (or, dialects if you prefer since none of them has an army -- except perhaps category theory) that a single language.

First regarding the formality. Typically when I think of formal languages, I think of something lacking ambiguity. This is certainly the case for the sort of math that one would type into matlab or a theorem prover or mathematica. But what one types in here is what I would call "formal math" precisely because it is not the language that mathematicians actually speak. This is perhaps one reason why these tools are not more widely used: translating between the math that we speak and the math that these tools speak is somewhat non-trivial. The ambiguity issue is perhaps the most forceful argument that math is not completely formal: operators get overloaded, subscripts and conditionings get dropped, whole subexpressions get elided (typically with a "..."), etc. And this is not just an issue of venue: it happens in formal math journals as well.

It is often also the case that even what gets publishes is not really the mathematics that people speak. Back when I actually interacted with real mathematicians, there was inevitably this delay for "formally" writing up results, essentially breaking apart developed shorthand and presenting things cleanly. But the mathematics that appears in math papers is not really in its natural form -- at least, it's often not the language that mathematicians actually work with.

To enumerate a few points: Math...

has a recursive structure (obviously)
is ambiguous
is self-referential ("see Eq 5" or "the LHS" or simply by defining something and giving it a name)
has an alphabet and rules for combining symbols

To some degree, it even has a phonology. One of the most painful classes I ever took (the only class I ever dropped) was an "intro to logic for grad CS students who don't know math" class. This class pained me because the pedagogical tool used was for the professor to hand out typed notes and have the students go around and each read a definition/lemma/theorem. After twenty minutes of "alpha superscript tee subscript open parenthesis eye plus one close parethesis" I was ready to kill myself. It has no morphology that I can think of, but neither really does Chinese.

Moving on to the "a" part, I propose a challenge. Go find a statistician (maybe you are one!). Have him/her interpret the following expression:

. Next, go find a logician and have him/her interpret

. For people in the respective fields, these expressions have a very obvious meaning (I'm guessing that most of the readers here know what the second is). I'm sure that if I had enough background to drag up examples from other fields, we could continue to perplex people. In fact, even though about 8 years ago I was intimately familiar with the first sort of expression, I actually had to look it up to ensure that I got the form right (and to ensure that I didn't make a false statement). This reminded me somewhat of having to look up even very simple words and expressions in Japanese long after having used it (somewhat) regularly. I think that the diversity of meaning of symbols and subexpressions is a large part of the reason why most computer algebra systems handle only a subset of possible fields (some do algebra, some calculus, some logic, etc.). I believe in my heart that it would be virtually impossible to pin down a single grammar (much less a semantics!) for all of math.

So what does this have to do with an NLP blog? Well, IMO Math is a natural language, at least in all the ways that matter. So why don't we study it more? In particular, when I download a paper, what I typically do is first read the abstract, then flip to the back to skim to results, and then hunt for the "main" equation that will explain to me what they're doing. For me, at least, this is much faster than trying to read all of the text, provided that I can somewhat parse the expression (which is only a problem when people define too much notation). So much information, even in ACL-style papers, but more-so in ICML/NIPS-style papers, is contained in the math. I think we should try to exploit it.

29 April 2007

Problems with Sometimes-hidden Variables

Following in the footsteps of Andrew, I may start posting NLP-related questions I receive by email to the blog (if the answer is sufficiently interesting). At least its a consistent source of fodder. Here's one from today:

I am looking for a task, preferably in nlp and info retriv, in which we have labeled data where some are completely observed (x,h) and some are partially observed x. (h is a latent variable and is observed for some data points and not observed for others) Briefly, the data should be like {(x_i,h_i,y_i),(x_j,y_j)} where y is the class label.

I think there are a lot of such examples. The first one to spring to mind is alignments for MT, but please don't use the Aachen alignment data; see Alex's ACL06 paper for one way to do semi-supervised learning when the majority of the work is coming from the unsupervised part, not from the supervised part. You could try to do something like the so-called end-to-end system, but with semi-supervised training on the alignments. Of course, building an MT system is a can of worms we might not all want to open.

Depending on exactly what you're looking for, there may be (many) other options. One might be interested in the case where the ys are named entity tags or entities with coref annotated and the hs are parse trees (ala the BBN corpus). By this reasoning, pretty much any pipelined system can be considered of this sort, provided that there is data annotated for both tasks simultaneously (actually, I think a very interesting research question is what to do when this is not the case). Unfortunately, typically if the amount of annotated data differs between the two tasks, it is almost certainly going to be the "easier" task for which more data is available, and typically you would want this easier task to be the hidden task.

Other than that, I can't think of any obvious examples. You could of course fake it (do parsing, with POS tags as the hidden, and pretend that you don't have fully annotated POS data, but I might reject such a paper on the basis that this is a totally unrealistic setting -- then again, I might not if the results were sufficiently convincing). You might also look at multitask learning literature; I know that in such settings, it is often useful to treat the "hidden" variable as just another output variable that's sometimes present and sometimes not. This is easiest to do in the case of neural networks, but can probably be generalized.

05 April 2007

Discourse is darn hard

Discourse is an area I've been peripherally interested in for quite a while; probably unduly influence by my advisor's (ex-?)interest in the problem. The general sense I get is that discourse relations are relatively easy to identify when cue-phrases exist. This is essentially the approach taken by the Penn Discourse Treebank and Daniel's original parser. Unfortunately, cue phrases exist only for a small subset of sentences, so it's tempting to look at lexical techniques, since they work so well in other problems. The intuition is that if one sentence says "love" and the next says "hate," then maybe they are in some sort of contrastive relation. Unfortunately, the world is rarely this nice. The amount of inference necessary to reliably identify seemingly simple, coarse-grained relationships, seems astonishing.

Here are some examples of contrast relations:

The interest-only securities will be sold separately by BT Securities.
The principal-only securities will be repackaged by BT Securities into a Freddie Mac Remic, Series 103, that will have six classes.
James Earl Jones is cast to play the Rev. Mr. Johns.
The network deals a lot with unknowns, including Scott Wentworth, who portrayed Mr. Anderson, and Bill Alton as Father Jenco.
At the same time, second-tier firms will continue to lose ground.
In personal computers, Apple, Compaq and IBM are expected to tighten their hold on their business.

In (1), we need to know that interest-only and principal-only securities are essentially themselves contrastive. We then need to combine this information with the coreference of BT Securities in both cases (should be easy). This seems like maybe we could possibly get it with enough data. For (2), we essentially need to know that James Earl Jones is famous, and that "deals with unknowns" is the same as "casts." For (3), we need to know that Apple, Compaq and IBM are not second-tier firms. These latter two seem quite difficult.

Next, let's consider background. Here are some short examples of background relations:

The March 24 oil spill soiled hundreds of miles of shoreline along Alaska's southern coast and wreaked havoc with wildlife and the fishing industry.
Exxon Corp. is resigning from the National Wildlife Federation 's corporate advisory panel, saying the conservation group has been unfairly critical of the Exxon Valdez oil spill along the Alaskan coast.
Insurers typically retain a small percentage of the risks they underwrite and pass on the rest of the losses.
After Hugo hit, many insurers exhausted their reinsurance coverage and had to tap reinsurers to replace that coverage in case there were any other major disasters before the end of the year.

The first example here might be identifiable by noting that the tense in the two sentences is different: one is present, one is past. Beyond this, however, we'd essentially have to just recognize that the oil spill referred to in both sentences is, in fact, the same spill. We'd have to infer this from the fact that otherwise the first sentence is irrelevant. In (2), we'd have to recognize that the "passed on" losses go to reinsurers (I didn't even know such things existed).

As a final example, let's look at some evidence relations:

A month ago, the firm started serving dinner at about 7:30 each night; about 50 to 60 of the 350 people in the investment banking operation have consistently been around that late.
Still, without many actual deals to show off, Kidder is left to stress that it finally has "a team" in place, and that everyone works harder.
Yesterday, Mobil said domestic exploration and production operations had a $16 million loss in the third quarter, while comparable foreign operations earned $234 million.
Individuals familiar with Mobil's strategy say that Mobil is reducing its U.S. work force because of declining U.S. output.

These examples also seem quite difficult. In (1), the fact that the "team" is supposed to refer to the "50 to 60 ... people" and that having dinner, within the context of a company, is a "team-like" activity. (2) is somewhat easier -- knowing that a "loss" is related to a "decline" helps a lot, though it is entirely possible that the type of relation would differ. You also pretty much have to know that "domestic" means "U.S."

I feel like, in many ways, this discourse relation problem subsumes the now-popular "textual entailment" problem. In fact, some of the RST relations look a lot like textual entailment: elaboration-general-specific, eleboration-set-member, Contrast (as a negative entailment), consequence, etc.

I don't have a strong conclusion here other than: this is a hard problem. There've been some attempts at learning lexical things for this task, but I feel like it needs more than that.

25 February 2007

Language and X

I feel it is increasingly interesting to look at problems that have both a linguistic and a non-linguistic signal. The simplest example I can think of here is recent work in looking at images and their captions (eg., the matching words and pictures approach, though this is not the only such example). The idea in many of these techniques is that we have a single data set that contains two types of data. One is language (eg., the captions) and the other is something else (eg., the images). There are many other potentially interesting sources of data like this that I'll discuss after saying why I think these problems are interesting.

I think there are two basic reasons why such "dual" problems are interesting:

By looking at multiple modalities, we can get cross-modality reinforcement of what we're learning. I think a great example of that is this paper, which uses face recognition together with textual coreference in order to do unsupervised coref in captions. The idea is that if you can tell that two different pictures contain the entity we call "Bill Clinton" in them, then it's more likely that a name in their caption will corefer. This gives us a greater ability to share data across a single data set.
When language is treated in the context of some other source of information, we get a sort of grounding effect. That is, continuing the coref example from #1, we have--in some sense--grounded the string "Bill Clinton" to a pictorial representation of a person. Sure, this representation is still just a bunch of symbols, but they're markedly different symbols from the ones we use to represent the entity in purely linguistic terms. Perhaps hard-core cognitive scientists wouldn't call this grounding, but it's getting there. (Also along these lines, one of my favorite papers from a few years ago was my ex-ISI friend Mike doing grounding by looking and language and action in video games.)

There are probably other reasons why such tasks are interesting, but these are the two that appeal to me most strongly.

It seems that vision+language is the hot topic, or at least the warmest of the bunch. This is probably because vision people and language people tend to overlap and meet at machine learning conferences. It's probably also because the sorts of techniques used by the two communities are perhaps the most similar. But I think there's lots of room for looking at other "X"s. For instance, biology. There are lots of data sets (eg., GEO) that contain textual information ("these are cell lines depicting ovarian cancer") plus the actual cell lines. Heck, many papers in PubMed contain such information, albeit in figures rather than matrices. Robotics is another option. Ray Mooney's group down in Texas has worked at understanding orders given to robocup competitors based on language information (eg., this paper). Perhaps the oldest that actually lives within the NLP community is NLP + databases, which we really haven't seen much of in the past 5-10 years.

I think this is an interesting and fruitful area of future research and is one that I'll probably be exploring myself (but I won't give away what "X"s!).

07 January 2007

What Irks Me about E-mail Customer Service

I hate dealing with customer service for large corporations, and it has little to do with outsourcing. I hate it because in the past few months, I have sent out maybe three or four emails to customer service peeps, at places like BofA, Chase, Comcast, Ebay, etc. Having worked in a form of customer service previously (I worked at the computer services help desk as an undergrad at CMU to earn some extra money), I completely understand what's going on. But "understand" does not imply "accept." I post this here not as a rant, but because I think there are some interesting NLP problems under the hood.

So what's the problem? What has happened in all these cases is that I have some problem that I want to solve, can't find information about it in the FAQ or help pages on the web site, and so I email customer service with a question. As an example, I wanted to contest an Ebay charge but was two days past the 60 day cutoff (this was over Thanksgiving). So I asked customer service if, given the holiday, they could waive the cutoff. As a reply I get a form email, clearly copied directly out of the FAQ page, saying that there is a 60 day cutoff for filing contests to charges. Well no shit.

So here's my experience from working at the help desk. When we got emails, we had the option of either replying by crafting an email, or replying by selecting a prewritten document from a database. This database was pretty huge -- many thousands of problems, neatly categorized and searchable. For emails for which the answer existed in the database, it took maybe 10 seconds to send the reply out.

What seems to be happening nowadays is that this is being taken to the extreme. A prewritten form letter is always used, regardless of whether it is appropriate or not. If it is a person doing this bad routing, that's a waste of 10 seconds of person time (probably more for these large companies). If it's a machine, it's no big deal from there perspective, but it makes me immediately hate the company with my whole heart.

But this seems to be a really interesting text categorization/routing problem. Basically, you have lots of normal classes (the prewritten letters) plus a "needs human attention" class. There's a natural precision/recall/$$$ trade-off, which is somewhat different and more complex than is standardly considered. But it's clearly an NLP/text categorization problem, and clearly one that should be worked on. I know from my friends at AT&T that they have something similar for routing calls, but my understanding is that this is quite different. Their routing happens based on short bits of information into a small number of categories. The customer service routing problem would presumably be based on lots of information in a large number of categories.

You could argue that this is no different from providing a "help search" option on the Ebay web page. But I think it's different, if for no other reason that how it appears to the user. If the user thinks he is writing an email to a person, he will write a good email with full sentences and lots of information. If he's just "searching" then he'll only write a few keywords.

natural language processing blog