Some out there might find this amusing (if you know me, you'll know why).
Semester's over; back to regular blogging soon!
my biased thoughts on the fields of natural language processing (NLP), computational linguistics (CL) and related topics (machine learning, math, funding, etc.)
23 April 2008
06 April 2008
ICWSM Report
(Guest Post by Kevin Duh -- Thanks, Kevin!!!)
I recently attended ICWSM (International Conference on Weblogs and Social Media), which consisted of an interesting mix of researchers from NLP, Data Mining, Pyschology, Sociology, and Information Sciences. Social media (which defined generally can include blogs, newsgroups, and online communities like facebook, flikr, youtube, del.icio.us) now accounts for the majority of content produced and consumed on the Web. As the area grows in importance, people are getting really interested in finding ways to better understand the phenomenon and to better build applications on top of it. This conference, the second in the series, has nearly 200 participants this year. I think this is a rewarding area for NLPers and MLers to test their wits on: there are many interesting applications and open problems.
In the following, I'll pick out some papers, just to give a flavor of the range of work in this area. For a full list of papers, see the conference program. Most papers are available online (do a search); some are linked from the conference blog.
Interesting new applications:
1) International sentiment analysis for News and Blogs -- M. Bautin, L. Vijayarenu, S. Skiena (StonyBrook) Suppose you want to monitor the sentiment of particular named entities (e.g. Bush, Putin) on news and blogs across different countries for comparison. This may be useful for, e.g., political scientists analyzing global reactions to the same event. There are two approaches: One is to apply sentiment analyzers trained in different languages; Another is to apply machine translation on foreign text, then apply an English sentiment analyzer. Their approach is the latter (using off-the-shelf MT engine). Their system generates very-fun-to-watch "heat maps" of named entities that are popular/unpopular across the globe. I think this paper opens up a host of interesting questions for NLPers: Is sentiment polarity something that can be translated across languages? How would one modify an MT system for this particular task? Is it more effective to apply MT, or to build multilingual sentiment analyzers?
2) Recovering Implicit Thread Structure in Newsgroup Style Conversations, by Y-C. Wang, M. Joshi, C. Rose, W. Cohen (CMU) Internet newsgroups can quite messy in terms of conversation structure. One long thread can actually represent different conversations among multiple parties. This work aims to use natural language cues to tease apart the conversations of a newsgroup thread. Their output is a conversation graph that shows the series of post-replies in a more coherent manner.
3) BLEWS: Using blogs to provide context for news articles -- M. Gamon, S. Basu, D. Belenko, D. Fisher, M. Hurst, C. Konig (Microsoft) Every news article has its bias (e.g. liberal vs. conservative). A reader who wishes to be well-educated on an issue should ideally peruse articles on all sides of the spectrum. This paper presents a system that aids the reader in quickly undertanding the political leaning (and emotional charge) of an article. It does so by basically looking at how many conservative vs. liberal blogs link to a news article. I think this paper is a good example of how one can creatively combine a few existing technologies (NLP, visualization, link analysis) to produce an application that has a lot of value-added.
Methods and algorithms adapted for social media data:
4) Document representation and query expansion models for blog recommendation -- J. Arguello, J. Elsas, J. Callan, J. Carbonel (CMU) This is an information retrieval paper, where the goal is to retrieve blogs relevant to an user query. This is arguably a harder problem than traditional webpage retrieval, since blogs are composed of many posts, and they can be on slightly different topics. The paper adopts a language modeling approach and asks the question: should we model blogs at the blog-level, or at the post-level? They also explored what kind of query expansion would work for blog retrieval. This paper is a nice example of how one can apply traditional methods to a new problem, and then discover a whole range of interesting and new research problems due to domain differences.
Understanding and analyzing social communities:
5) Wikipedian Self-governance in action: Motivating the policy-lens -- I. Beschastnikh, T. Kriplean, D. McDonald (UW) [Best paper award] Wikipedia is an example of self-governance, where participant editors discuss/argue about what should and can be edited. Over the years, a number of community-generated policies and guidelines have formed. These include policies such as "all sources need to be verified" and "no original research should be included in Wikipedia". Policies are themselves subject to modification, and they are often used as justification by different editors under different perspectives. How are these policies used in practice? Are they being used by knowledgeable Wikipedian "lawyers" or adminstrators at the expense of commonday editors? This paper analyzes the Talk pages of Wikipedia to see how policies are used and draws some very interesting observations about the evolution of Wikipedia.
6) Understanding the efficiency of social tagging systems using information theory -- E. Chi, T. Mytkowicz (PARC) Social communities such as del.icio.us allows users to tag webpages with arbitrary terms; how efficient is this evolving vocabulary of tags for categorizing the webpage of interest? Is there a way to measure whether a social community is "doing well"? This paper looks at this problem with the tools of information theory. For example, they compute the conditional entropy of documents given tags H(doc|tag) over time and observe that the efficiency is actually decreasing as popular tags are becoming overused.
Overall, I see three general directions of research for an NLPer in this field: The first approach focuses on building novel web applications that require NLP as a sub-component for the value-added. NLPers in industry or large research groups are well-suited to build these applications; this is where start-ups may spring up. The second approach is more technical: it focuses on how to adapt existing NLP techniques to new data such as blogs and social media.
This is a great area for individual researchers and grad student projects, since the task is challenging but clearly-defined: beat the baseline (old NLP technique) by introducing novel modifications, new features and models. Success in this space may be picked up by the groups that build the large applications.The third avenue of research, which is less examined (as far as I know), is to apply NLP to help analyze social phenomenon. The Web provides an incredible record of human artifacts. If we can study all that is said and written on the web, we can really understand a lot about social systems and human behavior.
I don't know when NLP technology will be ready, but I think it would be really cool to use NLP to study language for language's sake, and more importantly, to study language in its social context--perhaps we could call that "Social Computational Linguistics". I imagine this area of research will require collaboration with the social scientists; it is not yet clear what NLP technology is needed in this space, but papers (5) and (6) above may be a good place to start.
I recently attended ICWSM (International Conference on Weblogs and Social Media), which consisted of an interesting mix of researchers from NLP, Data Mining, Pyschology, Sociology, and Information Sciences. Social media (which defined generally can include blogs, newsgroups, and online communities like facebook, flikr, youtube, del.icio.us) now accounts for the majority of content produced and consumed on the Web. As the area grows in importance, people are getting really interested in finding ways to better understand the phenomenon and to better build applications on top of it. This conference, the second in the series, has nearly 200 participants this year. I think this is a rewarding area for NLPers and MLers to test their wits on: there are many interesting applications and open problems.
In the following, I'll pick out some papers, just to give a flavor of the range of work in this area. For a full list of papers, see the conference program. Most papers are available online (do a search); some are linked from the conference blog.
Interesting new applications:
1) International sentiment analysis for News and Blogs -- M. Bautin, L. Vijayarenu, S. Skiena (StonyBrook) Suppose you want to monitor the sentiment of particular named entities (e.g. Bush, Putin) on news and blogs across different countries for comparison. This may be useful for, e.g., political scientists analyzing global reactions to the same event. There are two approaches: One is to apply sentiment analyzers trained in different languages; Another is to apply machine translation on foreign text, then apply an English sentiment analyzer. Their approach is the latter (using off-the-shelf MT engine). Their system generates very-fun-to-watch "heat maps" of named entities that are popular/unpopular across the globe. I think this paper opens up a host of interesting questions for NLPers: Is sentiment polarity something that can be translated across languages? How would one modify an MT system for this particular task? Is it more effective to apply MT, or to build multilingual sentiment analyzers?
2) Recovering Implicit Thread Structure in Newsgroup Style Conversations, by Y-C. Wang, M. Joshi, C. Rose, W. Cohen (CMU) Internet newsgroups can quite messy in terms of conversation structure. One long thread can actually represent different conversations among multiple parties. This work aims to use natural language cues to tease apart the conversations of a newsgroup thread. Their output is a conversation graph that shows the series of post-replies in a more coherent manner.
3) BLEWS: Using blogs to provide context for news articles -- M. Gamon, S. Basu, D. Belenko, D. Fisher, M. Hurst, C. Konig (Microsoft) Every news article has its bias (e.g. liberal vs. conservative). A reader who wishes to be well-educated on an issue should ideally peruse articles on all sides of the spectrum. This paper presents a system that aids the reader in quickly undertanding the political leaning (and emotional charge) of an article. It does so by basically looking at how many conservative vs. liberal blogs link to a news article. I think this paper is a good example of how one can creatively combine a few existing technologies (NLP, visualization, link analysis) to produce an application that has a lot of value-added.
Methods and algorithms adapted for social media data:
4) Document representation and query expansion models for blog recommendation -- J. Arguello, J. Elsas, J. Callan, J. Carbonel (CMU) This is an information retrieval paper, where the goal is to retrieve blogs relevant to an user query. This is arguably a harder problem than traditional webpage retrieval, since blogs are composed of many posts, and they can be on slightly different topics. The paper adopts a language modeling approach and asks the question: should we model blogs at the blog-level, or at the post-level? They also explored what kind of query expansion would work for blog retrieval. This paper is a nice example of how one can apply traditional methods to a new problem, and then discover a whole range of interesting and new research problems due to domain differences.
Understanding and analyzing social communities:
5) Wikipedian Self-governance in action: Motivating the policy-lens -- I. Beschastnikh, T. Kriplean, D. McDonald (UW) [Best paper award] Wikipedia is an example of self-governance, where participant editors discuss/argue about what should and can be edited. Over the years, a number of community-generated policies and guidelines have formed. These include policies such as "all sources need to be verified" and "no original research should be included in Wikipedia". Policies are themselves subject to modification, and they are often used as justification by different editors under different perspectives. How are these policies used in practice? Are they being used by knowledgeable Wikipedian "lawyers" or adminstrators at the expense of commonday editors? This paper analyzes the Talk pages of Wikipedia to see how policies are used and draws some very interesting observations about the evolution of Wikipedia.
6) Understanding the efficiency of social tagging systems using information theory -- E. Chi, T. Mytkowicz (PARC) Social communities such as del.icio.us allows users to tag webpages with arbitrary terms; how efficient is this evolving vocabulary of tags for categorizing the webpage of interest? Is there a way to measure whether a social community is "doing well"? This paper looks at this problem with the tools of information theory. For example, they compute the conditional entropy of documents given tags H(doc|tag) over time and observe that the efficiency is actually decreasing as popular tags are becoming overused.
Overall, I see three general directions of research for an NLPer in this field: The first approach focuses on building novel web applications that require NLP as a sub-component for the value-added. NLPers in industry or large research groups are well-suited to build these applications; this is where start-ups may spring up. The second approach is more technical: it focuses on how to adapt existing NLP techniques to new data such as blogs and social media.
This is a great area for individual researchers and grad student projects, since the task is challenging but clearly-defined: beat the baseline (old NLP technique) by introducing novel modifications, new features and models. Success in this space may be picked up by the groups that build the large applications.The third avenue of research, which is less examined (as far as I know), is to apply NLP to help analyze social phenomenon. The Web provides an incredible record of human artifacts. If we can study all that is said and written on the web, we can really understand a lot about social systems and human behavior.
I don't know when NLP technology will be ready, but I think it would be really cool to use NLP to study language for language's sake, and more importantly, to study language in its social context--perhaps we could call that "Social Computational Linguistics". I imagine this area of research will require collaboration with the social scientists; it is not yet clear what NLP technology is needed in this space, but papers (5) and (6) above may be a good place to start.
04 April 2008
More complaining about automatic evaluation
I remember a few years ago complaining about automatic evaluation at conference was the thing to do. (Ironically, so was writing papers about automatic evaluation!) Things are saner now on both sides. While what I'm writing here is interpretable as a gripe, it's really intended as a "did anyone else notice this" because it's somewhat subtle.
The evaluation metric I care about is Rouge, designed for summarization. The primary difference between Rouge and Bleu is that Rouge is recall-oriented while Bleu is precision-oriented. The way Rouge works is as follows. Pick an ngram size. Get a single system summary H and a single reference summary R (we'll get to multiple references shortly). Let |H| denote the size of bag the defined by H and let |H^R| denote the bag intersection. Namely, the number of times some n-gram is allowed to appear in H^R is the min of the number of times it appears in H and R. Take this number and divide by |R|. This is the ngram recall for our system on this one example.
To extend this to more than one summary, we simple average the Rouges at each individual summary.
Now, suppose we have multiple references, R_1, R_2, ..., R_K. In the original Rouge papers and implementation, we compute the score for a single sentence as the max over the references of the Rouge on that individual reference. In other words, our score is the score against a single reference, where that reference is chosen optimistically.
In later Rouge paper and implementation, this changed. In the single-reference case, our score was |H^R|/|R|. In the multiple reference setting, it is |H^(R_1 + R_2 + ... + R_K)|/|R_1 + R_2 + ... + R_K|, where + denotes bag union. Apparently this makes the evaluation more stable.
(As an aside, there is no notion of a "too long" penalty because all system output is capped at some fixed length, eg., 100 words.)
Enough about how Rouge works. Let's talk about how my DUC summarization system worked back in 2006. First, we run BayeSum to get a score for each sentence. Then, based on the score and about 10 other features, we perform sentence extraction, optimized against Rouge. Many of these features are simple patterns; the most interesting (for this post) is my "MMR-like" feature.
MMR (Maximal Marginal Relevance) is a now standard technique in summarization that aims to allow your sentence extractor to extract sentences that aren't wholly redundant. The way it works is as follows. We score each sentence. We pick as our first sentence the sentence with the highest score. We the rescore each sentence to a weighted linear combination of the original score and minus the similarity between the proposed second sentence and its similarity to the first. Essentially, we want to punish redundancy, weighted by some parameter a.
This parameter is something that I tune in max-Rouge training. What I found was that at the end of the day, the value of a that is found by the system is always negative, which means that instead of disfavoring redundancy, we're actually favoring it. I always took this as a notion that human summaries really aren't that diverse.
The take-home message is that if you can opportunistically pick one good sentence to go in your summary, the remaining sentences you choose should be as similar to that one was possible. It's sort of an exploitation (not exploration) issue.
The problem is that I don't think this is true. I think it's an artifact, and probably a pretty bad one, of the "new" version of Rouge with multiple references. In particular, suppose I opportunistically choose one good sentence. It will match a bunch of ngrams in, say, reference 1. Now, suppose as my second sentence I choose something that is actually diverse. Sure, maybe it matches something diverse in one of the references. But maybe not. Suppose instead that I pick (roughly) the same sentence that I chose for sentence 1. It won't re-match against ngrams from reference 1, but if it's really an important sentence, it will match the equivalent sentence in reference 2. And so on.
So this is all nice, but does it happen? It seems so. Below, I've taken all of the systems from DUC 2006 and plotted (on X) their human-graded Non-Redundancy scores (higher means less redundant) against (on Y) their Rouge-2 scores.
Here, we clearly see (though there aren't even many data points) that high non-redundacy means low Rouge-2. Below is Rouge-SU4, which is another version of the metric:
Again, we see the same trend. If you want high Rouge scores, you had better be redundant.
The point here is not to gripe about the metric, but to point out something that people may not be aware of. I certainly wasn't until I actually started looking at what my system was learning. Perhaps this is something that deserves some attention.
The evaluation metric I care about is Rouge, designed for summarization. The primary difference between Rouge and Bleu is that Rouge is recall-oriented while Bleu is precision-oriented. The way Rouge works is as follows. Pick an ngram size. Get a single system summary H and a single reference summary R (we'll get to multiple references shortly). Let |H| denote the size of bag the defined by H and let |H^R| denote the bag intersection. Namely, the number of times some n-gram is allowed to appear in H^R is the min of the number of times it appears in H and R. Take this number and divide by |R|. This is the ngram recall for our system on this one example.
To extend this to more than one summary, we simple average the Rouges at each individual summary.
Now, suppose we have multiple references, R_1, R_2, ..., R_K. In the original Rouge papers and implementation, we compute the score for a single sentence as the max over the references of the Rouge on that individual reference. In other words, our score is the score against a single reference, where that reference is chosen optimistically.
In later Rouge paper and implementation, this changed. In the single-reference case, our score was |H^R|/|R|. In the multiple reference setting, it is |H^(R_1 + R_2 + ... + R_K)|/|R_1 + R_2 + ... + R_K|, where + denotes bag union. Apparently this makes the evaluation more stable.
(As an aside, there is no notion of a "too long" penalty because all system output is capped at some fixed length, eg., 100 words.)
Enough about how Rouge works. Let's talk about how my DUC summarization system worked back in 2006. First, we run BayeSum to get a score for each sentence. Then, based on the score and about 10 other features, we perform sentence extraction, optimized against Rouge. Many of these features are simple patterns; the most interesting (for this post) is my "MMR-like" feature.
MMR (Maximal Marginal Relevance) is a now standard technique in summarization that aims to allow your sentence extractor to extract sentences that aren't wholly redundant. The way it works is as follows. We score each sentence. We pick as our first sentence the sentence with the highest score. We the rescore each sentence to a weighted linear combination of the original score and minus the similarity between the proposed second sentence and its similarity to the first. Essentially, we want to punish redundancy, weighted by some parameter a.
This parameter is something that I tune in max-Rouge training. What I found was that at the end of the day, the value of a that is found by the system is always negative, which means that instead of disfavoring redundancy, we're actually favoring it. I always took this as a notion that human summaries really aren't that diverse.
The take-home message is that if you can opportunistically pick one good sentence to go in your summary, the remaining sentences you choose should be as similar to that one was possible. It's sort of an exploitation (not exploration) issue.
The problem is that I don't think this is true. I think it's an artifact, and probably a pretty bad one, of the "new" version of Rouge with multiple references. In particular, suppose I opportunistically choose one good sentence. It will match a bunch of ngrams in, say, reference 1. Now, suppose as my second sentence I choose something that is actually diverse. Sure, maybe it matches something diverse in one of the references. But maybe not. Suppose instead that I pick (roughly) the same sentence that I chose for sentence 1. It won't re-match against ngrams from reference 1, but if it's really an important sentence, it will match the equivalent sentence in reference 2. And so on.
So this is all nice, but does it happen? It seems so. Below, I've taken all of the systems from DUC 2006 and plotted (on X) their human-graded Non-Redundancy scores (higher means less redundant) against (on Y) their Rouge-2 scores.
Here, we clearly see (though there aren't even many data points) that high non-redundacy means low Rouge-2. Below is Rouge-SU4, which is another version of the metric:
Again, we see the same trend. If you want high Rouge scores, you had better be redundant.
The point here is not to gripe about the metric, but to point out something that people may not be aware of. I certainly wasn't until I actually started looking at what my system was learning. Perhaps this is something that deserves some attention.
01 April 2008
Should NAACL and ICML co-locate?
It's been a while since we've had a survey, so here's a new one. The question is: should NAACL and ICML attempt to co-locate in the near future (probably no sooner than 2010, since planning for 2009 is already fairly well established). The reason the question is NAACL-specific and not, eg., ACL or EMNLP, is because I don't have any control over what ACL or EMNLP does. (For that matter, I don't have any control over what ICML does either.)
The obvious question is: why should this happen? Well, for one, it will save me travel time and money! What more convincing do you want?
More seriously, I think that the NLP community and the ML community have been bumping up against each other for a while now. More importantly, the model is no longer: NLP world supplies data to ML world supplies algorithm to NLP world. Our problems are often different from those that are commonly studied in ML, or have different prior biases (not necessarily in the Bayesian sense). But not only do we have interesting problems, I think we also have a different way of viewing the world (though not so different as to make communication impossible). I think the goal of such an endeavor would be marked most by having shared tutorials (one side can see what the other does) and shared workshops.
The obvious question is: why should this happen? Well, for one, it will save me travel time and money! What more convincing do you want?
More seriously, I think that the NLP community and the ML community have been bumping up against each other for a while now. More importantly, the model is no longer: NLP world supplies data to ML world supplies algorithm to NLP world. Our problems are often different from those that are commonly studied in ML, or have different prior biases (not necessarily in the Bayesian sense). But not only do we have interesting problems, I think we also have a different way of viewing the world (though not so different as to make communication impossible). I think the goal of such an endeavor would be marked most by having shared tutorials (one side can see what the other does) and shared workshops.