(Guest Post by Kevin Duh -- Thanks, Kevin!!!)
I recently attended ICWSM (International Conference on Weblogs and Social Media), which consisted of an interesting mix of researchers from NLP, Data Mining, Pyschology, Sociology, and Information Sciences. Social media (which defined generally can include blogs, newsgroups, and online communities like facebook, flikr, youtube, del.icio.us) now accounts for the majority of content produced and consumed on the Web. As the area grows in importance, people are getting really interested in finding ways to better understand the phenomenon and to better build applications on top of it. This conference, the second in the series, has nearly 200 participants this year. I think this is a rewarding area for NLPers and MLers to test their wits on: there are many interesting applications and open problems.
In the following, I'll pick out some papers, just to give a flavor of the range of work in this area. For a full list of papers, see the conference program. Most papers are available online (do a search); some are linked from the conference blog.
Interesting new applications:
1) International sentiment analysis for News and Blogs -- M. Bautin, L. Vijayarenu, S. Skiena (StonyBrook) Suppose you want to monitor the sentiment of particular named entities (e.g. Bush, Putin) on news and blogs across different countries for comparison. This may be useful for, e.g., political scientists analyzing global reactions to the same event. There are two approaches: One is to apply sentiment analyzers trained in different languages; Another is to apply machine translation on foreign text, then apply an English sentiment analyzer. Their approach is the latter (using off-the-shelf MT engine). Their system generates very-fun-to-watch "heat maps" of named entities that are popular/unpopular across the globe. I think this paper opens up a host of interesting questions for NLPers: Is sentiment polarity something that can be translated across languages? How would one modify an MT system for this particular task? Is it more effective to apply MT, or to build multilingual sentiment analyzers?
2) Recovering Implicit Thread Structure in Newsgroup Style Conversations, by Y-C. Wang, M. Joshi, C. Rose, W. Cohen (CMU) Internet newsgroups can quite messy in terms of conversation structure. One long thread can actually represent different conversations among multiple parties. This work aims to use natural language cues to tease apart the conversations of a newsgroup thread. Their output is a conversation graph that shows the series of post-replies in a more coherent manner.
3) BLEWS: Using blogs to provide context for news articles -- M. Gamon, S. Basu, D. Belenko, D. Fisher, M. Hurst, C. Konig (Microsoft) Every news article has its bias (e.g. liberal vs. conservative). A reader who wishes to be well-educated on an issue should ideally peruse articles on all sides of the spectrum. This paper presents a system that aids the reader in quickly undertanding the political leaning (and emotional charge) of an article. It does so by basically looking at how many conservative vs. liberal blogs link to a news article. I think this paper is a good example of how one can creatively combine a few existing technologies (NLP, visualization, link analysis) to produce an application that has a lot of value-added.
Methods and algorithms adapted for social media data:
4) Document representation and query expansion models for blog recommendation -- J. Arguello, J. Elsas, J. Callan, J. Carbonel (CMU) This is an information retrieval paper, where the goal is to retrieve blogs relevant to an user query. This is arguably a harder problem than traditional webpage retrieval, since blogs are composed of many posts, and they can be on slightly different topics. The paper adopts a language modeling approach and asks the question: should we model blogs at the blog-level, or at the post-level? They also explored what kind of query expansion would work for blog retrieval. This paper is a nice example of how one can apply traditional methods to a new problem, and then discover a whole range of interesting and new research problems due to domain differences.
Understanding and analyzing social communities:
5) Wikipedian Self-governance in action: Motivating the policy-lens -- I. Beschastnikh, T. Kriplean, D. McDonald (UW) [Best paper award] Wikipedia is an example of self-governance, where participant editors discuss/argue about what should and can be edited. Over the years, a number of community-generated policies and guidelines have formed. These include policies such as "all sources need to be verified" and "no original research should be included in Wikipedia". Policies are themselves subject to modification, and they are often used as justification by different editors under different perspectives. How are these policies used in practice? Are they being used by knowledgeable Wikipedian "lawyers" or adminstrators at the expense of commonday editors? This paper analyzes the Talk pages of Wikipedia to see how policies are used and draws some very interesting observations about the evolution of Wikipedia.
6) Understanding the efficiency of social tagging systems using information theory -- E. Chi, T. Mytkowicz (PARC) Social communities such as del.icio.us allows users to tag webpages with arbitrary terms; how efficient is this evolving vocabulary of tags for categorizing the webpage of interest? Is there a way to measure whether a social community is "doing well"? This paper looks at this problem with the tools of information theory. For example, they compute the conditional entropy of documents given tags H(doc|tag) over time and observe that the efficiency is actually decreasing as popular tags are becoming overused.
Overall, I see three general directions of research for an NLPer in this field: The first approach focuses on building novel web applications that require NLP as a sub-component for the value-added. NLPers in industry or large research groups are well-suited to build these applications; this is where start-ups may spring up. The second approach is more technical: it focuses on how to adapt existing NLP techniques to new data such as blogs and social media.
This is a great area for individual researchers and grad student projects, since the task is challenging but clearly-defined: beat the baseline (old NLP technique) by introducing novel modifications, new features and models. Success in this space may be picked up by the groups that build the large applications.The third avenue of research, which is less examined (as far as I know), is to apply NLP to help analyze social phenomenon. The Web provides an incredible record of human artifacts. If we can study all that is said and written on the web, we can really understand a lot about social systems and human behavior.
I don't know when NLP technology will be ready, but I think it would be really cool to use NLP to study language for language's sake, and more importantly, to study language in its social context--perhaps we could call that "Social Computational Linguistics". I imagine this area of research will require collaboration with the social scientists; it is not yet clear what NLP technology is needed in this space, but papers (5) and (6) above may be a good place to start.
06 April 2008
(Guest Post by Kevin Duh -- Thanks, Kevin!!!)