natural language processing blog: 2012/02

18 February 2012

Making sense of Wikipedia categories

Wikipedia's category hierarchy forms a graph. It's definitely cyclic (Category:Ethology belongs to Category:Behavior, which in turn belongs to Category:Ethology).

At any rate, did you know that "Chicago Stags coaches" are a subcategory of "Natural sciences"? If you don't believe me, go to the Wikipedia entry for the Natural sciences category, and expand the following list of subcategories:

Biology
Zoology
Subfields of zoology
Ethology
Behavior
Human behavior
Recreation
Games
Ball games
Basketball
Basketball teams
Defunct basketball teams
Defunct National Basketball Association teams
Chicago Stags
Chicago Stags coaches

I guess it kind of makes sense. There are some other fun ones, like "Rhaeto-Romance languages", "American World War I flying aces" and "1911 films". Of course, these are all quite deep in the "hierarchy" (all of those are at depth 15 or higher).

So if you're trying to actually find pages about Natural sciences, maybe it's enough to limit the depth of your breadth first search down the graph.

This is sort of reasonable, and things up to and including depth four are quite reasonable, including topics like "Neurochemistry", "Planktology" and "Chemical elements". There are a few outliers, like "Earth observation satellites of Israel" which you could certainly make a case might not be natural science.

At depth five, things become much more mixed. On the one hand, you get categories you might like to include, like "Statins", "Hematology", "Lagoons" and "Satellites" (interesting that Satellites is actually deeper than the Isreal thing). But you also get a roughly equal amount of weird things, like "Animals in popular culture" and "Human body positions". It's still not 50/50, but it's getting murky.

At depth six, based on my quick perusal, it's about 50/50.

And although I haven't tried it, I suspect that if you use a starting point other than Natural sciences, the depth at which things get weird is going to be very different.

So I guess the question is how do deal with this.

One thought is to "hope" that editors of Wikipedia pages will list the categories of pages roughly in order of importance, so that you can assume that the first category listed for a page is "the" category for that page. This would render the structure to be a tree. For the above example, this would cut the list at "Subfields of zoology" because the first listed category for the Ethology category is "Behavioral sciences", not "Subfields of zoology."

Doing this seems to make life somewhat better; you cut out the stags coaches, but you still get the "Chicago Stags draft picks" (at depth 17). The path, if you care, is (Natural sciences -> Physical sciences -> Physics -> Fundamental physics concepts -> Matter -> Structure -> Difference -> Competition -> Competitions -> Sports competitions -> Sports leagues -> Sports leagues by country -> Sports leagues in the United States -> Basketball leagues in the United States -> National Basketball Association -> National Basketball Association draft picks). Still doesn't feel like Natural sciences to me. In fairness, at depth 6, life is much better. You still get "Heating, ventilating, and air conditioning" but many of the weird entries have gone away.

Another idea is the following. Despite not being a tree or DAG, there is a root to the Wikipedia hierarchy (called Category:Contents). For each page/category you can compute it's minimum depth from that Contents page. Now, when you consider subpages of Natural sciences, you can limit yourself to pages whose shortest path goes through Natural sciences. Basically trying to encode the idea that if the shallowest way to reach Biology is through Natural sciences, it's probably a natural science.

This also fails. For instance, the depth of "Natural sciences" (=5) is the same as the depth of "Natural sciences good articles", so if you start from Natural sciences, you'll actually exclude all the good articles! Moreover, even if you insist that a shortest path go through Natural sciences, you'll notice that many editors have depth 5, so any page they've edited will be allowed. Maybe this is a fluke, but "Biology lists" has depth of only 4, which means that anything that can be reached through "Biology lists" would be excluded, something we certainly wouldn't want to do. There's also the issue that the hierarchy might be much bushier for some high-level topics than others, which makes comparing depths very difficult.

So, that leaves me not really knowing what to do. Yes, I could compute unigram distributions over the pages in topics and cut when those distributions get too dissimilar, but (a) that's annoying and very computationally expensive, (b) requires you to look at the text of the pages which seems silly, (c) you now just have more hyperparameters to tune. You could annotate it by hand ("is this a natural science") but that doesn't scale. You could compute the graph Laplacian and look at flow and use "average path length" rather than shortest paths, but this is a pretty big graph that we're talking about.

Has anyone else tried and succeed at using the Wikipedia category structure?

11 February 2012

De-Authorship attribution

I received the following (slightly edited) question from my colleague Jon Katz a few days ago:

I was thinking about the problem of authorship attribution... Have people thought about the flip side of this problem? Namely, "anonymizing" text so that it would be hard to attribute it to any author?

This is something I've actually wondered about in the context of blogging for a while. I noticed at some point that my "blogger voice" is very similar to my "reviewer voice" and started worrying that I might be too identifiable as a reviewer. This might either be due to lexical choice ("bajillion" or "awesome") or due to some more subtle stylistic choices.

There is quite a bit of work on authorship attribution. I think the first time I heard a talk on this topic was on March 24, 2004, when Shlomo Argamon gave a talk at ISI (no, I don't have an amazing memory, I cheated) on "On Writing, Our Selves: Explorations in Stylistic Text Categorization." The basic hypothesis of the talk, at least as I remember it, was that if you're trying to do authorship attribution, you should throw out content words and focus on things like POS tag sequences, parse tree structures, and things like that.

There's been a lot of subsequent work in this, and related areas. One very related area is on things like trying to predict demographic information (age, gender, socio-economic status, education level, and, yes, astrological sign) from tweets, blog posts or emails (or other forms). One of the key distinctions that I think is important in all of this work is whether the original author is intentionally trying to hide information about him or herself. For instance, someone trying to impersonate Shakespeare, or a child predator pretending to be a different age or gender, or a job applicant trying to sound more educate than is true. This latter is a much harder problem because the stupid topically stereotypical features that pop out as being indicative (like men talking about "wifes" and "football" and women talking about "husbands" and "yoga") and the silly features that don't really tell us anything interesting (on twitter, apparently men tend to put "http://" before URLs more than women -- who knew?) because these "pretenders" are going to intentionally try to hide that information (now that everyone knows to hide "http://" to trick gender recognizers!). It also means that falling back on topic as a surrogate for demography should not work as well. This seems to be a very different problem from trying to identify whether a blog post is written by me or by Jon, which should be 99.9% do-able by just looking at content words.

The reason I bring this all up is because we don't want to anonymize by changing the topic. The topic needs to stay the same: we just need to cut out additional identifying information. So, getting back to Jon's question, the most relevant work that I know of is on text steganography (by Ching-Yun Chang and Stephen Clark), where they use the ability to do paraphrasing to encode messages in text. Aside from the challenge of making the output actually somewhat grammatical, the basic idea is that when you have two ways of saying the same thing (via paraphases), you can choose the first one to encode a "0" and the second to encode a "1" and then use this to encode a message in seemingly-natural text.

I also remember having a conversation a while ago while a (different) colleague about trying to build a chat system where you could pretend that you're chatting with someone famous (like Obama or Harry Potter or Scooby Doo). A similar problem is trying to paraphrase my own writing to sound like someone else, but zoinks, that seems hard! A basic approach would be to build a Scooby Doo language model (SDLM) and then run my blog posts through a paraphrase engine that uses the SDLM for producing the output. My vague sense is that this would work pretty poorly, primarily because the subtleness in phrase structure selection would be lost on a highly-lexicalized language model. I imagine you'd get some funny stuff out and it might be amusing to do, but I don't have time to try.

As far as pure anonymization goes, it seems like doing something similar to the steganography approach would work. Here, what you could do is generate a random sequence of bits, and then "encode" that random sequence using the steganography system. This would at least remove some identifying information. But the goal of the steganography isn't to change every phrase, but just to change enough phrases that you can encode your message. It also wouldn't solve the problem that perhaps you can identifying a bit about an author by the lengths of their sentences. Or their oscillation between long and short sentences. This also wouldn't be hidden.

An alternative, human-in-the-loop approach might be simply to have an authorship recognition system running in your word processor, and then any time you type something that enables it to identify you, it could highlight it and you could be tasked with changing it. I suspect this would be a frustrating, but fairly interesting experience (at least the first time).

p.s., I'm now officially tweeting on @haldaume3.