Comments on natural language processing blog: Making sense of Wikipedia categories

What about counting the number of paths to a topic...

2012-03-11T14:52:46.166-06:00

What about counting the number of paths to a topic from a category? Would be linear in number of nodes*number of categories.

I used the Wikipedia categories to define a vector...

2012-02-27T11:21:53.768-07:00

I used the Wikipedia categories to define a vector space (mostly for disambiguation purposes), which gave OK results. The code I used for doing that is available on our Github, in case that's useful. Using them literally though, given the weird things you point out, will probably give very odd results :)

I did a bachelor's thesis on using Wikipedia c...

2012-02-26T12:04:32.233-07:00

I did a bachelor's thesis on using Wikipedia categories for NE recognition, based on this paper: http://www.mt-archive.info/ACL-2008-Richman.pdf . But that uses categories from the bottom-up, so to speak, where as you are talking about top-down.

I wasn't able to reproduce the same level of results as in that paper but my software was surely much more crude and my level of knowledge much less than the authors'.

Depending on your use-case, you might find YAGO (h...

2012-02-25T22:13:20.139-07:00

Depending on your use-case, you might find YAGO (http://www.mpi-inf.mpg.de/yago-naga/yago/) useful.
We had started with Wikipedia, but switched to YAGO for our paper (http://www2011india.com/proceeding/companion/p21.pdf) on answering transitive type-entity queries.
The YAGO category hierarchy is much more cleaner and manageable.

My method is stupid but effective: manually prunin...

2012-02-22T04:31:32.368-07:00

My method is stupid but effective: manually pruning. First I generate the hierarchical taxonomy, and then skim it by dragging the scrollbar. If I find something of no concern, I'll locate where the hierarchy goes astray and then add the topmost undesired category to a "blacklist". Next time when the taxonomy is generated, the children of this category won't be added to the hierarchy. Iterate this process and the taxonomy becomes purer and purer.

I have refined the Wikipedia category graph for us...

2012-02-20T18:16:30.380-07:00

I have refined the Wikipedia category graph for use in the INEX 2010 XML Mining track. This was done by finding the shortest paths between a page and any of the 'Main Topic Classifications'. It results in a multi-label category structure. I only used the last 2 vertices of the shortest path sequences and threw away small categories.

You can find the details in the paper at http://eprints.qut.edu.au/41223/.

I am not very sure what your use-case is. But I...

2012-02-20T11:50:59.642-07:00

I am not very sure what your use-case is. But I've used the Wikipedia category graph to measure relatedness between two articles. I guess the distance between the category "Biology" and "Chicago Stags coaches" is a good estimate of how (un)related they are.

But before creating the graph that, I excluded the root node (Category:Contents) from the graph so as to get rid of "too much of generalization".

Look at our work on it: http://airlab.elet.polimi....

2012-02-19T12:12:03.734-07:00

Look at our work on it: http://airlab.elet.polimi.it/images/3/3e/Macro-categories.pdf

Another resource to look for is the DBPedia ontology.

Cheers,
Riccardo Tasso

may be you need to calculate article per topics di...

2012-02-19T10:33:10.096-07:00

may be you need to calculate article per topics distribution:
P_1_2 = P({a from C_1} | {a from C_2}) and P_2_1 = P({a from C_2} | {a from C_1})
And based on these distribution calculate topic hierarchy:
P_1_2 = P_2_1 => C_1 = C_2
P_1_2 > P_2_1 => C_1 -> C_2

It is not hard to calculate using some kind of inverted index: C_1 -> {a_1, a_2, ... a_n}

Your analysis is right: the category structure ist...

2012-02-19T05:41:55.164-07:00

Your analysis is right: the category structure ist just not very reliable. So a way to deal with it is regarding categories more as tags than as really structured information. I used this idea for NE Classification:

http://www.aclweb.org/anthology/W11-3607 (PDF)

I think the global structure is way too messy for meaningful analysis. Maybe for many problems it is sufficient to look at local structure in the category network.

I quite like the paper: Wu, Fei and Weld, Daniel...

2012-02-19T04:26:02.594-07:00

I quite like the paper:

Wu, Fei and Weld, Daniel S. (2008). Automatically Refining the Wikipedia Infobox Ontology. In Proceedings of the 17th International World Wide Web Conference, (WWW-08), Beijing, China, April, 2008.

It was some on this - but actually I think they introduces MORE categories (as well as assigning more articles to categories)

out of curiosity, have you looked at other dbs tha...

2012-02-19T00:52:34.872-07:00

out of curiosity, have you looked at other dbs that map (at least partially) to wikipedia and have their own taxonomies? http://freebase.com/ , for example, which i've been working with a bit recently.