21 August 2007

Topic modeling: syntactic versus semantic

Topic modeling has turned into a bit of a cottage industry in the NLP/machine learning world. Most seems to stem from latent Dirichlet allocation, though this of course built on previous techniques; the most well-known of which is latent semantic analysis. At the end of the day, such "topic models" really look more like dimensionality reduction techniques (eg., the similarity to multinomial PCA); however, in practice, they're often used as (perhaps soft) clustering methods. Words are mapped to topics; topics are used as features; this is fed into some learning algorithm.

One thing that's interested me for a while is that when viewed as clustering algorithms, how these topic models compare with more standard word clustering algorithms from the NLP community. For instance, the Brown clustering technique (built into SRILM) that clusters words based on context. (Lots of other word clustering techniques exist, but they pretty much all cluster based on local context; where local is either positionally local or local in a syntactic tree.)

I think the general high level story is that "topic models" go for semantics while "clustering models" go for syntax. That is, clustering models will tend to cluster words together that appear in similar local context, while topic models will cluster words together that appear in a similar global context. I've even heard stories that when given a choice of using POS tags as features in a model versus Brown clusters, it really don't make a difference.

I think this sentiment is a bit unfair to clustering models. Saying that context-based clustering models only find syntactically similar words is just not true. Consider the example clusters from the original LDA paper (the top portion of Figure 8). If we look up "film" ("new" seems odd) in CBC, we get: movie, film, comedy, drama, musical, thriller, documentary, flick, etc. (I left out multiword entries). The LDA list contains: new, film, show, music, movie, play, musical, best, actor, etc. We never get things like "actor" or "york" (presumably this is why "new" appeared), "love" or "theater", but it's unclear if this is good or not. Perhaps with more topics, these things would have gone into separate topics.

If we look up "school", we get: hospital, school, clinic, center, laboratory, lab, library, institute, university, etc. Again, this is a different sort of list than the LDA list, which contains: school, students, schools, education, teachers, high, public, teacher, bennett, manigat, state, president, etc.

It seems like the syntactic/semantic distinction is not quite right. In some sense, with the first list, LDA is being more liberal in what it considers film-like, with CBC being more conservative. OTOH, with the "school" list, CBC seems to be more liberal.

I realize, of course, that this is comparing apples and oranges... the data sets are different, the models are different, the preprocessing is different, etc. But it's still pretty clear that both sort of models are getting at the same basic information. It would be cool to see some work that tried to get leverage from both local context and global context, but perhaps this wouldn't be especially beneficial since these approaches---at least looking at these two lists---don't seem to produce results that are strongly complementary. I've also seen questions abound regarding getting topics out of topic models that are "disjoint" in some sense...this is something CBC does automatically. Perhaps a disjoint-LDA could leverage these ideas.

13 comments:

Suresh said...

Is this merely a continuum issue ? I don't know anything about the context clustering, but it seems to me that there's some implicit local neighborhood used to define "words that bear some relation to this word", and this neighborhood size is the entire document for LDA type methods, and is a lot smaller for context-based methods.

If this were the case, a multi-scale approach would immediately come to mind.

hal said...

Good point. I think this is one of two issues. The other issue is that LDA type methods throw away ordering, whereas with context-based methods, ordering plays a key role (my understanding is that without it, they suck).

PierreD said...

Please excuse a newbee question : From the statistical learning point of view, what is the difference between syntax and semantics? Is it just that syntax is local concept and semantics a global one, regarding words co-occurrences?

Anonymous said...

Hi Friends,

I Find Absolutely FREE PlayBoy & Penthouse:

http://www.oxpe.net/playboy/

If I find something else I'll inform you.

Best Regards,
Vera

Bob Carpenter said...

LDA's just a simple Bayesian latent factor model. If you take LDA and swap out the Dirichlet/Multinomial topic models for something else, say n-gram language models (you can even keep the Dirichlet priors if you like, or you can replace them with something more motivated like a process prior), you wind up with something like McCallum and Wang's topic n-grams.

In fact, you can pretty much take whatever you want for the topic model and you wind up with general model-based clustering. Gaussians are quite popular for speech recognition.

I think Suresh was right -- if you cluster on vectors of prior and subsequent words, you get a syntactic clustering; if you take whole documents, then you get something more semantic. In the speech processing language modeling community, folks have combined LSA-style factor models at the "semantic level" (e.g. Jurafsky) and context clustering at the "syntatic" level for language models.

David said...

echoing hal, good point suresh.

a multi-scale LDA model would be interesting. for example, one can imagine a model of local topics at the word context level and less local topics at the paragraph and document levels.

this could be modeled either as a combination of word distributions, or as higher order topics being distributions over the lower order topics. the latter option reminds me a little of andrew mccallums work on pachinko allocation machines. (and, i remember that there are probabilistic vision models where increasingly larger patches of the image are modeled.)

Laura Dietz said...

Since both word clustering and topic models rely on a syntactic level -- i.e. word cooccurrences -- it is hard to say which one yields clusters, that a human would like to call a topic.

I think topic models are sexy, because it is so straight forward to encode domain specific assumptions. This way one can answer more advanced questions such as

How to split a document into topical coherent parts? Latent Dirichlet Co-Clustering

Who is a good reviewer for a given paper? Expertise Modeling for Matching Papers with Reviewers

Which citations are more influential on a given paper than others? Unsupervised Prediction of Citation Influences

AB said...

Topic models sure make interesting conference papers, but the evaluation always sucks. It's either

(1) a rather subjective evaluation ("hey look, this top 10 words in the topic seem to make sense") or

(2) an evaluation where the authors "forget" to compare with other rather trivial baselines. For instance, McCallum's "expertise modeling" paper above simply disregarded all baselines from expert search from the information retrieval community (some of those from UMass).

In fact, if you check a recent paper (Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning - Arindam Banerjee and Sugato Basu), the topic modeling based techniques are really far from being competitive with other simple models.

I'd like to see one of these models perform well in a TREC or CONLL evaluation. Am I asking for much?

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex