05 February 2007

Bag of Words citation

I was recently asked by a colleague if I knew what the first paper was that used the bag of words model. I'm pretty certain it would be an IR paper, but have no idea what I would be. Manning+Schutze and Jurafsky+Martin don't have it. I know tf-idf is due to Sparck-Jones, but I presumably BOW existed before that. The vector space model is often credited to Salton, which is probably the earliest thing I know of, but my guess is that BOW predated even that. Anyone know a citation?

5 comments:

david blei said...

i don't have the book, but mosteller and wallace (1964) may use BOW.

Anonymous said...

They don't use the term "bag-of-words" but I think Luhn (1957) and Maron & Kuhns (1959) deserve a look. Luhn introduced a concept related to what we know as synsets and the model described by Maron and Kuhns appears to me quite similar to BOW.

The URLs:

http://www.research.ibm.com/journal/rd/014/ibmrd0104D.pdf

http://www.doc.ic.ac.uk/~jmag/classic/1960.On%20Relevance,%20Probabilistic%20Indexing%20and%20Information%20Retrieval.pdf

Anonymous said...

My guess is the early cryptographers. Shannon's 1948 paper A Mathematical Theory of Communication lays out a "first-order word approximation", which is equivalent to a bag of words. Of course, he generalized to n-gram models. In the paper, he cites cryptographers for the word distributions.

Patrick said...

The use of individual words to represent a document for retrieval purposes probably goes back to the advent of movable type. In Western civilization this means going back to the mid 15th century. One late 16th century work has a rather complete term index, unordered either alphabetically or even by order of appearance. In China movable type appeared in the 11th century. It woudn't be too difficult to imagine that indices were generated and employed in China in the wake of the invention of movable type there.

Cryptography may indeed be another route to explore early history of bag-of-words representations. To my (admittedly scant) understanding, most Western ciphers, bound as they are to alphabet-type texts, operated on individual characters. Hence there would be little in the way of representations utilizing word-to-word mappings rather than character-to-character mappings. However pictograms-based languages may hold promise for finding some bag-of-words representation for a document that precedes the invention of movable type.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花