natural language processing blog: Coordinate ascent and inverted indices...

31 January 2010

Coordinate ascent and inverted indices...

Due to a small off-the-radar project I'm working on right now, I've been building my own inverted indices. (Yes, I'm vaguely aware of discussions in DB/Web search land about whether you should store your inverted indices in a database or whether you should handroll your own. This is tangential to the point of this post.)

For those of you who don't remember your IR 101, here's the deal with inverted indices. We're a search engine and want to be able to quickly find pages that contain query terms. One way of storing our set of documents (eg., the web) is to store a list of documents, each of which is a list of words appearing in that document. If there are N documents of length L, then answering a query is O(N*L) since we have to look over each document to see if it contains the word we care about. The alternative is to store an inverted index, where we have a list of words and for each word, we store the list of documents it appears in. Answering a query here is something like O(1) if we hash them, O(log |V|) if we do binary search (V = vocabulary), etc. Why it's called an inverted index is beyond me: it's really just like the index you find at the back of a textbook. And the computation difference is like trying to find mentions of "Germany" in a textbook by reading every page and looking for "Germany" versus going to the index in the back of the book.

Now, let's say we have an inverted index for, say, the web. It's pretty big (and in all honesty, probably distributed across multiple storage devices or multiple databases or whatever). But regardless, a linear scan of the index would give you something like: here's word 1 and here are the documents it appears in; here's word 2 and its doucments; here's word v and its documents.

Suppose that, outside of the index, we have a classification task over the documents on the web. That is, for any document, we can (efficiently -- say O(1) or O(log N)) get the "label" of this document. It's either +1, -1 or ? (? == unknown, or unlabeled).

My argument is that this is a very plausible set up for a very large scale problem.

Now, if we're trying to solve this problem, doing a "common" optimization like stochastic (sub)gradient descent is just not going to work, because it would require us to iterate over documents rather than iterating over words (where I'm assuming words == features, for now...). That would be ridiculously expensive.

The alternative is to do some sort of coordinate ascent algorithm. These actually used to be quite popular in maxent land, and, in particular, Joshua Goodman had a coordinate ascent algorithm for maxent models that apparently worked quite well. (In searching for that paper, I just came across a 2009 paper on roughly the same topic that I hadn't seen before.)

Some other algorithms have a coordinate ascent feel, for instance the LASSO (and relatives, including the Dantzig selector+LASSO = DASSO), but they wouldn't really scale well in this problem because it would require a single pass over the entire index to make one update. Other approaches, such as boosting, etc., would fare very poorly in this setting.

This observation first led me to wonder if we can do something LASSO or boosting like in this setting. But then that made me wonder if this is a special case, or if there are other cases in the "real world" where you data is naturally laid out as features * data points rather than data points * features. Sadly, I cannot think of any. But perhaps that's not because there aren't any.

(Note that I also didn't really talk about how to do semi-supervised learning in this setting... this is also quite unclear to me right now!)

8 comments:

Alexandre Passos31 January, 2010 16:48
I'm sorry if this sounds silly, but wouldn't naive bayes be ridiculously easy to train in this setting (even EM for semi-supervised naive bayes)?
ReplyDelete
Replies
Bob Carpenter01 February, 2010 13:26
Madigan and Lewis's Bayesian Multinomial Regression (BMR) package uses coordinate ascent with L1 priors, and is pretty scalable. They seem to have added hierarchical models since I last checked, and they allow very flexible priors with means and variances specified by dimension.

More practically speaking, who builds a reverse index and throws away the original docs? Why do you want to train coordinate-wise?

P.S. If you think of the natural indexing as a relation R(doc,term) between a doc to its terms, it naturally corresponds to a function from the doc to its set of terms. If you look at the usually definition of relation inverse, you get R^{-1}(term,doc) = R(doc,term), which corresponds to a map from terms to the set of docs in which they occur.

P.P.S. Apache Lucene's inverted indexes are flexible, fast, and easy to use, at least from Java. Lucene uses a byte-based variable length encoding of document id sequences, and can store and compute with offset info as well as simple occurrence. If there's no dynamic doc insertion/deletion, building your own reverse indices is really easy. The Managing Gigabytes book has a reasonable discussion of these issues.
ReplyDelete
Replies
Anonymous13 April, 2010 20:26
cheap nike shox
cheap sport shoes
nike tn dollar
ed hardy ugg boots
ed hardy love kills slowly
ed hardy clothing us
ed hardy clothing
cheap ed hardy
cheap ed hardy clothing
ed hardy clothes
ed hardy wholesale
ed hardy clothing
ed hardy t shirts
ed hardy shirts
ed hardy uk
ed hardy t shirts
ed hardy shirts
ed hardy hoodies
Cheap JORDAN SHOES，，
cheap nike max ，。
puma future cat
ed hardy ugg boots.
ed hardy love kills slowly boots.
ed hardy love kills slowly.
ed hardy polo shirts.
cheap ed hardy clothing,.
ed hardy shirts .
ed hardy t shirts.,.,.
ReplyDelete
Replies
combattery8405 July, 2010 00:01
Laptop battery
ACER Laptop Battery
ASUS Laptop Battery
COMPAQ Laptop Battery
Dell Laptop Battery
HP Laptop Battery
IBM Laptop Battery
SONY Laptop Battery
TOSHIBA Laptop Battery
APPLE M8403 battery
APPLE A1078 Battery
APPLE A1079 battery
APPLE A1175 battery
APPLE a1185 battery 1
APPLE A1189 battery
Acer aspire 5920 battery
Acer btp-arj1 battery
Acer LC.BTP01.013 battery

Acer ASPIRE 1300 battery
Acer ASPIRE 1310 battery
Acer Aspire 1410 battery
Acer ASPIRE 1680 battery
ACER BTP-63D1 battery
ACER BTP-43D1 battery
Acer lc.btp05.001 battery
Acer aspire 3000 battery
Acer Travelmate 4000 battery
ACER aspire 5560 battery
ACER BATBL50L6 battery
ACER TravelMate 240 Battery
ACER BT.00803.004 Battery
ACER Travelmate 4002lmi battery
Acer travelmate 800 battery
ReplyDelete
Replies
combattery8407 July, 2010 00:14
Laptop Battery
acer Laptop Battery
apple Laptop Battery
asus Laptop Battery
compaq Laptop Battery
Dell Laptop Battery
fujitsu Laptop Battery
gateway Laptop Battery
hp Laptop Battery
ibm Laptop Battery
sony Laptop Battery
toshiba Laptop Battery
APPLE M8403 battery
APPLE A1078 Battery
APPLE A1079 battery
APPLE A1175 battery 1
APPLE a1185 battery
APPLE A1189 battery
Acer aspire 5920 battery
Acer btp-arj1 battery
Acer LC.BTP01.013 battery
Acer ASPIRE 1300 battery
Acer ASPIRE 1310 battery
Acer Aspire 1410 battery
Acer ASPIRE 1680 battery
ACER BTP-63D1 battery
ACER BTP-43D1 battery
Acer lc.btp05.001 battery
Acer aspire 3000 battery
Acer Travelmate 4000 battery
ACER aspire 5560 battery
ACER BATBL50L6 battery
ACER TravelMate 240 Battery
ACER BT.00803.004 Battery
ReplyDelete
Replies
combattery8407 July, 2010 00:16
ACER Travelmate 4002lmi battery
Acer travelmate 800 battery
Acer aspire 3613wlmi battery
Travelmate 2414wlmi battery
Acer batcl50l battery
Acer Travelmate 2300 battery
ACER aspire 3610 battery
ACER travelmate 4600 battery
Dell Latitude D800 battery
Dell Inspiron 600m battery
Dell Inspiron 8100 Battery
Dell Y9943 battery
Dell Inspiron 1521 battery
Dell Inspiron 510m battery
Dell Latitude D500 battery
Dell Latitude D520 battery
Dell GD761 battery
Dell NF343 battery
Dell D5318 battery
Dell G5260 battery
Dell Inspiron 9200 battery
Dell Latitude C500 battery
Dell HD438 Battery
Dell GK479 battery
Dell PC764 battery
Dell KD476 Battery
Dell Inspiron 1150 battery
ReplyDelete
Replies
ylinling00114 July, 2010 02:32
I like your article, really interesting! My point is also very good, I hope you'll like:chi flat iron are a very popular choice of hair straightener.New Balance,new Blance shoes,new Blance Outlet are some of the most comfortable and stylish shoes on the market today. The designer has a whole range of shoes for all types of athletes. five finger shoes,vibram five fingers,Five fingers shoes give women the feeling of walking barefoot while still keeping the feet protected.
ReplyDelete
Replies
Unknown26 July, 2010 10:03
Social Media is big, but are there semantic tools to monitor them?

Not too long ago a series of reputation disasters aided and abetted by social media have seriously affected Tiger Woods, Toyota, Lebron James, Mel Gibson, BP, and assorted other businesses, entertainment figures and politicians. The noise level woke some people up. Needless to say, Google’s recent purchase of MetaWeb, and the booming business of “Social Business Software” are indicators that monitoring social media is a growing trend with business opportunities to meet the thrust of semantic tools to be put in hands of enterprises that enable them to monitor their brands, reputation, and other influences. It is estimated that text analytical industry to be growing by 25% Y-O-Y based on an estimate made by Seth Grimes, Text Analytics Conference Chair, in an article http://www.b-eye-network.com/channels/1394/view/12638 .

Semantic Tool that works.

Ctrl is a library with an API that can be used to extract metadata (key topics, and entities) of any textual document, to generate a document summary, and to index and retrieve documents by topics as opposed to key words.

Ctrl can be used in several domains for different purposes (besides the obvious application in search). For instance, in the news industry Ctrl provides a tool to automatically generate the ‘story highlights’, categorize and index any article based on its topics, and recommend related stories. This in turn helps both internal and external (users) in the retrieval of related documents for any required topic; it further allows automatic data push (could be in the form of RSS Feeds) for user‐selected topics.

In this context, Ctrl can be used in effective targeted marketing since users are retrieving highly relevant information that exactly matches their topics of interest. Ctrl can also be the new standard in the business intelligence industry for intelligent topic‐based enterprise search. Its ability to provide relevant documents based on topics is a daily need in large corporations. The various solutions that are currently employed in the industry are quite costly and are time‐ and error-prone since they rely on expert topic (or meta‐data) engineering for effective performance. In the Intelligence community, Ctrl means cutting time and cost that is being spent on processing a huge number of documents ‘manually’ searching for relevant information about specific topics, since the identification of topics (and more importantly, the identification of the key topics) is the most important differentiator between Ctrl and existing systems. Other applications can be developed around Ctrl’s API functionality to service several other fields especially when integrated with existing software (e.g., database systems, desktop and document management tools, etc.)
The accuracy of Ctrl's WSD algorithm in inferring the most likely meaning of a word in some context is above 80%, and, as far as we know, this accuracy rate is by far much higher than any WSD results that have been reported in the computational linguistics research.

Ctrl give words meanings, including names of things. To do so basic reference resolution and entity identification has to be performed. Names of people, organizations, movies, etc. are not therefore just words and phrases, but are full-fledged concepts that can be related to and matched with other concepts and topics. Thus, while the sentence “popularity of former NY Yankees slugger Babe Ruth” is not at all related to “Dr. Ruth popularity in NY”, there is clearly some semantic relation between “US President Barak Obama” and “German Chancellor Angela Merkel” (due, among other things, to the semantic relationship between ‘Barak Obama’ and ‘Chancellor’, for example). Read more: http://ctrl.pragma-tech.com

Now you can try it yourself! If you would like to receive an invitation to test ctrl online, please visit this link below:
http://ctrl.pragma-tech.com/DemosInvitation.aspx
ReplyDelete
Replies

Add comment