31 January 2010

Coordinate ascent and inverted indices...

Due to a small off-the-radar project I'm working on right now, I've been building my own inverted indices. (Yes, I'm vaguely aware of discussions in DB/Web search land about whether you should store your inverted indices in a database or whether you should handroll your own. This is tangential to the point of this post.)

For those of you who don't remember your IR 101, here's the deal with inverted indices. We're a search engine and want to be able to quickly find pages that contain query terms. One way of storing our set of documents (eg., the web) is to store a list of documents, each of which is a list of words appearing in that document. If there are N documents of length L, then answering a query is O(N*L) since we have to look over each document to see if it contains the word we care about. The alternative is to store an inverted index, where we have a list of words and for each word, we store the list of documents it appears in. Answering a query here is something like O(1) if we hash them, O(log |V|) if we do binary search (V = vocabulary), etc. Why it's called an inverted index is beyond me: it's really just like the index you find at the back of a textbook. And the computation difference is like trying to find mentions of "Germany" in a textbook by reading every page and looking for "Germany" versus going to the index in the back of the book.

Now, let's say we have an inverted index for, say, the web. It's pretty big (and in all honesty, probably distributed across multiple storage devices or multiple databases or whatever). But regardless, a linear scan of the index would give you something like: here's word 1 and here are the documents it appears in; here's word 2 and its doucments; here's word v and its documents.

Suppose that, outside of the index, we have a classification task over the documents on the web. That is, for any document, we can (efficiently -- say O(1) or O(log N)) get the "label" of this document. It's either +1, -1 or ? (? == unknown, or unlabeled).

My argument is that this is a very plausible set up for a very large scale problem.

Now, if we're trying to solve this problem, doing a "common" optimization like stochastic (sub)gradient descent is just not going to work, because it would require us to iterate over documents rather than iterating over words (where I'm assuming words == features, for now...). That would be ridiculously expensive.

The alternative is to do some sort of coordinate ascent algorithm. These actually used to be quite popular in maxent land, and, in particular, Joshua Goodman had a coordinate ascent algorithm for maxent models that apparently worked quite well. (In searching for that paper, I just came across a 2009 paper on roughly the same topic that I hadn't seen before.)

Some other algorithms have a coordinate ascent feel, for instance the LASSO (and relatives, including the Dantzig selector+LASSO = DASSO), but they wouldn't really scale well in this problem because it would require a single pass over the entire index to make one update. Other approaches, such as boosting, etc., would fare very poorly in this setting.

This observation first led me to wonder if we can do something LASSO or boosting like in this setting. But then that made me wonder if this is a special case, or if there are other cases in the "real world" where you data is naturally laid out as features * data points rather than data points * features. Sadly, I cannot think of any. But perhaps that's not because there aren't any.

(Note that I also didn't really talk about how to do semi-supervised learning in this setting... this is also quite unclear to me right now!)

8 comments:

  1. I'm sorry if this sounds silly, but wouldn't naive bayes be ridiculously easy to train in this setting (even EM for semi-supervised naive bayes)?

    ReplyDelete
  2. Madigan and Lewis's Bayesian Multinomial Regression (BMR) package uses coordinate ascent with L1 priors, and is pretty scalable. They seem to have added hierarchical models since I last checked, and they allow very flexible priors with means and variances specified by dimension.

    More practically speaking, who builds a reverse index and throws away the original docs? Why do you want to train coordinate-wise?

    P.S. If you think of the natural indexing as a relation R(doc,term) between a doc to its terms, it naturally corresponds to a function from the doc to its set of terms. If you look at the usually definition of relation inverse, you get R^{-1}(term,doc) = R(doc,term), which corresponds to a map from terms to the set of docs in which they occur.

    P.P.S. Apache Lucene's inverted indexes are flexible, fast, and easy to use, at least from Java. Lucene uses a byte-based variable length encoding of document id sequences, and can store and compute with offset info as well as simple occurrence. If there's no dynamic doc insertion/deletion, building your own reverse indices is really easy. The Managing Gigabytes book has a reasonable discussion of these issues.

    ReplyDelete
  3. I like your article, really interesting! My point is also very good, I hope you'll like:chi flat iron are a very popular choice of hair straightener.New Balance,new Blance shoes,new Blance Outlet are some of the most comfortable and stylish shoes on the market today. The designer has a whole range of shoes for all types of athletes. five finger shoes,vibram five fingers,Five fingers shoes give women the feeling of walking barefoot while still keeping the feet protected.

    ReplyDelete
  4. Social Media is big, but are there semantic tools to monitor them?

    Not too long ago a series of reputation disasters aided and abetted by social media have seriously affected Tiger Woods, Toyota, Lebron James, Mel Gibson, BP, and assorted other businesses, entertainment figures and politicians. The noise level woke some people up. Needless to say, Google’s recent purchase of MetaWeb, and the booming business of “Social Business Software” are indicators that monitoring social media is a growing trend with business opportunities to meet the thrust of semantic tools to be put in hands of enterprises that enable them to monitor their brands, reputation, and other influences. It is estimated that text analytical industry to be growing by 25% Y-O-Y based on an estimate made by Seth Grimes, Text Analytics Conference Chair, in an article http://www.b-eye-network.com/channels/1394/view/12638 .

    Semantic Tool that works.

    Ctrl is a library with an API that can be used to extract metadata (key topics, and entities) of any textual document, to generate a document summary, and to index and retrieve documents by topics as opposed to key words.

    Ctrl can be used in several domains for different purposes (besides the obvious application in search). For instance, in the news industry Ctrl provides a tool to automatically generate the ‘story highlights’, categorize and index any article based on its topics, and recommend related stories. This in turn helps both internal and external (users) in the retrieval of related documents for any required topic; it further allows automatic data push (could be in the form of RSS Feeds) for user‐selected topics.

    In this context, Ctrl can be used in effective targeted marketing since users are retrieving highly relevant information that exactly matches their topics of interest. Ctrl can also be the new standard in the business intelligence industry for intelligent topic‐based enterprise search. Its ability to provide relevant documents based on topics is a daily need in large corporations. The various solutions that are currently employed in the industry are quite costly and are time‐ and error-prone since they rely on expert topic (or meta‐data) engineering for effective performance. In the Intelligence community, Ctrl means cutting time and cost that is being spent on processing a huge number of documents ‘manually’ searching for relevant information about specific topics, since the identification of topics (and more importantly, the identification of the key topics) is the most important differentiator between Ctrl and existing systems. Other applications can be developed around Ctrl’s API functionality to service several other fields especially when integrated with existing software (e.g., database systems, desktop and document management tools, etc.)
    The accuracy of Ctrl's WSD algorithm in inferring the most likely meaning of a word in some context is above 80%, and, as far as we know, this accuracy rate is by far much higher than any WSD results that have been reported in the computational linguistics research.

    Ctrl give words meanings, including names of things. To do so basic reference resolution and entity identification has to be performed. Names of people, organizations, movies, etc. are not therefore just words and phrases, but are full-fledged concepts that can be related to and matched with other concepts and topics. Thus, while the sentence “popularity of former NY Yankees slugger Babe Ruth” is not at all related to “Dr. Ruth popularity in NY”, there is clearly some semantic relation between “US President Barak Obama” and “German Chancellor Angela Merkel” (due, among other things, to the semantic relationship between ‘Barak Obama’ and ‘Chancellor’, for example). Read more: http://ctrl.pragma-tech.com

    Now you can try it yourself! If you would like to receive an invitation to test ctrl online, please visit this link below:
    http://ctrl.pragma-tech.com/DemosInvitation.aspx

    ReplyDelete