The problem is that the full covariance estimation is computationally really expensive. Not only do we have to play with O(D^2) parameters (D is the dimensionality), but we also have to perform complex operations on the data that typically scale at least as O(N^2) (N Is the number of data points).
This is incredibly frustrating. We have the data to support a complex statistical analysis, but we don't have the computation time to actually perform the analysis. So we either throw out data to get the computation time down and do something more complex (which may now not be supported by the data) or, more often than not, do something simple on the large data set. Now, there is often nothing wrong with doing something simple, but if we cannot even try to do things that are more complex, then it's hard to say for sure whether simple is enough.
So then the question is: how can we scale. I only know a handful of answers to this question, but maybe other people can contribute some.
- Get a job at Google and just use a billion machines (and/or some really clever Google engineers, ala the Google SMT system). This is obviously not a very satisfying option for everyone.
- Subsample the data. This is also not very satisfying (and, perhaps, even worse than the first option).
- Use a randomized algorithm, such as what Deepak did in his thesis. The message here is that if your complexity hinges on pairwise computations that look something like distance metrics, you can introduce randomization and do this in something like O(N) rather than O(N^2) time.
- Use smart data structures. Things like kd-trees are becoming increasingly popular in the ML community for solving pairwise problems. The idea is to recursively divide your data space (in an intelligent fashion) so that you can store sufficient statistics about what's under a node at that node itself. (I think one reason these haven't taken off in NLP is that they appear at first glance to be much better suited to real-valued mid-dimensional data, rather than sparse, discrete, super-high-dimensional data...is there an alternative for us?)
Ironically, if you look around the autonlab (you're link to kd-trees) you'll find AD-trees which are an extension of kd-trees well suited for "sparse, discrete super-high-dimensional data". And within a few weeks I'll be proposing a thesis on the topic of expanding and exploring the possibilities of AD-trees (which is not to say that they're not already quite useful).
ReplyDeleteThat's fantastic! Thanks for the pointer.
ReplyDelete酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花
ReplyDelete