## 10 October 2006

### Scaling and Data

In NLP, we often live in the idealized learning world where we have more data than we really know what to do with. The oft-cited Banko + Brill results are perhaps extreme in this regard (in the sense that we rarely have quite that much data), but we certainly have far more than most fields. The great thing about having lots of data is that large data sets support complex statistical analysis. As a stupid example, consider estimating a Gaussian. We estimate the mean and covariance (or generate a posterior over these quantities, if you prefer to be Bayesian). In a small data setting, we'd almost always approximate the Gaussian by either a diagonal, or constant diagonal covariance matrix. Especially if the number of data points is less than the number of dimensions (true Bayesians might not do this, but this is probably tangengtial). But if we have billions of data points, there's likely enough information in there to reliably estimate quite a few parameters (or approximate their posteriors) and we can do the full covariance matrix estimation.

The problem is that the full covariance estimation is computationally really expensive. Not only do we have to play with O(D^2) parameters (D is the dimensionality), but we also have to perform complex operations on the data that typically scale at least as O(N^2) (N Is the number of data points).

This is incredibly frustrating. We have the data to support a complex statistical analysis, but we don't have the computation time to actually perform the analysis. So we either throw out data to get the computation time down and do something more complex (which may now not be supported by the data) or, more often than not, do something simple on the large data set. Now, there is often nothing wrong with doing something simple, but if we cannot even try to do things that are more complex, then it's hard to say for sure whether simple is enough.

So then the question is: how can we scale. I only know a handful of answers to this question, but maybe other people can contribute some.

1. Get a job at Google and just use a billion machines (and/or some really clever Google engineers, ala the Google SMT system). This is obviously not a very satisfying option for everyone.
2. Subsample the data. This is also not very satisfying (and, perhaps, even worse than the first option).
3. Use a randomized algorithm, such as what Deepak did in his thesis. The message here is that if your complexity hinges on pairwise computations that look something like distance metrics, you can introduce randomization and do this in something like O(N) rather than O(N^2) time.
4. Use smart data structures. Things like kd-trees are becoming increasingly popular in the ML community for solving pairwise problems. The idea is to recursively divide your data space (in an intelligent fashion) so that you can store sufficient statistics about what's under a node at that node itself. (I think one reason these haven't taken off in NLP is that they appear at first glance to be much better suited to real-valued mid-dimensional data, rather than sparse, discrete, super-high-dimensional data...is there an alternative for us?)
There may be other general solutions, but I'm not aware of them. As it stands, with the exception of Deepak and few others, the solution appears to be basically to hire a bunch of smart people (and/or grad students) to do lots of engineering. But I'd prefer general principles.

Rob Van Dam (BYU) said...

Ironically, if you look around the autonlab (you're link to kd-trees) you'll find AD-trees which are an extension of kd-trees well suited for "sparse, discrete super-high-dimensional data". And within a few weeks I'll be proposing a thesis on the topic of expanding and exploring the possibilities of AD-trees (which is not to say that they're not already quite useful).

hal said...

That's fantastic! Thanks for the pointer.

. said...

Oes Tsetnoc one of the ways in which we can learn seo besides Mengembalikan Jati Diri Bangsa. By participating in the Oes Tsetnoc or Mengembalikan Jati Diri Bangsa we can improve our seo skills. To find more information about Oest Tsetnoc please visit my Oes Tsetnoc pages. And to find more information about Mengembalikan Jati Diri Bangsa please visit my Mengembalikan Jati Diri Bangsa pages. Thank you So much.

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex