## 07 September 2010

### Manifold Assumption versus Margin Assumption

[This post is based on some discussions that came up while talking about manifold learning with Ross Whitaker and Sam Gerber, who had a great manifold learning paper at ICCV last year.]

There are two assumptions that are often used in statistical learning (both theory and practice, though probably more of the latter), especially in the semi-supervised setting.  Unfortunately, they're incompatible.

The margin assumption states that your data are well separated.  Usually it's in reference to linear, possibly kernelized, classifiers, but that need not be the case.  As most of us know, there are lots of other assumptions that boil down to the same thing, such as the low-weight-norm assumption, or the Gaussian prior assumption.  At the end of the day, it means your data looks like what you have on the left, below, not what you have on the right.
The manifold assumption that is particularly popular in semi-supervised learning, but also shows up in supervised learning, says that your data lie on a low dimensional manifold embedded in a higher dimensional space.  One way of thinking about this is saying that your features cannot covary arbitrarily, but the manifold assumption is quite a bit stronger.  It usually assumes a Reimannian (i.e., locally Euclidean) structure, with data points "sufficiently" densely sampled.  In other words, life looks like the left, not the right, below:
Okay, yes, I know that the "Bad" one is a 2D manifold embedded in 2D, but that's only because I can't draw 3D images :).  And anyway, this is a "weird" manifold in the sense that at one point (where the +s and -s meet), it drops down to 1D.  This is fine in math-manifold land, but usually not at all accounted for in ML-manifold land.

The problem, of course, is that once you say "margin" and "manifold" in the same sentence, things just can't possibly work out.  You'd end up with a picture like:

This is fine from a margin perspective, but it's definitely not a (densely sampled) manifold any more.

In fact, almost by definition, once you stick a margin into a manifold (which is okay, since you'll define margin Euclideanly, and manifolds know how to deal with Euclidean geometry locally), you're hosed.

So I guess the question is: who do you believe?

#### 9 comments:

1. I dont agree that the two assumptions are incompatible.

The margin assumption rises from a particular sampling scheme on the manifold.

For example, take the USPS digits dataset that is so often used; the set of all possible handwritten symbols form the manifold, and the sampling (9 digits) form the margin.

There are tradeoffs between regularizing for margins and for smoothness, but regardless it is misleading to claim these two are incompatible.

2. Hi, I have a kind of "meta question": What are good references (e.g. books, articles, etc) for a beginning ML/DM student to gain (mostly) math background on the concepts discussed in this post? For example manifold comes for topology, right? Any good and preferably short introduction to that? Also what about "dense sampling"?

thank you very much!
Stelios

3. I agree with anonymous: using manifold regularization in max-margin classifiers makes sense and shows improvements over either of the methods alone in multiple datasets I've used.

4. I think there are similarities between the two assumptions at a deeper level: both say that labels are locally smooth. The difference is space at which this operates (Euclidean vs Reimannian). Perhaps your paradox comes from the idea that the manifold should be "densely sampled"--I'm not sure about this point, what exactly does this mean?

In practice, it's well known what assumption you ought to use depends a lot on dataset. Chapelle's semi-supervised learning book has a nice experimental comparison suggesting that image data seems more manifold-like, and text data seems more cluster-like.

I guess it won't be too hard to test your hypothesis about whether the two are incompatible: (1) run some manifold learning algorithm to embed the data (e.g. image) in low dimensional Euclidean space. (2) run a margin classifier like Transductive SVM. If it performs horribly compared to a directly applying TSVM or Graph-SSL algorithm, then that might suggest they are incompatible.

5. I'm not sure I understand. If you can't maximize some margin-like quantity given the manifold, what is it good for? I thought manifold learning methods modeled each class as its own manifold, and penalize class boundaries in dense regions. The "Analysis of Representations for Domain Adaptation" paper, by Ben-david, Blitzer, Crammer and Pereira even suggests that an optimal strategy for DA looks like a mixture between fitting the source and target data in the same manifold and maximizing the margin between the classes.

6. Oops -- I posted this, then got sick, so couldn't respond...

@Anonymous: I guess I don't really follow. You basically seem to be arguing "there is a manifold, but then we sample but don't sample near the margin" but then I don't know what it means for "there [to be] a manifold"? Our data is our only access to whatever there "is" and if it's not sampled near the margin, then I have to say that there's nothing near the margin.

Following the same logic I could say something like "no there's no manifold, it fills up the entirety of R^d, it's just that when we sample we don't sample everywhere." It seems tautological to me.

(Plus, "the set of all possible handwritten symbols form a manifold" is a very strong statement, and I think you'd have a hard time backing that up.)

That said, I think the main issue with manifolds that I wanted to talk about was the dense sampling on a single manifold issue. This is required to make most of the theory I know go through (eg., the ISOMAP theory), and is also required in practice otherwise you'll get a meaningless nearest neighbor graph. But if you're densely sampled, and on a single manifold, then there absolutely should be no margin.

@Anonymous2: There really isn't anything :). But, people who do manifold learning almost never use anything complicated about topology. In fact, the fact that the thing is even a manifold (eg., and has local tangent spaces) is almost never used!

@Kevin: Yeah, you're right. Densely sampled means roughly that anywhere there is manifold, you'll find "enough" data points, where "enough" has to do with the noise rate.

@Kevin @Alexandre: Yes, of course these things work in practice :). I guess what I'm saying is that saying that you're using a "densely sampled manifold" comes with a lot of hidden assumptions, which I don't think are ever realized. So the question is: can we pare down these assumptions to something that we might actually believe about our data?

7. To realize the benefits of the manifold assumption, isn't it enough to assume that each class belongs to a separate manifold? And in that case, there is not necessarily a contradiction with the margin assumption, right?

8. Although I see your point, I wonder whether the main problem isn't simply that the manifold assumption is not very likely to hold in many real-world data sets? In particular, the amount of data we need to have a densely sampled manifold grows exponentially with the intrinsic dimensionality of that data. I once saw a study that estimated the intrinsic dimensionality of the space of facial appearance to be about 100, so it seems unlikely we will ever have a data set that densely samples that manifold. Moreover, don't quite a few data sets constitute (fairly) widely separated clusters?

@Anonymous: I am fairly convinced the space of handwritten digits does not form a single manifold, but a bunch of widely separated ones. See, e.g., the t-SNE visualization of the MNIST data at http://homepage.tudelft.nl/19j49/t-SNE_files/mnist_large.jpg

9. I think the ideas in the paper "Inference in the Universum" will give you a way to seamlessly combine the two. The key idea is you may have a fairly densely sampled low-d manifold, but only some of the points on it can actually be classified. The rest of the points are just "background" without a ground-truth label. That paper talks about how to do a large-margin classification where the margin is in respect to the manifold space.