12 April 2010

How I teach machine learning

I've had discussions about this with tons of people, and it seems like my approach is fairly odd. So I thought I'd blog about it because I've put a lot of thought into it over the past four offerings of the machine learning course here at Utah.

At a high level, if there is one thing I want them to remember after the semester is over it's the idea of generalization and how it relates to function complexity. That's it. Now, more operationally, I'd like them to learn SVMs (and kernels) and EM for generative models.

In my opinion, the whole tenor of the class is set by how it starts. Here's how I start.
  1. Decision trees. No entropy. No mutual information. Just decision trees based on classification accuracy. Why? Because the point isn't to teach them decision trees. The point is to get as quickly as possible to the point where we can talk about things like generalization and function complexity. Why decision trees? Because EVERYONE gets them. They're so intuitive. And analogies to 20 questions abound. We also talk about the who notion of data being drawn from a distribution and what it means to predict well in the future.

  2. Nearest neighbor classifiers. No radial basis functions, no locally weighted methods, etc. Why? Because I want to introduce the idea of thinking of data as points in high dimensional space. This is a big step for a lot of people, and one that takes some getting used to. We then do k-nearest neighbor and relate it to generalization, overfitting, etc. The punch line of this section is the idea of a decision boundary and the complexity of decision boundaries.

  3. Linear algebra and calculus review. At this point, they're ready to see why these things matter. We've already hinted at learning as some sort of optimization (via decision trees) and data in high dimensions, hence calculus and linear algebra. Note: no real probability here.

  4. Linear classifiers as methods for directly optimizing a decision boundary. We start with 0-1 loss and then move to perceptron. Students love perceptron because it's so procedural.
The rest follows mostly as almost every other machine learning course out there. But IMO these first four days are crucial. I've tried (in the past) starting with linear regression or linear classification and it's just a disaster. You spend too much time talking about unimportant stuff. The intro with error-based decision trees moving to kNN is amazingly useful.

The sad thing is that there are basically no books that follow any order even remotely like this. Except...drum roll... it's actually not far from what Mitchell's book does. Except he does kNN much later. It's really depressing how bad most machine learning books are from a pedagogical perspective... you'd think that in 12 years someone would have written something that works better.

On top of that, the most recent time I taught ML, I structured everything around recommender systems. You can actually make it all work, and it's a lot of fun. We actually did recommender systems for classes here at the U (I had about 90-odd students from AI the previous semester fill out ratings on classes they'd taken in the past). The data was a bit sparse, but I think it was a lot of fun.

The other thing I change most recently that I'm very happy with is that I have a full project on feature engineering. (It ties in to the course recommender system idea.) Why? Because most people who take ML, if they ever use it at all, will need to do this. It's maybe one of the most important things that they'll have to learn. We should try to teach it. Again, something that no one ever talks about in books.

Anyway, that's my set of tricks. If you have some that you particularly like, feel free to share!

20 comments:

  1. Hal, thanks for the great post! I really like this way of teaching, which balances nicely between practical use and theoretical reasoning.

    May I ask about the programming language you recommended in the class? Have you chosen more general languages like Java, Python, or some numerical computing environments such as Matlab? I think for an undergraduate class it is a very important question. In our school it is currently done by Java+Weka. But recently I trends to believe that Matlab makes more sense and can let the student gain more insight of the algorithms. Any opinion?

    ReplyDelete
  2. Weiwei: I think weka has quite a high barrier to entry if your main objective is understanding and implementing learning algorithms; it's hard to see the point of their complicated class hierarchy before one has tried to solve many different problems. I'm partial to matlab and python+numpy because, by focusing on the linear algebra side of things, they can help students move to a more abstract understanding of what's going on. Matlab has the grat advantage of being the de facto standard for quick-and-dirty implementations found on the web, which might come in handy later in life if anyone follows a career in ml.

    ReplyDelete
  3. Weiwei: definitely not Java+Weka. I used to use matlab (for reasons that Top says), but now I use Python+NumPy. The main reason I changed was because doing feature engineering for the recommender system in matlab was really unpleasant (because it was largely text-based) and Python was much nicer.

    ReplyDelete
  4. I also hate that matlab isn't free, and spent a huge amount of time making all my matlab scripts for class Octave compatible, which was no fun. Plus plotting in Octave is less than great.

    ReplyDelete
  5. Hal, are you going to write a book on this anytime soon? Sounds like a good project while you're driving cross country this summer. Ha ha.

    ReplyDelete
  6. @Hal @Top Thanks for the tips! As you said, Matlab is handy but expensive. Besides, I guess using open source programming languages is also beneficial to the students for their future career paths: After all, not so many companies develop real-world applications with Matlab. It seems that Python+NumPy is a good way to go. I will keep that in mind. Thanks!

    ReplyDelete
  7. Very cool!

    Yeah, talking about entropy and mutual information seems to waste too much time and be largely orthogonal to the class. The only advantage is that its in the book (Mitchell).

    I am also not sure if all my students really "got" the importance of generalization as much as they should have. Perhaps underlining it as THE basic concept is the way to go.

    As for programming: I allowed them to code in what ever language they want and it was a bit of a disaster (for one I couldn't really follow what some students were doing when they called weird libraries in languages I don't use).

    I will probably require them all to learn Matlab/Octave the next time since I am very familiar with Matlab. The non-freeness of Matlab and the differences between Matlab and Octave are definitely annoying, but it still seems like the least painful option for now.

    ReplyDelete
  8. Hi! What do you think about using R for teaching machine learning? I used it for a data mining class and it fared well, but the course was oriented more towards application (data analysis), rather than implementation.

    I'd also note that "statistical learning" books start with linear vs. k-NN classifiers and the discussion of your point 2, (decision trees seem more ML-ish).

    ReplyDelete
  9. Very cool!

    As I come from a more NLP background to ML, I'd add also some simple MLE probabilistic "classifier" before the decision trees (i.e. "choose the most probable class"). This is very intuitive, and sets the ground for HMMs later.

    Also, perceptrons can come before KNN: you can treat them feature vectors as just "feature collections" when doing perceptron and everything still work. Then discuss the vector representation, the linear classification, and that other vector-space models are possible (perhaps also touching the duality of the perceptron).

    For implementation language, I definitely support the python+numpy duo. I would supplement this with IPython as a nifty interactive python shell (which also allow for easy online plotting), and the CVXOPT package for convex optimization (which make implementing SVMs a pretty easy assignment).

    Hal, given this great introductory sequence, I am curious: how do you go about explaining EM?

    ReplyDelete
  10. @luk: I don't know R :).

    @yoav: I *very* intentionally do NOT do probabilistic classification until later. This is because probability scares most students and they don't really get it. I remember taking AI as an undergrad and being lost with naive Bayes.

    @yoav: what you say about perceptron is actually exactly what I do... we talk about perceptron as feature weights, and then ask "what does the decision boundary look like" (just like we did in kNN) and then see that it's linear, blah blah blah and then enter linear algebra.

    ReplyDelete
  11. I would have thought machine learning following "Collective intelligence" (http://amzn.to/bHSb2k). It uses Python, concise, to the point and shows just the right amount of information.

    ReplyDelete
  12. @boris

    collective intelligence is a great book. But I wouldn't teach a CS ML class based on it -- not enough foundations. It would be a good candidate for teaching a more applicative data-mining / data analysis course (and still would need to supplement it with some newer stuff like SVMs, which if remember correctly are not covered)

    ReplyDelete
  13. Great ideas, Hal.

    I agree with your point about feature engineering. In my NLP course I emphasize its role in the process of designing good models. All of my labs require some error analysis, and two of them require feature engineering specifically. Some students come to the course with machine learning experience, and some do not, but all seem to enjoy the idea of bringing their knowledge and insights to bear in a machine learning setting. Students also benefit from discussing questions about empiricism versus rationalism in this context.

    ReplyDelete
  14. I'd be happy to be your writing peon. I have always been interested in ML from a pedagogical perspective.

    ReplyDelete
  15. Great post! Thanks :-) I am yet to see an ML book that does not scare people. I wonder if the problem lies with the lack of adequate visualization and correlation tools. For example I would love to create a single example which I solve using the common ML techniques, demonstrate the effect of different params and explain what works/does not work here and why!

    ReplyDelete
  16. Hal,nice and interesting post...However i think that linear classifiers should be started with with the geometrical importance in view. As far as probablistic classification is concerend would it not be great to use Graphical models as it seems to make more sense ? I had attended this winter school where a introduction on HMM's using Graphical models made the point more sensible than the traditional method...Anyways great post...

    ReplyDelete
  17. Hi,
    The practical use and theoretical reasoning is necessary in order to learn in a better way. In order to attract the students towards machine leaning you have to encourage them by adopting new and innovative things. You have shared good post. Thanks, keep it up.

    Essay Papers

    ReplyDelete