One of the most fun AHA moments in ML or stats is when you see that for multinomial distributions, your naive idea of relative frequency corresponds (through a bunch of calculus) to maximum likelihood estimation. Ditto means of Gaussians, though that's never quite as amazing because Gaussians seems sort of odd beasts anyway. It's nice when our intuitions match what stats tells us.

I'll often watch a DVD, and then leave it in the DVD player until I want to watch something else. Usually it will return to the menu screen and the timer display will go through 0:00, 0:01, 0:02, ..., until it hits the length of the title screen loop, and then go back to 0:00. What I always wonder when I glance up at it is "what is the actual length of the title screen loop?" That is, what is the highest value it'll count up to.

Being a good statistician, I set up a model. First I play the frequentist game. Since I glance up and pretty arbitrary times, my observations can be modeled as a uniform random variable from 0 to some upper bound U, which is the quantity I want to infer.

Supposing that I only make one observation, say 0:27, I want a maximum likelihood estimate of U. It's easy to see that U=27 is the maximum likelihood solution. Anything less than 27 will render my observation impossible. For any U>=27, my observation has probability 1/(U+1), so the ML solution is precisely U=27. Note that even if I observe five observations a <= b <= c <= d <= e, the ML solution is still U=e. Huh. The math works. The model is as correct as I could really imagine it. But this answer doesn't really seem reasonable to me. Somehow I expect a number greater than my observation to come about.

Perhaps I need a prior. That'll solve it. I'll add a prior p(U) and then do maximum a posteriori. Well, it's easy to see that if p(U) is unimodal, and its mode is less than my (highest) observation, then the MAP solution will still be the (highest) observation. If it's mode is more than my (highest) observation, then the MAP solution will be the mode of p(U). If I think about this a bit, it's hard for me to justify a multi-modal p(U), and also hard for me to be happy with a system in which my prior essentially completely determines my solution.

Or I could be fully Bayesian and look at a posterior distribution p(U | data). This will just be a left-truncated version of my prior, again, not very satisfying.

(Note that the "Maximum Likelihood Sets" idea, which I also like quite a bit, doesn't fix this problem either.)

It also really bugs me that only my largest observation really affects the answer. If I get one hundred observations, most of them centered around 0:10, and then one at 0:30, then I'd guess that 0:30 or 0:31 or 0:32 is probably the right answer. But if I observe a bunch of stuff spread roughly uniformly between 0:00 and 0:30, then I'd be more willing to believe something larger.

I don't really have a solution to this dilemma. Perhaps you could argue that my difficulties arise from the use of a Uniform distribution -- namely, that were I to use another distribution, these problems wouldn't really arise. I don't think this is quite true, as described below:

We actually run in to this problem in NLP fairly often. I observe a billion documents and see, in this 1b documents, 100k unique words, many of which are singletons. But I'm not willing to believe that if I see that document 1,000,000,001, that there won't be a new unseen word. Sure it's less likely that I see a new unique word than in document 1001, where it is almost guaranteed, but there's still a relatively large probability that some new word will show up in that next document.

The whole point is that somehow we expect random samples to be representatives of the underlying distribution. We don't expect them to define the underlying distribution. I don't actually expect to ever see the corner cases, unless I've seen everything else.

UPDATE: The Bayesian approach actually does do something reasonable. Here is some Matlab code for computing posteriors with a uniform prior, and results with an upper bound of 100 and various observations are plotted below: