When I was visiting Singapore, Lan Man gave a talk at NUS right before mine (for extra fun, try to find her web site without using my link!). The work she presented is, for the purposes of this discussion, essentially a feature weighting technique for document classification.
More specifically, let's consider the standard bag of words model for text classification. Here, we have a feature set that is exactly the size of the vocabulary. The easiest thing to do is to use binary features (word present or not), but other things seem to work better. For instance, the number of times the word appears (term frequency, or tf) often works better. This is not so surprising. The problem with just using tf is that words like "the" appear a lot and they're somehow less "important." So instead, what we often do is tf*idf, where idf is the "inverse document frequency." There are a few variants of idf (that use logs, etc.) but the basic idea is to count how in many documents each word appears at least once. Then, we divide the number of documents by this count. So the idf value is always at least one, and is one only for words that appear in all documents (eg., "the"). What Lan Man was proposing was essentially a different value to use instead of idf (I don't recall the details, but it was similarly an easy-to-compute function of the corpus). The result is essentially that binary does worst, tf does next best, followed by tf*idf and then tf*her_proposal worked best (for most data sets).
The importance of using tf*idf arose in IR, where people started computing cosine similarities between word vectors. Here, downweighting things like "the" was of utmost importance, due to how cosine works. But it makes much less sense in supervised learning. That is, for linear classifiers (which are the norm, and which she used for many of her experiments), it really shouldn't matter if we apply a constant scaling factor to each feature. The learned feature weight itself is what is supposed to determine the importance of a feature. If we learn weights for tf*idf, we could easily convert them into just-as-good weights for tf by multiplying each weight by the corresponding df value.
I can think of a few explanations of this phenomenon, but none is at all satisfying (eg., by changing the scaling, we may be increasing the size of the margin while reducing the maximum norm vector size, which, for most generalization bounds, would be a good thing. Another answer is an interplay with regularization -- it becomes less "expensive" to have an overall high weight -- learned weight times df -- for infrequent informative words than for irrelevant words.). The problem is that this is a problem! It means that even for applications as simple as text categorization, not only is feature engineering an important issue, but feature weighting seems also to be.
(Incidentally, I'm curious if other fields -- eg., vision -- have similar techniques to "idf." For instance, one could imagine weighting pixel values by "idf"s of how often a pixel is on in a dataset. I wonder if this helps. If not, it may be because for smallish images, the number of features is small already. I wonder if there is an algorithm design for machine learning that would get around this problem.)
More specifically, let's consider the standard bag of words model for text classification. Here, we have a feature set that is exactly the size of the vocabulary. The easiest thing to do is to use binary features (word present or not), but other things seem to work better. For instance, the number of times the word appears (term frequency, or tf) often works better. This is not so surprising. The problem with just using tf is that words like "the" appear a lot and they're somehow less "important." So instead, what we often do is tf*idf, where idf is the "inverse document frequency." There are a few variants of idf (that use logs, etc.) but the basic idea is to count how in many documents each word appears at least once. Then, we divide the number of documents by this count. So the idf value is always at least one, and is one only for words that appear in all documents (eg., "the"). What Lan Man was proposing was essentially a different value to use instead of idf (I don't recall the details, but it was similarly an easy-to-compute function of the corpus). The result is essentially that binary does worst, tf does next best, followed by tf*idf and then tf*her_proposal worked best (for most data sets).
The importance of using tf*idf arose in IR, where people started computing cosine similarities between word vectors. Here, downweighting things like "the" was of utmost importance, due to how cosine works. But it makes much less sense in supervised learning. That is, for linear classifiers (which are the norm, and which she used for many of her experiments), it really shouldn't matter if we apply a constant scaling factor to each feature. The learned feature weight itself is what is supposed to determine the importance of a feature. If we learn weights for tf*idf, we could easily convert them into just-as-good weights for tf by multiplying each weight by the corresponding df value.
I can think of a few explanations of this phenomenon, but none is at all satisfying (eg., by changing the scaling, we may be increasing the size of the margin while reducing the maximum norm vector size, which, for most generalization bounds, would be a good thing. Another answer is an interplay with regularization -- it becomes less "expensive" to have an overall high weight -- learned weight times df -- for infrequent informative words than for irrelevant words.). The problem is that this is a problem! It means that even for applications as simple as text categorization, not only is feature engineering an important issue, but feature weighting seems also to be.
(Incidentally, I'm curious if other fields -- eg., vision -- have similar techniques to "idf." For instance, one could imagine weighting pixel values by "idf"s of how often a pixel is on in a dataset. I wonder if this helps. If not, it may be because for smallish images, the number of features is small already. I wonder if there is an algorithm design for machine learning that would get around this problem.)
9 comments:
where could we got the conclusion" Machine learning has failed"?
just a more flexible vector generation method than tfidf?
:)
I simply meant that the purpose of supervised learning (in a linear setting) is to find weights of each feature that lead to good performance. But we can get better performance by supplying our own weighting factors before learning. This is bizarre and really shouldn't be the case (especially when all information we use to supply the weights is already there in the data).
Why failed? Priors are important! Feature weighting in SVM (see Brank&Leskovec article on how they won KDD CUP 2003 with feature weighting http://ai.ijs.si/kddcup03/kddcup03-explorations.pdf) -- is a par with setting informative priors (see http://www.stat.rutgers.edu/~madigan/PAPERS/sigir06v11.pdf).
Aleks.
So is there a way to come up with a prior on weights such that when the likelihood looks like tf, the posterior looks like tf*idf? It seems that probably some people who do language modeling for IR must have looked at things like this, but I don't know the answer OTTOMH. If possible, though, this would at least be somewhat satisfying. Though it still bothers me that applying a very simple linear transformation (diagonal matrix!) can really have a hugely profound effect on performance, especially when the matrix is a function of the training data. (And thus isn't really background "prior" knowledge.)
I agree with Hal in that there's definitely something unsatisfying with needing to do feature weighting. Shouldn't the learning algorithm take care of that?
I've done experiments where I had a mixed feature representation consisting of probabilities, binary indicator variables, negative/positive values, and integer ranks. It seems quite crucial for SVM at least that the features all be normalized to the similar ranges. For example, it's better that I binarize the rank feature into multiple "smaller than rank i" thresholded binary features than keeping it as a single integer.
Why is this so? Does anyone have any pointers to the literature? Feature weighting in this case cannot be seen as setting informative priors, since the different scales arise simply to different kinds of features.
(Note: here I'm using feature weighting in a slightly different sense from TFIDF; I'm using weighting to normalize the features. But I think the question of why the machine learning algorithm is sensitive to feature weights is the same.)
酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花
Panggung Blogger | nowGoogle.com adalah Multiple Search Engine Popular | Gubuk Blogger | nowGoogle.com adalah Multiple Search Engine Popular | Astaga | nowGoogle.com adalah Multiple Search Engine Popular | Catatan Si Bongo | nowGoogle.com adalah Multiple Search Engine Popular | Blog Bongo Saja | nowGoogle.com adalah Multiple Search Engine Popular | Healthiness | nowGoogle.com adalah Multiple Search Engine Popular | Note Of Bongo | nowGoogle.com adalah Multiple Search Engine Popular | nowGoogle Spazioblog | nowGoogle.com adalah Multiple Search Engine Popular | Astaga Astalog | nowGoogle.com adalah Multiple Search Engine Popular | Seoperman | nowGoogle.com adalah Multiple Search Engine Popular | nowgoogle | nowGoogle.com adalah Multiple Search Engine Popular | Search Engine Optimazion
nowGoogle.com adalah Multiple Search Engine PopularnowGoogle.com adalah Multiple Search Engine PopularnowGoogle.com adalah Multiple Search Engine PopularnowGoogle.com adalah Multiple Search Engine PopularnowGoogle.com adalah Multiple Search Engine PopularnowGoogle.com adalah Multiple Search Engine PopularnowGoogle.com adalah Multiple Search Engine PopularnowGoogle.com adalah Multiple Search Engine PopularTEKNIK MILENIAteknik milenia
SIGN UP WEBweb 12.06DEPRESSION MANAGEDepressionmanageHEALTH WITH FITNESSfitness manageWEIGHT LOSS IDEALweight loss idealSKIN MANAGEskin care manageANIMAL LOVERanimal loverBEAUTY LIKE DIAMONDbeauty like diamondBACKYARD BLOGbackyard blogDO MORE WITH UR HOBBYdo more with your hobbyEMAIL ON THE MARKETemail on the marketLEARN EBAYlearn ebayLEARN SEOlearn SEOBERITA ARTIS INDOberita artis indoOpt-In List Blog opt in list blogSYSTEM PC INFOsystem PC infoMODIFICATION PHONEmodfication phoneGO ENTERPRENEURgo enterpreneurLive Journal Bloglive journal blogXANGA BLOGxanga blogBUDAYA WISATA BENGKULUbudaya wisata bengkuluMY PAPER
BLOG TIARAblog tiaraANAK CANTIKanak cantikINFORMATIKAHOLICinformatikaholicMODIF CELLmodif cell reviewINTISARI INFOintisari info
I am happy to find much useful information in the post, writing sequence is awesome, I always look for quality content, thanks for sharing.
jasa karikatur | kado ulang tahun | jasa pembuatan karikatur | karikatur wajah | souvenir pernikahan | souvenir perkawinan | souvenir pernikahan murah | souvenir pernikahan lilin | souvenir gantungan kunci | souvenir kipas | jasa pembuatan karikatur | jasa karikatur online | tempat bikin karikatur | sirih merah | sirih hitam | kado ultah | kado ulang tahun perkawinan | kado ulang tahun pacar | karikatur Online | kado unik | kado valentin
Post a Comment