- Marco Baroni, Georgiana Dinu & German Kruszewski, Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL 2014.
- Omer Levy & Yoav Goldberg., Linguistic Regularities in Sparse and Explicit Word Representations. CoNLL 2014 (best paper award recipient).

Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors Marco Baroni & Georgiana Dinu & Germ\'an Kruszewski Motivation: these silly deep learning people keep writing papers but don't compare to traditional distributional semantics models. So we will. Conclusion: okay, those people are actually right. == Background == Distributional semantics = you know a word by the company it keeps "Count models": * For each word type, collect context vectors * Context vectors look at n words on the left and right (with position info? together or separately?) varied in 2..5 * Each type is represented by the bag of contexts in which it appears * Contexts are scored by PMI or LLR (gets rid of useful but frequent contexts) * We might reduce dimensionality to k in { 200, ..., 500 } - using either SVD or NNMF "Predict models" (aka deep learning): * Assume a mapping from word type -> k-dim vector * Learn a model to predict any word token given the vectors of the n words to its left and right varied in {2,5} * Words are thrown out proportional to their frequency - makes things faster - reduces importance of frequent words like IDF * Vary k in { 200, ..., 500 } CW (Collobert & Weston) models: * freely available online * 100 dimensional vectors trained for 2 months on wikipedia * predict a word with 5 words to the left and right * used extensively in other literature == Tasks == * Synonym detection from TOEFL - given "levied" choose from { imposed, believed, requested, correlated } - compute cosine of word representations * Concept categorization - cluster words like { helicopters, motorcyles} into one class and { dogs, elephants } into another - used off-the-shelf clustering algorithms * Selectional preferences - given a verb/noun pair say whether the verb selects for that noun - eg, "eat apples" versus "eat gravity" - for each verb, take 20 most strongly associated nouns, average their representations, and measure cosine similarity to that * Analogy - eg, brother:sister :: grandson:(?granddaughter) - find nearest neighbor of (brother - sister + grandson) == Summary of results == pred tie count (win: >=5) wins wins * tune parameters PER TASK: 10 4 0 * tune parameters OVERALL: 10 3 1 * worst parameters 14 0 0 * best from relatedness: 11 3 0 - best count model: window=2, PMI, no compression, 300k dimensions - best predict model: window=5, no hier softmax, neg sampling, 400 dim Linguistic Regularities in Sparse and Explicit Word Representations Omer Levy & Yoav Goldberg Motivation: neural representations capture analogical reasoning well; why does this happen? Conclusion: we can just as well using tranditional distributed word representations if we measure similarity in a "multiplicative" way == Background == * neural language models produce representations that can answer analogy questions: - gender... man:woman :: king:queen - speakers... france:french :: mexico:spanish - number... apple:apples :: car:cars * question: how much of this is a property of *embeddings* (ie dense, low-dimensional) - alternative is distributional similarity == bag of contexts (ala Baroni paper) == Experiment 1 == * Mikolov (word2vec) computes similarity for solving a:b :: a*:b* by finding: arg max_{b*} similarity(b*, a* - a + b) (called 3CosAdd)

aka similarity(queen, king - man + woman)

* they use cosine similarity: cos(u,v) = dot(u,v) / [ ||u|| ||v|| ]

* expanding out, you get: arg max_{b*} cos(b*,b) + cos(b*,a*) - cos(b*,a) aka cos(queen,woman) + cos(queen,king) - cos(queen,man) Results: on MSR & Google datasets, embeggings >> explicit (predict >> count) on SemEval, basically tied (closed vocabulary?) * an alternative is: arg max_{b*} similarity(b*-b, a*-a) (called PairDirection) aka similarity(queen-woman, king-man) Results: Much much worse on MSR, Google (open vocab), and better (though tied) on SemEval. Perhaps because scale matters for open vocabulary? could have been tested explicitly... (drat!) == Experiment 2 == Looking at the expansion of 3CosAdd, it looks like a "noisy or" sort of operation: - b* should be close to b, should be close to a*, should be far from a... What about using something more like noisy and: cos(b*,b) cos(b*,a*) arg max ----------------------- b* cos(b*,a) + eps aka cos(queen,king) cos(queen,woman) arg max ----------------------------------- b* cos(queen,man) + eps Results: MSR Google 3CosAdd Predict 54% 63% Count 29% 45% 3CosMul Predict 59% 67% Count 57% 68% One against the other: Both correct -- 54% of cases Both wrong -- 24% Predict correct -- 11.1% Count correct -- 11.6%