In our local summer reading group, I led the discussion of two papers that appeared in Baltimore last month:
I love handouts, so I made a handout for this one too. I paste below the handout. All good ideas are those of the respective authors; all errors and bad ideas are probably due to bad transcription on my part.
Don't count, predict! A systematic comparison of context-counting
vs. context-predicting semantic vectors
Marco Baroni & Georgiana Dinu & Germ\'an Kruszewski
Motivation: these silly deep learning people keep writing papers but
don't compare to traditional distributional semantics models. So we
will.
Conclusion: okay, those people are actually right.
== Background ==
Distributional semantics = you know a word by the company it keeps
"Count models":
* For each word type, collect context vectors
* Context vectors look at n words on the left and right
(with position info? together or separately?)
varied in 2..5
* Each type is represented by the bag of contexts in which it appears
* Contexts are scored by PMI or LLR
(gets rid of useful but frequent contexts)
* We might reduce dimensionality to k in { 200, ..., 500 }
- using either SVD or NNMF
"Predict models" (aka deep learning):
* Assume a mapping from word type -> k-dim vector
* Learn a model to predict any word token given the vectors
of the n words to its left and right
varied in {2,5}
* Words are thrown out proportional to their frequency
- makes things faster
- reduces importance of frequent words like IDF
* Vary k in { 200, ..., 500 }
CW (Collobert & Weston) models:
* freely available online
* 100 dimensional vectors trained for 2 months on wikipedia
* predict a word with 5 words to the left and right
* used extensively in other literature
== Tasks ==
* Synonym detection from TOEFL
- given "levied" choose from { imposed, believed, requested, correlated }
- compute cosine of word representations
* Concept categorization
- cluster words like { helicopters, motorcyles} into one class
and { dogs, elephants } into another
- used off-the-shelf clustering algorithms
* Selectional preferences
- given a verb/noun pair say whether the verb selects for that noun
- eg, "eat apples" versus "eat gravity"
- for each verb, take 20 most strongly associated nouns, average
their representations, and measure cosine similarity to that
* Analogy
- eg, brother:sister :: grandson:(?granddaughter)
- find nearest neighbor of (brother - sister + grandson)
== Summary of results ==
pred tie count (win: >=5)
wins wins
* tune parameters PER TASK: 10 4 0
* tune parameters OVERALL: 10 3 1
* worst parameters 14 0 0
* best from relatedness: 11 3 0
- best count model: window=2, PMI, no compression, 300k dimensions
- best predict model: window=5, no hier softmax, neg sampling, 400 dim
Linguistic Regularities in Sparse and Explicit Word Representations
Omer Levy & Yoav Goldberg
Motivation: neural representations capture analogical reasoning well;
why does this happen?
Conclusion: we can just as well using tranditional distributed word
representations if we measure similarity in a "multiplicative" way
== Background ==
* neural language models produce representations that can answer analogy questions:
- gender... man:woman :: king:queen
- speakers... france:french :: mexico:spanish
- number... apple:apples :: car:cars
* question: how much of this is a property of *embeddings* (ie dense, low-dimensional)
- alternative is distributional similarity == bag of contexts (ala Baroni paper)
== Experiment 1 ==
* Mikolov (word2vec) computes similarity for solving a:b :: a*:b* by finding:
arg max_{b*} similarity(b*, a* - a + b) (called 3CosAdd)
aka similarity(queen, king - man + woman)
* they use cosine similarity: cos(u,v) = dot(u,v) / [ ||u|| ||v|| ]
* expanding out, you get:
arg max_{b*} cos(b*,b) + cos(b*,a*) - cos(b*,a)
aka
cos(queen,woman) + cos(queen,king) - cos(queen,man)
Results: on MSR & Google datasets, embeggings >> explicit (predict >> count)
on SemEval, basically tied (closed vocabulary?)
* an alternative is:
arg max_{b*} similarity(b*-b, a*-a) (called PairDirection)
aka similarity(queen-woman, king-man)
Results: Much much worse on MSR, Google (open vocab), and better (though tied) on
SemEval. Perhaps because scale matters for open vocabulary? could have been tested
explicitly... (drat!)
== Experiment 2 ==
Looking at the expansion of 3CosAdd, it looks like a "noisy or" sort of operation:
- b* should be close to b, should be close to a*, should be far from a...
What about using something more like noisy and:
cos(b*,b) cos(b*,a*)
arg max -----------------------
b* cos(b*,a) + eps
aka
cos(queen,king) cos(queen,woman)
arg max -----------------------------------
b* cos(queen,man) + eps
Results:
MSR Google
3CosAdd Predict 54% 63%
Count 29% 45%
3CosMul Predict 59% 67%
Count 57% 68%
One against the other:
Both correct -- 54% of cases
Both wrong -- 24%
Predict correct -- 11.1%
Count correct -- 11.6%