31 July 2014

Reading group notes: point/counter-point on "predict models"

In our local summer reading group, I led the discussion of two papers that appeared in Baltimore last month:

I love handouts, so I made a handout for this one too. I paste below the handout. All good ideas are those of the respective authors; all errors and bad ideas are probably due to bad transcription on my  part.

Don't count, predict! A systematic comparison of context-counting
vs. context-predicting semantic vectors

Marco Baroni & Georgiana Dinu & Germ\'an Kruszewski

Motivation: these silly deep learning people keep writing papers but
don't compare to traditional distributional semantics models. So we

Conclusion: okay, those people are actually right.

== Background ==

Distributional semantics = you know a word by the company it keeps

"Count models":
 * For each word type, collect context vectors
 * Context vectors look at n words on the left and right
     (with position info? together or separately?)
     varied in 2..5
 * Each type is represented by the bag of contexts in which it appears
 * Contexts are scored by PMI or LLR
     (gets rid of useful but frequent contexts)
 * We might reduce dimensionality to k in { 200, ..., 500 }
    - using either SVD or NNMF

"Predict models" (aka deep learning):
 * Assume a mapping from word type -> k-dim vector
 * Learn a model to predict any word token given the vectors
     of the n words to its left and right
     varied in {2,5}
 * Words are thrown out proportional to their frequency
    - makes things faster
    - reduces importance of frequent words like IDF
 * Vary k in { 200, ..., 500 }

CW (Collobert & Weston) models:
 * freely available online
 * 100 dimensional vectors trained for 2 months on wikipedia
 * predict a word with 5 words to the left and right
 * used extensively in other literature

== Tasks ==
 * Synonym detection from TOEFL
   - given "levied" choose from { imposed, believed, requested, correlated }
   - compute cosine of word representations
 * Concept categorization
   - cluster words like { helicopters, motorcyles} into one class
     and { dogs, elephants } into another
   - used off-the-shelf clustering algorithms
 * Selectional preferences
    - given a verb/noun pair say whether the verb selects for that noun
    - eg, "eat apples" versus "eat gravity"
    - for each verb, take 20 most strongly associated nouns, average
        their representations, and measure cosine similarity to that
 * Analogy
    - eg, brother:sister :: grandson:(?granddaughter)
    - find nearest neighbor of (brother - sister + grandson)

== Summary of results ==
                                     pred    tie   count    (win: >=5)
                                     wins           wins
 * tune parameters PER TASK:          10      4      0
 * tune parameters OVERALL:           10      3      1
 * worst parameters                   14      0      0
 * best from relatedness:             11      3      0

 - best count   model: window=2, PMI, no compression, 300k dimensions
 - best predict model: window=5, no hier softmax, neg sampling, 400 dim

Linguistic Regularities in Sparse and Explicit Word Representations
Omer Levy & Yoav Goldberg

Motivation: neural representations capture analogical reasoning well;
why does this happen?

Conclusion: we can just as well using tranditional distributed word
representations if we measure similarity in a "multiplicative" way

== Background ==

* neural language models produce representations that can answer analogy questions:
  - gender...   man:woman :: king:queen
  - speakers... france:french :: mexico:spanish
  - number...   apple:apples :: car:cars
* question: how much of this is a property of *embeddings* (ie dense, low-dimensional)
  - alternative is distributional similarity == bag of contexts (ala Baroni paper)

== Experiment 1 ==

* Mikolov (word2vec) computes similarity for solving a:b :: a*:b* by finding:
    arg max_{b*}  similarity(b*, a* - a + b)       (called 3CosAdd)
   aka similarity(queen, king - man + woman)
* they use cosine similarity: cos(u,v) = dot(u,v) / [ ||u||  ||v|| ]
* expanding out, you get:
    arg max_{b*}  cos(b*,b) + cos(b*,a*) - cos(b*,a)
    cos(queen,woman) + cos(queen,king) - cos(queen,man)

Results: on MSR & Google datasets, embeggings >> explicit (predict >> count)
         on SemEval, basically tied (closed vocabulary?)

* an alternative is:
    arg max_{b*}  similarity(b*-b, a*-a)           (called PairDirection)
  aka similarity(queen-woman, king-man)

Results: Much much worse on MSR, Google (open vocab), and better (though tied) on
SemEval. Perhaps because scale matters for open vocabulary? could have been tested
explicitly... (drat!)

== Experiment 2 ==

Looking at the expansion of 3CosAdd, it looks like a "noisy or" sort of operation:
 - b* should be close to b, should be close to a*, should be far from a...

What about using something more like noisy and:

            cos(b*,b)  cos(b*,a*)
  arg max  -----------------------
      b*       cos(b*,a) + eps


            cos(queen,king)  cos(queen,woman)
  arg max  -----------------------------------
      b*          cos(queen,man) + eps


                      MSR       Google
  3CosAdd  Predict    54%       63%
           Count      29%       45%
  3CosMul  Predict    59%       67%
           Count      57%       68%

One against the other:

  Both correct    -- 54% of cases
  Both wrong      -- 24%
  Predict correct -- 11.1%
  Count correct   -- 11.6%

1 comment:

rrenaud said...

You have a misspelling, "embeggings".