08 June 2009

The importance of input representations

As some of you know, I run a (machine learning) reading group every semester. This summer we're doing "assorted" topics, which basically means students pick a few papers from the past 24 months that are related and present on them. The week before I went out of town, we read two papers about inferring features from raw data; one was a deep learning approach; the other was more Bayesian. (As a total aside, I found it funny that in the latter paper they talk a lot about trying to find independent features, but in all cog sci papers I've seen where humans list features of objects/categories, they're highly dependent: eg., "has fur" and "barks" are reasonable features that humans like to produce that are very much not independent. In general, I tend to think that modeling things as explicitly dependent is a good idea.)

Papers like this love to use vision examples, I guess because we actually have some understanding of how the visual cortex words (from a neuroscience perspective), which we sorely lack for language (it seems much more complicated). They also love to start with pixel representations; perhaps this is neurologically motivated: I don't really know. But I find it kind of funny, primarily because there's a ton of information hard wired into the pixel representation. Why not feed .jpg and .png files directly into your system?

On the language side, an analogy is the bag of words representation. Yes, it's simple. But only simple if you know the language. If I handed you a bunch of text files in Arabic (suppose you'd never done any Arabic NLP) and asked you to make a bag of words, what would you do? What about Chinese? There, it's well known that word segmentation is hard. There's already a huge amount of information in a bag of words format.

The question is: does it matter?

Here's an experiment I did. I took the twenty newsgroups data (standard train/test split) and made classification data. To make the classification data, I took a posting, fed it through a module "X". "X" produced a sequence of tokens. I then extract n-gram features over these tokens and throw out anything that appears less than ten times. I then train a multiclass SVM on these (using libsvm). The only thing that varies in this setup is what "X" does. Here are four "X"s that I tried:
  1. Extract words. When composed with extracting n-gram tokens, this leads to a bag of words, bag of bigrams, bag of trigrams, etc., representation.
  2. Extract characters. This leads to character unigrams, character bigrams, etc.
  3. Extract bits from characters. That is, represent each character in its 8 bit ascii form and extract a sequence of zeros and ones.
  4. Extract bits from a gzipped version of the posting. This is the same as (3), but before extracting the data, we gzip the file.
The average word length for me is 3.55 characters, so a character ngram with length 4.5 is approximately equivalent to a bag of words model. I've plotted the results below for everything except words (words were boring: BOW got 79% accuracy, going to higher ngram length hurt by 2-3%). The x-axis is number of bits, so the unigram character model starts out at eight bits. The y-axis is accuracy:
As we can see, characters do well, even at the same bit sizes. Basically you get a ton of binary sequence features from raw bits that are just confusing the classifier. Zipped bits do markedly worse than raw bits. The reason the bit-based models don't extend further is because it started taking gigantic amounts of memory (more than my poor 32gb machine could handle) to process and train on those files. But 40 bits is about five characters, which is just over a word, so in theory the 40 bit models have the same information that the bag of words model (at 79% accuracy) has.

So yes, it does seem that the input representation matters. This isn't shocking, but I've never seen anyone actually try something like this before.

11 comments:

  1. I really don't think that segmentation brings all that much to the party. I say this because pretty good segmentation can be had by simple methods. What is happening in your results is, as you say, the sliding window n-gram approach combined with poor statistical segmentation is over-whelming your SVM classifier.

    That doesn't imply that a better unsupervised segmenter wouldn't make the difference between character level and word level retrieval essentially nil. My contention is just that; an unsupervised segmenter would do essentially identically if not a bit better than a human segmentation.

    Here are some useful references on character level retrieval, unsupervised segmentation and retrieval or classification using n-grams.

    Carl de Marcken's did some very nice early work in unsupervised word segmentation:

    http://www.demarcken.org/carl/papers/PhD.pdf

    Sproat worked on Chinese segmentation using related statistics:

    http://www.cslu.ogi.edu/~sproatr/newindex/publications.html

    One of the best current approaches to Chinese segmentation can work in an unsupervised setting and is essentially equivalent to Sproat's methods (but is simpler to implement if you have a dictionary):

    http://technology.chtsai.org/mmseg/

    You should take a look at Leong et al's work on Chinese retrieval using bigrams. They compared sliding n-gram and segmented retrieval in Chinese:

    http://trec.nist.gov/pubs%2Ftrec6/papers/iss.ps.gz

    Damashek did n-gram retrieval work in the dark ages:

    http://www.sciencemag.org/cgi/content/abstract/267/5199/843

    My language ID work was also similar to what you are talking about. I did comparisons of different size n-grams and training data:

    https://eprints.kfupm.edu.sa/66788/

    ReplyDelete
  2. Oops -- I think I was ambiguous. I know lots of people do character based stuff. What I meant by "never seen this before" was specifically the bit stuff and the gzip stuff, which really get to the heart of not knowing anything about the representation.

    ReplyDelete
  3. i completely agree that allowing for dependence between features is very important; however, ideally, i think it'd be better for the nature of the dependence to be inferred as well if possible.

    i've played around a bit with inferring conceptual features using the same techniques and interesting dependency problems arise.

    i used an animal data set from pat shafto ( paper ). so the "raw" input is lists of primitives like "has wings" that are owned by a set of animals and the features we are inferring are groups of the primitives that go together. the ibp w/ noisy-or likelihood doesn't do too bad, but it has some trouble.
    decent feature:
    Contains parts:
    has tough skin
    has antennae
    has wings
    is slender
    lives in hot climates
    flies
    is an insect
    lays eggs
    lives on land
    Owned by Objects:
    Grasshopper
    Ant
    Bee
    Dragonfly

    bad feature:
    Contains parts:
    lives in groups
    travels in groups
    is black
    is colorful
    Owned by Objects:
    Ant
    Bee
    Sheep
    Finch
    Penguin
    Dragonfly

    to fix it, we used the phylogenetic ibp (features only cond. indep. given a tree describing the dependencies between objects) to model the dependency between animal via the natural taxonomy (the tree was inferred through using some hierarchical clustering techniques on the raw data).

    better features using dependencies:
    -------------------------
    Inferred Feature: 14
    ------------------------
    Contains parts:
    lives in groups
    travels in groups
    Owned by Objects:
    Ant
    Bee
    Sheep
    Seal
    Dolphin
    Monkey
    Jellyfish
    Finch
    Penguin
    Dragonfly
    Bat

    (it knows to break apart the larger one even though they are correlated together because it breaks the dependency given by the tree).

    this still isn't an amazingly interesting example, but i am thinking about it :)

    ... and by the way, i really enjoyed the nlp example. i've been interested in inferring semantic features since when i used to work with micha and eugene.

    oh and if you liked the nips paper about the importance of the correlation between parts for visual feature inference, tom and i have a cogsci proceeding showing people use distributional information to infer visual features.

    ReplyDelete
  4. This is irrelevant to your point, but a hashing trick would allow you to scale to many more bits.

    ReplyDelete
  5. This is ultimate. I can say that you know the importance of deep study and your plans and style are great.

    ReplyDelete
  6. Very nice information. Thanks for this. Please come visit my site Tennessee TN Phone Directory when you got time.

    ReplyDelete
  7. Very nice information. Thanks for this. Please come visit my site Nashville Business Search Engine when you got time.

    ReplyDelete
  8. I enjoyed reading your work! GREAT post! I looked around for this… but I found you! Anyway, would you mind if I threw up a backlink from my site? Please come visit my site Columbus Yellow Page Business Directory when you got time.

    ReplyDelete
  9. How to contact you. Please help me for my Phd work. I'm searching for a problem in NLP so please guide me. Now I am working as a lecturer.
    mcasanthiya@yahoo.com

    ReplyDelete
  10. I agree with your so much. This is so important.
    palm beach cosmetic dentist

    ReplyDelete