natural language processing blog: The importance of input representations

08 June 2009

The importance of input representations

As some of you know, I run a (machine learning) reading group every semester. This summer we're doing "assorted" topics, which basically means students pick a few papers from the past 24 months that are related and present on them. The week before I went out of town, we read two papers about inferring features from raw data; one was a deep learning approach; the other was more Bayesian. (As a total aside, I found it funny that in the latter paper they talk a lot about trying to find independent features, but in all cog sci papers I've seen where humans list features of objects/categories, they're highly dependent: eg., "has fur" and "barks" are reasonable features that humans like to produce that are very much not independent. In general, I tend to think that modeling things as explicitly dependent is a good idea.)

Papers like this love to use vision examples, I guess because we actually have some understanding of how the visual cortex words (from a neuroscience perspective), which we sorely lack for language (it seems much more complicated). They also love to start with pixel representations; perhaps this is neurologically motivated: I don't really know. But I find it kind of funny, primarily because there's a ton of information hard wired into the pixel representation. Why not feed .jpg and .png files directly into your system?

On the language side, an analogy is the bag of words representation. Yes, it's simple. But only simple if you know the language. If I handed you a bunch of text files in Arabic (suppose you'd never done any Arabic NLP) and asked you to make a bag of words, what would you do? What about Chinese? There, it's well known that word segmentation is hard. There's already a huge amount of information in a bag of words format.

The question is: does it matter?

Here's an experiment I did. I took the twenty newsgroups data (standard train/test split) and made classification data. To make the classification data, I took a posting, fed it through a module "X". "X" produced a sequence of tokens. I then extract n-gram features over these tokens and throw out anything that appears less than ten times. I then train a multiclass SVM on these (using libsvm). The only thing that varies in this setup is what "X" does. Here are four "X"s that I tried:

Extract words. When composed with extracting n-gram tokens, this leads to a bag of words, bag of bigrams, bag of trigrams, etc., representation.
Extract characters. This leads to character unigrams, character bigrams, etc.
Extract bits from characters. That is, represent each character in its 8 bit ascii form and extract a sequence of zeros and ones.
Extract bits from a gzipped version of the posting. This is the same as (3), but before extracting the data, we gzip the file.

The average word length for me is 3.55 characters, so a character ngram with length 4.5 is approximately equivalent to a bag of words model. I've plotted the results below for everything except words (words were boring: BOW got 79% accuracy, going to higher ngram length hurt by 2-3%). The x-axis is number of bits, so the unigram character model starts out at eight bits. The y-axis is accuracy:

As we can see, characters do well, even at the same bit sizes. Basically you get a ton of binary sequence features from raw bits that are just confusing the classifier. Zipped bits do markedly worse than raw bits. The reason the bit-based models don't extend further is because it started taking gigantic amounts of memory (more than my poor 32gb machine could handle) to process and train on those files. But 40 bits is about five characters, which is just over a word, so in theory the 40 bit models have the same information that the bag of words model (at 79% accuracy) has.

So yes, it does seem that the input representation matters. This isn't shocking, but I've never seen anyone actually try something like this before.

11 comments:

Ted Dunning ... apparently Bayesian08 June, 2009 12:38
I really don't think that segmentation brings all that much to the party. I say this because pretty good segmentation can be had by simple methods. What is happening in your results is, as you say, the sliding window n-gram approach combined with poor statistical segmentation is over-whelming your SVM classifier.

That doesn't imply that a better unsupervised segmenter wouldn't make the difference between character level and word level retrieval essentially nil. My contention is just that; an unsupervised segmenter would do essentially identically if not a bit better than a human segmentation.

Here are some useful references on character level retrieval, unsupervised segmentation and retrieval or classification using n-grams.

Carl de Marcken's did some very nice early work in unsupervised word segmentation:

http://www.demarcken.org/carl/papers/PhD.pdf

Sproat worked on Chinese segmentation using related statistics:

http://www.cslu.ogi.edu/~sproatr/newindex/publications.html

One of the best current approaches to Chinese segmentation can work in an unsupervised setting and is essentially equivalent to Sproat's methods (but is simpler to implement if you have a dictionary):

http://technology.chtsai.org/mmseg/

You should take a look at Leong et al's work on Chinese retrieval using bigrams. They compared sliding n-gram and segmented retrieval in Chinese:

http://trec.nist.gov/pubs%2Ftrec6/papers/iss.ps.gz

Damashek did n-gram retrieval work in the dark ages:

http://www.sciencemag.org/cgi/content/abstract/267/5199/843

My language ID work was also similar to what you are talking about. I did comparisons of different size n-grams and training data:

https://eprints.kfupm.edu.sa/66788/
ReplyDelete
Replies
hal08 June, 2009 13:05
Oops -- I think I was ambiguous. I know lots of people do character based stuff. What I meant by "never seen this before" was specifically the bit stuff and the gzip stuff, which really get to the heart of not knowing anything about the representation.
ReplyDelete
Replies
Joseph Austerweil08 June, 2009 15:33
i completely agree that allowing for dependence between features is very important; however, ideally, i think it'd be better for the nature of the dependence to be inferred as well if possible.

i've played around a bit with inferring conceptual features using the same techniques and interesting dependency problems arise.

i used an animal data set from pat shafto ( paper ). so the "raw" input is lists of primitives like "has wings" that are owned by a set of animals and the features we are inferring are groups of the primitives that go together. the ibp w/ noisy-or likelihood doesn't do too bad, but it has some trouble.
decent feature:
Contains parts:
has tough skin
has antennae
has wings
is slender
lives in hot climates
flies
is an insect
lays eggs
lives on land
Owned by Objects:
Grasshopper
Ant
Bee
Dragonfly

bad feature:
Contains parts:
lives in groups
travels in groups
is black
is colorful
Owned by Objects:
Ant
Bee
Sheep
Finch
Penguin
Dragonfly

to fix it, we used the phylogenetic ibp (features only cond. indep. given a tree describing the dependencies between objects) to model the dependency between animal via the natural taxonomy (the tree was inferred through using some hierarchical clustering techniques on the raw data).

better features using dependencies:
-------------------------
Inferred Feature: 14
------------------------
Contains parts:
lives in groups
travels in groups
Owned by Objects:
Ant
Bee
Sheep
Seal
Dolphin
Monkey
Jellyfish
Finch
Penguin
Dragonfly
Bat

(it knows to break apart the larger one even though they are correlated together because it breaks the dependency given by the tree).

this still isn't an amazingly interesting example, but i am thinking about it :)

... and by the way, i really enjoyed the nlp example. i've been interested in inferring semantic features since when i used to work with micha and eugene.

oh and if you liked the nips paper about the importance of the correlation between parts for visual feature inference, tom and i have a cogsci proceeding showing people use distributional information to infer visual features.
ReplyDelete
Replies
John Langford09 June, 2009 06:25
This is irrelevant to your point, but a hashing trick would allow you to scale to many more bits.
ReplyDelete
Replies
Sprachreise Brighton28 August, 2009 03:01
This is ultimate. I can say that you know the importance of deep study and your plans and style are great.
ReplyDelete
Replies
Anonymous24 October, 2009 01:38
Very nice information. Thanks for this. Please come visit my site Tennessee TN Phone Directory when you got time.
ReplyDelete
Replies
Anonymous24 October, 2009 01:39
Very nice information. Thanks for this. Please come visit my site Nashville Business Search Engine when you got time.
ReplyDelete
Replies
Anonymous30 October, 2009 22:31
I enjoyed reading your work! GREAT post! I looked around for this… but I found you! Anyway, would you mind if I threw up a backlink from my site? Please come visit my site Columbus Yellow Page Business Directory when you got time.
ReplyDelete
Replies
Anonymous23 November, 2009 07:00
自慰套,真愛密碼,
自慰套,自慰器,自慰套,情趣,充氣娃娃,
性感丁字褲,AV,按摩棒,電動按摩棒,情趣按摩棒,
角色扮演,角色扮演服,吊帶襪,丁字褲,飛機杯,

按摩棒,變頻跳蛋,跳蛋,無線跳蛋,G點,
潤滑液,SM,情趣內衣,內衣,性感內衣,情趣用品,情趣,
ReplyDelete
Replies
Anonymous22 December, 2009 21:54
How to contact you. Please help me for my Phd work. I'm searching for a problem in NLP so please guide me. Now I am working as a lecturer.
mcasanthiya@yahoo.com
ReplyDelete
Replies
gamefan1202 March, 2010 15:58
I agree with your so much. This is so important.
palm beach cosmetic dentist
ReplyDelete
Replies

Add comment