08 June 2009

The importance of input representations

As some of you know, I run a (machine learning) reading group every semester. This summer we're doing "assorted" topics, which basically means students pick a few papers from the past 24 months that are related and present on them. The week before I went out of town, we read two papers about inferring features from raw data; one was a deep learning approach; the other was more Bayesian. (As a total aside, I found it funny that in the latter paper they talk a lot about trying to find independent features, but in all cog sci papers I've seen where humans list features of objects/categories, they're highly dependent: eg., "has fur" and "barks" are reasonable features that humans like to produce that are very much not independent. In general, I tend to think that modeling things as explicitly dependent is a good idea.)

Papers like this love to use vision examples, I guess because we actually have some understanding of how the visual cortex words (from a neuroscience perspective), which we sorely lack for language (it seems much more complicated). They also love to start with pixel representations; perhaps this is neurologically motivated: I don't really know. But I find it kind of funny, primarily because there's a ton of information hard wired into the pixel representation. Why not feed .jpg and .png files directly into your system?

On the language side, an analogy is the bag of words representation. Yes, it's simple. But only simple if you know the language. If I handed you a bunch of text files in Arabic (suppose you'd never done any Arabic NLP) and asked you to make a bag of words, what would you do? What about Chinese? There, it's well known that word segmentation is hard. There's already a huge amount of information in a bag of words format.

The question is: does it matter?

Here's an experiment I did. I took the twenty newsgroups data (standard train/test split) and made classification data. To make the classification data, I took a posting, fed it through a module "X". "X" produced a sequence of tokens. I then extract n-gram features over these tokens and throw out anything that appears less than ten times. I then train a multiclass SVM on these (using libsvm). The only thing that varies in this setup is what "X" does. Here are four "X"s that I tried:

  1. Extract words. When composed with extracting n-gram tokens, this leads to a bag of words, bag of bigrams, bag of trigrams, etc., representation.
  2. Extract characters. This leads to character unigrams, character bigrams, etc.
  3. Extract bits from characters. That is, represent each character in its 8 bit ascii form and extract a sequence of zeros and ones.
  4. Extract bits from a gzipped version of the posting. This is the same as (3), but before extracting the data, we gzip the file.
The average word length for me is 3.55 characters, so a character ngram with length 4.5 is approximately equivalent to a bag of words model. I've plotted the results below for everything except words (words were boring: BOW got 79% accuracy, going to higher ngram length hurt by 2-3%). The x-axis is number of bits, so the unigram character model starts out at eight bits. The y-axis is accuracy:
As we can see, characters do well, even at the same bit sizes. Basically you get a ton of binary sequence features from raw bits that are just confusing the classifier. Zipped bits do markedly worse than raw bits. The reason the bit-based models don't extend further is because it started taking gigantic amounts of memory (more than my poor 32gb machine could handle) to process and train on those files. But 40 bits is about five characters, which is just over a word, so in theory the 40 bit models have the same information that the bag of words model (at 79% accuracy) has.

So yes, it does seem that the input representation matters. This isn't shocking, but I've never seen anyone actually try something like this before.

27 comments:

Ted Dunning ... apparently Bayesian said...

I really don't think that segmentation brings all that much to the party. I say this because pretty good segmentation can be had by simple methods. What is happening in your results is, as you say, the sliding window n-gram approach combined with poor statistical segmentation is over-whelming your SVM classifier.

That doesn't imply that a better unsupervised segmenter wouldn't make the difference between character level and word level retrieval essentially nil. My contention is just that; an unsupervised segmenter would do essentially identically if not a bit better than a human segmentation.

Here are some useful references on character level retrieval, unsupervised segmentation and retrieval or classification using n-grams.

Carl de Marcken's did some very nice early work in unsupervised word segmentation:

http://www.demarcken.org/carl/papers/PhD.pdf

Sproat worked on Chinese segmentation using related statistics:

http://www.cslu.ogi.edu/~sproatr/newindex/publications.html

One of the best current approaches to Chinese segmentation can work in an unsupervised setting and is essentially equivalent to Sproat's methods (but is simpler to implement if you have a dictionary):

http://technology.chtsai.org/mmseg/

You should take a look at Leong et al's work on Chinese retrieval using bigrams. They compared sliding n-gram and segmented retrieval in Chinese:

http://trec.nist.gov/pubs%2Ftrec6/papers/iss.ps.gz

Damashek did n-gram retrieval work in the dark ages:

http://www.sciencemag.org/cgi/content/abstract/267/5199/843

My language ID work was also similar to what you are talking about. I did comparisons of different size n-grams and training data:

https://eprints.kfupm.edu.sa/66788/

hal said...

Oops -- I think I was ambiguous. I know lots of people do character based stuff. What I meant by "never seen this before" was specifically the bit stuff and the gzip stuff, which really get to the heart of not knowing anything about the representation.

Joseph Austerweil said...

i completely agree that allowing for dependence between features is very important; however, ideally, i think it'd be better for the nature of the dependence to be inferred as well if possible.

i've played around a bit with inferring conceptual features using the same techniques and interesting dependency problems arise.

i used an animal data set from pat shafto ( paper ). so the "raw" input is lists of primitives like "has wings" that are owned by a set of animals and the features we are inferring are groups of the primitives that go together. the ibp w/ noisy-or likelihood doesn't do too bad, but it has some trouble.
decent feature:
Contains parts:
has tough skin
has antennae
has wings
is slender
lives in hot climates
flies
is an insect
lays eggs
lives on land
Owned by Objects:
Grasshopper
Ant
Bee
Dragonfly

bad feature:
Contains parts:
lives in groups
travels in groups
is black
is colorful
Owned by Objects:
Ant
Bee
Sheep
Finch
Penguin
Dragonfly

to fix it, we used the phylogenetic ibp (features only cond. indep. given a tree describing the dependencies between objects) to model the dependency between animal via the natural taxonomy (the tree was inferred through using some hierarchical clustering techniques on the raw data).

better features using dependencies:
-------------------------
Inferred Feature: 14
------------------------
Contains parts:
lives in groups
travels in groups
Owned by Objects:
Ant
Bee
Sheep
Seal
Dolphin
Monkey
Jellyfish
Finch
Penguin
Dragonfly
Bat

(it knows to break apart the larger one even though they are correlated together because it breaks the dependency given by the tree).

this still isn't an amazingly interesting example, but i am thinking about it :)

... and by the way, i really enjoyed the nlp example. i've been interested in inferring semantic features since when i used to work with micha and eugene.

oh and if you liked the nips paper about the importance of the correlation between parts for visual feature inference, tom and i have a cogsci proceeding showing people use distributional information to infer visual features.

John Langford said...

This is irrelevant to your point, but a hashing trick would allow you to scale to many more bits.

Sprachreise Brighton said...

This is ultimate. I can say that you know the importance of deep study and your plans and style are great.

rr8004 said...

Very nice information. Thanks for this. Please come visit my site Tennessee TN Phone Directory when you got time.

rr8004 said...

Very nice information. Thanks for this. Please come visit my site Nashville Business Search Engine when you got time.

rr8004 said...

I enjoyed reading your work! GREAT post! I looked around for this… but I found you! Anyway, would you mind if I threw up a backlink from my site? Please come visit my site Columbus Business Directory when you got time.

rr8004 said...

I enjoyed reading your work! GREAT post! I looked around for this… but I found you! Anyway, would you mind if I threw up a backlink from my site? Please come visit my site Columbus Yellow Page Business Directory when you got time.

goodeda1122 said...

自慰套,真愛密碼,
自慰套,自慰器,自慰套,情趣,充氣娃娃,
性感丁字褲,AV,按摩棒,電動按摩棒,情趣按摩棒,
角色扮演,角色扮演服,吊帶襪,丁字褲,飛機杯,

按摩棒,變頻跳蛋,跳蛋,無線跳蛋,G點,
潤滑液,SM,情趣內衣,內衣,性感內衣,情趣用品,情趣,

Anonymous said...

How to contact you. Please help me for my Phd work. I'm searching for a problem in NLP so please guide me. Now I am working as a lecturer.
mcasanthiya@yahoo.com

tariely said...

Классные мультики мультфильм на кинозоуне.
электронная почта без регистрации

tariely said...

электронная почта без регистрации

gamefan12 said...

I agree with your so much. This is so important.
palm beach cosmetic dentist

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

cilemsin42 said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
hersey burada
sesliparti
seslisohbet odalari
Sesli adresi
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

cilemsin42 said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
hersey burada
sesliparti
seslisohbet odalari
Sesli adresi
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

combattery84 said...

Laptop Battery
ACER Laptop Battery
ASUS Laptop Battery
COMPAQ Laptop Battery
Dell Laptop Battery
FUJITSU Laptop Battery
HP Laptop Battery
IBM Laptop Battery
SONY Laptop Battery
TOSHIBA Laptop Battery
APPLE M8403 battery
APPLE A1078 Battery
APPLE A1079 battery
APPLE A1175 battery 1
APPLE a1185 battery
APPLE A1189 battery
Acer aspire 5920 battery
Acer btp-arj1 battery
Acer LC.BTP01.013 battery
Acer ASPIRE 1300 battery
Acer ASPIRE 1310 battery
Acer Aspire 1410 battery
Acer ASPIRE 1680 battery
ACER BTP-63D1 battery
ACER BTP-43D1 battery
Acer lc.btp05.001 battery
Acer aspire 3000 battery
Acer Travelmate 4000 battery
ACER aspire 5560 battery
ACER BATBL50L6 battery
ACER TravelMate 240 Battery

combattery84 said...

Dell inspiron 8500 battery
Dell Inspiron 4100 battery
Dell Inspiron 4000 battery
Dell Inspiron 8200 battery
Dell FK890 battery
Dell Inspiron 1721 battery
Dell Inspiron 1300 Battery
Dell Inspiron 1520 Battery
Dell Latitude D600 Battery
Dell XPS M1330 battery
Dell Latitude D531N Battery
Dell INSPIRON 6000 battery
Dell INSPIRON 6400 Battery
Dell Inspiron 9300 battery
Dell INSPIRON 9400 Battery
Dell INSPIRON e1505 battery
Dell INSPIRON 2500 battery
Dell INSPIRON 630m battery
Dell Latitude D820 battery
Dell Latitude D610 Battery
Dell Latitude D620 battery
Dell Latitude D630 battery
Dell xps m1210 battery
Dell e1705 battery
Dell d830 battery
Dell inspiron 2200 battery
Dell inspiron 640m battery
Dell inspiron b120 battery
Dell xps m1210 battery

combattery84 said...

Dell inspiron xps m1710 battery
Dell inspiron 1100 battery
Dell 310-6321 battery
Dell 1691p battery
Dell Inspiron 500m battery
Dell 6Y270 battery
Dell inspiron 8600 battery
Latitude x300 series battery
Dell latitude cpi battery
Dell 1x793 battery
dell Inspiron 1501 battery
Dell 75UYF Battery
Dell Inspiron 1720 battery
dell Latitude C640 battery
Dell XPS M140 battery
Dell Inspiron E1405 battery
dell 700m battery
dell C1295 battery
Dell U4873 Battery
Dell Latitude C600 battery
Armada E700 Series battery
Compaq 116314-001 battery
Compaq 319411-001 battery
Compaq nc4200 battery
Compaq Presario R3000 Battery
Compaq Presario 2100 battery
Compaq Presario r3000 Battery
Compaq Business Notebook NX9000 series battery
HP 395789-001 battery
HP 446506-001 Battery

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex

Anonymous said...

коттедж
восстановление зрения
зеленый лазер
электрошокер

Anonymous said...

коттедж
восстановление зрения
зеленый лазер
электрошокер

Anonymous said...

коттедж
восстановление зрения
зеленый лазер
электрошокер

Anonymous said...

коттедж
восстановление зрения
зеленый лазер
электрошокер