08 June 2009

The importance of input representations

As some of you know, I run a (machine learning) reading group every semester. This summer we're doing "assorted" topics, which basically means students pick a few papers from the past 24 months that are related and present on them. The week before I went out of town, we read two papers about inferring features from raw data; one was a deep learning approach; the other was more Bayesian. (As a total aside, I found it funny that in the latter paper they talk a lot about trying to find independent features, but in all cog sci papers I've seen where humans list features of objects/categories, they're highly dependent: eg., "has fur" and "barks" are reasonable features that humans like to produce that are very much not independent. In general, I tend to think that modeling things as explicitly dependent is a good idea.)

Papers like this love to use vision examples, I guess because we actually have some understanding of how the visual cortex words (from a neuroscience perspective), which we sorely lack for language (it seems much more complicated). They also love to start with pixel representations; perhaps this is neurologically motivated: I don't really know. But I find it kind of funny, primarily because there's a ton of information hard wired into the pixel representation. Why not feed .jpg and .png files directly into your system?

On the language side, an analogy is the bag of words representation. Yes, it's simple. But only simple if you know the language. If I handed you a bunch of text files in Arabic (suppose you'd never done any Arabic NLP) and asked you to make a bag of words, what would you do? What about Chinese? There, it's well known that word segmentation is hard. There's already a huge amount of information in a bag of words format.

The question is: does it matter?

Here's an experiment I did. I took the twenty newsgroups data (standard train/test split) and made classification data. To make the classification data, I took a posting, fed it through a module "X". "X" produced a sequence of tokens. I then extract n-gram features over these tokens and throw out anything that appears less than ten times. I then train a multiclass SVM on these (using libsvm). The only thing that varies in this setup is what "X" does. Here are four "X"s that I tried:

  1. Extract words. When composed with extracting n-gram tokens, this leads to a bag of words, bag of bigrams, bag of trigrams, etc., representation.
  2. Extract characters. This leads to character unigrams, character bigrams, etc.
  3. Extract bits from characters. That is, represent each character in its 8 bit ascii form and extract a sequence of zeros and ones.
  4. Extract bits from a gzipped version of the posting. This is the same as (3), but before extracting the data, we gzip the file.
The average word length for me is 3.55 characters, so a character ngram with length 4.5 is approximately equivalent to a bag of words model. I've plotted the results below for everything except words (words were boring: BOW got 79% accuracy, going to higher ngram length hurt by 2-3%). The x-axis is number of bits, so the unigram character model starts out at eight bits. The y-axis is accuracy:
As we can see, characters do well, even at the same bit sizes. Basically you get a ton of binary sequence features from raw bits that are just confusing the classifier. Zipped bits do markedly worse than raw bits. The reason the bit-based models don't extend further is because it started taking gigantic amounts of memory (more than my poor 32gb machine could handle) to process and train on those files. But 40 bits is about five characters, which is just over a word, so in theory the 40 bit models have the same information that the bag of words model (at 79% accuracy) has.

So yes, it does seem that the input representation matters. This isn't shocking, but I've never seen anyone actually try something like this before.


Ted Dunning ... apparently Bayesian said...

I really don't think that segmentation brings all that much to the party. I say this because pretty good segmentation can be had by simple methods. What is happening in your results is, as you say, the sliding window n-gram approach combined with poor statistical segmentation is over-whelming your SVM classifier.

That doesn't imply that a better unsupervised segmenter wouldn't make the difference between character level and word level retrieval essentially nil. My contention is just that; an unsupervised segmenter would do essentially identically if not a bit better than a human segmentation.

Here are some useful references on character level retrieval, unsupervised segmentation and retrieval or classification using n-grams.

Carl de Marcken's did some very nice early work in unsupervised word segmentation:


Sproat worked on Chinese segmentation using related statistics:


One of the best current approaches to Chinese segmentation can work in an unsupervised setting and is essentially equivalent to Sproat's methods (but is simpler to implement if you have a dictionary):


You should take a look at Leong et al's work on Chinese retrieval using bigrams. They compared sliding n-gram and segmented retrieval in Chinese:


Damashek did n-gram retrieval work in the dark ages:


My language ID work was also similar to what you are talking about. I did comparisons of different size n-grams and training data:


hal said...

Oops -- I think I was ambiguous. I know lots of people do character based stuff. What I meant by "never seen this before" was specifically the bit stuff and the gzip stuff, which really get to the heart of not knowing anything about the representation.

Joseph Austerweil said...

i completely agree that allowing for dependence between features is very important; however, ideally, i think it'd be better for the nature of the dependence to be inferred as well if possible.

i've played around a bit with inferring conceptual features using the same techniques and interesting dependency problems arise.

i used an animal data set from pat shafto ( paper ). so the "raw" input is lists of primitives like "has wings" that are owned by a set of animals and the features we are inferring are groups of the primitives that go together. the ibp w/ noisy-or likelihood doesn't do too bad, but it has some trouble.
decent feature:
Contains parts:
has tough skin
has antennae
has wings
is slender
lives in hot climates
is an insect
lays eggs
lives on land
Owned by Objects:

bad feature:
Contains parts:
lives in groups
travels in groups
is black
is colorful
Owned by Objects:

to fix it, we used the phylogenetic ibp (features only cond. indep. given a tree describing the dependencies between objects) to model the dependency between animal via the natural taxonomy (the tree was inferred through using some hierarchical clustering techniques on the raw data).

better features using dependencies:
Inferred Feature: 14
Contains parts:
lives in groups
travels in groups
Owned by Objects:

(it knows to break apart the larger one even though they are correlated together because it breaks the dependency given by the tree).

this still isn't an amazingly interesting example, but i am thinking about it :)

... and by the way, i really enjoyed the nlp example. i've been interested in inferring semantic features since when i used to work with micha and eugene.

oh and if you liked the nips paper about the importance of the correlation between parts for visual feature inference, tom and i have a cogsci proceeding showing people use distributional information to infer visual features.

John Langford said...

This is irrelevant to your point, but a hashing trick would allow you to scale to many more bits.

Sprachreise Brighton said...

This is ultimate. I can say that you know the importance of deep study and your plans and style are great.

Anonymous said...

Very nice information. Thanks for this. Please come visit my site Tennessee TN Phone Directory when you got time.

Anonymous said...

Very nice information. Thanks for this. Please come visit my site Nashville Business Search Engine when you got time.

Anonymous said...

I enjoyed reading your work! GREAT post! I looked around for this… but I found you! Anyway, would you mind if I threw up a backlink from my site? Please come visit my site Columbus Yellow Page Business Directory when you got time.

Anonymous said...



Anonymous said...

How to contact you. Please help me for my Phd work. I'm searching for a problem in NLP so please guide me. Now I am working as a lecturer.

gamefan12 said...

I agree with your so much. This is so important.
palm beach cosmetic dentist

Unknown said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
canlı müzik
en güzel evlilik
hersey burada
seslisohbet odalari
Sesli adresi
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
Sesli Chat
Guncel Haber
sohbet Sitesi
Chat sitesi..