10 February 2008

Kernels, distances and strings

Kernels and distances are closely related. Given a kernel K(x,z), this induces an RKHS H such that K(x,z)=, where f is the mapping from the input space to H, and <.,.> is the dot product in H. Dot products induce norms in the obvious way: ||x||^2 = K(x,x). This, in turn, induces an obvious distance metric: d(x,z)^2 = K(x,x)+K(z,z)-2K(x,z).

On the other hand, we also know how to turn distance metrics into kernels. If (X,d) is a metric space (i.e., X is a space and d is a metric on X), then K(x,z)=exp(-d(x,z)) is positive semi-definite when x,z are drawn from X.

Now, there are three questions that arise. The first is: in the distance->kernel setting, what does the RKHS "look like." I think the sense here is that it's basically the same as the RKHS induced by the Gaussian kernel: it's an infinite-dimensional feature space where different dimensions look like distances to some point. In practice, this "some point" is going to be one of the training points.

The second question follows the fact that we can iterate this construction. One way to ask this is: if we use a kernel to induce a distance, then use this distance to introduce a new kernel, what does this new kernel look like. Or, alternatively, if we have a distance, kernelize it to induce a new distance, then what does this distance look like. I don't have a good intuitive sense for the answer to any of these. Obviously it's straightforward to write out what these things look like, but that doesn't directly give me a sense of what's going on.

The final question is a bit "off topic" from the above two, but still relevant. There's been a lot of work in kernels for discrete data, like strings. The most popular string subsequence kernel is based on counting how many subsequences match between two strings, down-weighted by the length of the subsequences. It's well known, and recognized in string kernel papers, that a kernel of the form K(x,z) = 1-ned(x,z), where ned is a normalized string edit distance is not a valid kernel. However, from what we know about distance metrics, K(x,z) = exp(-sed(x,z)) should be a valid kernel. Yet I'm not aware of this being used anywhere. Is there any reason why? The reason I ask is because it's a much more intuitive than the subsequence kernel, which I only have a vague sense about.

17 comments:

Fernando Pereira said...

Quick comment on your 3rd point: K(x,z) is proportional to the probability x -> z in a probabilistic mutation model with the log-odds of mutations given by the edit costs. Hum...

Suresh said...

don't you mean exp(-d^2), rather than exp(-d) ?

also, what about the Haussler convolution kernel for strings ?

Kathy said...
This comment has been removed by the author.
hal said...

suresh: right, d^2.

fernando: so let's say we use edit distance to induce a kernel. based on your comment, we can think of the kernel value to be the probability of an automata mapping one string to the other. the kernel-induced distance then looks like some distance between (probabilistic) automata that do the mapping. that actually sounds kind of interesting :). i have no idea what happens if you iterate again, though, and create a new kernel based on this.

Bob Carpenter said...

For applications in which token reorderings are likely, basic subsequence comparison works better than simple edit distance. You get good character n-gram subsequence relations between "Smith, John" and "John Smith" even though they're miles apart in terms of character-level edit distances.

There are richer probabilistic edit distances like the ones introduced by Brill and Moore for spelling and by McCallum, Bellare and Pereira for word skipping and other general edits. These don't, in general, don't have negative logs that (when offset from match cost) form a proper metric like Levenshtein distance.

I don't know much about kernels, but if K(x,y) = exp(-d(x,y)**2) always produces a kernel if d is a proper metric, then the question arises of when a probabilistic string transducer defining p(s1|s2) defines a metric. I think that reduces to when:

d(s1,s2) = - log p(s1|s2) + log p(s2|s2)

forms a metric (the second term is so that d(s,s) = 0.0).

Plain Levenshtein distance with uniform edit costs defines a distance metric, but needs some fiddling to turn into a probability distribution (sum of all operations, including matching, must have probabilty 1.0).

hal said...

bob:

"For applications in which token reorderings are likely, basic subsequence comparison works better than simple edit distance." -- is this true or is it just plausible? I.e., has this effect actually been verified? I definitely find it plausible, but are there cases where it actually works out that way? What about when you're talking about words instead of just characters?

There are also a ton of edit distances that William Cohen has proposed and even more that he's compared. If they're actually metrics, then these could also be easily kernelized.

Bob Carpenter said...

Yes, if you mean do character n-grams work better than edit distance for matching.

Last year, we worked on a database linkage and deduplication problem for film names and actor names, and indeed found character n-grams with TF/IDF weighting a reasonable comparison metric. It put almost all the string matching true positives above an easily identified threshold, with only a few residuals where you had things like names transliterated from the original language versus translated.

We've also used this technique for entity transliteration detection, as in finding variants of "Al Jazeera". These probably would've worked OK with edit distance, too.

Substring character n-grams neatly deal with issues such as diacritics (only a small penalty for mismatch), minor case variation (e.g. "University Of Michigan" vs. "Univ. of Michigan") for varying spellings of titles (e.g. "Star Wars IV" vs. "Star Wars Four"), and for various token orders (e.g. "La Traviata" vs. "Traviata, La").

I've also used them for word-sense disambiguation in our tutorial, using both a TF/IDF form of classification and a k-nearest neighbors built on character n-gram dimensions. Again, you get significant robustness boosts over whole word matchers.

Note that we extract character n-grams across word boundaries, so you get some higher-order token-like effects for free. The bag of words assumption is particularly bad for text classifiers.

Character n-grams also work very well for general robust search over text. I'd like to see them compared to character n-gram language models for search. They're actually the norm for languages like Chinese that are impossible to tokenize reliably (i.e. state of the art is 97-98%). And they're also common for transcribed speech at a phonemic or syllabic lattice level.

There'd obviously be rule-based ways to handle all the things mentioned above, as well as variations due to pronunciation, whole word re-orderings, deletions (e.g. the affine edit distances used for genomic/proteomic matching).

I like the idea behind Cohen et al.'s soft TF/IDF:

http://www.cs.cmu.edu/~wcohen/postscript/kdd-2003-match-ws.pdf

But I can't understand either where the IDF is computed or whether the resulting "distance" is even symmetric.

The Jaro-Winkler string comparison is a custom model designed by Jaro and modified by Winkler for matching individual first or last names.

Huzefa said...

String Kernels are highly popular for protein sequence classification problems (as an example). Here are some references below. The second is some of my doctoral work involving use of string kernels with a biological similarity. Using such a biological measure -makes the kernel not positive semi-definite. We work around this using an eigen value transformation.

Christina S. Leslie , Eleazar Eskin , Adiel Cohen , Jason Weston , and William Stafford Noble - Mismatch string kernels for discriminative protein classification Bioinformatics 20: 467-476.

Huzefa Rangwala, George Karypis. Profile-based Direct Kernels for Remote Homology Detection and Fold Recognition in BIOINFORMATICS, 21(23):4239-4247 (2005)

Andre Martins said...

It is not true that "if (X,d) is a metric space then K(x,z)=exp(-t * d^2(x,z)) is positive definite". d must be a "Hilbertian distance", that is, a distance arising from a inner product in a RKHS; not any metric (under the axioms of nonnegativity, symmetry and triangle inequality) is allowed. In particular the string edit distance is NOT Hilbertian, therefore, K(x,z)=exp(-t * sed^2(x,z)) is not pd. See for example:

Corinna Cortes, Patrick Haffner and Mehryar Mohri, "Positive Definite Rational Kernels", Proceedings of The 16th Annual Conference on Computational Learning Theory (COLT 2003)

Anonymous said...

Ultima Online Gold, UO Gold, crestingwait
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
lotro gold
wow gold
warhammer gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
Age of Conan Gold, AOC Gold

Anonymous said...

I always heard something from my neighbor that he sometimes goes to the internet bar to play the game which will use him some gw gold,he usually can win a lot of GuildWars Gold,then he let his friends all have some Guild Wars Gold,his friends thank him very much for introducing them the GuildWars money,they usually cheap gw gold together.

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex