Kernels and distances are closely related. Given a kernel K(x,z), this induces an RKHS H such that K(x,z)=

On the other hand, we also know how to turn distance metrics into kernels. If (X,d) is a metric space (i.e., X is a space and d is a metric on X), then K(x,z)=exp(-d(x,z)) is positive semi-definite when x,z are drawn from X.

Now, there are three questions that arise. The first is: in the distance->kernel setting, what does the RKHS "look like." I think the sense here is that it's basically the same as the RKHS induced by the Gaussian kernel: it's an infinite-dimensional feature space where different dimensions look like distances to some point. In practice, this "some point" is going to be one of the training points.

The second question follows the fact that we can iterate this construction. One way to ask this is: if we use a kernel to induce a distance, then use this distance to introduce a new kernel, what does this new kernel look like. Or, alternatively, if we have a distance, kernelize it to induce a new distance, then what does this distance look like. I don't have a good intuitive sense for the answer to any of these. Obviously it's straightforward to write out what these things look like, but that doesn't directly give me a sense of what's going on.

The final question is a bit "off topic" from the above two, but still relevant. There's been a lot of work in kernels for discrete data, like strings. The most popular string subsequence kernel is based on counting how many subsequences match between two strings, down-weighted by the length of the subsequences. It's well known, and recognized in string kernel papers, that a kernel of the form K(x,z) = 1-ned(x,z), where ned is a normalized string edit distance is not a valid kernel. However, from what we know about distance metrics, K(x,z) = exp(-sed(x,z)) should be a valid kernel. Yet I'm not aware of this being used anywhere. Is there any reason why? The reason I ask is because it's a much more intuitive than the subsequence kernel, which I only have a vague sense about.

## 10 February 2008

### Kernels, distances and strings

Posted by hal at 2/10/2008 10:23:00 AM

Labels: machine learning

Subscribe to:
Post Comments (Atom)

## 17 comments:

Quick comment on your 3rd point: K(x,z) is proportional to the probability x -> z in a probabilistic mutation model with the log-odds of mutations given by the edit costs. Hum...

don't you mean exp(-d^2), rather than exp(-d) ?

also, what about the Haussler convolution kernel for strings ?

suresh: right, d^2.

fernando: so let's say we use edit distance to induce a kernel. based on your comment, we can think of the kernel value to be the probability of an automata mapping one string to the other. the kernel-induced distance then looks like some distance between (probabilistic) automata that do the mapping. that actually sounds kind of interesting :). i have no idea what happens if you iterate again, though, and create a new kernel based on this.

For applications in which token reorderings are likely, basic subsequence comparison works better than simple edit distance. You get good character n-gram subsequence relations between "Smith, John" and "John Smith" even though they're miles apart in terms of character-level edit distances.

There are richer probabilistic edit distances like the ones introduced by Brill and Moore for spelling and by McCallum, Bellare and Pereira for word skipping and other general edits. These don't, in general, don't have negative logs that (when offset from match cost) form a proper metric like Levenshtein distance.

I don't know much about kernels, but if K(x,y) = exp(-d(x,y)**2) always produces a kernel if d is a proper metric, then the question arises of when a probabilistic string transducer defining p(s1|s2) defines a metric. I think that reduces to when:

d(s1,s2) = - log p(s1|s2) + log p(s2|s2)

forms a metric (the second term is so that d(s,s) = 0.0).

Plain Levenshtein distance with uniform edit costs defines a distance metric, but needs some fiddling to turn into a probability distribution (sum of all operations, including matching, must have probabilty 1.0).

bob:

"For applications in which token reorderings are likely, basic subsequence comparison works better than simple edit distance." -- is this

trueor is it justplausible? I.e., has this effect actually been verified? I definitely find it plausible, but are there cases where it actually works out that way? What about when you're talking about words instead of just characters?There are also a ton of edit distances that William Cohen has proposed and even more that he's compared. If they're actually metrics, then these could also be easily kernelized.

Yes, if you mean do character n-grams work better than edit distance for matching.

Last year, we worked on a database linkage and deduplication problem for film names and actor names, and indeed found character n-grams with TF/IDF weighting a reasonable comparison metric. It put almost all the string matching true positives above an easily identified threshold, with only a few residuals where you had things like names transliterated from the original language versus translated.

We've also used this technique for entity transliteration detection, as in finding variants of "Al Jazeera". These probably would've worked OK with edit distance, too.

Substring character n-grams neatly deal with issues such as diacritics (only a small penalty for mismatch), minor case variation (e.g. "University Of Michigan" vs. "Univ. of Michigan") for varying spellings of titles (e.g. "Star Wars IV" vs. "Star Wars Four"), and for various token orders (e.g. "La Traviata" vs. "Traviata, La").

I've also used them for word-sense disambiguation in our tutorial, using both a TF/IDF form of classification and a k-nearest neighbors built on character n-gram dimensions. Again, you get significant robustness boosts over whole word matchers.

Note that we extract character n-grams across word boundaries, so you get some higher-order token-like effects for free. The bag of words assumption is particularly bad for text classifiers.

Character n-grams also work very well for general robust search over text. I'd like to see them compared to character n-gram language models for search. They're actually the norm for languages like Chinese that are impossible to tokenize reliably (i.e. state of the art is 97-98%). And they're also common for transcribed speech at a phonemic or syllabic lattice level.

There'd obviously be rule-based ways to handle all the things mentioned above, as well as variations due to pronunciation, whole word re-orderings, deletions (e.g. the affine edit distances used for genomic/proteomic matching).

I like the idea behind Cohen et al.'s soft TF/IDF:

http://www.cs.cmu.edu/~wcohen/postscript/kdd-2003-match-ws.pdf

But I can't understand either where the IDF is computed or whether the resulting "distance" is even symmetric.

The Jaro-Winkler string comparison is a custom model designed by Jaro and modified by Winkler for matching individual first or last names.

String Kernels are highly popular for protein sequence classification problems (as an example). Here are some references below. The second is some of my doctoral work involving use of string kernels with a biological similarity. Using such a biological measure -makes the kernel not positive semi-definite. We work around this using an eigen value transformation.

Christina S. Leslie , Eleazar Eskin , Adiel Cohen , Jason Weston , and William Stafford Noble - Mismatch string kernels for discriminative protein classification Bioinformatics 20: 467-476.

Huzefa Rangwala, George Karypis. Profile-based Direct Kernels for Remote Homology Detection and Fold Recognition in BIOINFORMATICS, 21(23):4239-4247 (2005)

It is not true that "if (X,d) is a metric space then K(x,z)=exp(-t * d^2(x,z)) is positive definite". d must be a "Hilbertian distance", that is, a distance arising from a inner product in a RKHS; not any metric (under the axioms of nonnegativity, symmetry and triangle inequality) is allowed. In particular the string edit distance is NOT Hilbertian, therefore, K(x,z)=exp(-t * sed^2(x,z)) is not pd. See for example:

Corinna Cortes, Patrick Haffner and Mehryar Mohri, "Positive Definite Rational Kernels", Proceedings of The 16th Annual Conference on Computational Learning Theory (COLT 2003)

Ultima Online Gold, UO Gold, crestingwait

buy uo gold

buy uo gold

buy uo gold

buy uo gold

buy uo gold

buy uo gold

buy uo gold

buy uo gold

buy uo gold

buy uo gold

lotro gold

wow gold

warhammer gold

buy aoc gold

buy aoc gold

buy aoc gold

buy aoc gold

buy aoc gold

buy aoc gold

buy aoc gold

Age of Conan Gold, AOC Gold

I always heard something from my neighbor that he sometimes goes to the internet bar to play the game which will use him some gw gold，he usually can win a lot of GuildWars Gold，then he let his friends all have some Guild Wars Gold，his friends thank him very much for introducing them the GuildWars money，they usually cheap gw gold together.

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差、打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店當日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作、兼差打工、假日兼職、兼職工作、酒店兼差、兼差、打工兼差、日領工作、晚上兼差工作、酒店工作、酒店上班、酒店打工、兼職、兼差、兼差工作、酒店上班等,想了解酒店相關工作和特種行業內容,想兼職工作日領、假日兼職、兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚，無業績壓力，無脫秀無喝酒壓力，高層次會員制客源，工作輕鬆，可日領、現領。

一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已，對水水們的上班安全一點保障都沒有！艾葳酒店經紀公司的水水們上班時全程媽咪作陪，不需擔心！只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職、缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??

艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,

等相關服務 幫您快速的實現您的夢想~!!

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet

seslisohbet

sesli chat

seslichat

sesli sohbet sitesi

sesli chat sitesi

sesli sohpet

kamerali sohbet

kamerali chat

webcam sohbet

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..

seslisohbet

seslichat

sesli sohbet

sesli chat

sesli

sesli site

görünlütü sohbet

görüntülü chat

kameralı sohbet

kameralı chat

sesli sohbet siteleri

sesli chat siteleri

görüntülü sohbet siteleri

görüntülü chat siteleri

kameralı sohbet siteleri

canlı sohbet

sesli muhabbet

görüntülü muhabbet

kameralı muhabbet

seslidunya

seslisehir

sesli sex

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..

seslisohbet

seslichat

sesli sohbet

sesli chat

sesli

sesli site

görünlütü sohbet

görüntülü chat

kameralı sohbet

kameralı chat

sesli sohbet siteleri

sesli chat siteleri

sesli muhabbet siteleri

görüntülü sohbet siteleri

görüntülü chat siteleri

görüntülü muhabbet siteleri

kameralı sohbet siteleri

kameralı chat siteleri

kameralı muhabbet siteleri

canlı sohbet

sesli muhabbet

görüntülü muhabbet

kameralı muhabbet

birsesver

birses

seslidunya

seslisehir

sesli sex

Post a Comment