## 10 February 2008

### Kernels, distances and strings

Kernels and distances are closely related. Given a kernel K(x,z), this induces an RKHS H such that K(x,z)=, where f is the mapping from the input space to H, and <.,.> is the dot product in H. Dot products induce norms in the obvious way: ||x||^2 = K(x,x). This, in turn, induces an obvious distance metric: d(x,z)^2 = K(x,x)+K(z,z)-2K(x,z).

On the other hand, we also know how to turn distance metrics into kernels. If (X,d) is a metric space (i.e., X is a space and d is a metric on X), then K(x,z)=exp(-d(x,z)) is positive semi-definite when x,z are drawn from X.

Now, there are three questions that arise. The first is: in the distance->kernel setting, what does the RKHS "look like." I think the sense here is that it's basically the same as the RKHS induced by the Gaussian kernel: it's an infinite-dimensional feature space where different dimensions look like distances to some point. In practice, this "some point" is going to be one of the training points.

The second question follows the fact that we can iterate this construction. One way to ask this is: if we use a kernel to induce a distance, then use this distance to introduce a new kernel, what does this new kernel look like. Or, alternatively, if we have a distance, kernelize it to induce a new distance, then what does this distance look like. I don't have a good intuitive sense for the answer to any of these. Obviously it's straightforward to write out what these things look like, but that doesn't directly give me a sense of what's going on.

The final question is a bit "off topic" from the above two, but still relevant. There's been a lot of work in kernels for discrete data, like strings. The most popular string subsequence kernel is based on counting how many subsequences match between two strings, down-weighted by the length of the subsequences. It's well known, and recognized in string kernel papers, that a kernel of the form K(x,z) = 1-ned(x,z), where ned is a normalized string edit distance is not a valid kernel. However, from what we know about distance metrics, K(x,z) = exp(-sed(x,z)) should be a valid kernel. Yet I'm not aware of this being used anywhere. Is there any reason why? The reason I ask is because it's a much more intuitive than the subsequence kernel, which I only have a vague sense about.

Fernando Pereira said...

Quick comment on your 3rd point: K(x,z) is proportional to the probability x -> z in a probabilistic mutation model with the log-odds of mutations given by the edit costs. Hum...

Suresh said...

don't you mean exp(-d^2), rather than exp(-d) ?

also, what about the Haussler convolution kernel for strings ?

Kathy said...
This comment has been removed by the author.
hal said...

suresh: right, d^2.

fernando: so let's say we use edit distance to induce a kernel. based on your comment, we can think of the kernel value to be the probability of an automata mapping one string to the other. the kernel-induced distance then looks like some distance between (probabilistic) automata that do the mapping. that actually sounds kind of interesting :). i have no idea what happens if you iterate again, though, and create a new kernel based on this.

Bob Carpenter said...

For applications in which token reorderings are likely, basic subsequence comparison works better than simple edit distance. You get good character n-gram subsequence relations between "Smith, John" and "John Smith" even though they're miles apart in terms of character-level edit distances.

There are richer probabilistic edit distances like the ones introduced by Brill and Moore for spelling and by McCallum, Bellare and Pereira for word skipping and other general edits. These don't, in general, don't have negative logs that (when offset from match cost) form a proper metric like Levenshtein distance.

I don't know much about kernels, but if K(x,y) = exp(-d(x,y)**2) always produces a kernel if d is a proper metric, then the question arises of when a probabilistic string transducer defining p(s1|s2) defines a metric. I think that reduces to when:

d(s1,s2) = - log p(s1|s2) + log p(s2|s2)

forms a metric (the second term is so that d(s,s) = 0.0).

Plain Levenshtein distance with uniform edit costs defines a distance metric, but needs some fiddling to turn into a probability distribution (sum of all operations, including matching, must have probabilty 1.0).

hal said...

bob:

"For applications in which token reorderings are likely, basic subsequence comparison works better than simple edit distance." -- is this true or is it just plausible? I.e., has this effect actually been verified? I definitely find it plausible, but are there cases where it actually works out that way? What about when you're talking about words instead of just characters?

There are also a ton of edit distances that William Cohen has proposed and even more that he's compared. If they're actually metrics, then these could also be easily kernelized.

Bob Carpenter said...

Yes, if you mean do character n-grams work better than edit distance for matching.

Last year, we worked on a database linkage and deduplication problem for film names and actor names, and indeed found character n-grams with TF/IDF weighting a reasonable comparison metric. It put almost all the string matching true positives above an easily identified threshold, with only a few residuals where you had things like names transliterated from the original language versus translated.

We've also used this technique for entity transliteration detection, as in finding variants of "Al Jazeera". These probably would've worked OK with edit distance, too.

Substring character n-grams neatly deal with issues such as diacritics (only a small penalty for mismatch), minor case variation (e.g. "University Of Michigan" vs. "Univ. of Michigan") for varying spellings of titles (e.g. "Star Wars IV" vs. "Star Wars Four"), and for various token orders (e.g. "La Traviata" vs. "Traviata, La").

I've also used them for word-sense disambiguation in our tutorial, using both a TF/IDF form of classification and a k-nearest neighbors built on character n-gram dimensions. Again, you get significant robustness boosts over whole word matchers.

Note that we extract character n-grams across word boundaries, so you get some higher-order token-like effects for free. The bag of words assumption is particularly bad for text classifiers.

Character n-grams also work very well for general robust search over text. I'd like to see them compared to character n-gram language models for search. They're actually the norm for languages like Chinese that are impossible to tokenize reliably (i.e. state of the art is 97-98%). And they're also common for transcribed speech at a phonemic or syllabic lattice level.

There'd obviously be rule-based ways to handle all the things mentioned above, as well as variations due to pronunciation, whole word re-orderings, deletions (e.g. the affine edit distances used for genomic/proteomic matching).

I like the idea behind Cohen et al.'s soft TF/IDF:

http://www.cs.cmu.edu/~wcohen/postscript/kdd-2003-match-ws.pdf

But I can't understand either where the IDF is computed or whether the resulting "distance" is even symmetric.

The Jaro-Winkler string comparison is a custom model designed by Jaro and modified by Winkler for matching individual first or last names.

Huzefa said...

String Kernels are highly popular for protein sequence classification problems (as an example). Here are some references below. The second is some of my doctoral work involving use of string kernels with a biological similarity. Using such a biological measure -makes the kernel not positive semi-definite. We work around this using an eigen value transformation.

Christina S. Leslie , Eleazar Eskin , Adiel Cohen , Jason Weston , and William Stafford Noble - Mismatch string kernels for discriminative protein classification Bioinformatics 20: 467-476.

Huzefa Rangwala, George Karypis. Profile-based Direct Kernels for Remote Homology Detection and Fold Recognition in BIOINFORMATICS, 21(23):4239-4247 (2005)

Andre Martins said...

It is not true that "if (X,d) is a metric space then K(x,z)=exp(-t * d^2(x,z)) is positive definite". d must be a "Hilbertian distance", that is, a distance arising from a inner product in a RKHS; not any metric (under the axioms of nonnegativity, symmetry and triangle inequality) is allowed. In particular the string edit distance is NOT Hilbertian, therefore, K(x,z)=exp(-t * sed^2(x,z)) is not pd. See for example:

Corinna Cortes, Patrick Haffner and Mehryar Mohri, "Positive Definite Rational Kernels", Proceedings of The 16th Annual Conference on Computational Learning Theory (COLT 2003)

Anonymous said...

Ultima Online Gold, UO Gold, crestingwait
lotro gold
wow gold
warhammer gold
Age of Conan Gold, AOC Gold

Anonymous said...

I always heard something from my neighbor that he sometimes goes to the internet bar to play the game which will use him some gw gold，he usually can win a lot of GuildWars Gold，then he let his friends all have some Guild Wars Gold，his friends thank him very much for introducing them the GuildWars money，they usually cheap gw gold together.

. said...

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex