## 10 February 2008

### Kernels, distances and strings

Kernels and distances are closely related. Given a kernel K(x,z), this induces an RKHS H such that K(x,z)=, where f is the mapping from the input space to H, and <.,.> is the dot product in H. Dot products induce norms in the obvious way: ||x||^2 = K(x,x). This, in turn, induces an obvious distance metric: d(x,z)^2 = K(x,x)+K(z,z)-2K(x,z).

On the other hand, we also know how to turn distance metrics into kernels. If (X,d) is a metric space (i.e., X is a space and d is a metric on X), then K(x,z)=exp(-d(x,z)) is positive semi-definite when x,z are drawn from X.

Now, there are three questions that arise. The first is: in the distance->kernel setting, what does the RKHS "look like." I think the sense here is that it's basically the same as the RKHS induced by the Gaussian kernel: it's an infinite-dimensional feature space where different dimensions look like distances to some point. In practice, this "some point" is going to be one of the training points.

The second question follows the fact that we can iterate this construction. One way to ask this is: if we use a kernel to induce a distance, then use this distance to introduce a new kernel, what does this new kernel look like. Or, alternatively, if we have a distance, kernelize it to induce a new distance, then what does this distance look like. I don't have a good intuitive sense for the answer to any of these. Obviously it's straightforward to write out what these things look like, but that doesn't directly give me a sense of what's going on.

The final question is a bit "off topic" from the above two, but still relevant. There's been a lot of work in kernels for discrete data, like strings. The most popular string subsequence kernel is based on counting how many subsequences match between two strings, down-weighted by the length of the subsequences. It's well known, and recognized in string kernel papers, that a kernel of the form K(x,z) = 1-ned(x,z), where ned is a normalized string edit distance is not a valid kernel. However, from what we know about distance metrics, K(x,z) = exp(-sed(x,z)) should be a valid kernel. Yet I'm not aware of this being used anywhere. Is there any reason why? The reason I ask is because it's a much more intuitive than the subsequence kernel, which I only have a vague sense about.

1. Quick comment on your 3rd point: K(x,z) is proportional to the probability x -> z in a probabilistic mutation model with the log-odds of mutations given by the edit costs. Hum...

2. don't you mean exp(-d^2), rather than exp(-d) ?

also, what about the Haussler convolution kernel for strings ?

3. This comment has been removed by the author.

4. suresh: right, d^2.

fernando: so let's say we use edit distance to induce a kernel. based on your comment, we can think of the kernel value to be the probability of an automata mapping one string to the other. the kernel-induced distance then looks like some distance between (probabilistic) automata that do the mapping. that actually sounds kind of interesting :). i have no idea what happens if you iterate again, though, and create a new kernel based on this.

5. For applications in which token reorderings are likely, basic subsequence comparison works better than simple edit distance. You get good character n-gram subsequence relations between "Smith, John" and "John Smith" even though they're miles apart in terms of character-level edit distances.

There are richer probabilistic edit distances like the ones introduced by Brill and Moore for spelling and by McCallum, Bellare and Pereira for word skipping and other general edits. These don't, in general, don't have negative logs that (when offset from match cost) form a proper metric like Levenshtein distance.

I don't know much about kernels, but if K(x,y) = exp(-d(x,y)**2) always produces a kernel if d is a proper metric, then the question arises of when a probabilistic string transducer defining p(s1|s2) defines a metric. I think that reduces to when:

d(s1,s2) = - log p(s1|s2) + log p(s2|s2)

forms a metric (the second term is so that d(s,s) = 0.0).

Plain Levenshtein distance with uniform edit costs defines a distance metric, but needs some fiddling to turn into a probability distribution (sum of all operations, including matching, must have probabilty 1.0).

6. bob:

"For applications in which token reorderings are likely, basic subsequence comparison works better than simple edit distance." -- is this true or is it just plausible? I.e., has this effect actually been verified? I definitely find it plausible, but are there cases where it actually works out that way? What about when you're talking about words instead of just characters?

There are also a ton of edit distances that William Cohen has proposed and even more that he's compared. If they're actually metrics, then these could also be easily kernelized.

7. Yes, if you mean do character n-grams work better than edit distance for matching.

Last year, we worked on a database linkage and deduplication problem for film names and actor names, and indeed found character n-grams with TF/IDF weighting a reasonable comparison metric. It put almost all the string matching true positives above an easily identified threshold, with only a few residuals where you had things like names transliterated from the original language versus translated.

We've also used this technique for entity transliteration detection, as in finding variants of "Al Jazeera". These probably would've worked OK with edit distance, too.

Substring character n-grams neatly deal with issues such as diacritics (only a small penalty for mismatch), minor case variation (e.g. "University Of Michigan" vs. "Univ. of Michigan") for varying spellings of titles (e.g. "Star Wars IV" vs. "Star Wars Four"), and for various token orders (e.g. "La Traviata" vs. "Traviata, La").

I've also used them for word-sense disambiguation in our tutorial, using both a TF/IDF form of classification and a k-nearest neighbors built on character n-gram dimensions. Again, you get significant robustness boosts over whole word matchers.

Note that we extract character n-grams across word boundaries, so you get some higher-order token-like effects for free. The bag of words assumption is particularly bad for text classifiers.

Character n-grams also work very well for general robust search over text. I'd like to see them compared to character n-gram language models for search. They're actually the norm for languages like Chinese that are impossible to tokenize reliably (i.e. state of the art is 97-98%). And they're also common for transcribed speech at a phonemic or syllabic lattice level.

There'd obviously be rule-based ways to handle all the things mentioned above, as well as variations due to pronunciation, whole word re-orderings, deletions (e.g. the affine edit distances used for genomic/proteomic matching).

I like the idea behind Cohen et al.'s soft TF/IDF:

http://www.cs.cmu.edu/~wcohen/postscript/kdd-2003-match-ws.pdf

But I can't understand either where the IDF is computed or whether the resulting "distance" is even symmetric.

The Jaro-Winkler string comparison is a custom model designed by Jaro and modified by Winkler for matching individual first or last names.

8. String Kernels are highly popular for protein sequence classification problems (as an example). Here are some references below. The second is some of my doctoral work involving use of string kernels with a biological similarity. Using such a biological measure -makes the kernel not positive semi-definite. We work around this using an eigen value transformation.

Christina S. Leslie , Eleazar Eskin , Adiel Cohen , Jason Weston , and William Stafford Noble - Mismatch string kernels for discriminative protein classification Bioinformatics 20: 467-476.

Huzefa Rangwala, George Karypis. Profile-based Direct Kernels for Remote Homology Detection and Fold Recognition in BIOINFORMATICS, 21(23):4239-4247 (2005)

9. It is not true that "if (X,d) is a metric space then K(x,z)=exp(-t * d^2(x,z)) is positive definite". d must be a "Hilbertian distance", that is, a distance arising from a inner product in a RKHS; not any metric (under the axioms of nonnegativity, symmetry and triangle inequality) is allowed. In particular the string edit distance is NOT Hilbertian, therefore, K(x,z)=exp(-t * sed^2(x,z)) is not pd. See for example:

Corinna Cortes, Patrick Haffner and Mehryar Mohri, "Positive Definite Rational Kernels", Proceedings of The 16th Annual Conference on Computational Learning Theory (COLT 2003)

10. 酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花