tag:blogger.com,1999:blog-19803222.post3790097992972719942..comments2024-03-18T01:45:45.724-06:00Comments on natural language processing blog: Kernels, distances and stringshalhttp://www.blogger.com/profile/02162908373916390369noreply@blogger.comBlogger10125tag:blogger.com,1999:blog-19803222.post-22017786259554922592009-05-12T10:37:00.000-06:002009-05-12T10:37:00.000-06:00酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒...酒店經紀PRETTY GIRL <A HREF="http://www.taipeilady.com/" REL="nofollow" TITLE="台北酒店經紀人">台北酒店經紀人</A> ,<A HREF="http://tw.myblog.yahoo.com/jw!qZ9n..6QEhhc0LkItOBm/" REL="nofollow" TITLE="禮服店">禮服店</A> 酒店兼差PRETTY GIRL<A HREF="http://www.mashow.org/" REL="nofollow" TITLE="酒店公關">酒店公關</A> 酒店小姐 彩色爆米花<A HREF="http://blog.xuite.net/jkl338801/blog/" REL="nofollow" TITLE="酒店兼職">酒店兼職</A>,酒店工作 彩色爆米花<A HREF="http://tw.myblog.yahoo.com/jw!BIBoU5SeBRs21nb_ajFpncbTqXds" REL="nofollow" TITLE="酒店經紀">酒店經紀</A>, <A HREF="http://mypaper.pchome.com.tw/news/thomsan/3/1310065116/20080905040949/" REL="nofollow" TITLE="酒店上班">酒店上班</A>,酒店工作 PRETTY GIRL<A HREF="http://tw.myblog.yahoo.com/jw!rybqykeeER6TH3AKz1HQ5grm/" REL="nofollow" TITLE="酒店喝酒">酒店喝酒</A>酒店上班 彩色爆米花<A HREF="http://mypaper.pchome.com.tw/news/jkl338801/" REL="nofollow" TITLE="台北酒店">台北酒店</A>酒店小姐 PRETTY GIRL<A HREF="http://www.mashow.org/" REL="nofollow" TITLE="酒店上班">酒店上班</A>酒店打工PRETTY GIRL<A HREF="http://www.tpangel.com/" REL="nofollow" TITLE="酒店打工">酒店打工</A>酒店經紀 彩色爆米花Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19803222.post-59376422237682349032008-02-12T18:54:00.000-07:002008-02-12T18:54:00.000-07:00It is not true that "if (X,d) is a metric space t...It is not true that "if (X,d) is a metric space then K(x,z)=exp(-t * d^2(x,z)) is positive definite". d must be a "Hilbertian distance", that is, a distance arising from a inner product in a RKHS; not any metric (under the axioms of nonnegativity, symmetry and triangle inequality) is allowed. In particular the string edit distance is NOT Hilbertian, therefore, K(x,z)=exp(-t * sed^2(x,z)) is not pd. See for example:<BR/><BR/>Corinna Cortes, Patrick Haffner and Mehryar Mohri, "Positive Definite Rational Kernels", Proceedings of The 16th Annual Conference on Computational Learning Theory (COLT 2003)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19803222.post-88066680978032747772008-02-12T14:30:00.000-07:002008-02-12T14:30:00.000-07:00String Kernels are highly popular for protein sequ...String Kernels are highly popular for protein sequence classification problems (as an example). Here are some references below. The second is some of my doctoral work involving use of string kernels with a biological similarity. Using such a biological measure -makes the kernel not positive semi-definite. We work around this using an eigen value transformation.<BR/><BR/>Christina S. Leslie , Eleazar Eskin , Adiel Cohen , Jason Weston , and William Stafford Noble - Mismatch string kernels for discriminative protein classification Bioinformatics 20: 467-476. <BR/><BR/>Huzefa Rangwala, George Karypis. Profile-based Direct Kernels for Remote Homology Detection and Fold Recognition in BIOINFORMATICS, 21(23):4239-4247 (2005)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19803222.post-70290106138970152512008-02-12T14:00:00.000-07:002008-02-12T14:00:00.000-07:00Yes, if you mean do character n-grams work better ...Yes, if you mean do character n-grams work better than edit distance for matching.<BR/><BR/>Last year, we worked on a database linkage and deduplication problem for film names and actor names, and indeed found character n-grams with TF/IDF weighting a reasonable comparison metric. It put almost all the string matching true positives above an easily identified threshold, with only a few residuals where you had things like names transliterated from the original language versus translated. <BR/><BR/>We've also used this technique for entity transliteration detection, as in finding variants of "Al Jazeera". These probably would've worked OK with edit distance, too.<BR/><BR/>Substring character n-grams neatly deal with issues such as diacritics (only a small penalty for mismatch), minor case variation (e.g. "University Of Michigan" vs. "Univ. of Michigan") for varying spellings of titles (e.g. "Star Wars IV" vs. "Star Wars Four"), and for various token orders (e.g. "La Traviata" vs. "Traviata, La"). <BR/><BR/>I've also used them for word-sense disambiguation in our tutorial, using both a TF/IDF form of classification and a k-nearest neighbors built on character n-gram dimensions. Again, you get significant robustness boosts over whole word matchers.<BR/><BR/>Note that we extract character n-grams across word boundaries, so you get some higher-order token-like effects for free. The bag of words assumption is particularly bad for text classifiers. <BR/><BR/>Character n-grams also work very well for general robust search over text. I'd like to see them compared to character n-gram language models for search. They're actually the norm for languages like Chinese that are impossible to tokenize reliably (i.e. state of the art is 97-98%). And they're also common for transcribed speech at a phonemic or syllabic lattice level.<BR/><BR/>There'd obviously be rule-based ways to handle all the things mentioned above, as well as variations due to pronunciation, whole word re-orderings, deletions (e.g. the affine edit distances used for genomic/proteomic matching).<BR/><BR/>I like the idea behind Cohen et al.'s soft TF/IDF:<BR/><BR/>http://www.cs.cmu.edu/~wcohen/postscript/kdd-2003-match-ws.pdf<BR/><BR/>But I can't understand either where the IDF is computed or whether the resulting "distance" is even symmetric.<BR/><BR/>The Jaro-Winkler string comparison is a custom model designed by Jaro and modified by Winkler for matching individual first or last names.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19803222.post-46434402574639486692008-02-11T13:48:00.000-07:002008-02-11T13:48:00.000-07:00bob:"For applications in which token reorderings a...bob:<BR/><BR/>"For applications in which token reorderings are likely, basic subsequence comparison works better than simple edit distance." -- is this <I>true</I> or is it just <I>plausible</I>? I.e., has this effect actually been verified? I definitely find it plausible, but are there cases where it actually works out that way? What about when you're talking about words instead of just characters?<BR/><BR/>There are also a ton of edit distances that William Cohen has proposed and even more that he's compared. If they're actually metrics, then these could also be easily kernelized.halhttps://www.blogger.com/profile/02162908373916390369noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-88359863737983757812008-02-11T13:18:00.000-07:002008-02-11T13:18:00.000-07:00For applications in which token reorderings are li...For applications in which token reorderings are likely, basic subsequence comparison works better than simple edit distance. You get good character n-gram subsequence relations between "Smith, John" and "John Smith" even though they're miles apart in terms of character-level edit distances. <BR/><BR/>There are richer probabilistic edit distances like the ones introduced by Brill and Moore for spelling and by McCallum, Bellare and Pereira for word skipping and other general edits. These don't, in general, don't have negative logs that (when offset from match cost) form a proper metric like Levenshtein distance.<BR/><BR/>I don't know much about kernels, but if K(x,y) = exp(-d(x,y)**2) always produces a kernel if d is a proper metric, then the question arises of when a probabilistic string transducer defining p(s1|s2) defines a metric. I think that reduces to when:<BR/><BR/> d(s1,s2) = - log p(s1|s2) + log p(s2|s2)<BR/><BR/>forms a metric (the second term is so that d(s,s) = 0.0). <BR/><BR/>Plain Levenshtein distance with uniform edit costs defines a distance metric, but needs some fiddling to turn into a probability distribution (sum of all operations, including matching, must have probabilty 1.0).Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19803222.post-60967597519312030072008-02-10T15:42:00.000-07:002008-02-10T15:42:00.000-07:00suresh: right, d^2.fernando: so let's say we use e...suresh: right, d^2.<BR/><BR/>fernando: so let's say we use edit distance to induce a kernel. based on your comment, we can think of the kernel value to be the probability of an automata mapping one string to the other. the kernel-induced distance then looks like some distance between (probabilistic) automata that do the mapping. that actually sounds kind of interesting :). i have no idea what happens if you iterate again, though, and create a new kernel based on this.halhttps://www.blogger.com/profile/02162908373916390369noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-12496391142466164532008-02-10T15:41:00.000-07:002008-02-10T15:41:00.000-07:00This comment has been removed by the author.Kathy Khttps://www.blogger.com/profile/14972562577130217259noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-55351799187690375222008-02-10T14:59:00.000-07:002008-02-10T14:59:00.000-07:00don't you mean exp(-d^2), rather than exp(-d) ? al...don't you mean exp(-d^2), rather than exp(-d) ? <BR/><BR/>also, what about the Haussler convolution kernel for strings ?Suresh Venkatasubramanianhttps://www.blogger.com/profile/15898357513326041822noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-79324018785849124082008-02-10T11:18:00.000-07:002008-02-10T11:18:00.000-07:00Quick comment on your 3rd point: K(x,z) is proport...Quick comment on your 3rd point: K(x,z) is proportional to the probability x -> z in a probabilistic mutation model with the log-odds of mutations given by the edit costs. Hum...Fernando Pereirahttps://www.blogger.com/profile/05849361902113771573noreply@blogger.com