Information Distance in Multiples Paul M. B. Vitányi - PDF

Description
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 57, NO 4, APRIL Information Distance in Multiples Paul M B Vitányi Abstract Information distance is a parameter-free similarity measure based on compression,

Please download to get full document.

View again

of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Abstract

Publish on:

Views: 17 | Pages: 6

Extension: PDF | Download: 0

Share
Transcript
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 57, NO 4, APRIL Information Distance in Multiples Paul M B Vitányi Abstract Information distance is a parameter-free similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering classification The notion of information distance is extended from pairs to multiples (finite lists) We study maximal overlap, metricity, universality, minimal overlap, additivity normalized information distance in multiples We use the theoretical notion of Kolmogorov complexity which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program Index Terms Data mining, information distance, Kolmogorov complexity, multiples, pattern recognition, similarity I INTRODUCTION I N pattern recognition, learning data mining one obtains information from objects containing information This involves an objective definition of the information in a single object, the information to go from one object to another object in a pair of objects, the information to go from one object to any other object in a multiple of objects the shared information between objects, [34] The classical notion of Kolmogorov complexity [21] is an objective measure for the information in an a single object information distance measures the information between a pair of objects [3] This last notion has spawned research in the theoretical direction, among others [6], [30], [35], [37] [39] Research in the practical direction has focused on the normalized information distance, the similarity metric, which arises by normalizing the information distance in a proper manner approximating the Kolmogorov complexity through real-world compressors [7] [9], [26], This normalized information distance is a parameter-free, feature-free, alignment-free similarity measure that has had great impact in applications A variant of this compression distance has been tested on all time sequence databases used in the last decade in the major data mining conferences (sigkdd, sigmod, icdm, icde, ssdb, vldb, pkdd, pakdd) [18] The conclusion is that the method is competitive with all 51 other methods used superior in heterogenous data clustering anomaly detection In [4] it was shown that the method is resistant to noise This theory has found many applications in pattern recognition, phylogeny, Manuscript received May 20, 2009; revised November 15, 2010; accepted November 19, 2010 Date of current version March 16, 2011 This work was supported in part by the ESF QiT Programmme, the EU NoE PASCAL II in part by the Netherls BSIK/BRICKS project The author is with the National Research Center for Mathematics Computer Science, The Netherls (CWI), also with the University of Amsterdam, CWI, Science Park 123, 1098XG Amsterdam, The Netherls ( Communicated by A Krzyzak, Associate Editor for Pattern Recognition, Statistical Learning, Inference Digital Object Identifier /TIT clustering classification For objects that are represented as computer files such applications range from weather forecasting, software, earthquake prediction, music, literature, ocr, bioinformatics, to internet [1], [2], [5], [8] [10], [12], [19], [20], [22], [23], [25], [31] [33], [40] For objects that are only represented by name, or objects that are abstract like red, Einstein, three, the normalized information distance uses background information provided by Google, or any search engine that produces aggregate page counts It discovers the meaning of words phrases in the sense of producing a relative semantics Applications run from ontology, semantics, tourism on the web, taxonomy, multilingual questions, to question-answer systems [13] [15], [17], [36], [41] [43] For more references on either subject see the textbook [28] or Google Scholar for references to [8], [9], [26] However, in many applications we are interested in shared information between many objects instead of just a pair of objects For example, in customer reviews of gadgets, in blogs about public happenings, in newspaper articles about the same occurrence, we are interested in the most comprehensive one or the most specialized one Thus, we want to extend the information distance measure from pairs to multiples A Related Work In [27] the notion is introduced of the information required to go from any object in a multiple of objects to any other object in the multiple This is applied to extracting the essence from, for example, a finite list of internet news items, reviews of electronic cameras, tv s so on, in a way that works better than other methods Let denote a finite list of finite binary strings defined by, the constituting strings ordered length-increasing lexicographic We use lists not sets, since if is a set we cannot express simply the distance from a string to itself or between strings that are all equal Let be the reference universal Turing machine, for convenience the prefix one as in Section II We define the information distance in by It is shown in [27, Theorem 2] that (I1) up to a logarithmic additive term Define In [27, Theorem 3], they state that for every list we have (I2) up to a logarithmic additive term This is not a corollary of (I1) as stated in [27], but both inequalities follow from the definitions The left-h side (LHS) is interpreted as the program length of the most comprehensive object that contains the most /$ IEEE 2452 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 57, NO 4, APRIL 2011 information about all the others [all elements of ], the right-h side (RHS) is interpreted as the program length of the most specialized object that is similar to all the others The paper [27] develops the stated results applications It does not develop the theory in any detail That is the purpose of the present paper B Results Information distance for multiples, that is, finite lists, appears both practically theoretically promising In nearly all cases below the results imply the corresponding ones for the pairwise information distance defined as follows The information distance in [3] between strings is In the current paper These two definitions coincide for since up to an additive constant term We investigate the maximal overlap of information (Theorem 31) which for specializes to Theorem 34 in [3], Corollary 32 shows (I1) Corollary 33 shows that the LHS of (I2) can be taken to correspond to a single program embodying the most comprehensive object that contains the most information about all the others as stated but not argued or proved in [27]; metricity (Theorem 41) universality (Theorem 52) which for (for metricity) (for universality) specialize to Theorem 42 in [3]; additivity (Theorem 61); minimum overlap of information (Theorem 71) which for specializes to [29, Theorem 837] the nonmetricity of normalized information distance for lists of more than two elements certain proposals of the normalizing factor (Section VIII) In contrast, for lists of two elements we can normalize the information distance as in Lemma V4 Theorem V7 of [26] The definitions are of necessity new as are the proof ideas Remarkably, the new notation proofs for the general case are simpler than the mentioned existing proofs for the particular case of pairwise information distance II PRELIMINARIES Kolmogorov Complexity: This is the information in a single object [21] The notion has been the subject of a plethora of papers Informally, the Kolmogorov complexity of a finite binary string is the length of the shortest string from which the original can be losslessly reconstructed by an effective general-purpose computer such as a particular universal Turing machine Hence it constitutes a lower bound on how far a lossless compression program can compress For technical reasons we choose Turing machines with a separate read-only input tape, that is scanned from left to right without backing up, a separate work tape on which the computation takes place a separate output tape Upon halting, the initial segment of the input that has been scanned is called the input program the contents of the output tape is called the output By construction, the set of halting programs is prefix free We call the reference universal prefix Turing machine This leads to the definition of prefix Kolmogorov complexity which we shall designate simply as Kolmogorov complexity Formally, the conditional Kolmogorov complexity is the length of the shortest input such that the reference universal prefix Turing machine on input with auxiliary information outputs The unconditional Kolmogorov complexity is defined by where is the empty string (of length 0) In these definitions both can consist of a nonempty finite lists of finite binary strings For more details theorems that are used in the present work see Appendix I Lists: A list is a multiple of finite binary strings in length-increasing lexicographic order If is a list, then some or all of its elements may be equal Thus, a list is not a set but an ordered bag of elements With some abuse of the common set-membership notation we write for every to mean that is an element of list The conditional prefix Kolmogorov complexity of a list given an element is the length of a shortest program for the reference universal Turing machine that with input outputs the list The prefix Kolmogorov complexity of a list is defined by One can also put lists in the conditional such as or We will use the straightforward laws up to an additive constant term, for equals the list with the element deleted Information Distance: To obtain the pairwise information distance in [3] we take in (I1) Then(I1) is equivalent to III MAXIMAL OVERLAP We use the notation terminology of Section I-A Define, We prove a maximal overlap theorem: the information needed to go from any to any in can be divided in two parts: a single string of length a string of length (possibly depending on ), everything up to an additive logarithmic term Theorem 31: A single program of length bits concatenated with a string of bits, possibly depending on, suffice to find from for every To find an arbitrary element from it suffices to concatenate at most another bits, possibly depending on Proof: Enumerate the finite binary strings lexicographic length-increasing as Let be a graph defined as follows Let be the set of finite binary strings the set of vectors of strings in defined by such that Given the set can be enumerated Define Define by length-increasing lexicographic enumerating put with if for some, where is chosen as follows It is the th string of length where is the number of times we have used So the first times we choose an edge we use, the next we use so on In this way, so that By adding to we take care that the degree of is at most not at most as it could be without the prefix The degree of a node is trivially VITÁNYI: INFORMATION DISTANCE IN MULTIPLES 2453 In addition, we enumerate length-increasing lexicographic color everyone of the edges incident with an enumerated vector with the same binary string of length If is connected by edges to nodes, then choose as the minimum color not yet appearing on any edge incident with any Since the degree of every node is bounded by hence the colors already used for edges incident on nodes number at most, a color is always available Knowing,, one can reconstruct color its edges Given an element from the list knowing the appropriate string of length the color of the edge, we can find Hence a single program, say, of length bits suffices to find from for any with An additional bits suffice to select any element of Taking these bits so that they encode the difference from to we can compute from every to every vice versa with the same program of length concatenated with a string of length a string of length, both possibly depending on Since we know,, from the fixed program, where they are encoded as a self-delimiting prefix of length say, we can concatenate these strings without separation markers reconstruct them Theorem 41: The information distance for lists,,is a metric where the (in)equalities hold up to a additive term Here is the largest quantity involved in the metric (in)equalities Proof: It is clear that satisfies positive definiteness symmetry up to an additive term where It remains to show the triangle inequality Claim 42: Let,, be three nonempty finite lists of finite binary strings Then, up to an additive term Proof: By Theorem 31 equalities up to a additive term Here,, are the elements for which the maximum is reached for the respective s Assume that, the case being symmetrical Let be some element of Then Corollary 32: Since, the theorem implies (I1), that is, [27, Theorem 2] It is not a priori clear that in the LHS of (I2) corresponds to a single program that represents the information overlap of every shortest program going from any to the list This seems in fact assumed in [27] where is interpreted as the [Kolmogorov complexity of] the most comprehensive object that contains the most information about all the others In fact, for every we can choose a shortest program going from to the list so that these programs have pairwise no information overlap at all (Theorem 71) But here we have proved the following Corollary 33: The quantity corresponds to a single shortest program that represents the maximum overlap of information of all programs going from to the list for any IV METRICITY We consider nonempty finite lists of finite binary strings, each list ordered length-increasing lexicographic Let be the set of such ordered nonempty finite lists of finite binary strings A distance function on is defined by where is the set of nonnegative real numbers Define if is a list of the elements of the lists the elements of are ordered length-increasing lexicographical A distance function is a metric if 1) Positive definiteness: if all elements of are equal otherwise 2) Symmetry: is invariant under all permutations of 3) Triangle inequality: The first inequality follows from the general, the second inequality by the obvious subadditive property of, the third inequality since in the first term the is reached for in the second term both for take any element from the fourth inequality follows by in the second term dropping from the conditional moving from the conditional to the main argument observing that both the is reached for The theorem follows with (in)equalities up to an additive term V UNIVERSALITY Let A priori we allow asymmetric distances We would like to exclude degenerate distance measures such as for all For each, we want only finitely many lists such that Exactly how fast we want the number of lists we admit to go to is not important; it is only a matter of scaling For every distance we require the following density condition for every : (V1) Thus, for the density condition on we consider only lists with not all elements of are equal Moreover, we consider only distances that are computable in some broad sense Definition 51: An admissible list distance is a total, possibly asymmetric, function from to the nonnegative real 2454 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 57, NO 4, APRIL 2011 numbers that is 0 if all elements of are equal greater than 0 otherwise (up to an additive additive term with ), is upper semicomputable satisfies the density requirement in (V1) Theorem 52: The list information distance is admissible it is minimal in the sense that for every admissible list distance function we have up to an additive constant term Proof: It is straightforward that is a total realvalued function, is 0 only if all elements of are equal unequal 0 otherwise (up to an additive term with ) is upper semicomputable We verify the density requirement of (V1) For every, consider lists of at least two elements not all equal Define functions Then, It is easy to see that for every where the RHS sum is taken over all programs for which the reference prefix machine, given, computes a finite list of at least two elements not all equal such that This sum is the probability that,given, computes such a list from a program generated bit by bit uniformly at rom Therefore, the RHS sum is at most 1 satisfies the density requirement (V1) We prove minimality Fix any Since is upper semicomputable, the function defined by for satisfying 0 otherwise, is lower semicomputable Since, we have for every Note that given we can compute hence By the conditional version of Eq (A2) in the Appendix, that is [28, Theorem 432], we have with, that is, is a positive constant depending on only By the conditional version of Eq (A3) in the Appendix, that is [28, Theorem 434], we have for every that Hence, for every we have Altogether, for every admissible distance every every list satisfying, there is a constant such that Hence, VI ADDITIVITY Theorem 61: is not subadditive: neither nor, the (in)equalities up to logarithmic additive terms, holds for all lists Proof: Below, all (in)equalities are taken up to logarithmic additive terms Let, be strings of length, with denoting the empty word Then, If, then Hence, Let, be strings of length such that,,, Then, Hence, Let Note that subadditivity holds for lists of singleton elements since, where the equality holds up to an additive term the inequality holds up to an additive constant term VII MINIMAL OVERLAP Let be a shortest program converting to Naively we expect that the shortest program that that maps to contains the information about that is lacking in However, this is too simple, because different short programs mapping to may have different properties For example, suppose both elements are strings of length with Let be a program that ignores the input prints Let be a program such that (that is, ), where denotes bitwise addition modulo 2 Then, the programs have nothing in common Now let be arbitrary strings of length at most Muchnik [29], see also the textbook [28, Theorem 837], shows that there exists a shortest program that converts to (that is, ), such that is simple with respect to therefore depends little on the origin, that is, This is a fundamental coding property for individual strings that parallels related results about rom variables known as the Slepian-Wolf Csiszár-Körner-Marton theorems [11] Theorem 71: Let be a list of binary strings of length at most For every there exists a string of length such that Proof: Muchnik s theorem as stated before gives a code for when is known There, we assumed that have length at most The proof in [29] does not use any assumption about Hence we can extend the result to information distance in finite lists as follows Suppose we encode the constituent list elements of self-delimitingly in altogether bits (now takes the position of we consider strings of length at most ) Substitute by for some Then the theorem above follows straightforwardly from Muchnik s original theorem about two strings of length at most The code is not uniquely determined For example, let be a string such that, Then, both can be used VITÁNYI: INFORMATION DISTANCE IN MULTIPLES 2455 for with But have no mutual information at all Corollary 72: Let For every string there is a program such that, where the last four equalities hold up to an additive term VIII NORMALIZED LIST INFORMATION DISTANCE The quantitative difference in a certain feature between many objects can be considered as an admissible distance, provided it is upper semicomputable satisfies the density condition (V1) Theorem 52 shows that is universal in that among all admissible list distances in that always least That is, it accounts for the dominant feature in which the elements of the given list are alike Many admissible distances are absolute, but if we want to express similarity, then we are more interested in relative ones For example, if two strings of bits have information distance 1000 bits, then we are inclined to think that those strings are relatively similar But if two strings of 1000 bits have information distance 1000 bits, then we
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks