Stéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3 - PDF

Description
ESAIM: Probability ad Statistics URL: Will be set by the publisher THEORY OF CLASSIFICATION: A SURVEY OF SOME RECENT ADVANCES Stéphae Bouchero 1, Olivier Bousquet 2 ad Gábor Lugosi

Please download to get full document.

View again

of 20
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Reviews

Publish on:

Views: 21 | Pages: 20

Extension: PDF | Download: 0

Share
Transcript
ESAIM: Probability ad Statistics URL: Will be set by the publisher THEORY OF CLASSIFICATION: A SURVEY OF SOME RECENT ADVANCES Stéphae Bouchero 1, Olivier Bousquet 2 ad Gábor Lugosi 3 Abstract The last few years have witessed importat ew developmets i the theory ad practice of patter classificatio We ited to survey some of the mai ew ideas that have led to these recet results Résumé La pratique et la théorie de la recoaissace des formes ot cou des développemets importats durat ces derières aées Ce survol vise à exposer certaies des idées ouvelles qui ot coduit à ces développemets 1991 Mathematics Subject Classificatio 62G08,60E15,68Q32 September 23, 2005 Cotets 1 Itroductio 2 2 Basic model 2 3 Empirical risk miimizatio ad Rademacher averages 3 4 Miimizig cost fuctios: some basic ideas behid boostig ad support vector machies 8 41 Margi-based performace bouds 9 42 Covex cost fuctioals 13 5 Tighter bouds for empirical risk miimizatio Relative deviatios Noise ad fast rates Localizatio Cost fuctios Miimax lower bouds 26 6 PAC-bayesia bouds 29 7 Stability 31 8 Model selectio Oracle iequalities 32 Keywords ad phrases: Patter Recogitio, Statistical Learig Theory, Cocetratio Iequalities, Empirical Processes, Model Selectio The authors ackowledge support by the PASCAL Network of Excellece uder EC grat o The work of the third author was supported by the Spaish Miistry of Sciece ad Techology ad FEDER, grat BMF Laboratoire Probabilités et Modèles Aléatoires, CNRS & Uiversité Paris VII, Paris, Frace, wwwprobajussieufr/~bouchero 2 Pertiece SA, 32 rue des Jeûeurs, Paris, Frace 3 Departmet of Ecoomics, Pompeu Fabra Uiversity, Ramo Trias Fargas 25-27, Barceloa, Spai, c EDP Scieces, SMAI 1999 2 TITLE WILL BE SET BY THE PUBLISHER 82 A glimpse at model selectio methods Naive pealizatio Ideal pealties Localized Rademacher complexities Pre-testig Revisitig hold-out estimates 45 Refereces 47 1 Itroductio The last few years have witessed importat ew developmets i the theory ad practice of patter classificatio The itroductio of ew ad effective techiques of hadlig high-dimesioal problems such as boostig ad support vector machies have revolutioized the practice of patter recogitio At the same time, the better uderstadig of the applicatio of empirical process theory ad cocetratio iequalities have led to effective ew ways of studyig these methods ad provided a statistical explaatio for their success These ew tools have also helped develop ew model selectio methods that are at the heart of may classificatio algorithms The purpose of this survey is to offer a overview of some of these theoretical tools ad give the mai ideas of the aalysis of some of the importat algorithms This survey does ot attempt to be exhaustive The selectio of the topics is largely biased by the persoal taste of the authors We also limit ourselves to describig the key ideas i a simple way, ofte sacrificig geerality I these cases the reader is poited to the refereces for the sharpest ad more geeral results available Refereces ad bibliographical remarks are give at the ed of each sectio, i a attempt to avoid iterruptios i the argumets 2 Basic model The problem of patter classificatio is about guessig or predictig the ukow class of a observatio A observatio is ofte a collectio of umerical ad/or categorical measuremets represeted by a d-dimesioal vector x but i some cases it may eve be a curve or a image I our model we simply assume that x X where X is some abstract measurable space equipped with a σ-algebra The ukow ature of the observatio is called a class It is deoted by y ad i the simplest case takes values i the biary set 1, 1} I these otes we restrict our attetio to biary classificatio The reaso is simplicity ad that the biary problem already captures may of the mai features of more geeral problems Eve though there is much to say about multiclass classificatio, this survey does ot cover this icreasig field of research I classificatio, oe creates a fuctio g : X 1, 1} which represets oe s guess of y give x The mappig g is called a classifier The classifier errs o x if g(x) y To formalize the learig problem, we itroduce a probabilistic settig, ad let (X, Y ) be a X 1, 1}- valued radom pair, modelig observatio ad its correspodig class The distributio of the radom pair (X, Y ) may be described by the probability distributio of X (give by the probabilities PX A} for all measurable subsets A of X ) ad η(x) = PY = 1 X = x} The fuctio η is called the a posteriori probability We measure the performace of classifier g by its probability of error L(g) = Pg(X) Y } Give η, oe may easily costruct a classifier with miimal probability of error I particular, it is easy to see that if we defie g 1 if η(x) 1/2 (x) = 1 otherwise TITLE WILL BE SET BY THE PUBLISHER 3 the L(g ) L(g) for ay classifier g The miimal risk L def = L(g ) is called the Bayes risk (or Bayes error) More precisely, it is immediate to see that L(g) L = E [ 1 g(x) g (X)} 2η(X) 1 ] 0 (1) (see, eg, [72]) The optimal classifier g is ofte called the Bayes classifier I the statistical model we focus o, oe has access to a collectio of data (X i, Y i ), 1 i We assume that the data D cosists of a sequece of idepedet idetically distributed (iid) radom pairs (X 1, Y 1 ),, (X, Y ) with the same distributio as that of (X, Y ) A classifier is costructed o the basis of D = (X 1, Y 1,, X, Y ) ad is deoted by g Thus, the value of Y is guessed by g (X) = g (X; X 1, Y 1,, X, Y ) The performace of g is measured by its (coditioal) probability of error L(g ) = Pg (X) Y D } The focus of the theory (ad practice) of classificatio is to costruct classifiers g whose probability of error is as close to L as possible Obviously, the whole arseal of traditioal parametric ad oparametric statistics may be used to attack this problem However, the high-dimesioal ature of may of the ew applicatios (such as image recogitio, text classificatio, micro-biological applicatios, etc) leads to territories beyod the reach of traditioal methods Most ew advaces of statistical learig theory aim to face these ew challeges Bibliographical remarks Several textbooks, surveys, ad research moographs have bee writte o patter classificatio ad statistical learig theory A partial list icludes Fukuaga [97], Duda ad Hart [77], Vapik ad Chervoekis [233], Devijver ad Kittler [70], Vapik [229,230], Breima, Friedma, Olshe, ad Stoe [53], Nataraja [175], McLachla [169], Athoy ad Biggs [10], Kears ad Vazirai [117], Devroye, Györfi, ad Lugosi [72], Ripley [185], Vidyasagar [235] Kulkari, Lugosi, ad Vekatesh [128], Athoy ad Bartlett [9], Duda, Hart, ad Stork [78], Lugosi [144], ad Medelso [171] 3 Empirical risk miimizatio ad Rademacher averages A simple ad atural approach to the classificatio problem is to cosider a class C of classifiers g : X 1, 1} ad use data-based estimates of the probabilities of error L(g) to select a classifier from the class The most atural choice to estimate the probability of error L(g) = Pg(X) Y } is the error cout L (g) = 1 1 g(xi) Y i} i=1 L (g) is called the empirical error of the classifier g First we outlie the basics of the theory of empirical risk miimizatio (ie, the classificatio aalog of M-estimatio) Deote by g the classifier that miimizes the estimated probability of error over the class: L (g ) L (g) for all g C The the probability of error L(g) = P g(x) Y D } of the selected rule is easily see to satisfy the elemetary iequalities L(g ) if g C L(g) 2 sup L (g) L(g), (2) g C L(g) L (g) + sup L (g) L(g) g C 4 TITLE WILL BE SET BY THE PUBLISHER We see that by guarateeig that the uiform deviatio sup g C L (g) L(g) of estimated probabilities from their true values is small, we make sure that the probability of the selected classifier g is ot much larger tha the best probability of error i the class C ad at the same time the empirical estimate L (g ) is also good It is importat to ote at this poit that boudig the excess risk by the maximal deviatio as i (2) is quite loose i may situatios I Sectio 5 we survey some ways of obtaiig improved bouds O the other had, the simple iequality above offers a coveiet way of uderstadig some of the basic priciples ad it is eve sharp i a certai miimax sese, see Sectio 55 Clearly, the radom variable L (g) is biomially distributed with parameters ad L(g) Thus, to obtai bouds for the success of empirical error miimizatio, we eed to study uiform deviatios of biomial radom variables from their meas We formulate the problem i a somewhat more geeral way as follows Let X 1,, X be idepedet, idetically distributed radom variables takig values i some set X ad let F be a class of bouded fuctios X [ 1, 1] Deotig expectatio ad empirical averages by P f = Ef(X 1 ) ad P f = (1/) i=1 f(x i), we are iterested i upper bouds for the maximal deviatio sup(p f P f) f F Cocetratio iequalities are amog the basic tools i studyig such deviatios powerful expoetial cocetratio iequality is the bouded differeces iequality The simplest, yet quite Theorem 31 bouded differeces iequality Le g : X R be a fuctio of variables such that for some oegative costats c 1,, c, sup x 1,,x, x i X g(x 1,, x ) g(x 1,, x i 1, x i, x i+1,, x ) c i, 1 i Let X 1,, X be idepedet radom variables The radom variable Z = g(x 1,, X ) satisfies where C = i=1 c2 i P Z EZ t} 2e 2t2 /C The bouded differeces assumptio meas that if the i-th variable of g is chaged while keepig all the others fixed, the value of the fuctio caot chage by more tha c i Our mai example for such a fuctio is Z = sup P f P f f F Obviously, Z satisfies the bouded differeces assumptio with c i = 2/ ad therefore, for ay δ (0, 1), with probability at least 1 δ, 2 log sup P f P f E 1 δ sup P f P f + (3) f F f F This cocetratio result allows us to focus o the expected value, which ca be bouded coveietly by a simple symmetrizatio device Itroduce a ghost sample X 1,, X, idepedet of the X i ad distributed idetically If P f = (1/) i=1 f(x i ) deotes the empirical averages measured o the ghost sample, the by Jese s iequality, E sup f F ( [ P f P f = E sup E f F ]) P f P f X 1,, X E sup P f P f f F TITLE WILL BE SET BY THE PUBLISHER 5 Let ow σ 1,, σ be idepedet (Rademacher) radom variables with Pσ i = 1} = Pσ i = 1} = 1/2, idepedet of the X i ad X i The E sup f F [ P f P f = E sup f F [ = E sup 2E f F [ 1 1 sup f F ] (f(x i) f(x i ) i=1 ] σ i (f(x i) f(x i ) i=1 ] σ i f(x i ) Let A R be a bouded set of vectors a = (a 1,, a ), ad itroduce the quatity 1 i=1 1 R (A) = E sup σ i a i a A R (A) is called the Rademacher average associated with A For a give sequece x 1,, x X, we write F(x 1 ) for the class of -vectors (f(x 1 ),, f(x )) with f F Thus, usig this otatio, we have deduced the followig i=1 Theorem 32 With probability at least 1 δ, sup P f P f 2ER (F(X1 )) + f F 2 log 1 δ We also have sup P f P f 2R (F(X1 )) + f F 2 log 2 δ The secod statemet follows simply by oticig that the radom variable R (F(X1 ) satisfies the coditios of the bouded differeces iequality The secod iequality is our first data-depedet performace boud It ivolves the Rademacher average of the coordiate projectio of F give by the data X 1,, X Give the data, oe may compute the Rademacher average, for example, by Mote Carlo itegratio Note that for a give 1 choice of the radom sigs σ 1,, σ, the computatio of sup f F i=1 σ if(x i ) is equivalet to miimizig i=1 σ if(x i ) over f F ad therefore it is computatioally equivalet to empirical risk miimizatio R (F(X1 )) measures the richess of the class F ad provides a sharp estimate for the maximal deviatios I fact, oe may prove that 1 2 ER (F(X1 )) 1 2 E sup f F P f P f 2ER (F(X 1 ))) (see, eg, va der Vaart ad Weller [227]) Next we recall some of the simple structural properties of Rademacher averages Theorem 33 properties of rademacher averages Let A, B be bouded subsets of R ad let c R be a costat The R (A B) R (A) + R (B), R (c A) = c R (A), R (A B) R (A) + R (B) 6 TITLE WILL BE SET BY THE PUBLISHER where c A = ca : a A} ad A B = a + b : a A, b B} Moreover, if A = a (1),, a (N) } R is a fiite set, the 2 log N R (A) max j=1,,n a(j) (4) N where deotes Euclidea orm If abscov(a) = j=1 c ja (j) : N N, } N j=1 c j 1, a (j) A is the absolute covex hull of A, the R (A) = R (abscov(a)) (5) Fially, the cotractio priciple states that if φ : R R is a fuctio with φ(0) = 0 ad Lipschitz costat L φ ad φ A is the set of vectors of form (φ(a 1 ),, φ(a )) R with a A, the R (φ A) L φ R (A) proof The first three properties are immediate from the defiitio Iequality (4) follows by Hoeffdig s iequality which states that if X is a bouded zero-mea radom variable takig values i a iterval [α, β], the for ay s 0, E exp(sx) exp ( s 2 (β α) 2 /8 ) I particular, by idepedece, This implies that E exp ( s 1 ) σ i a i = i=1 e sr(a) = exp i=1 ( 1 se max j=1,,n N Ee s 1 j=1 E exp (s 1 ) σ ia i i=1 σ i a (j) i ) ( s 2 a 2 ) ( i s 2 a 2 ) exp 2 2 = exp 2 2 i=1 E exp P i=1 σia(j) i N max j=1,,n exp ( s max j=1,,n 1 ( s 2 a (j) 2 ) 2 2 Takig the logarithm of both sides, dividig by s, ad choosig s to miimize the obtaied upper boud for R (A), we arrive at (4) The idetity (5) is easily see from the defiitio For a proof of the cotractio priciple, see Ledoux ad Talagrad [133] Ofte it is useful to derive further upper bouds o Rademacher averages As a illustratio, we cosider the case whe F is a class of idicator fuctios Recall that this is the case i our motivatig example i the classificatio problem described above whe each f F is the idicator fuctio of a set of the form (x, y) : g(x) y} I such a case, for ay collectio of poits x 1 = (x 1,, x ), F(x 1 ) is a fiite subset of R whose cardiality is deoted by S F (x 1 ) ad is called the vc shatter coefficiet (where vc stads for Vapik-Chervoekis) Obviously, S F (x 1 ) 2 By iequality (4), we have, for all x 1, i=1 σ i a (j) i ) R (F(x 1 )) 2 log SF (x 1 ) (6) where we used the fact that for each f F, i f(x i) 2 I particular, 2 log SF (X1 E sup P f P f 2E ) f F The logarithm of the vc shatter coefficiet may be upper bouded i terms of a combiatorial quatity, called the vc dimesio If A 1, 1}, the the vc dimesio of A is the size V of the largest set of idices TITLE WILL BE SET BY THE PUBLISHER 7 i 1,, i V } 1,, } such that for each biary V -vector b = (b 1,, b V ) 1, 1} V there exists a a = (a 1,, a ) A such that (a i1,, a iv ) = b The key iequality establishig a relatioship betwee shatter coefficiets ad vc dimesio is kow as Sauer s lemma which states that the cardiality of ay set A 1, 1} may be upper bouded as A where V is the vc dimesio of A I particular, V i=0 ( ) ( + 1) V i log S F (x 1 ) V (x 1 ) log( + 1) where we deote by V (x 1 ) the vc dimesio of F(x 1 ) Thus, the expected maximal deviatio E sup f F P f P f may be upper bouded by 2E 2V (X1 ) log( + 1)/ To obtai distributio-free upper bouds, itroduce the vc dimesio of a class of biary fuctios F, defied by V = sup V (x 1 ),x 1 The we obtai the followig versio of what has bee kow as the Vapik-Chervoekis iequality: Theorem 34 vapik-chervoekis iequality For all distributios oe has E sup(p f P f) 2 f F 2V log( + 1) Also, for a uiversal costat C V E sup(p f P f) C f F The secod iequality, that allows to remove the logarithmic factor, follows from a somewhat refied aalysis (called chaiig) The vc dimesio is a importat combiatorial parameter of the class ad may of its properties are well kow Here we just recall oe useful result ad refer the reader to the refereces for further study: let G be a m-dimesioal vector space of real-valued fuctios defied o X The class of idicator fuctios F = f(x) = 1 g(x) 0 : g G } has vc dimesio V m Bibliographical remarks Uiform deviatios of averages from their expectatios is oe of the cetral problems of empirical process theory Here we merely refer to some of the comprehesive coverages, such as Shorack ad Weller [199], Gié [98], va der Vaart ad Weller [227], Vapik [231], Dudley [83] The use of empirical processes i classificatio was pioeered by Vapik ad Chervoekis [232, 233] ad re-discovered 20 years later by Blumer, Ehrefeucht, Haussler, ad Warmuth [41], Ehrefeucht, Haussler, Kears, ad Valiat [88] For surveys see Nataraja [175], Devroye [71] Athoy ad Biggs [10], Kears ad Vazirai [117], Vapik [230, 231], Devroye, Györfi, ad Lugosi [72], Ripley [185], Vidyasagar [235], Athoy ad Bartlett [9], The bouded differeces iequality was formulated explicitly first by McDiarmid [166] (see also the surveys [167]) The martigale methods used by McDiarmid had appeared i early work of Hoeffdig [109], Azuma [18], Yuriksii [242, 243], Milma ad Schechtma [174] Closely related cocetratio results have bee obtaied i various ways icludig iformatio-theoretic methods (see Ahlswede, Gács, ad Körer [1], Marto [154], 8 TITLE WILL BE SET BY THE PUBLISHER [155], [156], Dembo [69], Massart [158] ad Rio [183]), Talagrad s iductio method [217], [213], [216] (see also McDiarmid [168], Luczak ad McDiarmid [143], Pacheko [ ]) ad the so-called etropy method, based o logarithmic Sobolev iequalities, developed by Ledoux [132], [131], see also Bobkov ad Ledoux [42], Massart [159], Rio [183], Bouchero, Lugosi, ad Massart [45, 46], Bousquet [47], ad Bouchero, Bousquet, Lugosi, ad Massart [44] Symmetrizatio was at the basis of the origial argumets of Vapik ad Chervoekis [232, 233] We leart the simple symmetrizatio trick show above from Gié ad Zi [99] but differet forms of symmetrizatio have bee at the core of obtaiig related results of similar flavor, see also Athoy ad Shawe-Taylor [11], Cao, Ettiger, Hush, Scovel [55], Herbrich ad Williamso [108], Medelso ad Philips [172] The use of Rademacher averages i classificatio was first promoted by Koltchiskii [124] ad Bartlett, Bouchero, ad Lugosi [24], see also Koltchiskii ad Pacheko [126,127], Bartlett ad Medelso [29], Bartlett, Bousquet, ad Medelso [25], Bousquet, Koltchiskii, ad Pacheko [50], Kégl, Lider, ad Lugosi [13], Medelso [170] Hoeffdig s iequality appears i [109] For a proof of the cotractio priciple we refer to Ledoux ad Talagrad [133] Sauer s lemma was proved idepedetly by Sauer [189], Shelah [198], ad Vapik ad Chervoekis [232] For related combiatorial results we refer to Frakl [90], Haussler [106], Alesker [7], Alo, Be-David, Cesa- Biachi, ad Haussler [8], Szarek ad Talagrad [210], Cesa-Biachi ad Haussler [60], Medelso ad Vershyi [173], [188] The secod iequality of Theorem 34 is based o the method of chaiig, ad was first proved by Dudley [81] The questio of how sup f F P f P f behaves has bee kow as the Gliveko-Catelli problem ad much has bee said about it A few key refereces iclude Vapik ad Chervoekis [232, 234], Dudley [79, 81, 82], Talagrad [211, 212, 214, 218], Dudley, Gié, ad Zi [84], Alo, Be-David, Cesa-Biachi, ad Haussler [8], Li, Log, ad Sriivasa [138], Medelso ad Vershyi [173] The vc dimesio has bee widely studied ad may of its properties are kow We refer to Cover [63], Dudley [80, 83], Steele [204], Weocur ad Dudley [238], Assouad [15], Khovaskii [118], Macityre ad Sotag [149], Goldberg ad Jerrum [101], Karpiski ad A Macityre [114], Koira ad Sotag [121], Athoy ad Bartlett [9], ad Bartlett ad Maass [28] 4 Miimizig cost fuctios: some basic ideas behid boostig ad support vector machies The results summarized i the previous sectio reveal that miimizig the empirical risk L (g) over a class C of classifiers with a vc dimesio much smaller tha the sample size is guarateed to work well This result has two fudametal problems First, by requirig that the vc dimesio be small, oe imposes serious limitatios o the approximatio properties of the class I particular, eve though the differece betwee the probability of error L(g ) of the empirical risk miimizer is close to the smallest probability of error if g C L(g) i the class, if g C L(g) L may be very large The other problem is algorithmic: miimizig the empirical probability of misclassificatio L(g) is very ofte a computatioally difficult problem Eve i seemigly simple cases, for example whe X = R d ad C is the cl
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks