HIGH DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO. By Nicolai Meinshausen and Peter Bühlmann ETH Zürich - PDF

Description
The Annals of Statistics HIGH DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO By Nicolai Meinshausen and Peter Bühlmann ETH Zürich The pattern of zero entries in the inverse covariance matrix

Please download to get full document.

View again

of 20
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Biography

Publish on:

Views: 46 | Pages: 20

Extension: PDF | Download: 0

Share
Transcript
The Annals of Statistics HIGH DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO By Nicolai Meinshausen and Peter Bühlmann ETH Zürich The pattern of zero entries in the inverse covariance matrix of a multivariate normal distriution corresponds to conditional independence restrictions etween variales. Covariance selection aims at estimating those structural zeros from data. We show that neighorhood selection with the Lasso is a computationally attractive alternative to standard covariance selection for sparse high-dimensional graphs. Neighorhood selection estimates the conditional independence restrictions separately for each node in the graph and is hence equivalent to variale selection for Gaussian linear models. We show that the proposed neighorhood selection scheme is consistent for sparse high-dimensional graphs. Consistency hinges on the choice of the penalty parameter. The oracle value for optimal prediction does not lead to a consistent neighorhood estimate. Controlling instead the proaility of falsely joining some distinct connectivity components of the graph, consistent estimation for sparse graphs is achieved (with exponential rates), even when the numer of variales grows like the numer of oservations raised to an aritrary power. 1. Introduction. Consider the p-dimensional multivariate normal distriuted random variale X = (X 1,..., X p ) N (µ, Σ). This includes Gaussian linear models where, for example, X 1 is the response variale and {X k ; 2 k p} are the predictor variales. Assuming that the covariance matrix Σ is non-singular, the conditional independence structure of the distriution can e conveniently represented y a graphical model G = (Γ, E), where Γ = {1,..., p} is the set of nodes and E the set of edges in Γ Γ. A pair (a, ) is contained in the edge set E if and only if X a is conditionally dependent of X, given all remaining variales X Γ\{a,} = AMS 2000 suject classifications: Primary 62J07; secondary 62H20, 62F12 Keywords and phrases: Linear regression, Covariance selection, Gaussian graphical models, Penalized regression 1 2 N. MEINSHAUSEN AND P. BÜHLMANN {X k ; k Γ\{a, }}. Every pair of variales not contained in the edge set is conditionally independent, given all remaining variales and corresponds to a zero entry in the inverse covariance matrix (Lauritzen, 1996). Covariance selection was introduced y Dempster (1972) and aims at discovering the conditional independence restrictions (the graph) from a set of i.i.d. oservations. Covariance selection traditionally relies on the discrete optimization of an ojective function (Lauritzen, 1996; Edwards, 2000). Exhaustive search is computationally infeasile for all ut very low-dimensional models. Usually, greedy forward or ackward search is employed. In forward search, the initial estimate of the edge set is the empty set and edges are then added iteratively until a suitale stopping criterion is fulfilled. The selection (deletion) of a single edge in this search strategy requires an MLE fit (Speed and Kiiveri, 1986) for O(p 2 ) different models. The procedure is not well suited for high-dimensional graphs. The existence of the MLE is not guaranteed in general if the numer of oservations is smaller than the numer of nodes (Buhl, 1993). More disturingly, the complexity of the procedure renders even greedy search strategies impractical for modestly sized graphs. In contrast, neighorhood selection with the Lasso, proposed in the following, relies on optimization of a convex function, applied consecutively to each node in the graph. The method is computationally very efficient and is consistent even for the high-dimensional setting, as will e shown. Neighorhood selection is a suprolem of covariance selection. The neighorhood ne a of a node a Γ is the smallest suset of Γ\{a} so that, given all variales X nea in the neighourhood, X a is conditionally independent of all remaining variales. The neighorhood of a node a Γ consists of all nodes Γ\{a} so that (a, ) E. Given n i.i.d. oservations of X, neighorhood selection aims at estimating (individually) the neighorhood of any given variale (or node). The neighorhood selection can e cast into a standard regression prolem and can e solved efficiently with the Lasso (Tishirani, 1996), as will e shown in this paper. The consistency of the proposed neighorhood selection will e shown for sparse high-dimensional graphs, where the numer of variales is potentially growing like any power of the numer of oservations (high-dimensionality) whereas the numer of neighors of any variale is growing at most slightly slower than the numer of oservations (sparsity). VARIABLE SELECTION WITH THE LASSO 3 A numer of studies have examined the case of regression with a growing numer of parameters as sample size increases. The closest to our setting is the recent work of Greenshtein and Ritov (2004), who study consistent prediction in a triangular setup very similar to ours (see also Juditsky and Nemirovski, 2000). However, the prolem of consistent estimation of the model structure, which is the relevant concept for graphical models, is very different and not treated in these studies. We study in section 2 under which conditions, and at which rate, the neighorhood estimate with the Lasso converges to the true neighorhood. The choice of the penalty is crucial in the high-dimensional setting. The oracle penalty for optimal prediction turns out to e inconsistent for estimation of the true model. This solution might include an unounded numer of noise variales into the model. We motivate a different choice of the penalty such that the proaility of falsely connecting two or more distinct connectivity components of the graph is controlled at very low levels. Asymptotically, the proaility of estimating the correct neighorhood converges exponentially to 1, even when the numer of nodes in the graph is growing rapidly like any power of the numer of oservations. As a consequence, consistent estimation of the full edge set in a sparse high-dimensional graph is possile (section 3). Encouraging numerical results are provided in section 4. The proposed estimate is shown to e oth more accurate than the traditional forward selection MLE strategy and computationally much more efficient. The accuracy of the forward selection MLE fit is in particular poor if the numer of nodes in the graph is comparale to the numer of oservations. In contrast, neighorhood selection with the Lasso is shown to e reasonaly accurate for estimating graphs with several thousand nodes, using only a few hundred oservations. 2. Neighorhood Selection. Instead of assuming a fixed true underlying model, we adopt a more flexile approach similar to the triangular setup in Greenshtein and Ritov (2004). Both the numer of nodes in the graphs (numer of variales), denoted y p(n) = Γ(n), and the distriution (the covariance matrix) depend in general on the numer of oservations, so that Γ = Γ(n) and Σ = Σ(n). The neighorhood ne a of a node a Γ(n) is the smallest suset of Γ(n)\{a} so that X a is conditionally independent of all 4 N. MEINSHAUSEN AND P. BÜHLMANN remaining variales. Denote the closure of node a Γ(n) y cl a := ne a {a}. Then X a {X k ; k Γ(n)\cl a } X nea. For details see Lauritzen (1996). The neighorhood depends in general on n as well. However, this dependence is notationally suppressed in the following. It is instructive to give a slightly different definition of a neighorhood. For each node a Γ(n), consider optimal prediction of X a, given all remaining variales. Let θ a R p(n) e the vector of coefficients for optimal prediction, (1) θ a = arg min θ:θ a=0 E(X a k Γ(n) θ k X k ) 2. As a generalization of (1), which will e of use later, consider optimal prediction of X a, given only a suset of variales {X k ; k A}, where A Γ(n)\{a}. The optimal prediction is characterized y the vector θ a,a, (2) θ a,a = arg min E(X a θ:θ k =0, k / A k Γ(n) θ k X k ) 2. The elements of θ a are determined y the inverse covariance matrix (Lauritzen, 1996). For Γ\{a} and K(n) = Σ 1 (n), it holds that θ a = K a (n)/k aa (n). The set of non-zero coefficients of θ a is identical to the set { Γ(n)\{a} : K a (n) 0} of non-zero entries in the corresponding row vector of the inverse covariance matrix and defines precisely the set of neighors of node a. The est predictor for X a is thus a linear function of variales in the set of neighors of the node a only. The set of neighors of a node a Γ(n) can hence e written as ne a = { Γ(n) : θ a 0}. This set corresponds to the set of effective predictor variales in regression with response variale X a and predictor variales {X k ; k Γ(n) \ {a}}. Given n independent oservations of X N (0, Σ(n)), neighorhood selection tries to estimate the set of neighors of a node a Γ(n). As the optimal linear prediction of X a has non-zero coefficients precisely for variales in the set of neighors of the node a, it seems reasonale to try to exploit this relation. VARIABLE SELECTION WITH THE LASSO Neighorhood selection with the Lasso. It is well known that the Lasso, introduced y Tishirani (1996), and known as Basis Pursuit in the context of wavelet regression (Chen et al., 2001), has a parsimonious property (Knight and Fu, 2000). When predicting a variale X a with all remaining variales {X k ; k Γ(n)\{a}}, the vanishing Lasso coefficient estimates identify asymptotically the neighorhood of node a in the graph, as shown in the following. Let the n p(n)-dimensional matrix X contain n independent oservations of X, so that the columns X a correspond for all a Γ(n) to the vector of n independent oservations of X a. Let, e the usual inner product on R n and 2 the corresponding norm. The Lasso estimate ˆθ a,λ of θ a is given y (3) ˆθ a,λ = arg min θ:θ a=0 (n 1 X a Xθ λ θ 1 ), where θ 1 = Γ(n) θ is the l 1 -norm of the coefficient vector. Normalization of all variales to common empirical variance is recommended for the estimator in (3). The solution to (3) is not necessarily unique. However, if uniqueness fails, the set of solutions is still convex and all our results aout neighorhoods (in particular Theorems 1 and 2) hold for any solution of (3). Other regression estimates have een proposed, which are ased on the l p -norm, where p is typically in the range [0, 2] (Frank and Friedman, 1993). A value of p = 2 leads to the ridge estimate, while p = 0 corresponds to traditional model selection. It is well known that the estimates have a parsimonious property (with some components eing exactly zero) for p 1 only, while the optimization prolem in (3) is only convex for p 1. Hence l 1 -constrained empirical risk minimization occupies a unique position, as p = 1 is the only value of p for which variale selection takes place while the optimization prolem is still convex and hence feasile for high-dimensional prolems. The neighorhood estimate (parameterized y λ) is defined y the nonzero coefficient estimates of the l 1 -penalized regression, ˆne λ a = { Γ(n) : ˆθ a,λ 0}. Each choice of a penalty parameter λ specifies thus an estimate of the neighorhood ne a of node a Γ(n) and one is left with the choice of a suitale penalty parameter. Larger values of the penalty tend to shrink the size of 6 N. MEINSHAUSEN AND P. BÜHLMANN the estimated set, while more variales are in general included into ˆne λ a if the value of λ is diminished The prediction-oracle solution. A seemingly useful choice of the penalty parameter is the (unavailale) prediction-oracle value, λ oracle = arg min λ E(X a k Γ(n) ˆθ a,λ k X k) 2. The expectation is understood to e with respect to a new X, which is independent of the sample on which ˆθ a,λ is estimated. The prediction-oracle penalty minimizes the predictive risk among all Lasso estimates. An estimate of λ oracle is otained y the cross-validated choice λ cv. For l 0 -penalized regression it was shown y Shao (1993) that the crossvalidated choice of the penalty parameter is consistent for model selection under certain conditions on the size of the validation set. The predictionoracle solution does not lead to consistent model selection for the Lasso, as shown in the following for a simple example. Proposition 1. Let the numer of variales grow to infinity, p(n), for n, with p(n) = o(n γ ) for some γ 0. Assume that the covariance matrices Σ(n) are identical to the identity matrix except for some pair (a, ) Γ(n) Γ(n), for which Σ a (n) = Σ a (n) = s, for some 0 s 1 and all n N. The proaility of selecting the wrong neighorhood for node a converges to 1 under the prediction-oracle penalty, P ( ˆne λ oracle a ne a ) 1 for n. A proof is given in the appendix. It follows from the proof of Proposition 1 that many noise variales are included into the neighorhood estimate with the prediction-oracle solution. In fact, the proaility of including noise variales with the prediction-oracle solution does not even vanish asymptotically for a fixed numer of variales. If the penalty is chosen larger than the prediction-optimal value, consistent neighorhood selection is possile with the Lasso, as demonstrated in the following Assumptions. We make a few assumptions to prove consistency of neighorhood selection with the Lasso. We always assume availaility of n independent oservations from X N (0, Σ). VARIABLE SELECTION WITH THE LASSO 7 High-dimensionality. The numer of variales is allowed to grow like the numer of oservations n raised to an aritrarily high power. Assumption 1. There exists γ 0, so that p(n) = O(n γ ) for n. It is in particular allowed for the following analysis that the numer of variales is very much larger than the numer of oservations, p(n) n. Non-singularity. matrices. We make two regularity assumptions for the covariance Assumption 2. For all a Γ(n) and n N, V ar(x a ) = 1. There exists v 2 0, so that for all n N and a Γ(n), V ar(x a X Γ(n)\{a} ) v 2. Common variance can always e achieved y appropriate scaling of the variales. A scaling to common (empirical) variance of all variales is desirale, as the solutions would otherwise depend on the chosen units or dimensions in which they are represented. The second part of the assumption explicitly excludes singular or nearly singular covariance matrices. For singular covariance matrices, edges are not uniquely defined y the distriution and it is hence not surprising that nearly singular covariance matrices are not suitale for consistent variale selection. Note, however, that the empirical covariance matrix is a.s. singular if p(n) n, which is allowed in our analysis. Sparsity. The main assumption is the sparsity of the graph. This entails a restriction on the size of the neighorhoods of variales. Assumption 3. There exists some 0 κ 1 so that max ne a = O(n κ ) for n. a Γ(n) This assumption limits the maximal possile rate of growth for the size of neighorhoods. For the next sparsity condition, consider again the definition in (2) of the optimal coefficient θ,a for prediction of X, given variales in the set A Γ(n). 8 N. MEINSHAUSEN AND P. BÜHLMANN Assumption 4. There exists some ϑ so that for all neighoring nodes a, Γ(n) and all n N, θ a,ne\{a} 1 ϑ. This assumption is e.g. fulfilled if Assumption 2 holds and the size of the overlap of neighorhoods is ounded y an aritrarily large numer from aove for neighoring nodes. That is, if there exists some m so that for all n N, (4) max ne a ne m for n, a, Γ(n), ne a then Assumption 4 is fulfilled. To see this, note that Assumption 2 gives a finite ound for the l 2 -norm of θ a,ne \{a}, while (4) gives a finite ound for the l 0 -norm. Taken together, Assumption 4 is implied. Magnitude of partial correlations. The next assumption ounds the magnitude of partial correlations from elow. The partial correlation π a etween variales X a and X is the correlation after having eliminated the linear effects from all remaining variales {X k ; k Γ(n)\{a, }}; for details see Lauritzen (1996). Assumption 5. There exists a constant δ 0 and some ξ κ, with κ as in Assumption 3, so that for every (a, ) E, π a δn (1 ξ)/2. It will e shown elow that Assumption 5 cannot e relaxed in general. Note that neighorhood selection for node a Γ(n) is equivalent to simultaneously testing the null hypothesis of zero partial correlation etween variale X a and all remaining variales X, Γ(n)\{a}. The null hypothesis of zero partial correlation etween two variales can e tested y using the corresponding entry in the normalised inverse empirical covariance matrix. A graph estimate ased on such tests has een proposed y Drton and Perlman (2004). Such a test can only e applied, however, if the numer of variales is smaller than the numer of oservations, p(n) n, as the empirical covariance matrix is singular otherwise. Even if p(n) = n c for some constant c 0, Assumption 5 would have to hold with ξ = 1 to have VARIABLE SELECTION WITH THE LASSO 9 a positive power of rejecting false null hypotheses for such an estimate, that is partial correlations would have to e ounded y a positive value from elow. Neighorhood staility. The last assumption is referred to as neighorhood staility. Using the definition of θ a,a in equation (2), define for all a, Γ(n), (5) S a () := sign(θ a,nea k )θ,nea k. k ne a The assumption of neighorhood staility restricts the magnitude of the quantities S a () for non-neighoring nodes a, Γ(n). Assumption 6. / ne a, There exists some δ 1 so that for all a, Γ(n) with S a () δ. It is shown in Proposition 3 that this assumption cannot e relaxed. We give in the following a more intuitive condition which essentially implies Assumption 6. This will justify the term neighorhood staility. Consider the definition in equation (1) of the optimal coefficients θ a for prediction of X a, For η 0, define θ a (η) as the optimal set of coefficients under an additional l 1 -penalty, (6) θ a (η) := arg min θ:θ a=0 E(X a k Γ(n) θ k X k ) 2 + η θ 1. The neighorhood ne a of node a was defined as the set of non-zero coefficients of θ a, ne a = {k Γ(n) : θk a 0}. Define the distured neighorhood ne a (η) as ne a (η) := {k Γ(n) : θk a (η) 0}. It clearly holds that ne a = ne a (0). The assumption of neighorhood staility is fulfilled if there exists some infinitesimal small perturation η, which may depend on n, so that the distured neighorhood ne a (η) is identical to the undistured neighorhood ne a (0). Proposition 2. If there exists some η 0 so that ne a (η) = ne a (0), then S a () 1 for all Γ(n)\ne a. 10 N. MEINSHAUSEN AND P. BÜHLMANN A proof is given in the appendix. In light of Proposition 2 it seems that Assumption 6 is a very weak condition. To give one example, Assumption 6 is automatically fulfilled under the much stronger assumption that the graph does not contain cycles. We give a rief reasoning for this. Consider two non-neighoring nodes a and. If the nodes are in different connectivity components there is nothing left to show as S a () = 0. If they are in the same connectivity component, then there exists one node k ne a that separates from ne a \{k}, as there is just one unique path etween any two variales in the same connectivity component if the graph does not contain cycles. Using the gloal Markov property, the random variale X is independent of X nea\{k}, given X k. The random variale E(X X nea ) is thus a function of X k only. As the distriution is Gaussian, E(X X nea ) = θ,nea k X k. By Assumption 2, V ar(x X nea ) = v 2 for some v 2 0. It follows that V ar(x ) = v 2 + (θ,nea k ) 2 = 1 and hence = 1 v 2 1, which implies that Assumption 6 is indeed fulfilled if θ,nea k the graph does not contain cycles. We mention that Assumption 6 is likewise fulfilled if the inverse covariance matrices Σ 1 (n) are for each n N diagonally dominant. A matrix is said to e diagonally dominant if and only if, for each row, the sum of the asolute v
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks