Description

Chapter 5 Least Squares The term least squares describes a frequently used approach to solving overdetermined or inexactly specified systems of equations in an approximate sense. Instead of solving the

Information

Category:
## Fan Fiction

Publish on:

Views: 15 | Pages: 21

Extension: PDF | Download: 0

Share

Transcript

Chapter 5 Least Squares The term least squares describes a frequently used approach to solving overdetermined or inexactly specified systems of equations in an approximate sense. Instead of solving the equations exactly, we seek only to minimize the sum of the squares of the residuals. The least squares criterion has important statistical interpretations. If appropriate probabilistic assumptions about underlying error distributions are made, least squares produces what is known as the maximum-likelihood estimate of the parameters. Even if the probabilistic assumptions are not satisfied, years of experience have shown that least squares produces useful results. The computational techniques for linear least squares problems make use of orthogonal matrix factorizations. 5.1 Models and Curve Fitting A very common source of least squares problems is curve fitting. Let t be the independent variable and let y(t) denote an unknown function of t that we want to approximate. Assume there are m observations, i.e., values of y measured at specified values of t: y i = y(t i ), i = 1,..., m. The idea is to model y(t) by a linear combination of n basis functions: y(t) β 1 ϕ 1 (t) + + β n ϕ n (t). The design matrix X is a rectangular matrix of order m by n with elements x i,j = ϕ j (t i ). The design matrix usually has more rows than columns. In matrix-vector notation, the model is y Xβ. September 17, 2 Chapter 5. Least Squares The symbol stands for is approximately equal to. We are more precise about this in the next section, but our emphasis is on least squares approximation. The basis functions ϕ j (t) can be nonlinear functions of t, but the unknown parameters, β j, appear in the model linearly. The system of linear equations Xβ y is overdetermined if there are more equations than unknowns. The Matlab backslash operator computes a least squares solution to such a system. beta = X\y The basis functions might also involve some nonlinear parameters, α 1,..., α p. The problem is separable if it involves both linear and nonlinear parameters: y(t) β 1 ϕ 1 (t, α) + + β n ϕ n (t, α). The elements of the design matrix depend upon both t and α: x i,j = ϕ j (t i, α). Separable problems can be solved by combining backslash with the Matlab function fminsearch or one of the nonlinear minimizers available in the Optimization Toolbox. The new Curve Fitting Toolbox provides a graphical interface for solving nonlinear fitting problems. Some common models include the following: Straight line: If the model is also linear in t, it is a straight line: y(t) β 1 t + β 2. Polynomials: The coefficients β j appear linearly. Matlab orders polynomials with the highest power first: ϕ j (t) = t n j, j = 1,..., n, y(t) β 1 t n β n 1 t + β n. The Matlab function polyfit computes least squares polynomial fits by setting up the design matrix and using backslash to find the coefficients. Rational functions: The coefficients in the numerator appear linearly; the coefficients in the denominator appear nonlinearly: ϕ j (t) = t n j α 1 t n α n 1 t + α n, y(t) β 1t n β n 1 t + β n α 1 t n α n 1 t + α n. 5.2. Norms 3 Exponentials: The decay rates, λ j, appear nonlinearly: ϕ j (t) = e λjt, y(t) β 1 e λ 1t + + β n e λ nt. Log-linear: If there is only one exponential, taking logs makes the model linear but changes the fit criterion: y(t) Ke λt, log y β 1 t + β 2, with β 1 = λ, β 2 = log K. Gaussians: The means and variances appear nonlinearly: ϕ j (t) = e ( t µ j σ j ) 2, ( y(t) β 1 e t µ1 σ 1 ) 2 t µn + + β n e ( σn ) Norms The residuals are the differences between the observations and the model: r i = y i Or, in matrix-vector notation, n β j ϕ j (t i, α), i = 1,..., m. 1 r = y X(α)β. We want to find the α s and β s that make the residuals as small as possible. What do we mean by small? In other words, what do we mean when we use the symbol? There are several possibilities. Interpolation: If the number of parameters is equal to the number of observations, we might be able to make the residuals zero. For linear problems, this will mean that m = n and that the design matrix X is square. If X is nonsingular, the β s are the solution to a square system of linear equations: β = X \y. Least squares: Minimize the sum of the squares of the residuals: r 2 = m ri 2. 1 4 Chapter 5. Least Squares Weighted least squares: If some observations are more important or more accurate than others, then we might associate different weights, w j, with different observations and minimize r 2 w = m w i ri 2. 1 For example, if the error in the ith observation is approximately e i, then choose w i = 1/e i. Any algorithm for solving an unweighted least squares problem can be used to solve a weighted problem by scaling the observations and design matrix. We simply multiply both y i and the ith row of X by w i. In Matlab, this can be accomplished with X = diag(w)*x y = diag(w)*y One-norm: Minimize the sum of the absolute values of the residuals: r 1 = m r i. 1 This problem can be reformulated as a linear programming problem, but it is computationally more difficult than least squares. The resulting parameters are less sensitive to the presence of spurious data points or outliers. Infinity-norm: Minimize the largest residual: r = max r i. i This is also known as a Chebyshev fit and can be reformulated as a linear programming problem. Chebyshev fits are frequently used in the design of digital filters and in the development of approximations for use in mathematical function libraries. The Matlab Optimization and Curve Fitting Toolboxes include functions for one-norm and infinity-norm problems. We will limit ourselves to least squares in this book. 5.3 censusgui The NCM program censusgui involves several different linear models. The data are the total population of the United States, as determined by the U.S. Census, for the years 1900 to The units are millions of people. 5.3. censusgui 5 t y The task is to model the population growth and predict the population when t = The default model in censusgui is a cubic polynomial in t: y β 1 t 3 + β 2 t 2 + β 3 t + β 4. There are four unknown coefficients, appearing linearly. 400 Predict U.S. Population in Millions Figure 5.1. censusgui. Numerically, it s a bad idea to use powers of t as basis functions when t is around 1900 or The design matrix is badly scaled and its columns are nearly linearly dependent. A much better basis is provided by powers of a translated and scaled t: s = (t 1955)/55. 6 Chapter 5. Least Squares This new variable is in the interval 1 s 1 and the model is y β 1 s 3 + β 2 s 2 + β 3 s + β 4. The resulting design matrix is well conditioned. Figure 5.1 shows the fit to the census data by the default cubic polynomial. The extrapolation to the year 2020 seems reasonable. The push buttons allow you to vary the degree of the polynomial. As the degree increases, the fit becomes more accurate in the sense that r decreases, but it also becomes less useful because the variation between and outside the observations increases. The censusgui menu also allows you to choose interpolation by spline and pchip and to see the log-linear fit y Ke λt. Nothing in the censusgui tool attempts to answer the all-important question, Which is the best model? That s up to you to decide. 5.4 Householder Reflections Householder reflections are matrix transformations that are the basis for some of the most powerful and flexible numerical algorithms known. We will use Householder reflections in this chapter for the solution of linear least squares problems and in a later chapter for the solution of matrix eigenvalue and singular value problems. Formally, a Householder reflection is a matrix of the form H = I ρuu T, where u is any nonzero vector and ρ = 2/ u 2. The quantity uu T is a matrix of rank one where every column is a multiple of u and every row is a multiple of u T. The resulting matrix H is both symmetric and orthogonal, that is, and H T = H H T H = H 2 = I. In practice, the matrix H is never formed. Instead, the application of H to a vector x is computed by τ = ρu T x, Hx = x τu. Geometrically, the vector x is projected onto u and then twice that projection is subtracted from x. Figure 5.2 shows a vector u and a line labeled u that is perpendicular to u. It also shows two vectors, x and y, and their images, Hx and Hy, under the 5.4. Householder Reflections 7 Householder reflection u x u Hx y Hy Figure 5.2. Householder reflection. transformation H. The matrix transforms any vector into its mirror image in the line u. For any vector x, the point halfway between x and Hx, that is, the vector x (τ/2)u, is actually on the line u. In more than two dimensions, u is the plane perpendicular to the defining vector u. Figure 5.2 also shows what happens if u bisects the angle between x and one of the axes. The resulting Hx then falls on that axis. In other words, all but one of the components of Hx are zero. Moreover, since H is orthogonal, it preserves length. Consequently, the nonzero component of Hx is ± x. For a given vector x, the Householder reflection that zeros all but the kth component of x is given by σ = ± x, u = x + σe k, ρ = 2/ u 2 = 1/(σu k ), H = I ρuu T. In the absence of roundoff error, either sign could be chosen for σ, and the resulting Hx would be on either the positive or the negative kth axis. In the presence of roundoff error, it is best to choose the sign so that sign σ = sign x k. Then the operation x k + σ is actually an addition, not a subtraction. 8 Chapter 5. Least Squares 5.5 The QR Factorization If all the parameters appear linearly and there are more observations than basis functions, we have a linear least squares problem. The design matrix X is m by n with m n. We want to solve Xβ y. But this system is overdetermined there are more equations than unknowns. So we cannot expect to solve the system exactly. Instead, we solve it in the least squares sense: min β Xβ y. A theoretical approach to solving the overdetermined system begins by multiplying both sides by X T. This reduces the system to a square, n-by-n system known as the normal equations: X T Xβ = X T y. If there are thousands of observations and only a few parameters, the design matrix X is quite large, but the matrix X T X is small. We have projected y onto the space spanned by the columns of X. Continuing with this theoretical approach, if the basis functions are independent, then X T X is nonsingular and β = (X T X ) 1 X T y. This formula for solving linear least squares problems appears in most textbooks on statistics and numerical methods. However, there are several undesirable aspects to this theoretical approach. We have already seen that using a matrix inverse to solve a system of equations is more work and less accurate than solving the system by Gaussian elimination. But, more importantly, the normal equations are always more badly conditioned than the original overdetermined system. In fact, the condition number is squared: κ(x T X ) = κ(x ) 2. With finite-precision computation, the normal equations can actually become singular, and (X T X ) 1 nonexistent, even though the columns of X are independent. As an extreme example, consider the design matrix X = 1 1 δ 0. 0 δ If δ is small, but nonzero, the two columns of X are nearly parallel but are still linearly independent. The normal equations make the situation worse: ( ) X T 1 + δ 2 1 X = δ 2. 5.5. The QR Factorization 9 If δ 10 8, the matrix X T X computed with double-precision floating-point arithmetic is exactly singular and the inverse required in the classic textbook formula does not exist. Matlab avoids the normal equations. The backslash operator not only solves square, nonsingular systems, but also computes the least squares solution to rectangular, overdetermined systems: β = X \y. Most of the computation is done by an orthogonalization algorithm known as the QR factorization. The factorization is computed by the built-in function qr. The NCM function qrsteps demonstrates the individual steps. The two versions of the QR factorization are illustrated in Figure 5.3. Both versions have X = QR. In the full version, R is the same size as X and Q is a square matrix with as many rows as X. In the economy-sized version, Q is the same size as X and R is a square matrix with as many columns as X. The letter Q is a substitute for the letter O in orthogonal and the letter R is for right triangular matrix. The Gram Schmidt process described in many linear algebra texts is a related, but numerically less satisfactory, algorithm that generates the same factorization. X = Q R X = Q R Figure 5.3. Full and economy QR factorizations. A sequence of Householder reflections is applied to the columns of X to produce the matrix R: H n H 2 H 1 X = R. 10 Chapter 5. Least Squares The jth column of R is a linear combination of the first j columns of X. Consequently, the elements of R below the main diagonal are zero. If the same sequence of reflections is applied to the right-hand side, the equations Xβ y become where Rβ z, H n H 2 H 1 y = z. The first n of these equations is a small, square, triangular system that can be solved for β by back substitution with the subfunction backsubs in bslashtx. The coefficients in the remaining m n equations are all zero, so these equations are independent of β and the corresponding components of z constitute the transformed residual. This approach is preferable to the normal equations because Householder reflections have impeccable numerical credentials and because the resulting triangular system is ready for back substitution. The matrix Q in the QR factorization is Q = (H n H 2 H 1 ) T. To solve least squares problems, we do not have to actually compute Q. In other uses of the factorization, it may be convenient to have Q explicitly. If we compute just the first n columns, we have the economy-sized factorization. If we compute all m columns, we have the full factorization. In either case, Q T Q = I, so Q has columns that are perpendicular to each other and have unit length. Such a matrix is said to have orthonormal columns. For the full Q, it is also true that QQ T = I, so the full Q is an orthogonal matrix. Let s illustrate this with a small version of the census example. We will fit the six observations from 1950 to 2000 with a quadratic: y(s) β 1 s 2 + β 2 s + β 3. The scaled time s = ((1950:10:2000) )/50 and the observations y are s y 5.5. The QR Factorization 11 The design matrix is X = [s.*s s ones(size(s))] The M-file qrsteps shows the steps in the QR factorization. qrsteps(x,y) The first step introduces zeros below the diagonal in the first column of X The same Householder reflection is applied to y Zeros are introduced in the second column The second Householder reflection is also applied to y Finally, zeros are introduced in the third column and the reflection applied to y. This produces the triangular matrix R and a modified right-hand side z. 12 Chapter 5. Least Squares R = z = The system of equations Rβ = z is the same size as the original, 6 by 3. We can solve the first three equations exactly (because R(1 : 3, 1 : 3) is nonsingular). beta = R(1:3,1:3)\z(1:3) beta = This is the same solution beta that the backslash operator computes with or beta = R\z beta = X\y The last three equations in Rβ = z cannot be satisfied by any choice of β, so the last three components of z represent the residual. In fact, the two quantities norm(z(4:6)) norm(x*beta - y) are both equal to Notice that even though we used the QR factorization, we never actually computed Q. The population in the year 2010 can be predicted by evaluating β 1 s 2 + β 2 s + β 3 at s = ( )/50 = 1.2. This can be done with polyval. p2010 = polyval(beta,1.2) p2010 = The actual 2010 census found the population to be million. 5.6. Pseudoinverse Pseudoinverse The definition of the pseudoinverse makes use of the Frobenius norm of a matrix: ( 1/2 A F = ai,j) 2. i j The Matlab expression norm(x, fro ) computes the Frobenius norm. A F the same as the 2-norm of the long vector formed from all the elements of A. is norm(a, fro ) == norm(a(:)) The Moore Penrose pseudoinverse generalizes and extends the usual matrix inverse. The pseudoinverse is denoted by a dagger superscript, and computed by the Matlab pinv. Z = pinv(x) Z = X, If X is square and nonsingular, then the pseudoinverse and the inverse are the same: X = X 1. If X is m by n with m n and X has full rank, then its pseudoinverse is the matrix involved in the normal equations: X = (X T X ) 1 X T. The pseudoinverse has some, but not all, of the properties of the ordinary inverse. X is a left inverse because X X = (X T X ) 1 X T X = I is the n-by-n identity. But X is not a right inverse because XX = X(X T X ) 1 X T only has rank n and so cannot be the m-by-m identity. The pseudoinverse does get as close to a right inverse as possible in the sense that, out of all the matrices Z that minimize Z = X also minimizes XZ I F, Z F. It turns out that these minimization properties also define a unique pseudoinverse even if X is rank deficient. 14 Chapter 5. Least Squares Consider the 1-by-1 case. What is the inverse of a real (or complex) number x? If x is not zero, then clearly x 1 = 1/x. But if x is zero, x 1 does not exist. The pseudoinverse takes care of that because, in the scalar case, the unique number that minimizes both xz 1 and z is x = { 1/x : x 0, 0 : x = 0. The actual computation of the pseudoinverse involves the singular value decomposition, which is described in a later chapter. You can edit pinv or type pinv to see the code. 5.7 Rank Deficiency If X is rank deficient, or has more columns than rows, the square matrix X T X is singular and (X T X ) 1 does not exist. The formula β = (X T X ) 1 X T y obtained from the normal equations breaks down completely. In these degenerate situations, the least squares solution to the linear system Xβ y is not unique. A null vector of X is a nonzero solution to Xη = 0. Any multiple of any null vector can be added to β without changing how well Xβ approximates y. In Matlab, the solution to Xβ y can be computed with either backslash or the pseudoinverse, that is, or beta = X\y beta = pinv(x)*y In the full rank case, these two solutions are the same, although pinv does considerably more computation to obtain it. But in degenerate situations these two solutions are not the same. The solution computed by backslash is called a basic solution. If r is the rank of X, then at most r of the components of beta = X\y 5.7. Rank Deficiency 15 are nonzero. Even the requirement of a basic solution does not guarantee uniqueness. The particular basic computation obtained with backslash is determined by details of the QR factorization. The solution computed by pinv is the minimum norm solution. Out of all the vectors β that minimize Xβ y, the vector beta = pinv(x)*y also minimizes β. This minimum norm solution is unique. For example, let X = and y = The matrix X is rank deficient. The middle column is the average of the first and last columns. The vector is a null vector. The statement beta = X\y produces a warning, η = Warning: Rank deficient, rank = 2 tol = e-14 and the solution beta = As promised, this solution is basic; it has only two nonzero components. However, the vectors beta = 16 Chapter 5. Least Squares and beta = are also basic solutions. The statement beta = pinv(x)*y produces the solution beta = without giving a warning about rank deficiency. The norm of the pseudoinverse solution norm(pinv(x)*y) = is slightly less than the norm of the backslash solution norm(x\y) = Out of all the vectors β that minimize Xβ y, the pseudoinverse has found the shortest. Notice that the difference between the two solutions, X\y - pinv(x)*y = is a multiple of the null vector η. If handled with care, rank deficient least squares problems c

Related Search

Similar documents

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks