next up previous contents
Next: 2.2 Minimizing Arbitrary Convex Up: 2.1 Minimizing Quadratic Forms Previous: 2.1.1 The Wrong Iterative   Contents


2.1.2 The Right Iterative Method: Conjugate Gradient

The most important difference between CG and Steepest Descent is that CG will find the minimum of an $ n$-dimensional quadratic form in $ n$ or fewer steps. Because a quadratic form's Hessian is constant it is possible to plan a set of search directions which avoid redundant calculations. CG takes advantage of this property, making it more suitable than Steepest Descent for minimizing quadratic forms. We explain below how the CG search directions and step lengths are chosen.

We would like to have a set of $ n$ linearly independent direction vectors $ \ensuremath{\mathbf{d_0}}, \ldots, \ensuremath{\mathbf{d_{n-1}}}$ and scalars $ \alpha_0, \ldots,
\alpha_{n-1}$ for which

$\displaystyle \ensuremath{\mathbf{e}}_0 = \sum_{j=0}^{n-1} \alpha_j \ensuremath{\mathbf{d}}_j$ (2.7)

This would allow us to travel exactly the right distance in each direction, reducing the error to zero and arriving at a minimizer of the quadratic form $ f$ in $ n$ or fewer steps. However this is not enough. Suppose we knew $ \mathbf{e_0}$ and could solve Equation 2.7. The solution would require O$ \left( n^3 \right)$ operations to compute, a time complexity we dismissed as infeasible in Section 2.1. However, if the vectors $ \ensuremath{\mathbf{d_0}}, \ldots, \ensuremath{\mathbf{d_{n-1}}}$ were mutually orthogonal, the solution to this equation could be found in O$ \left( n \right)$ time per $ \alpha_j$.

Because we cannot know $ \ensuremath{\mathbf{e_0}}$ without knowing the solution, we cannot solve Equation 2.7 in any amount of time. However, we can compute $ \ensuremath{\mathbf{A}}\ensuremath{\mathbf{e_0}}$, as shown in Equation 2.6. Left multiplying Equation 2.7 by $ \mathbf{A}$ produces

$\displaystyle \ensuremath{\mathbf{A}}\ensuremath{\mathbf{e_0}} = \sum_{j=0}^{n-1} \alpha_j \ensuremath{\mathbf{A}} \ensuremath{\mathbf{d_j}}$ (2.8)

We have returned to our O$ \left( n^3 \right)$ problem, and this time orthogonality of $ \ensuremath{\mathbf{d_0}}, \ldots, \ensuremath{\mathbf{d_{n-1}}}$ will not help because we have left-multiplied $ \ensuremath{\mathbf{d_j}}$ by $ \mathbf{A}$. Instead we want mutual $ \mathbf{A}$-orthogonality, wherein we require that $ \ensuremath{\mathbf{d_i}}^T \ensuremath{\mathbf{A}} \ensuremath{\mathbf{d_j}} = 0, i \neq j$.

If we assume we can find a set of linearly independent directions $ \ensuremath{\mathbf{d_j}}$ which are mutually $ \mathbf{A}$-orthogonal, we may solve for $ \alpha_k$ by left-multiplying Equation 2.8 by $ \ensuremath{\mathbf{d_k}}^T$, since

$\displaystyle \ensuremath{\mathbf{d_k}}^T \ensuremath{\mathbf{A}} \ensuremath{\mathbf{e_0}}$ $\displaystyle =$ $\displaystyle \sum_{j=0}^{n-1} \alpha_j \ensuremath{\mathbf{d_k}}^T \ensuremath{\mathbf{A}} \ensuremath{\mathbf{d_j}}$ (2.9)
  $\displaystyle =$ $\displaystyle \alpha_k \ensuremath{\mathbf{d_k}}^T \ensuremath{\mathbf{A}} \ensuremath{\mathbf{d_k}}$ (2.10)

by $ \mathbf{A}$-orthogonality and hence

$\displaystyle \alpha_k = \frac{\ensuremath{\mathbf{d_k}}^T \ensuremath{\mathbf{...
...{\ensuremath{\mathbf{d_k}}^T \ensuremath{\mathbf{A}} \ensuremath{\mathbf{d_k}}}$ (2.11)

Note that we further require $ \ensuremath{\mathbf{A}} \ensuremath{\mathbf{d_k}} \neq 0$ for $ \alpha_k$ to be well-defined. Our strategy, if a set of linearly independent $ \mathbf{A}$-orthogonal vectors $ \ensuremath{\mathbf{d_j}}$ not in the null space of $ \mathbf{A}$ exist, is to
  1. Choose $ \ensuremath{\mathbf{x_0}}$
  2. Compute $ \ensuremath{\mathbf{r_i}} = \ensuremath{\mathbf{b}} - \ensuremath{\mathbf{A}}\ensuremath{\mathbf{x_i}}$
  3. Compute $ \alpha_i = -\ensuremath{\mathbf{d_i}}^T \ensuremath{\mathbf{r_0}} / \ensuremath{\mathbf{d_i}}^T \ensuremath{\mathbf{A}} \ensuremath{\mathbf{d_i}}$
  4. Compute $ \ensuremath{\mathbf{x_{i+1}}} = \ensuremath{\mathbf{x_i}} - \alpha_i \ensuremath{\mathbf{d_i}}$
  5. Increment $ i$ and repeat from step 2 until $ \vert\vert\ensuremath{\mathbf{r_i}}\vert\vert = 0$
This ensures we travel $ -\sum_{j=0}^{n-1} \alpha_j \ensuremath{\mathbf{d_j}}$, thus eliminating all of $ \ensuremath{\mathbf{e_0}}$ to arrive at the solution $ \ensuremath{\mathbf{x^*}}$ in no more than $ n$ iterations.

Not only does there exist a set of linearly independent mutually $ \mathbf{A}$-orthogonal direction vectors $ \ensuremath{\mathbf{d_j}}$ not in the null space of $ \mathbf{A}$, but each vector can be computed incrementally. Let $ \ensuremath{\mathbf{d_0}} = \ensuremath{\mathbf{r_0}}$. This allows computation of $ \ensuremath{\mathbf{x_1}}$ and $ \ensuremath{\mathbf{r_1}}$, and all remaining direction vectors can be computed as needed via

$\displaystyle \ensuremath{\mathbf{d_i}} = \ensuremath{\mathbf{r_i}} + \beta_i \ensuremath{\mathbf{d_{i-1}}}$ (2.12)

where

$\displaystyle \beta_i = \frac{\ensuremath{\mathbf{r_i}}^T \ensuremath{\mathbf{r_i}}}{\ensuremath{\mathbf{r_{i-1}}}^T\ensuremath{\mathbf{r_{i-1}}}}$ (2.13)

[41,31]. Another name for $ \mathbf{A}$-orthogonality is conjugacy, and it is from the conjugacy of its search directions that CG gets its name. Our final linear CG algorithm is shown in Algorithm 1. Note that line 1.1 in this algorithm can be equivalently written as $ \alpha_i := -\ensuremath{\mathbf{r_i}}^T \ensuremath{\mathbf{r_i}} / (\ensuremath{\mathbf{r_i}}^T \ensuremath{\mathbf{A}}
\ensuremath{\mathbf{r_i}})$ due to properties described below. This formulation eliminates the need to store $ \mathbf{r_0}$. See Shewchuk [41] for details.

.0
\begin{algorithm}
% latex2html id marker 4394
[tbp] \SetKwInOut{Input}{input} \S...
...\mathbf{A}} \ensuremath{\mathbf{x_i}}$ \;
\par
$i := i+1$ \;
}\end{algorithm}

It is worth emphasizing that each direction vector is a linear combination of the current residual and the previous direction vector. Because the direction vectors are linearly independent, a simple induction implies that $ \ensuremath{\mathbf{d_i}}$ is a linear combination of $ \ensuremath{\mathbf{r_0}}, \ldots, \ensuremath{\mathbf{r_i}}$ and

$\displaystyle \mathcal{S}_i =$   span$\displaystyle \left( \ensuremath{\mathbf{d_0}}, \ldots, \ensuremath{\mathbf{d_i}} \right) =$   span$\displaystyle \left( \ensuremath{\mathbf{r_0}}, \ldots, \ensuremath{\mathbf{r_i}} \right)$ (2.14)

The space searched so far, $ \mathcal{D}_i$, may also be written as the Krylov subspaces:

$\displaystyle \mathcal{S}_i = {\ensuremath{\mathbf{d_0}}, \ensuremath{\mathbf{A...
...emath{\mathbf{r_0}}, \ldots, \ensuremath{\mathbf{A}}^\ensuremath{\mathbf{r_i}}}$ (2.15)

See Shewchuk [41] and Nash and Sofer [31] for proofs of these properties.

Figure 2.2 shows the conjugate directions used by CG when started from the same place as Steepest Descent was in Figure 2.1. Because CG will converge within $ n$ iterations, this is a worst-case starting point for CG requiring $ n=2$ steps. CG does so much better than Steepest Descent in our example because it assumes, correctly, that $ f$ is a quadratic form. If $ f$ is not a quadratic form then adjustments must be made. The resulting algorithm is often referred to as nonlinear CG, and is discussed briefly in Section 2.2. Predictably, the CG algorithm discussed so far is called linear CG.

We have omitted demonstration of the existence of a linearly independent set of conjugate directions which are not in the null space of $ \mathbf{A}$, and also omitted many interesting properties of CG. The value of $ \alpha$ in Equation 2.11 would be the same had we defined it as the step length for minimizing line-search in direction $ \mathbf{d_j}$ starting at $ \mathbf{x_j}$. The minimizer is $ \mathbf{x_{j+1}}$. This is where the gradient, and hence the residual, is perpendicular to $ \ensuremath{\mathbf{d_j}}$. Thus $ r_{j+1}$ is perpendicular to $ \ensuremath{\mathbf{d_j}}$. Equation 2.14 then implies that each new residual is perpendicular to the space already searched. It also follows that the residuals are mutually orthogonal and that $ \vert\vert\ensuremath{\mathbf{e_i}}\vert\vert _\ensuremath{\mathbf{A}}^2 = \ensuremath{\mathbf{e_i}}^T \ensuremath{\mathbf{A}} \ensuremath{\mathbf{e_i}}$ is minimized over $ \mathcal{S}_i$. These properties help explain the efficiency per iteration of CG.

One may also observe that the computations needed in each CG iteration depend only on $ \mathbf{A}$ and the previous iteration, with the exception of $ \alpha_k$ as shown in Equation 2.11. In fact $ \alpha_k$ can also be written in terms of the previous iteration [41,31]. Therefore we only need to store a few vectors and $ \mathbf{A}$. Because only matrix-vector computations are needed, storage and computation may exploit the structure or sparsity of $ \mathbf{A}$.

It was demonstrated above that CG should arrive at the exact solution $ \ensuremath{\mathbf{x^*}}$ in no more than $ n$ iterations when $ \mathbf{A}$ is $ n \times n$. This property does not hold when rounded arithmetic is used, for instance when CG is performed on a computer. Errors during computations can reduce the conjugacy of the direction vectors and decrease efficiency so that many more than $ n$ iterations are required. [31]

The convergence rate for CG is not entirely straightforward. In Shewchuk [41] this rate is described as

$\displaystyle \frac{\vert\vert \ensuremath{\mathbf{e_i}} \vert\vert _\ensuremat...
...ath{\mathbf{A}}} \le 2 \left( \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1} \right)^i$ (2.16)

where $ \kappa$ is condition number $ \lambda_{\mbox{max}} /
\lambda_{\mbox{min}}$, and $ \lambda_{\mbox{max}}$ and $ \lambda_{\mbox{min}}$ are the largest and smallest eigenvalues of $ \mathbf{A}$ in absolute value. If CG is terminated when the error under $ \mathbf{A}$ is sufficiently small relative to the initial error, that is $ \vert\vert \ensuremath{\mathbf{e_i}} \vert\vert _\ensuremath{\mathbf{A}} / \vert\vert \ensuremath{\mathbf{e_0}} \vert\vert _\ensuremath{\mathbf{A}} < \epsilon$, then CG has a worst-case complexity of O$ \left( m \kappa \right)$ where $ m$ is the number of nonzero entries in $ \mathbf{A}$. The convergence rate and worse-case complexity occur when the eigenvalues of $ \mathbf{A}$ are evenly distributed between $ \lambda_{\mbox{min}}$ and $ \lambda_{\mbox{max}}$ and are distinct. Significantly better convergence is possible when the eigenvalues are clustered or duplicated. [41]

There are many possibilities for deciding when to terminate CG besides the relative size of the error. Furthermore the relation between the true error and the computed error is complex, and the subject of ongoing numerical analysis research. Termination alternatives include minimizing the absolute size of the error, or relative or absolute size of the residual, or context-dependent quantities such as the relative change of the likelihood when computing a maximum likelihood estimate.

Finding a good termination criteria is very important to minimize wasted time while ensuring sufficient optimality. When we discuss our use of CG later in this thesis, the most important issue will be proper termination. Because the worst-case time complexity of CG depends on the spectrum of $ \mathbf{A}$, our CG termination methods should to adapt to each dataset. We will also examine how the starting point $ \mathbf{x_0}$ affects our methods in the one case where an informed choice is possible. Overall, we aim hide all of these issues from the users of our algorithms.


next up previous contents
Next: 2.2 Minimizing Arbitrary Convex Up: 2.1 Minimizing Quadratic Forms Previous: 2.1.1 The Wrong Iterative   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu