2.1.2 The Right Iterative Method: Conjugate Gradient

We would like to have a set of linearly independent direction vectors and scalars for which

This would allow us to travel exactly the right distance in each direction, reducing the error to zero and arriving at a minimizer of the quadratic form in or fewer steps. However this is not enough. Suppose we knew and could solve Equation 2.7. The solution would require O operations to compute, a time complexity we dismissed as infeasible in Section 2.1. However, if the vectors were mutually orthogonal, the solution to this equation could be found in O time per .

Because we cannot know without knowing the solution, we cannot solve Equation 2.7 in any amount of time. However, we can compute , as shown in Equation 2.6. Left multiplying Equation 2.7 by produces

We have returned to our O problem, and this time orthogonality of will not help because we have left-multiplied by . Instead we want mutual

If we assume we can find a set of linearly independent directions
which are mutually
-orthogonal, we may solve
for by left-multiplying Equation 2.8
by
, since

(2.9) | |||

(2.10) |

by -orthogonality and hence

Note that we further require for to be well-defined. Our strategy, if a set of linearly independent -orthogonal vectors not in the null space of exist, is to

- Choose
- Compute
- Compute
- Compute
- Increment and repeat from step 2 until

Not only does there exist a set of linearly independent mutually -orthogonal direction vectors not in the null space of , but each vector can be computed incrementally. Let . This allows computation of and , and all remaining direction vectors can be computed as needed via

where

[41,31]. Another name for -orthogonality is conjugacy, and it is from the conjugacy of its search directions that CG gets its name. Our final linear CG algorithm is shown in Algorithm 1. Note that line 1.1 in this algorithm can be equivalently written as due to properties described below. This formulation eliminates the need to store . See Shewchuk [41] for details.

It is worth emphasizing that each direction vector is a linear combination of the current residual and the previous direction vector. Because the direction vectors are linearly independent, a simple induction implies that is a linear combination of and

The space searched so far, , may also be written as the Krylov subspaces:

(2.15) |

See Shewchuk [41] and Nash and Sofer [31] for proofs of these properties.

Figure 2.2 shows the conjugate directions used by CG
when started from the same place as Steepest Descent was in
Figure 2.1. Because CG will converge within
iterations, this is a worst-case starting point for CG requiring
steps. CG does so much better than Steepest Descent in our example
because it assumes, correctly, that is a quadratic form. If
is not a quadratic form then adjustments must be made. The resulting
algorithm is often referred to as *nonlinear CG*, and is
discussed briefly in Section 2.2. Predictably,
the CG algorithm discussed so far is called *linear CG*.

We have omitted demonstration of the existence of a linearly independent set of conjugate directions which are not in the null space of , and also omitted many interesting properties of CG. The value of in Equation 2.11 would be the same had we defined it as the step length for minimizing line-search in direction starting at . The minimizer is . This is where the gradient, and hence the residual, is perpendicular to . Thus is perpendicular to . Equation 2.14 then implies that each new residual is perpendicular to the space already searched. It also follows that the residuals are mutually orthogonal and that is minimized over . These properties help explain the efficiency per iteration of CG.

One may also observe that the computations needed in each CG iteration depend only on and the previous iteration, with the exception of as shown in Equation 2.11. In fact can also be written in terms of the previous iteration [41,31]. Therefore we only need to store a few vectors and . Because only matrix-vector computations are needed, storage and computation may exploit the structure or sparsity of .

It was demonstrated above that CG should arrive at the exact solution in no more than iterations when is . This property does not hold when rounded arithmetic is used, for instance when CG is performed on a computer. Errors during computations can reduce the conjugacy of the direction vectors and decrease efficiency so that many more than iterations are required. [31]

The convergence rate for CG is not entirely straightforward. In Shewchuk [41] this rate is described as

(2.16) |

where is

There are many possibilities for deciding when to terminate CG besides the relative size of the error. Furthermore the relation between the true error and the computed error is complex, and the subject of ongoing numerical analysis research. Termination alternatives include minimizing the absolute size of the error, or relative or absolute size of the residual, or context-dependent quantities such as the relative change of the likelihood when computing a maximum likelihood estimate.

Finding a good termination criteria is very important to minimize wasted time while ensuring sufficient optimality. When we discuss our use of CG later in this thesis, the most important issue will be proper termination. Because the worst-case time complexity of CG depends on the spectrum of , our CG termination methods should to adapt to each dataset. We will also examine how the starting point affects our methods in the one case where an informed choice is possible. Overall, we aim hide all of these issues from the users of our algorithms.