5.2.4 Indirect (IRLS) Summary

Throughout this section we have evaluated many parameters for
stability, termination and speed. Most of the variations on IRLS were
rejected. We have retained `rrlambda` for regularization, `cgwindow` to assure
continued improvement, and `cgbinit` was kept for `cgdeveps` experiments
due to a speed improvement. Due to the interesting and different
behaviors of the `cgeps` and `cgdeveps` termination methods, we will
treat each as a different classifier for the remainder of this thesis.
In the following chapter we will compare IRLS performance to that of
other classification algorithms and see if our fixed parameters are
adequate.

Algorithm 5 shows our final version of IRLS.
The `lrmax`, `rrlambda`, and `lreps` parameters are embedded in the
pseudocode to emphasize that they are fixed at their default values.
There are two important changes from Algorithm 4. The
first is at line 5.1, where we call CG
to find an approximate solution to the linear weighted least squares
subproblem. We have have written this as a function call to
**CUSTOMCG**, since the details vary according to whether we are
using `cgeps` or `cgdeveps`. The second change is at
line 5.2. Instead of returning the
value of
at the final iteration, we choose the value of
which had the minimum deviance among IRLS iterations.

When using `cgeps`, we modify CG as shown in
Algorithm 6. To prevent confusion, we
have replaced the scalar used in the original CG of
Algorithm 1 with . The primary difference
is the addition of `cgwindow` termination. This may be seen at
line 6.1 and the lines beginning
at 6.2. Note that this version of
CG ignores the
parameter sent by IRLS.

Algorithm 7 shows our modification to
CG when using `cgdeveps`. Again we use in place of the
original scalar of Algorithm 1. Another
difference is the addition of the `cgwindow` technique, seen at or around
lines 7.2
and 7.3. The final difference
is at line 7.4, where we choose
the best
according to deviance. When compared to the CG
algorithm used with `cgeps`, described in the previous paragraph, there
are two important differences. The first is the nonzero
initialization of
at
line 7.1, which makes use of the
parameter passed by our IRLS algorithm. This
is the result of the `cgbinit` parameter we have chosen to use with
`cgdeveps`. A side-effect of the nonzero
is a change in the
computation of
, shown in the following line. The second
important difference with the `cgeps` version of CG is the selection
of the best
at line 7.4.

The lack of CG position history in
Algorithm 6 is regrettable, since it
complicates comparison of `cgeps` and `cgdeveps`. However, this
difference is unlikely to have had any material effect. Had CG
iterations caused the final iterate to be much worse than the best
iterate, then `cgdecay` would have had greater positive effect in the
experiments of Section 5.2.1.13. Instead, those
experiments suggested that the deviance grew slowly after reaching a
minimum. Since optimizing the likelihood does not necessarily
correspond to maximizing the AUC score [46], it is not
clear that choosing a slightly non-optimal iterate should have any
negative effect at all.

We have neglected many interesting possibilities for IRLS with CG. It
is possible to use CG with a *preconditioning matrix*, which may
contribute to stability and speed if some or all of the covariance
matrix can be estimated [41,31,7]. We made one experiment with a simple diagonal
preconditioner [41]. The result was
unsatisfactory but the failure was not investigated. Alternate
techniques for IRLS or CG termination could be tried, including the
use of *validation sets* [10]. Again we made an
unsuccessful foray, and did not pursue a remedy. Though CG is
designed for positive definite matrices, it has performed well on a
variety of ill-conditioned data matrices arising from datasets such as
`ds1` with linearly dependent or duplicate columns. A version of CG
known as *biconjugate gradient* is designed for positive
semidefinite matrices, and this might further accelerate IRLS.
Besides biconjugate gradient, Greenbaum [7] lists several
more iterative methods for solving linear systems. Truncated-Newton
methods should be investigated for finding the zeros of the LR score
equations, and the result compared to our combination of IRLS and CG.