5.2.3 Indirect (IRLS) Speed

Most of the speed-enhancing techniques we have tried also belong to
the stability or termination sections above. One technique not yet
discussed aims at reducing the number of CG iterations by selecting
better CG starting points. In all previous experiments the
parameter vector is set to zero for each IRLS iteration,
excluding the constant term if `binitmean` is used. If we enable the speed
parameter `cgbinit`, then
will only be set to zero for the
first IRLS iteration. Subsequent iterations will start CG from the
previous stopping point. This parameter may be found at the bottom of
Table 5.3.

eps-type | cgb |
ds1 |
imdb |
citeseer |
ds2 |
ds1.100pca |
ds1.10pca |
|||||||

cgeps |
- | 0.949 | 86 | 0.983 | 370 | 0.946 | 44 | 0.720 | 3417 | 0.918 | 44 | 0.846 | 7 | |

cgeps |
x | 0.949 | 87 | 0.983 | 407 | 0.945 | 50 | 0.720 | 3264 | 0.918 | 48 | 0.845 | 8 | |

cgdeveps |
- | 0.946 | 78 | 0.983 | 514 | 0.946 | 74 | 0.718 | 2923 | 0.916 | 43 | 0.846 | 12 | |

cgdeveps |
x | 0.948 | 63 | 0.983 | 402 | 0.945 | 70 | 0.722 | 1978 | 0.913 | 36 | 0.842 | 9 |

Table 5.22 shows the AUC and times achieved before
and after `cgbinit` was added to the default parameters of the previous
section. The `cgb` column indicates whether or not `cgbinit` was enabled.
We observe little change in `cgeps` experiments, and it is not always
to our benefit to enable `cgbinit`. For experiments using `cgdeveps` it is
clear that `cgbinit` causes quicker termination with nearly identical
scores.

Table 5.23 shows the mean IRLS and CG iteration over
the ten folds of the experiments of Table 5.22. When
`cgeps` is used, the number of IRLS and CG iterations are similar
whether or not `cgbinit` is enabled. Comparison of the point-by-point
progress of CG for the `ds1` experiments having `cgbinit` disabled or
enabled revealed that IRLS iterations were finishing in nearly the
same place. This is despite the difference in starting points for the
second and later IRLS iterations when `cgbinit` is enabled. It appears to
take as long from either starting point to reach the termination
region for the objective function, both in time and the number of
iterations. The final deviances are similar, as are the AUC scores.
In summary, the `cgeps` `cgbinit` experiments with run at most five
percent faster, run slower by up to thirteen percent, and arrive at
nearly the same place. We see no reason to enable `cgbinit` for
`cgeps` experiments.

For `cgdeveps` experiments, Table 5.23 suggests
something very different is happening when `cgbinit` is enabled. The
number of CG iterations per fold is different. Since our CG iteration
counts represent the number of directions searched, we can conclude
that different paths are taken toward the global optimum. Because the
relative difference of the deviances is used to terminate CG
iterations, there is no reason to believe that these paths will end in
the same place; we only know that the relative improvement per CG
iteration had decreased below a threshold. Point-by-point comparisons
of CG and IRLS progress for the `cgdeveps` `ds1` experiments confirm
that the distance between the terminating parameter estimates is as
much as one-third the mean of the two parameter vectors' norms.
Regardless of this discrepancy, the final parameter estimates with
`cgbinit` produce similar AUC scores to those of previous
experiments, and they arrive at those scores sooner. Therefore we
will use `cgbinit` with future `cgdeveps` experiments.

Figure 5.11 shows an updated version of our LR
description tree. Note that `cgbinit` only appears on the `cgdeveps` branch.

We are still faced with the question of why the number of CG
iterations does not change when `cgbinit` is used with `cgeps`, but does
change when `cgbinit` is used with `cgdeveps`.
One explanation is that without `cgbinit`, terminating on `cgeps` and
`cgdeveps` with our default parameters forces too many CG iterations
for some datasets. When `cgeps` is used, linear CG will not stop
until the gradient of the quadratic objective function is sufficiently
small. If this termination region is particularly tiny, then the
starting point might not materially affect how many iterations are
needed to reach it, regardless of the path taken. This would prevent
`cgbinit` from reducing the number of CG iterations needed to reach
termination.

On the other hand, with `cgdeveps` the position of termination can
change. Adding `cgbinit` to `cgdeveps` experiments changes the path and
the termination point, allowing a change in the number of CG
iterations. In particular, consider what happens once IRLS iterations
are within a neighborhood of the optimum in which the Hessian changes
relatively little. The optimum of the weighted least squares
subproblem will be near the LR MLE. Continuation of CG iterations
where the previous set left off may be similar to allowing CG to
continue longer in previous iteration, since the new Hessian is
similar to the old Hessian. Combining this observation with the
properties of linear CG discussed in Section 2.1.2, it
is unsurprising to find that the total number of CG iterations needed
to reach the optimum are less when `cgbinit` is used.

cgb |
ds1 |
imdb |
citeseer |
ds2 |
ds1.100pca |
ds1.10pca |
||||||||

LR | CG | LR | CG | LR | CG | LR | CG | LR | CG | LR | CG | |||

cgeps |
- | 6.0 | 29.3 | 7.0 | 5.3 | 6.0 | 2.2 | 8.0 | 32.9 | 5.0 | 15.5 | 5.0 | 7.6 | |

cgeps |
x | 6.0 | 28.9 | 7.0 | 5.3 | 6.0 | 2.0 | 8.0 | 30.4 | 5.0 | 16.3 | 5.0 | 8.6 | |

cgdeveps |
- | 5.8 | 15.5 | 6.8 | 5.9 | 6.0 | 3.5 | 7.9 | 19.3 | 5.0 | 8.8 | 5.0 | 8.2 | |

cgdeveps |
x | 6.0 | 10.2 | 6.8 | 3.6 | 6.0 | 2.5 | 7.0 | 13.6 | 5.0 | 6.2 | 5.0 | 5.2 |