Most of the speed-enhancing techniques we have tried also belong to the stability or termination sections above. One technique not yet discussed aims at reducing the number of CG iterations by selecting better CG starting points. In all previous experiments the parameter vector is set to zero for each IRLS iteration, excluding the constant term if binitmean is used. If we enable the speed parameter cgbinit, then will only be set to zero for the first IRLS iteration. Subsequent iterations will start CG from the previous stopping point. This parameter may be found at the bottom of Table 5.3.
Table 5.22 shows the AUC and times achieved before and after cgbinit was added to the default parameters of the previous section. The cgb column indicates whether or not cgbinit was enabled. We observe little change in cgeps experiments, and it is not always to our benefit to enable cgbinit. For experiments using cgdeveps it is clear that cgbinit causes quicker termination with nearly identical scores.
Table 5.23 shows the mean IRLS and CG iteration over the ten folds of the experiments of Table 5.22. When cgeps is used, the number of IRLS and CG iterations are similar whether or not cgbinit is enabled. Comparison of the point-by-point progress of CG for the ds1 experiments having cgbinit disabled or enabled revealed that IRLS iterations were finishing in nearly the same place. This is despite the difference in starting points for the second and later IRLS iterations when cgbinit is enabled. It appears to take as long from either starting point to reach the termination region for the objective function, both in time and the number of iterations. The final deviances are similar, as are the AUC scores. In summary, the cgeps cgbinit experiments with run at most five percent faster, run slower by up to thirteen percent, and arrive at nearly the same place. We see no reason to enable cgbinit for cgeps experiments.
For cgdeveps experiments, Table 5.23 suggests something very different is happening when cgbinit is enabled. The number of CG iterations per fold is different. Since our CG iteration counts represent the number of directions searched, we can conclude that different paths are taken toward the global optimum. Because the relative difference of the deviances is used to terminate CG iterations, there is no reason to believe that these paths will end in the same place; we only know that the relative improvement per CG iteration had decreased below a threshold. Point-by-point comparisons of CG and IRLS progress for the cgdeveps ds1 experiments confirm that the distance between the terminating parameter estimates is as much as one-third the mean of the two parameter vectors' norms. Regardless of this discrepancy, the final parameter estimates with cgbinit produce similar AUC scores to those of previous experiments, and they arrive at those scores sooner. Therefore we will use cgbinit with future cgdeveps experiments.
Figure 5.11 shows an updated version of our LR description tree. Note that cgbinit only appears on the cgdeveps branch.
We are still faced with the question of why the number of CG iterations does not change when cgbinit is used with cgeps, but does change when cgbinit is used with cgdeveps. One explanation is that without cgbinit, terminating on cgeps and cgdeveps with our default parameters forces too many CG iterations for some datasets. When cgeps is used, linear CG will not stop until the gradient of the quadratic objective function is sufficiently small. If this termination region is particularly tiny, then the starting point might not materially affect how many iterations are needed to reach it, regardless of the path taken. This would prevent cgbinit from reducing the number of CG iterations needed to reach termination.
On the other hand, with cgdeveps the position of termination can change. Adding cgbinit to cgdeveps experiments changes the path and the termination point, allowing a change in the number of CG iterations. In particular, consider what happens once IRLS iterations are within a neighborhood of the optimum in which the Hessian changes relatively little. The optimum of the weighted least squares subproblem will be near the LR MLE. Continuation of CG iterations where the previous set left off may be similar to allowing CG to continue longer in previous iteration, since the new Hessian is similar to the old Hessian. Combining this observation with the properties of linear CG discussed in Section 2.1.2, it is unsurprising to find that the total number of CG iterations needed to reach the optimum are less when cgbinit is used.