Throughout this section we have evaluated many parameters for stability, termination and speed. Most of the variations on IRLS were rejected. We have retained rrlambda for regularization, cgwindow to assure continued improvement, and cgbinit was kept for cgdeveps experiments due to a speed improvement. Due to the interesting and different behaviors of the cgeps and cgdeveps termination methods, we will treat each as a different classifier for the remainder of this thesis. In the following chapter we will compare IRLS performance to that of other classification algorithms and see if our fixed parameters are adequate.
Algorithm 5 shows our final version of IRLS. The lrmax, rrlambda, and lreps parameters are embedded in the pseudocode to emphasize that they are fixed at their default values. There are two important changes from Algorithm 4. The first is at line 5.1, where we call CG to find an approximate solution to the linear weighted least squares subproblem. We have have written this as a function call to CUSTOMCG, since the details vary according to whether we are using cgeps or cgdeveps. The second change is at line 5.2. Instead of returning the value of at the final iteration, we choose the value of which had the minimum deviance among IRLS iterations.
When using cgeps, we modify CG as shown in Algorithm 6. To prevent confusion, we have replaced the scalar used in the original CG of Algorithm 1 with . The primary difference is the addition of cgwindow termination. This may be seen at line 6.1 and the lines beginning at 6.2. Note that this version of CG ignores the parameter sent by IRLS.
Algorithm 7 shows our modification to CG when using cgdeveps. Again we use in place of the original scalar of Algorithm 1. Another difference is the addition of the cgwindow technique, seen at or around lines 7.2 and 7.3. The final difference is at line 7.4, where we choose the best according to deviance. When compared to the CG algorithm used with cgeps, described in the previous paragraph, there are two important differences. The first is the nonzero initialization of at line 7.1, which makes use of the parameter passed by our IRLS algorithm. This is the result of the cgbinit parameter we have chosen to use with cgdeveps. A side-effect of the nonzero is a change in the computation of , shown in the following line. The second important difference with the cgeps version of CG is the selection of the best at line 7.4.
The lack of CG position history in Algorithm 6 is regrettable, since it complicates comparison of cgeps and cgdeveps. However, this difference is unlikely to have had any material effect. Had CG iterations caused the final iterate to be much worse than the best iterate, then cgdecay would have had greater positive effect in the experiments of Section 220.127.116.11. Instead, those experiments suggested that the deviance grew slowly after reaching a minimum. Since optimizing the likelihood does not necessarily correspond to maximizing the AUC score , it is not clear that choosing a slightly non-optimal iterate should have any negative effect at all.
We have neglected many interesting possibilities for IRLS with CG. It is possible to use CG with a preconditioning matrix, which may contribute to stability and speed if some or all of the covariance matrix can be estimated [41,31,7]. We made one experiment with a simple diagonal preconditioner . The result was unsatisfactory but the failure was not investigated. Alternate techniques for IRLS or CG termination could be tried, including the use of validation sets . Again we made an unsuccessful foray, and did not pursue a remedy. Though CG is designed for positive definite matrices, it has performed well on a variety of ill-conditioned data matrices arising from datasets such as ds1 with linearly dependent or duplicate columns. A version of CG known as biconjugate gradient is designed for positive semidefinite matrices, and this might further accelerate IRLS. Besides biconjugate gradient, Greenbaum  lists several more iterative methods for solving linear systems. Truncated-Newton methods should be investigated for finding the zeros of the LR score equations, and the result compared to our combination of IRLS and CG.