Many of the parameters in Section 5.2.1 affect termination and optimality, but were aimed at enhancing stability and providing a robust foundation for the remaining parameters. The parameters discussed in this section are primarily concerned with achieving optimal scores without wasted computation. These include the parameters lreps, cgeps and cgdeveps. There are an additional two parameters, lrmax and cgmax, used to set upper limits on the number of CG and IRLS iterations. These parameters are summarized in Table 5.3.
IRLS is terminated when the relative difference of the deviance falls below lreps or the number of IRLS iterations reaches lrmax. See Section 4.4 for details. Deviancebased termination is common, with the relative difference suggested by McIntosh [26]. Because relative deviances are scaleindependent, it is not difficult to find a value for lreps which works well for most datasets.
Several options for CG termination were discussed in 2.1.2. Our focus is understanding whether and when LR is suitable for classification, especially in data mining applications, and we have examined very few termination criteria. The cgeps parameter is used for the traditional residual termination criteria. For linear CG inside IRLS, the residual vector is
x2html_comment_mark>511
figures/auc_lreps_cgeps_dsadec.ps figures/time_lreps_cgeps_dsb.ps figures/time_lreps_cgeps_dsadec.ps

The top half of Figure 5.5 shows how the AUC score varies as a function of lreps and cgeps for datasets ds2 and ds1.10pca. Each line on these plots represents a different lreps value, and the horizontal axis represents cgeps. For ds2 there is little difference between lreps lines so long as cgeps is not made too large. For ds1.10pca there is again little difference between lreps values, and no change for cgeps values. The ds1.10pca plot demonstrates that an absurdly low value of lreps, such as 0.5, can be somewhat harmful. This lreps sensitivity is dwarfed by the effect of too large a cgeps for ds2, and we will be very careful choosing a default cgeps value. The bottom half of Figure 5.5 illustrates how speed changes with lreps and cgeps. As one would expect, the shortest times occurring for the largest lreps and cgeps values. These results are representative of experiments on all six datasets used in our previous experiments.
tex2html_comment_mark>514
figures/time_cgeps_all.ps

The largest, and hence fastest, ``safe'' value of lreps appears to be 0.1. To be somewhat conservative, when using cgeps to terminate CG we will choose lreps=0.05. Choosing a value for cgeps is more difficult. Figure 5.6 shows how AUC and speed vary with cgeps in all six datasets used in previous experiments with lreps set to 0.05. Note that the vertical ``Time'' axis is logarithmic. It appears that any small value for cgeps will produce the same AUC score, but we incur a time penalty if cgeps is too small. This time penalty is severe for ds1, imdb and ds2. The largest sensible cgeps values for ds1, with respect to AUC , are quite different than those for imdb and ds2. Any conservative choice of cgeps will cause unduly slow results for ds1, which leads to a reconsideration of cgeps scaling.
The scaling used in every previous experiment multiplied the reported cgeps by , the number of attributes in the dataset. This choice was made using the naive argument that the CG residual bound should depend on its dimensionality. Perhaps the most obvious change is to multiply by the square root of the dimensionality, since a Euclidean norm is used. Another option is multiplying cgeps by the value of the initial CG residual before any IRLS iterations, so that problems which start with a large residual terminate CG iterations sooner. A similar termination criterion is suggested in Shewchuk [41] and discussed briefly in Section 2.1.2. When the binitmean parameter is disabled, the LR parameter vector is initialized to zero and the initial CG residual of Equation 5.1 simplifies to .
tex2html_comment_mark>515
figures/time_cgeps_grad_all.ps

Figure 5.6 suggests that the most flattering value of the scaled cgeps is 0.01 for ds1, 0.0005 for imdb, and somewhere near 0.0001 for ds2. The table below shows how these values scale under the number of attributes , the square root of the number of attributes , and the Euclidean norm of the initial residual vector .
Dataset  cgeps  cgeps  cgeps  
ds1  6,348  0.01  79.67  0.79778  83,227  0.00076 
imdb  685,569  0.0005  827.99  0.41399  92,503  0.00370 
ds2  1,143,054  0.0001  1069.14  0.10691  171,694  0.00066 
We conclude that a cgeps of 0.005 is adequate for all datasets, while 0.001 is safe for all datasets. Choosing cgeps=0.001 increases the time significantly. Nonetheless we prefer cgeps for autonomous applications and will use this setting for later experiments in this thesis.
cgeps  ds1  imdb  citeseer  ds2  ds1.100pca  ds1.10pca  
0.05  0.925  23  0.965  139  0.928  36  0.673  843  0.889  22  0.788  4 
0.01  0.947  42  0.970  193  0.947  41  0.717  1590  0.912  26  0.846  6 
0.005  0.948  47  0.983  279  0.947  37  0.720  2372  0.917  31  0.846  5 
0.001  0.949  86  0.983  370  0.946  44  0.720  3417  0.918  44  0.846  7 

0.0005  0.949  93  0.983  429  0.946  45  0.720  3548  0.918  46  0.846  7 
0.0001  0.949  106  0.983  589  0.945  58  0.720  3838  0.918  50  0.846  7 
0.00005  0.949  112  0.983  638  0.945  61  0.720  3936  0.918  52  0.846  7 
0.00001  0.949  117  0.983  752  0.945  70  0.720  4211  0.918  56  0.846  7 
The size of the CG residual is influenced by many factors including the number of positive dataset rows and the sparsity of the data, and hence there are many reasonable scalings for cgeps. For completeness we have performed experiments for other scale factors including the number of nonzero elements , the average number of nonzero elements per dataset row, the dataset sparsity , the ``dense'' size , and the number of positive dataset rows. While scaling is both logically and empirically attractive, the need to scale cgeps motivates exploration of CG termination criteria which do not depend on the CG residual.
One alternative to residualbased termination methods for CG is to use a contextsensitive measure such as the deviance. The cgdeveps parameter tests the relative difference of the deviance between CG iterations, just as lreps tests this quantity for IRLS iterations. While the CG residual is a natural part of CG computations, the LR deviance is not. The deviance requires O time to compute. However, as seen with lreps the value of cgdeveps is adaptive, and may be more suitable for an autonomous classifier. We do not allow cgeps and cgdeveps to be used simultaneously, though nothing precludes this possibility.
x2html_comment_mark>544
figures/auc_lreps_cgdeveps_jlee.ps figures/time_lreps_cgdeveps_dsb.ps figures/time_lreps_cgdeveps_jlee.ps

The top half of Figure 5.8 illustrates AUC versus cgdeveps for ds2 and citeseer. The imdb dataset behaves something like citeseer on a smaller scale, while the others mimic ds2. The improvement in AUC for citeseer with looser cgdeveps comes from avoiding the overfitting we witnessed in earlier experiments with this dataset. Examining the deviances we see that the overfit experiments have reduced deviance an extra 1.5%, unlike the 500+% reductions seen in Table 5.7, and we are not concerned with the small AUC penalty in these experiments. It is possible that different stability parameter defaults would be appropriate for cgdeveps, but for this work we will keep our current stability parameters and avoid overtight lreps and cgdeveps values.
Figure 5.8 suggests that an lreps of 0.1 is adequate and 0.05 is safe so long as cgdeveps is sufficiently small. As with cgeps, larger lreps and cgdeveps cause earlier termination and hence better speed. The bottom half of Figure 5.8 shows the relation between lreps, cgdeveps and time for datasets ds2 and citeseer. For these datasets and the others, the cost of lreps=0.05 is always very similar to the cost of lreps=0.1. Preferring to error on the side of safety, we will use lreps=0.05 for all cgdeveps experiments.
tex2html_comment_mark>556
figures/time_cgdeveps_all.ps

The relation between AUC , time and cgdeveps when lreps=0.05 is shown for all six datasets in Figure 5.9. Note that the time axis is logarithmic. The most interesting cgdeveps values are 0.01, 0.005 and 0.001. Table 5.20 shows AUC and times for these values. If not for dataset ds1.10pca, setting cgdeveps=0.01 would be adequate, perhaps even safe. While cgdeveps=0.001 appears safe for everyone, it comes with a 10% to 30% speed penalty over cgdeveps=0.005. Though 0.005 produces only adequate results for ds1.10pca, the neighboring decrease in AUC is only 3%. Given these tradeoffs and since this only seems to affect one of our datasets, we accept the risk of data mining ontheedge and choose cgdeveps=0.005. We ran each experiment from the cgdeveps=0.005 row of Table 5.20, just as we did in the cgeps=0.001 row of Table 5.19. There was little deviance and the original experiments were representative of the mean.
The paragraphs above describe two very different strategies for terminating CG iterations. Because of the sizeable difference, and because of the challenges and tradeoffs made when selecting default values for cgeps and cgdeveps, we will show results for both cgeps and cgdeveps throughout most of our later experiments.
Our experiments agree with the assertion in McCullagh and Nelder [25] that few IRLS iterations are needed to achieve optimality. Table 5.21 shows the maximum number of LR iterations for folds of the experiments reported above using default parameter values. For these experiments the maximums and minimums were always similar. In both the cgeps and cgdeveps experiments, no more than thirteen LR iterations were ever used. Therefore it is reasonable to set lrmax at thirty to prevent runaway experiments.
Table 5.21 also shows CG iteration maximums, taken within and then over the folds. From these counts we see that the number of CG iterations never exceeded fiftyfive for our experiments with the default parameters. Because of the variability between CG iteration counts, we feel that setting cgmax to twohundred is a reasonable restraint. While the counts in Table 5.21 are the maximum counts over the folds of each experiment, the minimum counts are not much different. We have added lreps, both CG termination parameters, lrmax and cgmax to our LR description tree, shown in Figure 5.10.
