next up previous contents
Next: 5.2.3 Indirect (IRLS) Speed Up: 5.2 IRLS Parameter Evaluation Previous: Stability Tests: Default   Contents

5.2.2 Indirect (IRLS) Termination And Optimality

Many of the parameters in Section 5.2.1 affect termination and optimality, but were aimed at enhancing stability and providing a robust foundation for the remaining parameters. The parameters discussed in this section are primarily concerned with achieving optimal scores without wasted computation. These include the parameters lreps, cgeps and cgdeveps. There are an additional two parameters, lrmax and cgmax, used to set upper limits on the number of CG and IRLS iterations. These parameters are summarized in Table 5.3.

IRLS is terminated when the relative difference of the deviance falls below lreps or the number of IRLS iterations reaches lrmax. See Section 4.4 for details. Deviance-based termination is common, with the relative difference suggested by McIntosh [26]. Because relative deviances are scale-independent, it is not difficult to find a value for lreps which works well for most datasets.

Several options for CG termination were discussed in  2.1.2. Our focus is understanding whether and when LR is suitable for classification, especially in data mining applications, and we have examined very few termination criteria. The cgeps parameter is used for the traditional residual termination criteria. For linear CG inside IRLS, the residual vector is

$\displaystyle \ensuremath{\mathbf{r}} = \ensuremath{\mathbf{b}} - \ensuremath{\...
...}^T \ensuremath{\mathbf{W}} \ensuremath{\mathbf{X}} \ensuremath{\mathbf{\beta}}$ (5.1)

and, cgwindow aside, we terminate when the Euclidean norm $ (\ensuremath{\mathbf{r}}^T
\ensuremath{\mathbf{r}})^2$ of this vector is smaller than cgeps. It is difficult to find a problem-independent value for cgeps which provides optimal AUC scores and speed. Early in our research we choose to scale cgeps by the number of attributes in the dataset, such that the residual would be compared to $ M \times$   cgeps. In this thesis we have questioned whether that scaling was appropriate, and we examine this question below.

Figure: AUC and time over several lreps and cgeps values on datasets ds2 and ds1.10pca.

x2html_comment_mark>511 figures/ figures/ figures/

The top half of Figure 5.5 shows how the AUC score varies as a function of lreps and cgeps for datasets ds2 and ds1.10pca. Each line on these plots represents a different lreps value, and the horizontal axis represents cgeps. For ds2 there is little difference between lreps lines so long as cgeps is not made too large. For ds1.10pca there is again little difference between lreps values, and no change for cgeps values. The ds1.10pca plot demonstrates that an absurdly low value of lreps, such as 0.5, can be somewhat harmful. This lreps sensitivity is dwarfed by the effect of too large a cgeps for ds2, and we will be very careful choosing a default cgeps value. The bottom half of Figure 5.5 illustrates how speed changes with lreps and cgeps. As one would expect, the shortest times occurring for the largest lreps and cgeps values. These results are representative of experiments on all six datasets used in our previous experiments.

Figure: AUC and time versus cgeps on several datasets with lreps=0.05. Note that the time axis is logarithmic.
tex2html_comment_mark>514 figures/

The largest, and hence fastest, ``safe'' value of lreps appears to be 0.1. To be somewhat conservative, when using cgeps to terminate CG we will choose lreps=0.05. Choosing a value for cgeps is more difficult. Figure 5.6 shows how AUC and speed vary with cgeps in all six datasets used in previous experiments with lreps set to 0.05. Note that the vertical ``Time'' axis is logarithmic. It appears that any small value for cgeps will produce the same AUC score, but we incur a time penalty if cgeps is too small. This time penalty is severe for ds1, imdb and ds2. The largest sensible cgeps values for ds1, with respect to AUC , are quite different than those for imdb and ds2. Any conservative choice of cgeps will cause unduly slow results for ds1, which leads to a reconsideration of cgeps scaling.

The scaling used in every previous experiment multiplied the reported cgeps by $ M$, the number of attributes in the dataset. This choice was made using the naive argument that the CG residual bound should depend on its dimensionality. Perhaps the most obvious change is to multiply by the square root of the dimensionality, since a Euclidean norm is used. Another option is multiplying cgeps by the value of the initial CG residual before any IRLS iterations, so that problems which start with a large residual terminate CG iterations sooner. A similar termination criterion is suggested in Shewchuk [41] and discussed briefly in Section 2.1.2. When the binitmean parameter is disabled, the LR parameter vector $ \ensuremath{\mathbf{\beta}}$ is initialized to zero and the initial CG residual $ \ensuremath{\mathbf{r}}_0$ of Equation 5.1 simplifies to $ \ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{W}}
\ensuremath{\mathbf{z}} = \ensuremath{\mathbf{X}}^T (\ensuremath{\mathbf{y}} - [0.5,\ldots,0.5]^T)$.

Figure: AUC and time versus $ \ensuremath{\mathbf{r}}_0$-cgeps on several datasets with lreps=0.05. Note that the time axis is logarithmic.
tex2html_comment_mark>515 figures/

Figure 5.6 suggests that the most flattering value of the $ M$-scaled cgeps is 0.01 for ds1, 0.0005 for imdb, and somewhere near 0.0001 for ds2. The table below shows how these values scale under the number of attributes $ M$, the square root of the number of attributes $ \sqrt{M}$, and the Euclidean norm of the initial residual vector $ \ensuremath{\mathbf{r}}_0$.

Dataset $ M$ $ M$-cgeps $ \sqrt{M}$ $ \sqrt{M}$-cgeps $ \ensuremath{\mathbf{r}}_0$ $ \ensuremath{\mathbf{r}}_0$-cgeps
ds1 6,348 0.01 79.67 0.79778 83,227 0.00076
imdb 685,569 0.0005 827.99 0.41399 92,503 0.00370
ds2 1,143,054 0.0001 1069.14 0.10691 171,694 0.00066
Both $ \sqrt{M}$-scaling and $ \ensuremath{\mathbf{r}}_0$-scaling reduce the ratio of the high and low cgeps values from 100 to 7.5 and 5.6, respectively. Since the $ \ensuremath{\mathbf{r_0}}$-scaling incorporates both the dimensionality-awareness of the $ \sqrt{M}$ with dataset-specific information, we favor this scaling for residual-based CG termination. Figure 5.7 transforms the $ M$-scaled cgeps values of Figure 5.6 into $ \ensuremath{\mathbf{r}}_0$-scaled values. Table 5.19 shows AUC scores and times for several $ \ensuremath{\mathbf{r}}_0$-cgeps values with lreps=0.05.

We conclude that a $ \ensuremath{\mathbf{r}}_0$-cgeps of 0.005 is adequate for all datasets, while 0.001 is safe for all datasets. Choosing $ \ensuremath{\mathbf{r}}_0$-cgeps=0.001 increases the time significantly. Nonetheless we prefer $ \ensuremath{\mathbf{r}}_0$   -cgeps$ = 0.001$ for autonomous applications and will use this setting for later experiments in this thesis.

Table: AUC and Times For Several $ \ensuremath{\mathbf{r}}_0$-cgeps Values
$ \ensuremath{\mathbf{r}}_0$-cgeps ds1 imdb citeseer ds2 ds1.100pca ds1.10pca
0.05 0.925 23 0.965 139 0.928 36 0.673 843 0.889 22 0.788 4
0.01 0.947 42 0.970 193 0.947 41 0.717 1590 0.912 26 0.846 6
0.005 0.948 47 0.983 279 0.947 37 0.720 2372 0.917 31 0.846 5
0.001 0.949 86 0.983 370 0.946 44 0.720 3417 0.918 44 0.846 7
0.0005 0.949 93 0.983 429 0.946 45 0.720 3548 0.918 46 0.846 7
0.0001 0.949 106 0.983 589 0.945 58 0.720 3838 0.918 50 0.846 7
0.00005 0.949 112 0.983 638 0.945 61 0.720 3936 0.918 52 0.846 7
0.00001 0.949 117 0.983 752 0.945 70 0.720 4211 0.918 56 0.846 7

The size of the CG residual $ \ensuremath{\mathbf{r}}$ is influenced by many factors including the number of positive dataset rows and the sparsity of the data, and hence there are many reasonable scalings for cgeps. For completeness we have performed experiments for other scale factors including the number of nonzero elements $ MRF$, the average number of nonzero elements per dataset row, the dataset sparsity $ F$, the ``dense'' size $ MR$, and the number of positive dataset rows. While $ \ensuremath{\mathbf{r}}_0$-scaling is both logically and empirically attractive, the need to scale cgeps motivates exploration of CG termination criteria which do not depend on the CG residual.

One alternative to residual-based termination methods for CG is to use a context-sensitive measure such as the deviance. The cgdeveps parameter tests the relative difference of the deviance between CG iterations, just as lreps tests this quantity for IRLS iterations. While the CG residual is a natural part of CG computations, the LR deviance is not. The deviance requires O$ \left( MRF \right)$ time to compute. However, as seen with lreps the value of cgdeveps is adaptive, and may be more suitable for an autonomous classifier. We do not allow cgeps and cgdeveps to be used simultaneously, though nothing precludes this possibility.

Figure: AUC and time over several lreps and cgdeveps values on datasets ds2 and ds1.10pca.

x2html_comment_mark>544 figures/ figures/ figures/

The top half of Figure 5.8 illustrates AUC versus cgdeveps for ds2 and citeseer. The imdb dataset behaves something like citeseer on a smaller scale, while the others mimic ds2. The improvement in AUC for citeseer with looser cgdeveps comes from avoiding the overfitting we witnessed in earlier experiments with this dataset. Examining the deviances we see that the overfit experiments have reduced deviance an extra 1.5%, unlike the 500+% reductions seen in Table 5.7, and we are not concerned with the small AUC penalty in these experiments. It is possible that different stability parameter defaults would be appropriate for cgdeveps, but for this work we will keep our current stability parameters and avoid over-tight lreps and cgdeveps values.

Figure 5.8 suggests that an lreps of 0.1 is adequate and 0.05 is safe so long as cgdeveps is sufficiently small. As with cgeps, larger lreps and cgdeveps cause earlier termination and hence better speed. The bottom half of Figure 5.8 shows the relation between lreps, cgdeveps and time for datasets ds2 and citeseer. For these datasets and the others, the cost of lreps=0.05 is always very similar to the cost of lreps=0.1. Preferring to error on the side of safety, we will use lreps=0.05 for all cgdeveps experiments.

Table: AUC and Times For Several cgdeveps Values
cgdeveps ds1 imdb citeseer ds2 ds1.100pca ds1.10pca
0.01 0.945 65 0.983 440 0.946 67 0.721 2520 0.914 38 0.821 10
0.005 0.946 78 0.983 514 0.946 74 0.718 2923 0.916 43 0.846 12
0.001 0.948 111 0.983 622 0.945 84 0.722 3570 0.916 63 0.846 13

Figure: AUC and time versus cgdeveps on several datasets with lreps=0.05. Note that the time axis is logarithmic.
tex2html_comment_mark>556 figures/

The relation between AUC , time and cgdeveps when lreps=0.05 is shown for all six datasets in Figure 5.9. Note that the time axis is logarithmic. The most interesting cgdeveps values are 0.01, 0.005 and 0.001. Table 5.20 shows AUC and times for these values. If not for dataset ds1.10pca, setting cgdeveps=0.01 would be adequate, perhaps even safe. While cgdeveps=0.001 appears safe for everyone, it comes with a 10% to 30% speed penalty over cgdeveps=0.005. Though 0.005 produces only adequate results for ds1.10pca, the neighboring decrease in AUC is only 3%. Given these trade-offs and since this only seems to affect one of our datasets, we accept the risk of data mining on-the-edge and choose cgdeveps=0.005. We ran each experiment from the cgdeveps=0.005 row of Table 5.20, just as we did in the $ \ensuremath{\mathbf{r}}_0$-cgeps=0.001 row of Table 5.19. There was little deviance and the original experiments were representative of the mean.

The paragraphs above describe two very different strategies for terminating CG iterations. Because of the sizeable difference, and because of the challenges and trade-offs made when selecting default values for cgeps and cgdeveps, we will show results for both cgeps and cgdeveps throughout most of our later experiments.

Our experiments agree with the assertion in McCullagh and Nelder [25] that few IRLS iterations are needed to achieve optimality. Table 5.21 shows the maximum number of LR iterations for folds of the experiments reported above using default parameter values. For these experiments the maximums and minimums were always similar. In both the cgeps and cgdeveps experiments, no more than thirteen LR iterations were ever used. Therefore it is reasonable to set lrmax at thirty to prevent runaway experiments.

Table 5.21 also shows CG iteration maximums, taken within and then over the folds. From these counts we see that the number of CG iterations never exceeded fifty-five for our experiments with the default parameters. Because of the variability between CG iteration counts, we feel that setting cgmax to two-hundred is a reasonable restraint. While the counts in Table 5.21 are the maximum counts over the folds of each experiment, the minimum counts are not much different. We have added lreps, both CG termination parameters, lrmax and cgmax to our LR description tree, shown in Figure 5.10.

Table 5.21: Maximum LR and CG iteration counts across all folds of cgeps and cgdeveps experiments.
  ds1 imdb citeseer ds2 ds1.100pca ds1.10pca
cgeps 6 34 7 7 6 3 8 55 5 20 5 9
cgdeveps 7 22 7 8 6 4 9 31 5 10 5 9

Figure 5.10: LR tree showing all termination parameters used for IRLS, including the safeguard parameters lrmax and cgmax.

next up previous contents
Next: 5.2.3 Indirect (IRLS) Speed Up: 5.2 IRLS Parameter Evaluation Previous: Stability Tests: Default   Contents
Copyright 2004 Paul Komarek,