AD ALTA
JOURNAL OF INTERDISCIPLINARY RESEARCH
performance of classification learner (see Tab. 3), albeit
Consistency method does exhibit different behavior.
To outline a prediction performance of individual classification
learners, mean point estimates across algorithms and
selection/no selection schemes are displayed in Tab.4; the best
indicators are depicted in bold; 95 % confidence intervals for
underlying distributions are constructed.
RF algorithm presents superior performance with very low bias
and acceptable variance across all performance measures;
however, the drop in test performance might be a sign of
overfitting. LOGIT method, on the other hand, displays higher
bias and very low variance as a consequence of regularization.
CIT and SVM learners exhibit akin performance with low bias
and moderate variance.
To examine the behavior of classification learners further,
another CIT model is built on top of the pipeline results; the
response variable is top-decile lift measured on the test set,
explanatory variables are feature selection scheme and
classification method. The motivation for analyzing test TDL
comes from its link to retention campaign profit dynamics (see
Verbeke et al., 2012). The tree structure is charted in Appendix
1.; it becomes apparent that a feature selection procedure does
not lead to significant improvement of the performance metric
when combined with classification learners with embedded
feature selection. This observation is supported by terminal
nodes 12 (CIT), 18 (RF) and 23 (LOGIT) which blend learner’s
performance with and without feature selection. SVM learner,
however, displays leap in performance when coupled with
feature selection scheme. This conclusion is backed by
comparison of boxplot charts in terminal node 10 (Consistency,
EBMs, OneR) or 12 (RFE) with terminal node 15 (no feature
selection scheme).
Figure 5.
Scaled co-occurrence matrix for selection scheme-feature pairs, Source: author
The subsequent dimension of analysis comprises of time
complexity of classification learners as a function of a number of
explanatory variables (
í µí±›). The empirical relationships are
exposed by locally estimated scatterplot smoothing (LOESS)
and depicted in Fig. 6.
Table 4.
Classification performance indicators aggregated by the classification method
classification
method
classification
runtime [s]
train
ACC
(95 % CI)
Train
AUC
(95 % CI)
Train
TDL
(95 % CI)
test
ACC
(95 % CI)
test
AUC
(95 % CI)
test
TDL
(95 % CI)
LOGIT
21.8
0.893
(0.865,
0.920)
0.874
(0.814,
0.934)
4.891
(3.727,
6.054)
0.890
(0.864,
0.917)
0.868
(0.808,
0.928)
4.782
(3.647,
5.917)
CIT
52.2
0.941
(0.908,
0.974)
0.917
(0.865,
0.970)
6.335
(5.228,
7.442)
0.925
(0.891,
0.959)
0.879
(0.824,
0.933)
5.832
(4.596,
7.067)
SVM
244.3
0.939
(0.898,
0.981)
0.919
(0.865,
0.973)
6.315
(5.064,
7.565)
0.922
(0.895,
0.950)
0.897
(0.851,
0.943)
5.806
(4.855,
6.758)
RF
361.7
0.980
(0.941,
1.020)
0.996
(0.975,
1.017)
6.965
(6.452,
7.478)
0.940
(0.899,
0.981)
0.906
(0.861,
0.951)
6.318
(5.013,
7.623)
Source: author
From asymptotic perspective there appear to be three classes of
behavior; (1) there is no clear relationship between number of
features and classification runtime, suggesting complexity of
í µí±‚(1), LOGIT flat line indicates such a nature; (2) there seems to
be linear relationship between number of explanatory variables
and classification runtime, indicating complexity of
í µí±‚(í µí±›), this
appears to be valid for SVM and RF models; (3) there is
quadratic relationship between number of included variables and
classification runtime, implying complexity of
í µí±‚(í µí±›
2
), this
behavior fits the shallow convex curvature of CIT arc. RF
LOESS, however, shows the systematic residual pattern in the
middle and right sections of the figure; the observed
phenomenon is induced by hyperparameter search step
(sensitivity of a weak learner to a number of predictors and its
depth).
- 61 -