AD ALTA
JOURNAL OF INTERDISCIPLINARY RESEARCH
performance; on the other hand, previously mentioned
procedures allow us to reduce explanatory variables by ~ 40 %
while retaining the same level of classification performance as
with original dataset. Other feature selection methods appear to
lead to inferior results.
Table 2.
Classification methods with respective parameters
Classification method
Optimized parameters
Implementation
LOGIT
regularization forms: {L1, L2 dual, L2 primal}, cost
Fan et al., 2008
CIT
max tree depth, p-value threshold
Hothorn, Zeileis, 2015
RF
number of selected predictors, splitting rule, minimal node size
Wright, Ziegler, 2017
SVM
kernel: {RBF}, cost, sigma
Karatzoglou et al., 2004
Source: author
To inspect explanatory variable importance in original feature
space (Tab. 1), co-occurrence matrix of selection scheme-feature
is constructed; the number of feature occurrence for both
individual and interaction terms are included. Moreover, the co-
occurrence matrix is scaled by the maximum possible incidence
of a feature (scheme-feature pair for the procedure without
feature selection). The result of the outlined steps is depicted
using heatmap and dendrograms in Fig. 5; the explanatory
variable state is not present as it is eliminated in data
preprocessing step due to near-zero variance.
Table 3.
Classification performance indicators aggregated by the feature selection method
feature
selection
method
number of
features
feature
selection
runtime [s]
Train
ACC
(95 % CI)
Train
AUC
(95 % CI)
Train
TDL
(95 % CI)
test
ACC
(95 % CI)
test
AUC
(95 % CI)
test
TDL
(95 % CI)
CFS
10.3
6.1
0.923
(0.852,
0.994)
0.911
(0.801,
1.022)
5.678
(3.441,
7.914)
0.905
(0.860,
0.950)
0.874
(0.813,
0.935)
5.171
(3.462,
6.880)
Consistency
18.2
139.6
0.939
(0.870,
1.007)
0.929
(0.836,
1.021)
6.165
(4.432,
7.898)
0.919
(0.877,
0.960)
0.890
(0.848,
0.931)
5.696
(4.381,
7.011)
FS
24.8
0.1
0.920
(0.860,
0.979)
0.907
(0.797,
1.017)
5.632
(3.822,
7.442)
0.904
(0.867,
0.941)
0.868
(0.786,
0.950)
5.123
(3.809,
6.437)
Relief
25.4
179.9
0.915
(0.850,
0.980)
0.896
(0.759,
1.032)
5.444
(3.309,
7.579)
0.898
(0.860,
0.935)
0.852
(0.765,
0.940)
4.868
(3.430,
6.305)
IGR
44.4
0.6
0.937
(0.872,
1.002)
0.927
(0.831,
1.022)
6.124
(4.429,
7.819)
0.918
(0.878,
0.958)
0.889
(0.836,
0.941)
5.652
(4.281,
7.024)
IG
45.0
0.7
0.939
(0.876,
1.003)
0.930
(0.844,
1.016)
6.196
(4.603,
7.789)
0.920
(0.883,
0.957)
0.892
(0.855,
0.929)
5.711
(4.463,
6.959)
SU
47.5
0.5
0.940
(0.875,
1.005)
0.929
(0.839,
1.020)
6.196
(4.642,
7.750)
0.920
(0.883,
0.958)
0.892
(0.845,
0.939)
5.751
(4.517,
6.984)
OneR
51.4
0.5
0.943
(0.881,
1.005)
0.933
(0.849,
1.016)
6.286
(4.883,
7.688)
0.923
(0.889,
0.958)
0.895
(0.862,
0.928)
5.856
(4.777,
6.935)
SVM-RFE
87.9
2190.3
0.952
(0.884,
1.020)
0.940
(0.860,
1.020)
6.465
(4.965,
7.965)
0.932
(0.883,
0.980)
0.900
(0.865,
0.935)
6.108
(4.692,
7.524)
LR-RFE
91.4
2190.4
0.951
(0.879,
1.023)
0.940
(0.858,
1.022)
6.437
(4.854,
8.019)
0.931
(0.880,
0.982)
0.899
(0.859,
0.939)
6.088
(4.614,
7.562)
RF-RFE
96.8
2190.2
0.952
(0.881,
1.022)
0.940
(0.859,
1.020)
6.454
(4.916,
7.991)
0.932
(0.881,
0.983)
0.900
(0.860,
0.939)
6.114
(4.626,
7.603)
none
158.0
0.0
0.950
(0.882,
1.018)
0.940
(0.863,
1.016)
6.441
(5.030,
7.853)
0.931
(0.880,
0.982)
0.899
(0.861,
0.936)
6.075
(4.562,
7.588)
Source: author
There are two evident analytic perspectives arising from co-
occurrence matrix, (1)
feature importance across different
selection procedures and (2) underlying similarity amongst
results of feature selection schemes.
Considering the former perspective (1), three diverse groups of
impact on the target variable are identified by the row-wise
dendrogram. The bottom cluster consists of just one element –
international_plan, which is recognized to be very important by
all selection schemes; the middle cluster contains three elements
– total_day_charge, number_customer_service_calls,
total_day_minutes, that are also observed to be important
indicators of customer's propensity to churn; the structure of the
upper cluster is rather ambiguous, except for area_code element
which is generally omitted.
From the latter perspective (2), three distinct groups of feature
structures are identified by the column-wise dendrogram. The
left cluster contains multivariate filter selection methods and
Fischer’s score; the middle cluster consists of EBM schemes and
OneR; the right cluster is reserved for RFE procedures. The
underlying similarity amongst selection schemes appears to be
driven by both number and structure of included features; this is
supported by the internal coherence of clusters considering the
- 60 -