AD ALTA
JOURNAL OF INTERDISCIPLINARY RESEARCH
6 Conclusions and future work
In an environment with steep data growth, it becomes inevitably
hard to identify useful patterns and extract relevant knowledge.
Thus, the goal of this paper is to examine the explanatory
variable selection procedure in customer churn domain,
specifically (1) its effect on prediction performance of a
classification learner; (2) its behavior across explanatory
variables; (3) a link between the number of included variables
and classification runtime. The general topic is examined using
an original experimental setup and utilizes publicly available
dataset.
We witness slight improvement in learner’s prediction
performance when combined RFE selection, although the
difference is not found statistically significant. From another
viewpoint, RFE schemes allow us to reduce the number of
features by ~ 40 % while retaining the same level of
classification performance as with full-featured dataset.
Consistency, EBM and OneR methods present notable behavior
when heavily reducing the number of features while (1) being
almost on par with RFE schemes across all performance
measures and (2) being computationally less demanding.
When examining underlying feature importance across different
selection schemes (see Fig. 5), international_plan,
total_day_charge, number_customer_service_calls and
total_day_minutes are recognized as important to the churn
event; relevance of other features is inconclusive, except for
area_code which is generally disregarded. From the perspective
of business enterprise, the aforementioned findings may
represent an invaluable insight into customer behavior. The
latent similarity amongst results of feature selection procedures
seem to be induced by number and structure of retained variables
(see Fig. 5); the observation is supported by the internal
coherence of clusters considering the performance of
classification learner (see Tab. 3), albeit Consistency method
does conduct adversely.
Figure 6.
LOESS approximation of classification learner’s runtime as a function of a number of included variables, Source: author
Considering the overall performance of classification learners,
RFs exhibit superior behavior across all metrics. LOGIT learners
distinct with higher bias and very low variance both of which are
induced by regularization. CIT and SVM algorithms show
comparable performance with low bias and moderate variance
(see Tab. 4). We exploit a link between classifier’s ability to
generalize and feature selection procedure through the CIT
model (see Appendix 1.). It becomes evident that incorporation
of selection scheme does not improve performance metric when
combined with classification learners with embedded feature
selection. On the other hand, practitioners and researchers can
tackle performance vs runtime trade-off with explicitly including
selection scheme into machine learning pipeline; more
specifically, by combining classifier with runtime sensitive to a
number of features (CIT, SVM, RF) with efficient and
computationally cheap univariate filter procedure (EBM, OneR).
We can notice comparable benefits in EBM + SVM setup, which
reduces computational runtime by ~ 15 % and improves test set
TDL by ~ 5 % when compared to none + SVM setup (see
Appendix 2.).
To illustrate the relevance of other parts of machine learning
solution, we compare obtained results with selected research
papers which utilize the same dataset, although their primary
goals do not involve feature selection. We achieved performance
comparable with Verbeke et al. (2012), the main discrepancy
appears amongst LOGIT models where our incorporation of
interaction features in data processing step leads to increase in
test TDL by a factor of ~ 1.5. On the other hand, works of
Vafeiadis et al. (2015) and Mehreen et al. (2017) exploit
concepts of meta-learning which lead to increase in test ACC
by ~ 5-10 % when compared to our endeavors.
As for future research of selection procedures in customer churn
domain, we suggest considering more datasets and conceptually
diverse classification learners. To explicitly address the trade-off
between a number of features and information retained, multi-
objective optimization might be leveraged in novel types of
selection procedures. Another possible direction for research
involves feature selection ensembles; meta-learning selection
based on votes of multiple selection methods. From the
perspective of the enterprise, adjusting feature selection
procedures to business objectives in order to analyze retention
drivers in profit perspective might be also a topic of interest.
Literature:
1. Aggarwal, C.C., 2014. Data classification: algorithms and
applications, Boca Raton: Taylor & Francis.
2. Arauzo-Azofra, A., Benitez, J.M. & Castro, J.L., 2008.
Consistency measures for feature selection. Journal of Intelligent
Information Systems, 30(3), pp.273-292. Available at:
http://link.springer.com/10.1007/s10844-007-0037-0.
3. Bagherzadeh-Khiabani, F. et al., 2016. A tutorial on variable
selection for clinical prediction models: feature selection
methods in data mining could improve the results. Journal of
Clinical Epidemiology, 71, pp.76-85. Available at:
https://linkinghub.elsevier.com/retrieve/pii/S0895435615004667
4. Bolón-Canedo, V., Sánchez-Maroño, N. & Alonso-Betanzos,
A., 2013. A review of feature selection methods on synthetic
- 62 -