AD ALTA
JOURNAL OF INTERDISCIPLINARY RESEARCH
EXPLANATORY VARIABLE SELECTION WITH BALANCED CLUSTERING IN CUSTOMER
CHURN PREDICTION
a
MARTIN FRIDRICH
Department of Informatics, Faculty of Business and
Management, Brno University of Technology, Kolejnà 2906/4,
612 00 Brno, Czech Republic
email:
a
fridrichmartin@yahoo.com
Abstract: The interest in customer relationship management has been fueled by the
broad adoption of customer-centric paradigm, rapid growth in data collection, and
technology advances for more than the past 15 years. It becomes hard to identify and
interpret meaningful patterns in customer behavior; thus the goal of the paper is to
compare multiple explanatory variable selection procedures and their effect on a
customer churn prediction model. Filter and wrapper concepts of variable selection are
examined, moreover, the runtime of the machine learning pipeline is improved by the
novel idea of balanced clustering. Classification learners are incorporated with regard
to simplicity and interpretability (LOGIT, CIT) and complexity and proven
performance on a given dataset (RF, RBF-SVM). In addition, we show that when
combined with learner capable of embedded feature selection, explicit variable
selection scheme does not necessarily lead to performance improvement. On the other
hand, RBF-SVM learner with no such ability benefits from relevant selection
procedure in all expected aspects, including classification performance and runtime,
problem comprehensibility, data storage.
Keywords: customer churn prediction, customer relationship management, feature
selection, machine learning, variable importance
1 Introduction
Customer relationship management (CRM) became a topic of
interest with a shift to the customer-centric paradigm. It aims to
create, retain, and strengthen a relationship with customers while
maintaining profits and revenue. Over the past 15 years, CRM
has been augmented by progress in data collection and
technology, enabling to tackle its challenges with a new set of
tools, i.e., machine learning. An important objective of customer
relationship management is to minimize customer churn, where
term customer churn refers to affinity to cease business with the
company in a given time. Churn reduction is usually motivated
by a difference between underlying unitary costs of customer
acquisition and customer retention, even though there are more
benefits to it (Gronwald, 2017; Gupta et al., 2004; Torkzadeh et
al., 2006).
To retain customers, prediction models are required to identify
early churn signals and flag customers at high risk of leaving. In
an environment, with rapid growth in data generation and
collection, it becomes increasingly challenging to detect
meaningful patterns and extract useful knowledge. Hence the
aim of the paper is to examine the explanatory variable selection
procedure and its effect on the performance of the churn
prediction model. It is generally assumed that the explanatory
variable selection procedure improves learner prediction
performance, ability to generalize the
problem,
comprehensibility, reduce computational runtime and reduce
storage requirements (acc. Aggarwal, 2014; Bagherzadeh-
Khiabani et al., 2016).
2 Explanatory variable selection
The merit of explanatory variable selection is to find a subset of
explanatory variables, which highly discriminates response
variable. One can distinguish three procedure types – filter,
wrapper and others (embedded, hybrid) however opinions on the
matter might differ (Aggarwal, 2014; Bolón-Canedo et al., 2013;
Bagherzadeh-Khiabani et al., 2016; Duda et al., 2012). We focus
solely on filter and wrapper selection procedures. The task of
dimensionality reduction is also tackled with feature extraction
methods (PCA, LDA, CCA, Isomap, Autoencoder, etc.) since
they project original features into new feature space while losing
original comprehensibility (Aggarwal, 2014), they are not
included.
Filter selection – FS relies on data properties without utilizing
any classification learner. The procedure consists of two steps,
(1) features are ranked according to the chosen criterion, (2)
highly ranked features are selected. Univariate filters account
only for a feature-class relationship; however, multivariate filters
explore feature set-class relationship; hence, the former is
inferior to the latter in handling redundant features.
Wrapper selection – As opposed to FS, WS adopts classification
learner to estimate the quality of the feature set. Considering
specific classification learner wrapper selection consists of three
steps, (1) searching subset of features (2) evaluating the selected
subset of features by the learner (3) repeating (1) and (2) until a
stopping criterion is met. WS outperforms FS in terms of
prediction quality of final learner, although the procedure can be
computationally very expensive.
Others – In addition to FS and WS procedures, scientific
literature depicts two more categories of selection methods, (1)
embedded procedures – feature selection is included in the phase
of learner fitting (i.e., logistic regression with L1 regularization,
tree-based methods), which might reduce computational time (2)
hybrid procedures – usually sequential combination of FS and
WS method.
The explanatory variable selection domain broadly intersects
with fields of machine learning (see Aggarwal, 2014; Arauzo-
Azofra et al., 2008; Bolón-Canedo et al., 2013; Dash, Liu, 2003;
Duda et al., 2012; Hall, 1999; Kononenko, 1994; Shakil Pervez,
Farid, 2015), biostatistics and high-throughput biology (see
Bagherzadeh-Khiabani et al., 2016; Guyon et al., 2002; Gilhan et
al., 2010; Zhu et al., 2010; Chu et al., 2011). In customer churn
domain, applications are limited and default to an evaluation of
only a few feature selection/extraction methods (see Verbeke et
al., 2012; Xiao et al., 2015; Spanoudes, Nguyen, 2017;
Subramanya, Somani, 2017; Vijaya, Sivasankar, 2018). Hence,
our goal is to examine the performance of multiple approaches to
explanatory variable selection and to compare the results with
literature utilizing the same customer churn dataset.
2.1 Filter selection
Fisher score – FS is univariate selection method, returns feature
ranks. Important features are expected to exhibit similar
observed values in the one class and different observed values
across different classes. This intuition is denoted in formula (1),
where
í µí±†
í µí±–
stands for Fisher score,
í µí¼‡
í µí±–í µí±—
and
í µí¼Œ
í µí±–í µí±—
2
are the mean and
variance of i-th feature in the j-th class respectively
í µí±›
í µí±—
is the
number of instances in the j-th class, and
í µí¼‡
í µí±–
is the mean of the i-
th feature (acc. Aggarwal, 2014; Bagherzadeh-Khiabani et al.,
2016).
í µí°¹í µí±†
í µí±–
=
∑
í µí±›
í µí±—
(í µí¼‡
í µí±–í µí±—
âˆ’í µí¼‡
í µí±–
)
2
í µí°¾
í µí±˜=1
∑
í µí±›
í µí±—
í µí°¾
í µí±˜=1
í µí¼Œ
í µí±–í µí±—
2
1)
Entropy-based measures – EBMs are based on an idea of
measuring uncertainty, the unpredictability of the variable. In the
paper, three types of information measures are examined: (1)
information gain, (2) information gain ratio, and (3) symmetrical
uncertainty criterion.
Information gain (IG) is denoted in formula (2), where
í µí°»(í µí±“
í µí±–
)
represents entropy of i-th feature,
í µí°»(í µí°¶) stands for class entropy,
and
í µí°»(í µí±“
í µí±–
|í µí°¶) amounts to joint entropy of í µí±“
í µí±–
and C. Features with
high IG are considered necessary, this predicament also holds for
IGR and SU (acc. Aggarwal, 2014; Bagherzadeh-Khiabani et al.,
2016; Duda et al., 2012).
í µí°¼í µí°º(í µí±“
í µí±–
,í µí°¶)=í µí°»(í µí±“
í µí±–
)+í µí°»(í µí°¶)âˆ’í µí°»(í µí±“
í µí±–
|í µí°¶)
(2)
IG suffers from a bias towards multi-valued features, to correct
that different metric was proposed – information gain ratio
(IGR). IGR is denoted in formula (3) (acc. Aggarwal, 2014;
Bagherzadeh-Khiabani et al., 2016; Duda et al., 2012).
- 56 -