Page 57

AD ALTA

JOURNAL OF INTERDISCIPLINARY RESEARCH

��(��

��

,��)=

��(��

��

,��)

��(��

��

)

(3)

IGR is limited by its asymmetry. To deal with both lack of
symmetry and bias towards multi-valued features, symmetrical
uncertainty criterion (SU) was suggested. SU is denoted in
formula (4) (acc. Aggarwal, 2014; Bagherzadeh-Khiabani et al.,
2016; Duda et al., 2012).

��(��

��

,��)=2

��(��

��

,��)

��(��

��

)+��(��)

(4)

OneR – OneR is univariate selection method, returns feature
ranks. It creates a root-level decision tree for each feature and
target class. For each such a tree, the error rate is calculated.
Features with a low error rate are considered important (acc.
Bagherzadeh-Khiabani et al., 2016).

Relief – Relief is multivariate selection method, returns feature
ranks. It randomly samples observations and locates its nearest
neighbor in the same and different target class; feature
importance is adjusted subsequently. A significant feature set is
assumed to have homogeneous values for each class,
heterogeneous values across classes (Kononenko, 1994).

Figure 1.

Categorization of explanatory variable selection methods, Source: 3, 4

Correlation-based feature selection – CFS is multivariate
selection method, returns feature subset. CFS measures how are
features in feature set correlated with each other and with the
target class. A feature set with high correlation with class and
low correlation amongst features is preferred. This intuition is
denoted in formula (5), where

��

stands for heuristic “merit” of

a feature subset

�� consisting of �� features, ��

��

�� is the mean

feature-class correlation for

�� , and ��

��

�� is the average feature-

feature inter-correlation (Dash, Liu, 2003; Hall, 1999). Search
through feature subset space is done through the best-first
forward search.

��

��+��(��−1)��

��

(5)

Consistency-based filter – CBF is multivariate selection method,
returns feature subset. CBF evaluates how consistently belong
observations with the same set of feature values to target class
(continuous feature values must be discretized). The algorithm
finds a feature subset relying on Liu’s consistency measure.
Consistency measure is denoted in formula (6) (acc.
Arauzo-Azofra, 2008). Search through feature subset space is
also maintained with the best-first forward search.

��
=1−

��

(6)

2.2 Wrapper selection

Recursive feature elimination – RFE is popular multivariate
selection method, returns feature ranks. The algorithm fits
classification learner to the full feature set. Each feature is
ranked using the classification learner (its coefficients /
importance). At each iteration of the algorithm, top-ranked
features are retained (low ranked features are eliminated), the
classification learner is refit and scored. The feature set with
learner’s best performance is chosen. RFE was originally
proposed with linear SVM (see Guyon et al., 2002), procedure,
however, can be utilized with different classification learners.
We combine RFE with classification methods LOGIT, RF, RBF-

SVM. The author considers RFE to be wrapper selection
procedure based on the implementation used (see Kuhn, 2008);
however, opinions on the matter differ (see Aggarwal, 2014;
Guyon et al., 2002).

3 Classification methods

From a machine learning viewpoint, customer churn prediction
is perceived as a binary classification problem with a purpose to
assign observations (customers) into one of two classes
(churners, non-churners). There is a vast amount of research
dedicated to classification method selection; however, we have
decided to apply (1) simple and interpretable classification
methods (LOGIT, CIT) and (2) more complex classification
methods, with the proven performance considering given
dataset (RF, RBF-SVM), acc. Verbeke et al. (2012).

Logistic regression – LOGIT is a parametric statistical method
which estimates the probability of an event (discrete response
variable), based on known circumstances (explanatory
variables). LOGIT models tend to suffer from the influence of
confounding factors and overfitting, to prevent that we used
LOGIT with L1 and L2 regularization forms (Fan et al., 2008).
LOGIT is straightforward to understand and interpret; it is also
broadly used as classification baseline.

Conditional inference tree – CIT is non-parametric decision tree
method (DT). Common implementations of DT tend to overfit
and endure bias towards selected features. To address that,
Hothorn et al. (2006) propose to base the splitting criterion on
resampling and multiple inference tests, resulting in CIT. Its
prediction ability is proven to be on par with pruned DT with no
bias towards selected explanatory variables (see Horton et al.,
2006; Horton, Zeileis, 2015).

Random forest – RF is non-parametric ensemble method which
combines DTs such that each model is built upon randomly
sampled explanatory variables (with replacement), votes of
individual DTs are aggregated to form the prediction (Breiman,
2001). RF models are prone to overfitting and often produce
satisfying prediction results without extensive hyperparameter
search.

- 57 -