Page 58

AD ALTA

JOURNAL OF INTERDISCIPLINARY RESEARCH

Support vector machine – Gaussian radial basis function SVM
(RBF-SVM) is a non-parametric method that constructs
hyperplane in high-dimensional space which has the largest
distance (maximum-margin) between borderline observations
(support vectors) while separating classes. Use of RBF kernel
trick enables more complex boundaries in original feature space,
which may lead to overfitting when not having enough
observations (acc. Jin, Wang, 2012).

4 Research methodology

4.1 Dataset

We utilize public telecommunication dataset, originally
published on UCI Machine learning repository, which is now
part of the C50 package in CRAN. The dataset is popular in
customer churn prediction research (see Verbeke et al., 2012;
Vafeiadis et al., 2015; Mehreen et al., 2017) enabling broader
discussion of results. It consists of 5000 observations, 19
explanatory variables (features), and 1 response variable (churn).

The features are largely based on transactional data. Observed
churn rate is 14.14 %.

4.2 Performance metrics

Accuracy – Performance of classification methods is routinely
evaluated with confusion matrix and related measures. One of
the popular metrics is

��. It is defined as follows (Powers,

2011):

��=

��+��

��+��+��+�� ,

wherein numerator depicts a number of correctly classified
positive (

��) and negative examples (��), in the denominator

we have sum a of correctly (

�� + ��) and incorrectly

classified examples (

�� + ��). Accuracy is used for clear

interpretability; however, it is threshold dependent and is not
reliable when dealing with imbalanced classes.

Table 1.

Churn dataset - variable names and data types

Variable name

Description

R dtype

state

factor

account_length

number of months as an active user

int

area_code

factor

international_plan

has an international plan (yes/no)

factor

voice_mail_plan

has voicemail plan (yes/no)

factor

number_vmail_messages

number of voice mail messages

int

total_day_minutes

total sum of day call minutes

num

total_day_calls

total number of day calls

int

total_day_charge

total sum of day charge

num

total_eve_minutes

total sum of evening call minutes

num

total_eve_calls

total number of evening calls

int

total_eve_charge

total sum of evening charge

num

total_night_minutes

total sum of night call minutes

num

total_night_calls

total number of night calls

int

total_night_charge

total sum of night charge

num

total_intl_minutes

total sum of international call minutes

num

total_intl_calls

total number of international calls

int

total_intl_charge

total sum of international charge

num

number_customer_service_calls

number of calls to customer service

int

churn

response variable

logi

Source: author

Top-decile lift – In retention campaign, the only a fraction of
customers can be contacted and offered discount or premium
service. To address that,

�� as an extension of �� measure is

often applied. It is calculated as a ratio of

��, with �� for

customers in top-decile propensity to churn (churn score) in the
numerator and

�� for whole customer base in the denominator.

�� is popular for its practical implications; however, it is
threshold dependent and ignores variations in fraction
selection (Verbeke et al., 2012).

Area under the receiver operating curve – Classification model
is expected to produce churn score

��=��(��), which is a function

of feature vector

��; probability density function of corresponding

scores is described as

��

(��), with cumulative distribution

function

��

(��) and two classes �� ∈{0,1}. ��/�� is then

outlined in Eq. 8 (Hand, 2009).

��/��=��

(��)��

∞

−∞

��/�� notion can be interpreted as a probability that
randomly drawn member of class 0 will produce a lower churn
score than randomly drawn member of class 1.

��/�� is the

most popular measure of classification performance due to
threshold independence (acc. Bradley, 1997), albeit it suffers
from several conceptual issues (see Hand, 2009).

4.3 Experimental design and implementation

Performance of different feature selection techniques is
examined through machine learning pipeline consisting of four
main steps – (1) data processing, (2) feature selection, (3) model
training and (4) model evaluation; their linkage is characterized
in Fig. 2. To ensure the stability of the outcomes, the process is
repeated 50 times. The pipeline is implemented in the R
language for statistical programming, specifically in
Microsoft R 3.5.1.

Data processing – Original churn dataset is randomly stratified
into the train (60 % of examples) and test set (40 % of
examples). Data transformations are performed on the train set
and projected to the test set to prevent data leak. Non-binary
factor columns are concealed with the one-hot encoding scheme.
Numerical/integer features are expanded to 2nd-degree
interaction terms, which results in a total of 158 explanatory
variables. Consequently, all numerical/integer features are
centered and scaled. Features with near zero variability are
removed.

Feature selection – Processed train set serves as the only input to
feature selection block. To address computational complexity
and class imbalance in the feature selection procedure, we
propose a balanced clustering method to reduce the number of
observations. The algorithm is described with pseudo-code in
Fig. 3. It is worth noting that the upper boundary for the
expected number of examples per class is limited by properties
of the train set. The procedure is implemented with clustering

- 58 -