AD ALTA
JOURNAL OF INTERDISCIPLINARY RESEARCH
Support vector machine – Gaussian radial basis function SVM
(RBF-SVM) is a non-parametric method that constructs
hyperplane in high-dimensional space which has the largest
distance (maximum-margin) between borderline observations
(support vectors) while separating classes. Use of RBF kernel
trick enables more complex boundaries in original feature space,
which may lead to overfitting when not having enough
observations (acc. Jin, Wang, 2012).
4 Research methodology
4.1 Dataset
We utilize public telecommunication dataset, originally
published on UCI Machine learning repository, which is now
part of the C50 package in CRAN. The dataset is popular in
customer churn prediction research (see Verbeke et al., 2012;
Vafeiadis et al., 2015; Mehreen et al., 2017) enabling broader
discussion of results. It consists of 5000 observations, 19
explanatory variables (features), and 1 response variable (churn).
The features are largely based on transactional data. Observed
churn rate is 14.14 %.
4.2 Performance metrics
Accuracy – Performance of classification methods is routinely
evaluated with confusion matrix and related measures. One of
the popular metrics is
. It is defined as follows (Powers,
2011):
=
+
+++ ,
7)
wherein numerator depicts a number of correctly classified
positive (
) and negative examples (), in the denominator
we have sum a of correctly (
+ ) and incorrectly
classified examples (
+ ). Accuracy is used for clear
interpretability; however, it is threshold dependent and is not
reliable when dealing with imbalanced classes.
Table 1.
Churn dataset - variable names and data types
Variable name
Description
R dtype
state
factor
account_length
number of months as an active user
int
area_code
factor
international_plan
has an international plan (yes/no)
factor
voice_mail_plan
has voicemail plan (yes/no)
factor
number_vmail_messages
number of voice mail messages
int
total_day_minutes
total sum of day call minutes
num
total_day_calls
total number of day calls
int
total_day_charge
total sum of day charge
num
total_eve_minutes
total sum of evening call minutes
num
total_eve_calls
total number of evening calls
int
total_eve_charge
total sum of evening charge
num
total_night_minutes
total sum of night call minutes
num
total_night_calls
total number of night calls
int
total_night_charge
total sum of night charge
num
total_intl_minutes
total sum of international call minutes
num
total_intl_calls
total number of international calls
int
total_intl_charge
total sum of international charge
num
number_customer_service_calls
number of calls to customer service
int
churn
response variable
logi
Source: author
Top-decile lift – In retention campaign, the only a fraction of
customers can be contacted and offered discount or premium
service. To address that,
as an extension of measure is
often applied. It is calculated as a ratio of
, with for
customers in top-decile propensity to churn (churn score) in the
numerator and
for whole customer base in the denominator.
is popular for its practical implications; however, it is
threshold dependent and ignores variations in fraction
selection (Verbeke et al., 2012).
Area under the receiver operating curve – Classification model
is expected to produce churn score
=(), which is a function
of feature vector
; probability density function of corresponding
scores is described as
(), with cumulative distribution
function
() and two classes ∈{0,1}. / is then
outlined in Eq. 8 (Hand, 2009).
/=�
0
()
1
()
∞
−∞
8)
/ notion can be interpreted as a probability that
randomly drawn member of class 0 will produce a lower churn
score than randomly drawn member of class 1.
/ is the
most popular measure of classification performance due to
threshold independence (acc. Bradley, 1997), albeit it suffers
from several conceptual issues (see Hand, 2009).
4.3 Experimental design and implementation
Performance of different feature selection techniques is
examined through machine learning pipeline consisting of four
main steps – (1) data processing, (2) feature selection, (3) model
training and (4) model evaluation; their linkage is characterized
in Fig. 2. To ensure the stability of the outcomes, the process is
repeated 50 times. The pipeline is implemented in the R
language for statistical programming, specifically in
Microsoft R 3.5.1.
Data processing – Original churn dataset is randomly stratified
into the train (60 % of examples) and test set (40 % of
examples). Data transformations are performed on the train set
and projected to the test set to prevent data leak. Non-binary
factor columns are concealed with the one-hot encoding scheme.
Numerical/integer features are expanded to 2nd-degree
interaction terms, which results in a total of 158 explanatory
variables. Consequently, all numerical/integer features are
centered and scaled. Features with near zero variability are
removed.
Feature selection – Processed train set serves as the only input to
feature selection block. To address computational complexity
and class imbalance in the feature selection procedure, we
propose a balanced clustering method to reduce the number of
observations. The algorithm is described with pseudo-code in
Fig. 3. It is worth noting that the upper boundary for the
expected number of examples per class is limited by properties
of the train set. The procedure is implemented with clustering
- 58 -