Photo by Mantas Hesthaven on Unsplash
Photo by Mantas Hesthaven on Unsplash

Customer churn, also known as customer attrition, occurs when customers stop doing business with a company. The companies are interested in identifying segments of these customers because the price for acquiring a new customer is usually higher than retaining the old one. For example, if Netflix knew a segment of customers who were at risk of churning they could proactively engage them with special offers instead of simply losing them.

In this blog post, we will create a simple customer churn prediction model using Telco Customer Churn dataset. We chose a decision tree to model churned customers, pandas for data crunching and matplotlib for visualizations. We will do all of that above in Python. The code can be used with another dataset with a few minor adjustments to train the baseline model. We also provide few references and give ideas for new features and improvements.

You can run this code by downloading this Jupyter notebook.

Follow me on twitter to get latest updates.

Let’s get started.

Here are a few links you might be interested in:

Disclosure: Bear in mind that some of the links above are affiliate links and if you go through them to make a purchase I will earn a commission. Keep in mind that I link courses because of their quality and not because of the commission I receive from your purchases. The decision is yours, and whether or not you decide to buy something is completely up to you.

Requirements

import platform
import pandas as pd
import sklearn
import numpy as np
import graphviz
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
print('python version', platform.python_version())
print('pandas version', pd.__version__)
print('sklearn version', sklearn.__version__)
print('numpy version', np.__version__)
print('graphviz version', graphviz.__version__)
print('matplotlib version', matplotlib.__version__)
python version 3.7.0
pandas version 0.23.4
sklearn version 0.19.2
numpy version 1.15.1
graphviz version 0.10.1
matplotlib version 2.2.3

Data Preprocessing

We use pandas to read the dataset and preprocess it. Telco dataset has one customer per line with many columns (features). There aren’t any rows with all missing values or duplicates (this rarely happens with real-world datasets). There are 11 samples that have TotalCharges set to “ “, which seems like a mistake in the data. We remove those samples and set the type to numeric (float).

df = pd.read_csv('data/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.shape
(7043, 21)
df.head()
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns

df = df.dropna(how="all")  # remove samples with all missing values
df.shape
(7043, 21)
df = df[~df.duplicated()] # remove duplicates
df.shape
(7043, 21)
total_charges_filter = df.TotalCharges == " "
df = df[~total_charges_filter]
df.shape
(7032, 21)
df.TotalCharges = pd.to_numeric(df.TotalCharges)

Exploratory Data Analysis

We have 2 types of features in the dataset: categorical (two or more values and without any order) and numerical. Most of the feature names are self-explanatory, except for:

  • Partner: whether the customer has a partner or not (Yes, No),
  • Dependents: whether the customer has dependents or not (Yes, No),
  • OnlineBackup: whether the customer has online backup or not (Yes, No, No internet service),
  • tenure: number of months the customer has stayed with the company,
  • MonthlyCharges: the amount charged to the customer monthly,
  • TotalCharges: the total amount charged to the customer.

There are 7032 customers in the dataset and 19 features without customerID (non-informative) and Churn column (target variable). Most of the categorical features have 4 or less unique values.

df.describe(include='all')
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
count 7032 7032 7032.000000 7032 7032 7032.000000 7032 7032 7032 7032 ... 7032 7032 7032 7032 7032 7032 7032 7032.000000 7032.000000 7032
unique 7032 2 NaN 2 2 NaN 2 3 3 3 ... 3 3 3 3 3 2 4 NaN NaN 2
top 7989-CHGTL Male NaN No No NaN Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check NaN NaN No
freq 1 3549 NaN 3639 4933 NaN 6352 3385 3096 3497 ... 3094 3472 2809 2781 3875 4168 2365 NaN NaN 5163
mean NaN NaN 0.162400 NaN NaN 32.421786 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 64.798208 2283.300441 NaN
std NaN NaN 0.368844 NaN NaN 24.545260 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 30.085974 2266.771362 NaN
min NaN NaN 0.000000 NaN NaN 1.000000 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 18.250000 18.800000 NaN
25% NaN NaN 0.000000 NaN NaN 9.000000 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 35.587500 401.450000 NaN
50% NaN NaN 0.000000 NaN NaN 29.000000 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 70.350000 1397.475000 NaN
75% NaN NaN 0.000000 NaN NaN 55.000000 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 89.862500 3794.737500 NaN
max NaN NaN 1.000000 NaN NaN 72.000000 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 118.750000 8684.800000 NaN

11 rows × 21 columns

We combine features into two lists so that we can analyze them jointly.

categorical_features = [
    "gender",
    "SeniorCitizen",
    "Partner",
    "Dependents",
    "PhoneService",
    "MultipleLines",
    "InternetService",
    "OnlineSecurity",
    "OnlineBackup",
    "DeviceProtection",
    "TechSupport",
    "StreamingTV",
    "StreamingMovies",
    "Contract",
    "PaperlessBilling",
    "PaymentMethod",
]
numerical_features = ["tenure", "MonthlyCharges", "TotalCharges"]
target = "Churn"

Feature distribution

We plot distributions for numerical and categorical features to check for outliers and compare feature distributions with target variable.

Numerical features distribution

Numeric summarizing techniques (mean, standard deviation, etc.) don’t show us spikes, shapes of distributions and it is hard to observe outliers with it. That is the reason we use histograms.

df[numerical_features].describe()
tenure MonthlyCharges TotalCharges
count 7032.000000 7032.000000 7032.000000
mean 32.421786 64.798208 2283.300441
std 24.545260 30.085974 2266.771362
min 1.000000 18.250000 18.800000
25% 9.000000 35.587500 401.450000
50% 29.000000 70.350000 1397.475000
75% 55.000000 89.862500 3794.737500
max 72.000000 118.750000 8684.800000

At first glance, there aren’t any outliers in the data. No data point is disconnected from distribution or too far from the mean value. To confirm that we would need to calculate interquartile range (IQR) and show that values of each numerical feature are within the 1.5 IQR from first and third quartile.

We could convert numerical features to ordinal intervals. For example, tenure is numerical, but often we don’t care about small numeric differences and instead group tenure to customers with short, medium and long term tenure. One reason to convert it would be to reduce the noise, often small fluctuates are just noise.

df[numerical_features].hist(bins=30, figsize=(10, 7))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10adb71d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10ed897f0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10edb2b00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10eddce10>]],
      dtype=object)
Histograms of numerical features
Histograms of numerical features

We look at distributions of numerical features in relation to the target variable. We can observe that the greater TotalCharges and tenure are the less is the probability of churn.

fig, ax = plt.subplots(1, 3, figsize=(14, 4))
df[df.Churn == "No"][numerical_features].hist(bins=30, color="blue", alpha=0.5, ax=ax)
df[df.Churn == "Yes"][numerical_features].hist(bins=30, color="red", alpha=0.5, ax=ax)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x10eed6a90>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x11125b8d0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x11127ef60>],
      dtype=object)
Numerical features in relation to the target variable
Numerical features in relation to the target variable

Categorical feature distribution

To analyze categorical features, we use bar charts. We observe that Senior citizens and customers without phone service are less represented in the data.

ROWS, COLS = 4, 4
fig, ax = plt.subplots(ROWS, COLS, figsize=(18, 18))
row, col = 0, 0
for i, categorical_feature in enumerate(categorical_features):
    if col == COLS - 1:
        row += 1
    col = i % COLS
    df[categorical_feature].value_counts().plot('bar', ax=ax[row, col]).set_title(categorical_feature)
Distribution of categorical features
Distribution of categorical features

The next step is to look at categorical features in relation to the target variable. We do this only for contract feature. Users who have a month-to-month contract are more likely to churn than users with long term contracts.

feature = 'Contract'
fig, ax = plt.subplots(1, 2, figsize=(14, 4))
df[df.Churn == "No"][feature].value_counts().plot('bar', ax=ax[0]).set_title('not churned')
df[df.Churn == "Yes"][feature].value_counts().plot('bar', ax=ax[1]).set_title('churned')
Text(0.5,1,'churned')
Contract feature in relation to the target variable
Contract feature in relation to the target variable

Target variable distribution

Target variable distribution shows that we are dealing with an imbalanced problem as there are many more non-churned as churned users. The model would achieve high accuracy as it would mostly predict majority class - users who didn’t churn in our example.

Few things we can do to minimize the influence of imbalanced dataset:

  • resample data (https://imbalanced-learn.readthedocs.io/en/stable/),
  • collect more samples,
  • use precision and recall as accuracy metrics.
df[target].value_counts().plot('bar').set_title('churned')
Text(0.5,1,'churned')
Target variable distribution
Target variable distribution

Features

Telco dataset is already grouped by customerID so it is difficult to add new features. When working on the churn prediction we usually get a dataset that has one entry per customer session (customer activity in a certain time). Then we could add features like:

  • number of sessions before buying something,
  • average time per session,
  • time difference between sessions (frequent or less frequent customer),
  • is a customer only in one country.

Sometimes we even have customer event data, which enables us to find patterns of customer behavior in relation to the outcome (churn).

Encoding features

To prepare the dataset for modeling churn, we need to encode categorical features to numbers. This means encoding “Yes”, “No” to 0 and 1 so that algorithm can work with the data. This process is called onehot encoding.

from sklearn.preprocessing import LabelEncoder

categorical_feature_names = []
label_encoders = {}
for categorical in categorical_features + [target]:
    label_encoders[categorical] = LabelEncoder()
    df[categorical] = label_encoders[categorical].fit_transform(df[categorical])
    names = label_encoders[categorical].classes_.tolist()
    print('Label encoder %s - values: %s' % (categorical, names))
    if categorical == target:
        continue
    categorical_feature_names.extend([categorical + '_' + str(name) for name in names])
Label encoder gender - values: ['Female', 'Male']
Label encoder SeniorCitizen - values: [0, 1]
Label encoder Partner - values: ['No', 'Yes']
Label encoder Dependents - values: ['No', 'Yes']
Label encoder PhoneService - values: ['No', 'Yes']
Label encoder MultipleLines - values: ['No', 'No phone service', 'Yes']
Label encoder InternetService - values: ['DSL', 'Fiber optic', 'No']
Label encoder OnlineSecurity - values: ['No', 'No internet service', 'Yes']
Label encoder OnlineBackup - values: ['No', 'No internet service', 'Yes']
Label encoder DeviceProtection - values: ['No', 'No internet service', 'Yes']
Label encoder TechSupport - values: ['No', 'No internet service', 'Yes']
Label encoder StreamingTV - values: ['No', 'No internet service', 'Yes']
Label encoder StreamingMovies - values: ['No', 'No internet service', 'Yes']
Label encoder Contract - values: ['Month-to-month', 'One year', 'Two year']
Label encoder PaperlessBilling - values: ['No', 'Yes']
Label encoder PaymentMethod - values: ['Bank transfer (automatic)', 'Credit card (automatic)', 'Electronic check', 'Mailed check']
Label encoder Churn - values: ['No', 'Yes']
df.head()
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG 0 0 1 0 1 0 1 0 0 ... 0 0 0 0 0 1 2 29.85 29.85 0
1 5575-GNVDE 1 0 0 0 34 1 0 0 2 ... 2 0 0 0 1 0 3 56.95 1889.50 0
2 3668-QPYBK 1 0 0 0 2 1 0 0 2 ... 0 0 0 0 0 1 3 53.85 108.15 1
3 7795-CFOCW 1 0 0 0 45 0 1 0 2 ... 2 2 0 0 1 0 0 42.30 1840.75 0
4 9237-HQITU 0 0 0 0 2 1 0 1 0 ... 0 0 0 0 0 1 2 70.70 151.65 1

5 rows × 21 columns

Classifier

We use sklearn, a Machine Learning library in Python, to create a classifier. The sklearn way is to use pipelines that define feature processing and the classifier. In our example, the pipeline takes a dataset in the input, it preprocesses features and trains the classifier. When trained, it takes the same input and returns predictions in the output.

In the pipeline, we separately process categorical and numerical features. We onehot encode categorical features and scale numerical features by removing the mean and scaling them to unit variance. We chose a decision tree model because of its interpretability and set max depth to 3 (arbitrarily).

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn import tree
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler


class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, df):
        return df[self.key]
pipeline = Pipeline(
    [
        (
            "union",
            FeatureUnion(
                transformer_list=[
                    (
                        "categorical_features",
                        Pipeline(
                            [
                                ("selector", ItemSelector(key=categorical_features)),
                                ("onehot", OneHotEncoder()),
                            ]
                        ),
                    )
                ]
                + [
                    (
                        "numerical_features",
                        Pipeline(
                            [
                                ("selector", ItemSelector(key=numerical_features)),
                                ("scalar", StandardScaler()),
                            ]
                        ),
                    )
                ]
            ),
        ),
        ("classifier", tree.DecisionTreeClassifier(max_depth=3, random_state=42)),
    ]
)

Training the model

We split the dataset to train (75% samples) and test (25% samples). We train (fit) the pipeline and make predictions.

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.25, random_state=42)

pipeline.fit(df_train, df_train[target])
pred = pipeline.predict(df_test)

Testing the model

With classification_report we calculate precision and recall with actual and predicted values. For class 1 (churned users) model achieves 0.67 precision and 0.37 recall. Precision tells us how many churned users did our classifier predicted correctly. On the other side, recall tell us how many churned users it missed. In layman terms, the classifier is not very accurate for churned users.

from sklearn.metrics import classification_report

print(classification_report(df_test[target], pred))
             precision    recall  f1-score   support

          0       0.81      0.94      0.87      1300
          1       0.67      0.37      0.48       458

avg / total       0.77      0.79      0.77      1758

Model interpretability

Decision Tree model uses Contract, MonthlyCharges, InternetService, TotalCharges, and tenure features to make a decision if a customer will churn or not. These features separate churned customers from others well based on the split criteria in the decision tree.

Each customer sample traverses the tree and final node gives the prediction. For example, if Contract_Month-to-month is:

  • equal to 0, continue traversing the tree with True branch,
  • equal to 1, continue traversing the tree with False branch,
  • not defined, it outputs the class 0.

This is a great approach to see how the model is making a decision or if any features sneaked in our model that shouldn’t be there.

dot_data = tree.export_graphviz(pipeline.named_steps['classifier'], out_file=None, 
                         feature_names = categorical_feature_names + numerical_features,
                         class_names=[str(el) for el in pipeline.named_steps.classifier.classes_],  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data) 
graph
Interpretation of the Decision tree model
Interpretation of the Decision tree model

Further reading