{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Churn prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Customer churn, also known as customer attrition, occurs when customers stop doing business with a company. The companies are interested in identifying segments of these customers because the price for acquiring a new customer is usually higher than retaining the old one. For example, if Netflix knew a segment of customers who were at risk of churning they could proactively engage them with special offers instead of simply losing them.\n", "\n", "In this blog post, we will create a simple customer churn prediction model using [Telco Customer Churn dataset](https://www.kaggle.com/blastchar/telco-customer-churn). We chose a decision tree to model churned customers, pandas for data crunching and matplotlib for visualizations. We will do all of that above in Python.\n", "The code can be used with another dataset with a few minor adjustments to train the baseline model. We also provide few references and give ideas for new features and improvements. \n", "\n", "You can run this code by downloading this [Jupyter notebook]({{site.url}}/assets/notebooks/2019-01-25-churn-prediction).\n", " \n", "Follow me on [twitter](https://twitter.com/romanorac) to get latest updates.\n", "\n", "Let's get started." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Requirements" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import platform\n", "import pandas as pd\n", "import sklearn\n", "import numpy as np\n", "import graphviz\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "python version 3.7.0\n", "pandas version 0.23.4\n", "sklearn version 0.19.2\n", "numpy version 1.15.1\n", "graphviz version 0.10.1\n", "matplotlib version 2.2.3\n" ] } ], "source": [ "print('python version', platform.python_version())\n", "print('pandas version', pd.__version__)\n", "print('sklearn version', sklearn.__version__)\n", "print('numpy version', np.__version__)\n", "print('graphviz version', graphviz.__version__)\n", "print('matplotlib version', matplotlib.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preprocessing\n", "\n", "We use pandas to read the dataset and preprocess it. Telco dataset has one customer per line with many columns (features).\n", "There aren't any rows with all missing values or duplicates (this rarely happens with real-world datasets). \n", "There are 11 samples that have TotalCharges set to \" \", which seems like a mistake in the data. We remove those samples and set the type to numeric (float)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7043, 21)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/WA_Fn-UseC_-Telco-Customer-Churn.csv')\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customerIDgenderSeniorCitizenPartnerDependentstenurePhoneServiceMultipleLinesInternetServiceOnlineSecurity...DeviceProtectionTechSupportStreamingTVStreamingMoviesContractPaperlessBillingPaymentMethodMonthlyChargesTotalChargesChurn
07590-VHVEGFemale0YesNo1NoNo phone serviceDSLNo...NoNoNoNoMonth-to-monthYesElectronic check29.8529.85No
15575-GNVDEMale0NoNo34YesNoDSLYes...YesNoNoNoOne yearNoMailed check56.951889.5No
23668-QPYBKMale0NoNo2YesNoDSLYes...NoNoNoNoMonth-to-monthYesMailed check53.85108.15Yes
37795-CFOCWMale0NoNo45NoNo phone serviceDSLYes...YesYesNoNoOne yearNoBank transfer (automatic)42.301840.75No
49237-HQITUFemale0NoNo2YesNoFiber opticNo...NoNoNoNoMonth-to-monthYesElectronic check70.70151.65Yes
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " customerID gender SeniorCitizen Partner Dependents tenure PhoneService \\\n", "0 7590-VHVEG Female 0 Yes No 1 No \n", "1 5575-GNVDE Male 0 No No 34 Yes \n", "2 3668-QPYBK Male 0 No No 2 Yes \n", "3 7795-CFOCW Male 0 No No 45 No \n", "4 9237-HQITU Female 0 No No 2 Yes \n", "\n", " MultipleLines InternetService OnlineSecurity ... DeviceProtection \\\n", "0 No phone service DSL No ... No \n", "1 No DSL Yes ... Yes \n", "2 No DSL Yes ... No \n", "3 No phone service DSL Yes ... Yes \n", "4 No Fiber optic No ... No \n", "\n", " TechSupport StreamingTV StreamingMovies Contract PaperlessBilling \\\n", "0 No No No Month-to-month Yes \n", "1 No No No One year No \n", "2 No No No Month-to-month Yes \n", "3 Yes No No One year No \n", "4 No No No Month-to-month Yes \n", "\n", " PaymentMethod MonthlyCharges TotalCharges Churn \n", "0 Electronic check 29.85 29.85 No \n", "1 Mailed check 56.95 1889.5 No \n", "2 Mailed check 53.85 108.15 Yes \n", "3 Bank transfer (automatic) 42.30 1840.75 No \n", "4 Electronic check 70.70 151.65 Yes \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7043, 21)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.dropna(how=\"all\") # remove samples with all missing values\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7043, 21)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df[~df.duplicated()] # remove duplicates\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7032, 21)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "total_charges_filter = df.TotalCharges == \" \"\n", "df = df[~total_charges_filter]\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "df.TotalCharges = pd.to_numeric(df.TotalCharges)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory Data Analysis\n", "\n", "We have 2 types of features in the dataset: categorical (two or more values and without any order) and numerical. Most of the feature names are self-explanatory, except for:\n", " - Partner: whether the customer has a partner or not (Yes, No),\n", " - Dependents: whether the customer has dependents or not (Yes, No),\n", " - OnlineBackup: whether the customer has online backup or not (Yes, No, No internet service),\n", " - tenure: number of months the customer has stayed with the company,\n", " - MonthlyCharges: the amount charged to the customer monthly,\n", " - TotalCharges: the total amount charged to the customer.\n", " \n", "There are 7032 customers in the dataset and 19 features without customerID (non-informative) and Churn column (target variable). Most of the categorical features have 4 or less unique values." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customerIDgenderSeniorCitizenPartnerDependentstenurePhoneServiceMultipleLinesInternetServiceOnlineSecurity...DeviceProtectionTechSupportStreamingTVStreamingMoviesContractPaperlessBillingPaymentMethodMonthlyChargesTotalChargesChurn
count703270327032.000000703270327032.0000007032703270327032...70327032703270327032703270327032.0000007032.0000007032
unique70322NaN22NaN2333...3333324NaNNaN2
top7989-CHGTLMaleNaNNoNoNaNYesNoFiber opticNo...NoNoNoNoMonth-to-monthYesElectronic checkNaNNaNNo
freq13549NaN36394933NaN6352338530963497...3094347228092781387541682365NaNNaN5163
meanNaNNaN0.162400NaNNaN32.421786NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaN64.7982082283.300441NaN
stdNaNNaN0.368844NaNNaN24.545260NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaN30.0859742266.771362NaN
minNaNNaN0.000000NaNNaN1.000000NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaN18.25000018.800000NaN
25%NaNNaN0.000000NaNNaN9.000000NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaN35.587500401.450000NaN
50%NaNNaN0.000000NaNNaN29.000000NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaN70.3500001397.475000NaN
75%NaNNaN0.000000NaNNaN55.000000NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaN89.8625003794.737500NaN
maxNaNNaN1.000000NaNNaN72.000000NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaN118.7500008684.800000NaN
\n", "

11 rows × 21 columns

\n", "
" ], "text/plain": [ " customerID gender SeniorCitizen Partner Dependents tenure \\\n", "count 7032 7032 7032.000000 7032 7032 7032.000000 \n", "unique 7032 2 NaN 2 2 NaN \n", "top 7989-CHGTL Male NaN No No NaN \n", "freq 1 3549 NaN 3639 4933 NaN \n", "mean NaN NaN 0.162400 NaN NaN 32.421786 \n", "std NaN NaN 0.368844 NaN NaN 24.545260 \n", "min NaN NaN 0.000000 NaN NaN 1.000000 \n", "25% NaN NaN 0.000000 NaN NaN 9.000000 \n", "50% NaN NaN 0.000000 NaN NaN 29.000000 \n", "75% NaN NaN 0.000000 NaN NaN 55.000000 \n", "max NaN NaN 1.000000 NaN NaN 72.000000 \n", "\n", " PhoneService MultipleLines InternetService OnlineSecurity ... \\\n", "count 7032 7032 7032 7032 ... \n", "unique 2 3 3 3 ... \n", "top Yes No Fiber optic No ... \n", "freq 6352 3385 3096 3497 ... \n", "mean NaN NaN NaN NaN ... \n", "std NaN NaN NaN NaN ... \n", "min NaN NaN NaN NaN ... \n", "25% NaN NaN NaN NaN ... \n", "50% NaN NaN NaN NaN ... \n", "75% NaN NaN NaN NaN ... \n", "max NaN NaN NaN NaN ... \n", "\n", " DeviceProtection TechSupport StreamingTV StreamingMovies \\\n", "count 7032 7032 7032 7032 \n", "unique 3 3 3 3 \n", "top No No No No \n", "freq 3094 3472 2809 2781 \n", "mean NaN NaN NaN NaN \n", "std NaN NaN NaN NaN \n", "min NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN \n", "max NaN NaN NaN NaN \n", "\n", " Contract PaperlessBilling PaymentMethod MonthlyCharges \\\n", "count 7032 7032 7032 7032.000000 \n", "unique 3 2 4 NaN \n", "top Month-to-month Yes Electronic check NaN \n", "freq 3875 4168 2365 NaN \n", "mean NaN NaN NaN 64.798208 \n", "std NaN NaN NaN 30.085974 \n", "min NaN NaN NaN 18.250000 \n", "25% NaN NaN NaN 35.587500 \n", "50% NaN NaN NaN 70.350000 \n", "75% NaN NaN NaN 89.862500 \n", "max NaN NaN NaN 118.750000 \n", "\n", " TotalCharges Churn \n", "count 7032.000000 7032 \n", "unique NaN 2 \n", "top NaN No \n", "freq NaN 5163 \n", "mean 2283.300441 NaN \n", "std 2266.771362 NaN \n", "min 18.800000 NaN \n", "25% 401.450000 NaN \n", "50% 1397.475000 NaN \n", "75% 3794.737500 NaN \n", "max 8684.800000 NaN \n", "\n", "[11 rows x 21 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe(include='all')" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "We combine features into two lists so that we can analyze them jointly. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "categorical_features = [\n", " \"gender\",\n", " \"SeniorCitizen\",\n", " \"Partner\",\n", " \"Dependents\",\n", " \"PhoneService\",\n", " \"MultipleLines\",\n", " \"InternetService\",\n", " \"OnlineSecurity\",\n", " \"OnlineBackup\",\n", " \"DeviceProtection\",\n", " \"TechSupport\",\n", " \"StreamingTV\",\n", " \"StreamingMovies\",\n", " \"Contract\",\n", " \"PaperlessBilling\",\n", " \"PaymentMethod\",\n", "]\n", "numerical_features = [\"tenure\", \"MonthlyCharges\", \"TotalCharges\"]\n", "target = \"Churn\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We plot distributions for numerical and categorical features to check for outliers and compare feature distributions with target variable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Numerical features distribution\n", "\n", "Numeric summarizing techniques (mean, standard deviation, etc.) don't show us spikes, shapes of distributions and it is hard to observe outliers with it. That is the reason we use histograms." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tenureMonthlyChargesTotalCharges
count7032.0000007032.0000007032.000000
mean32.42178664.7982082283.300441
std24.54526030.0859742266.771362
min1.00000018.25000018.800000
25%9.00000035.587500401.450000
50%29.00000070.3500001397.475000
75%55.00000089.8625003794.737500
max72.000000118.7500008684.800000
\n", "
" ], "text/plain": [ " tenure MonthlyCharges TotalCharges\n", "count 7032.000000 7032.000000 7032.000000\n", "mean 32.421786 64.798208 2283.300441\n", "std 24.545260 30.085974 2266.771362\n", "min 1.000000 18.250000 18.800000\n", "25% 9.000000 35.587500 401.450000\n", "50% 29.000000 70.350000 1397.475000\n", "75% 55.000000 89.862500 3794.737500\n", "max 72.000000 118.750000 8684.800000" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[numerical_features].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At first glance, there aren't any outliers in the data. No data point is disconnected from distribution or too far from the mean value. To confirm that we would need to calculate [interquartile range (IQR)](https://www.purplemath.com/modules/boxwhisk3.htm) and show that values of each numerical feature are within the 1.5 IQR from first and third quartile. \n", "\n", "We could convert numerical features to ordinal intervals. For example, tenure is numerical, but often we don't care about small numeric differences and instead group tenure to customers with short, medium and long term tenure. One reason to convert it would be to reduce the noise, often small fluctuates are just noise." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[,\n", " ],\n", " [,\n", " ]],\n", " dtype=object)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df[numerical_features].hist(bins=30, figsize=(10, 7))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We look at distributions of numerical features in relation to the target variable. We can observe that the greater TotalCharges and tenure are the less is the probability of churn." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([,\n", " ,\n", " ],\n", " dtype=object)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(1, 3, figsize=(14, 4))\n", "df[df.Churn == \"No\"][numerical_features].hist(bins=30, color=\"blue\", alpha=0.5, ax=ax)\n", "df[df.Churn == \"Yes\"][numerical_features].hist(bins=30, color=\"red\", alpha=0.5, ax=ax)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Categorical feature distribution\n", "\n", "To analyze categorical features, we use bar charts. We observe that Senior citizens and customers without phone service are less represented in the data." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "ROWS, COLS = 4, 4\n", "fig, ax = plt.subplots(ROWS, COLS, figsize=(18, 18))\n", "row, col = 0, 0\n", "for i, categorical_feature in enumerate(categorical_features):\n", " if col == COLS - 1:\n", " row += 1\n", " col = i % COLS\n", " df[categorical_feature].value_counts().plot('bar', ax=ax[row, col]).set_title(categorical_feature)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to look at categorical features in relation to the target variable. We do this only for contract feature. Users who have a month-to-month contract are more likely to churn than users with long term contracts." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5,1,'churned')" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "feature = 'Contract'\n", "fig, ax = plt.subplots(1, 2, figsize=(14, 4))\n", "df[df.Churn == \"No\"][feature].value_counts().plot('bar', ax=ax[0]).set_title('not churned')\n", "df[df.Churn == \"Yes\"][feature].value_counts().plot('bar', ax=ax[1]).set_title('churned')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Target variable distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Target variable distribution shows that we are dealing with an imbalanced problem as there are many more non-churned as churned users. The model would achieve high accuracy as it would mostly predict majority class - users who didn't churn in our example.\n", "\n", "Few things we can do to minimize the influence of imbalanced dataset:\n", "- resample data (https://imbalanced-learn.readthedocs.io/en/stable/),\n", "- collect more samples,\n", "- use precision and recall as accuracy metrics." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5,1,'churned')" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEQCAYAAAC5oaP8AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAEPFJREFUeJzt3XuwXWV5x/HvDwLeK7fTFJNoaE1bcVqRpkCr04tYCGgb2kGKYyWlmWZssWOnnbZYR7nJDNpWqqPS0pI20CpmVAZKmWIKOtY/EEK5KFCGiGCSARNJRAXFAk//2G/sFhPPPsnJ3nDe72fmzF7rWe9a+1kzJ/uX9e6190lVIUnqzz6TbkCSNBkGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwAdSfJ7yb53KT72JUk9yV57aT70NxnAEhSpwwAaQ8kmTfpHqTdZQBoTkuyKMknk2xN8lCSDw5t++sk25N8OckJQ/Xvm4JJcnaSf2nLi5NUkpVJvgJcP1RbkeQrSb6W5B1D+++T5MwkX2o9rE1y0ND2Nye5v2373n7S3mYAaM5Ksi9wNXA/sBhYAFzeNh8N3A0cArwXuCRJZnD4XwZeBhw/VHs18FPAscC7krys1f8IOKnt8yJgO/Ch1uPhwEXAm9u2g4GFM+hD2m0GgOayoxi8qP5ZVT1SVd+pqh1v/t5fVf9QVU8Aa4BDgfkzOPbZ7ZjfHqqdU1XfrqrbgNuAV7T6W4B3VNWmqnoMOBs4uU0fnQxcXVWfbdveCTy5m+crzYjzl5rLFjF4oX98J9se3LFQVY+2//w/fwbH3vjDjgk8OnS8lwBXJBl+YX+CQeC8aPhYVfVIkodm0Ie027wC0Fy2EXjxbrxR+wjw3KH1H9vJmJl8je5G4ISqOmDo59lVtRl4gEFQAZDkuQymgaS9zgDQXHYjgxfYC5I8L8mzk7xqhP1uBU5Nsl+SpQymafbE3wHnJ3kJQJKpJMvbto8Dr0/y6iT7A+fiv0uNib9omrPa/P6vAy8FvgJsAn57hF3fCfwEgzdrzwE+soetvB+4CvhUkm8CNzB4E5qqugM4oz3HA+05N+3h80kjiX8QRpL65BWAJHXKAJCkThkAktQpA0CSOmUASFKnntafBD7kkENq8eLFk25Dkp5Rbr755q9V1dR0457WAbB48WLWr18/6TYk6Rklyf2jjHMKSJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktSpp/UHwZ4pFp/575NuYU6574LXTboFqQteAUhSpwwASerUSAGQ5L4kX0hya5L1rXZQknVJ7mmPB7Z6knwgyYYktyc5cug4K9r4e5Ks2DunJEkaxUyuAH61qo6oqqVt/UzguqpaAlzX1gFOAJa0n1XARTAIDOAsBn8M+yjgrB2hIUkavz2ZAloOrGnLa4CThuqX1sANwAFJDgWOB9ZV1baq2g6sA5btwfNLkvbAqAFQwKeS3JxkVavNr6oH2vKDwPy2vADYOLTvplbbVV2SNAGj3gb66qranORHgXVJ/md4Y1VVkpqNhlrArAJ48YtfPBuHlCTtxEhXAFW1uT1uAa5gMIf/1Ta1Q3vc0oZvBhYN7b6w1XZVf+pzXVxVS6tq6dTUtH/QRpK0m6YNgCTPS/KCHcvAccAXgauAHXfyrACubMtXAae1u4GOAR5uU0XXAsclObC9+Xtcq0mSJmCUKaD5wBVJdoz/SFX9R5KbgLVJVgL3A6e08dcAJwIbgEeB0wGqaluS84Cb2rhzq2rbrJ2JJGlGpg2AqroXeMVO6g8Bx+6kXsAZuzjWamD1zNuUJM02PwksSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnRg6AJPsmuSXJ1W39sCSfT7IhyceS7N/qz2rrG9r2xUPHeHur353k+Nk+GUnS6GZyBfA24K6h9fcAF1bVS4HtwMpWXwlsb/UL2ziSHA6cCrwcWAZ8OMm+e9a+JGl3jRQASRYCrwP+sa0HeA3w8TZkDXBSW17e1mnbj23jlwOXV9VjVfVlYANw1GychCRp5ka9Avhb4M+BJ9v6wcDXq+rxtr4JWNCWFwAbAdr2h9v479V3so8kacymDYAkrwe2VNXNY+iHJKuSrE+yfuvWreN4Sknq0ihXAK8CfiPJfcDlDKZ+3g8ckGReG7MQ2NyWNwOLANr2FwIPDdd3ss/3VNXFVbW0qpZOTU3N+IQkSaOZNgCq6u1VtbCqFjN4E/f6qnoT8Gng5DZsBXBlW76qrdO2X19V1eqntruEDgOWADfO2plIkmZk3vRDdukvgMuTvBu4Bbik1S8BLkuyAdjGIDSoqjuSrAXuBB4HzqiqJ/bg+SVJe2BGAVBVnwE+05bvZSd38VTVd4A37GL/84HzZ9qkJGn2+UlgSeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6NW0AJHl2khuT3JbkjiTntPphST6fZEOSjyXZv9Wf1dY3tO2Lh4719la/O8nxe+ukJEnTG+UK4DHgNVX1CuAIYFmSY4D3ABdW1UuB7cDKNn4lsL3VL2zjSHI4cCrwcmAZ8OEk+87myUiSRjdtANTAt9rqfu2ngNcAH2/1NcBJbXl5W6dtPzZJWv3yqnqsqr4MbACOmpWzkCTN2EjvASTZN8mtwBZgHfAl4OtV9XgbsglY0JYXABsB2vaHgYOH6zvZR5I0ZiMFQFU9UVVHAAsZ/K/9p/dWQ0lWJVmfZP3WrVv31tNIUvdmdBdQVX0d+DTwC8ABSea1TQuBzW15M7AIoG1/IfDQcH0n+ww/x8VVtbSqlk5NTc2kPUnSDIxyF9BUkgPa8nOAXwPuYhAEJ7dhK4Ar2/JVbZ22/fqqqlY/td0ldBiwBLhxtk5EkjQz86YfwqHAmnbHzj7A2qq6OsmdwOVJ3g3cAlzSxl8CXJZkA7CNwZ0/VNUdSdYCdwKPA2dU1ROzezqSpFFNGwBVdTvwyp3U72Und/FU1XeAN+ziWOcD58+8TUnSbPOTwJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdWraAEiyKMmnk9yZ5I4kb2v1g5KsS3JPezyw1ZPkA0k2JLk9yZFDx1rRxt+TZMXeOy1J0nRGuQJ4HPjTqjocOAY4I8nhwJnAdVW1BLiurQOcACxpP6uAi2AQGMBZwNHAUcBZO0JDkjR+0wZAVT1QVf/dlr8J3AUsAJYDa9qwNcBJbXk5cGkN3AAckORQ4HhgXVVtq6rtwDpg2ayejSRpZDN6DyDJYuCVwOeB+VX1QNv0IDC/LS8ANg7ttqnVdlWXJE3AyAGQ5PnAJ4A/rqpvDG+rqgJqNhpKsirJ+iTrt27dOhuHlCTtxEgBkGQ/Bi/+/1pVn2zlr7apHdrjllbfDCwa2n1hq+2q/n2q6uKqWlpVS6empmZyLpKkGRjlLqAAlwB3VdX7hjZdBey4k2cFcOVQ/bR2N9AxwMNtquha4LgkB7Y3f49rNUnSBMwbYcyrgDcDX0hya6v9JXABsDbJSuB+4JS27RrgRGAD8ChwOkBVbUtyHnBTG3duVW2blbOQJM3YtAFQVZ8DsovNx+5kfAFn7OJYq4HVM2lQkrR3jHIFIOmZ7OwXTrqDuePshyfdwazyqyAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASerUtAGQZHWSLUm+OFQ7KMm6JPe0xwNbPUk+kGRDktuTHDm0z4o2/p4kK/bO6UiSRjXKFcA/A8ueUjsTuK6qlgDXtXWAE4Al7WcVcBEMAgM4CzgaOAo4a0doSJImY9oAqKrPAtueUl4OrGnLa4CThuqX1sANwAFJDgWOB9ZV1baq2g6s4wdDRZI0Rrv7HsD8qnqgLT8IzG/LC4CNQ+M2tdqu6j8gyaok65Os37p16262J0mazh6/CVxVBdQs9LLjeBdX1dKqWjo1NTVbh5UkPcXuBsBX29QO7XFLq28GFg2NW9hqu6pLkiZkdwPgKmDHnTwrgCuH6qe1u4GOAR5uU0XXAsclObC9+Xtcq0mSJmTedAOSfBT4FeCQJJsY3M1zAbA2yUrgfuCUNvwa4ERgA/AocDpAVW1Lch5wUxt3blU99Y1lSdIYTRsAVfXGXWw6didjCzhjF8dZDayeUXeSpL3GTwJLUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktSpsQdAkmVJ7k6yIcmZ435+SdLAWAMgyb7Ah4ATgMOBNyY5fJw9SJIGxn0FcBSwoarurarvApcDy8fcgyQJmDfm51sAbBxa3wQcPTwgySpgVVv9VpK7x9RbDw4BvjbpJqaT90y6A03AM+J3k3My6Q5G9ZJRBo07AKZVVRcDF0+6j7koyfqqWjrpPqSn8ndzMsY9BbQZWDS0vrDVJEljNu4AuAlYkuSwJPsDpwJXjbkHSRJjngKqqseTvBW4FtgXWF1Vd4yzh845taanK383JyBVNekeJEkT4CeBJalTBoAkdcoAkKROGQCSxi7JW5P8SFv++yQ3Jjl20n31xgCY45IsTHJFkq1JtiT5RJKFk+5L3VtVVd9IchwwH/h94L0T7qk7BsDc908MPmtxKPAi4N9aTZqkHbcfnghcVlW34evR2Hkb6ByX5NaqOmK6mjROSS5l8P0/Pwn8LIMX/89W1ZETbawzT7vvAtKseyjJ7wAfbetvBB6aYD8SwOnAzzH4duBHkxwCrJxwT93xkmvu+z3gFOBB4AHgZAb/+KSJqaongB8H/qCVnoOvR2PnFJCksUvyQWA/4Jeq6mVJDgKuraqfn3BrXXEKaI5K8q4fsrmq6ryxNSP9oF+sqiOT3AJQVdvaF0RqjAyAueuRndSex2Ce9WDAANAk/W+SfWh3AyU5GHhysi31xymgDiR5AfA2Bi/+a4G/qaotk+1KPUtyGvCbwFJgNYP3qc6pqssn2lhnDIA5rM2r/gnwJmAN8P6q2j7ZrtSzJNcAf1hV9yV5OfBaIMB/VtUXJ9tdf5wCmqOS/BXwWwy+Z/1nqupbE25JgsGHED+VZA3wXv8eyGR5BTBHJXkSeAx4nP//1CUM/rdVVfUjE2lM3UvyfOCdwDLgMobm/qvqfZPqq0deAcxRVeU91Xq6+i6DmxSeBbwA3/ydGANA0tgkWQa8j8H3Ux1ZVY9OuKWuOQUkaWyS/BfwFuf+nx4MAEnqlPPEktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVP/B+JOXYQZZs4mAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df[target].value_counts().plot('bar').set_title('churned')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Features\n", "\n", "Telco dataset is already grouped by customerID so it is difficult to add new features. When working on the churn prediction we usually get a dataset that has one entry per customer session (customer activity in a certain time). Then we could add features like: \n", " - number of sessions before buying something,\n", " - average time per session,\n", " - time difference between sessions (frequent or less frequent customer),\n", " - is a customer only in one country.\n", "\n", "Sometimes we even have customer event data, which enables us to find patterns of customer behavior in relation to the outcome (churn)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Encoding features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To prepare the dataset for modeling churn, we need to encode categorical features to numbers. This means encoding \"Yes\", \"No\" to 0 and 1 so that algorithm can work with the data. This process is called [onehot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Label encoder gender - values: ['Female', 'Male']\n", "Label encoder SeniorCitizen - values: [0, 1]\n", "Label encoder Partner - values: ['No', 'Yes']\n", "Label encoder Dependents - values: ['No', 'Yes']\n", "Label encoder PhoneService - values: ['No', 'Yes']\n", "Label encoder MultipleLines - values: ['No', 'No phone service', 'Yes']\n", "Label encoder InternetService - values: ['DSL', 'Fiber optic', 'No']\n", "Label encoder OnlineSecurity - values: ['No', 'No internet service', 'Yes']\n", "Label encoder OnlineBackup - values: ['No', 'No internet service', 'Yes']\n", "Label encoder DeviceProtection - values: ['No', 'No internet service', 'Yes']\n", "Label encoder TechSupport - values: ['No', 'No internet service', 'Yes']\n", "Label encoder StreamingTV - values: ['No', 'No internet service', 'Yes']\n", "Label encoder StreamingMovies - values: ['No', 'No internet service', 'Yes']\n", "Label encoder Contract - values: ['Month-to-month', 'One year', 'Two year']\n", "Label encoder PaperlessBilling - values: ['No', 'Yes']\n", "Label encoder PaymentMethod - values: ['Bank transfer (automatic)', 'Credit card (automatic)', 'Electronic check', 'Mailed check']\n", "Label encoder Churn - values: ['No', 'Yes']\n" ] } ], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "categorical_feature_names = []\n", "label_encoders = {}\n", "for categorical in categorical_features + [target]:\n", " label_encoders[categorical] = LabelEncoder()\n", " df[categorical] = label_encoders[categorical].fit_transform(df[categorical])\n", " names = label_encoders[categorical].classes_.tolist()\n", " print('Label encoder %s - values: %s' % (categorical, names))\n", " if categorical == target:\n", " continue\n", " categorical_feature_names.extend([categorical + '_' + str(name) for name in names])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customerIDgenderSeniorCitizenPartnerDependentstenurePhoneServiceMultipleLinesInternetServiceOnlineSecurity...DeviceProtectionTechSupportStreamingTVStreamingMoviesContractPaperlessBillingPaymentMethodMonthlyChargesTotalChargesChurn
07590-VHVEG001010100...000001229.8529.850
15575-GNVDE1000341002...200010356.951889.500
23668-QPYBK100021002...000001353.85108.151
37795-CFOCW1000450102...220010042.301840.750
49237-HQITU000021010...000001270.70151.651
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " customerID gender SeniorCitizen Partner Dependents tenure \\\n", "0 7590-VHVEG 0 0 1 0 1 \n", "1 5575-GNVDE 1 0 0 0 34 \n", "2 3668-QPYBK 1 0 0 0 2 \n", "3 7795-CFOCW 1 0 0 0 45 \n", "4 9237-HQITU 0 0 0 0 2 \n", "\n", " PhoneService MultipleLines InternetService OnlineSecurity ... \\\n", "0 0 1 0 0 ... \n", "1 1 0 0 2 ... \n", "2 1 0 0 2 ... \n", "3 0 1 0 2 ... \n", "4 1 0 1 0 ... \n", "\n", " DeviceProtection TechSupport StreamingTV StreamingMovies Contract \\\n", "0 0 0 0 0 0 \n", "1 2 0 0 0 1 \n", "2 0 0 0 0 0 \n", "3 2 2 0 0 1 \n", "4 0 0 0 0 0 \n", "\n", " PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn \n", "0 1 2 29.85 29.85 0 \n", "1 0 3 56.95 1889.50 0 \n", "2 1 3 53.85 108.15 1 \n", "3 0 0 42.30 1840.75 0 \n", "4 1 2 70.70 151.65 1 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classifier\n", "\n", "We use sklearn, a Machine Learning library in Python, to create a classifier.\n", "The sklearn way is to use pipelines that define feature processing and the classifier. In our example, the pipeline takes a dataset in the input, it preprocesses features and trains the classifier.\n", "When trained, it takes the same input and returns predictions in the output. \n", "\n", "In the pipeline, we separately process categorical and numerical features. We onehot encode categorical features and scale numerical features by removing the mean and scaling them to unit variance.\n", "We chose a decision tree model because of its interpretability and set max depth to 3 (arbitrarily)." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import FeatureUnion, Pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn import tree\n", "from sklearn.base import BaseEstimator, TransformerMixin\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "\n", "class ItemSelector(BaseEstimator, TransformerMixin):\n", " def __init__(self, key):\n", " self.key = key\n", "\n", " def fit(self, x, y=None):\n", " return self\n", "\n", " def transform(self, df):\n", " return df[self.key]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "pipeline = Pipeline(\n", " [\n", " (\n", " \"union\",\n", " FeatureUnion(\n", " transformer_list=[\n", " (\n", " \"categorical_features\",\n", " Pipeline(\n", " [\n", " (\"selector\", ItemSelector(key=categorical_features)),\n", " (\"onehot\", OneHotEncoder()),\n", " ]\n", " ),\n", " )\n", " ]\n", " + [\n", " (\n", " \"numerical_features\",\n", " Pipeline(\n", " [\n", " (\"selector\", ItemSelector(key=numerical_features)),\n", " (\"scalar\", StandardScaler()),\n", " ]\n", " ),\n", " )\n", " ]\n", " ),\n", " ),\n", " (\"classifier\", tree.DecisionTreeClassifier(max_depth=3, random_state=42)),\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We split the dataset to train (75% samples) and test (25% samples). \n", "We train (fit) the pipeline and make predictions. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "df_train, df_test = train_test_split(df, test_size=0.25, random_state=42)\n", "\n", "pipeline.fit(df_train, df_train[target])\n", "pred = pipeline.predict(df_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Testing the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With classification_report we calculate precision and recall with actual and predicted values.\n", "For class 1 (churned users) model achieves 0.67 precision and 0.37 recall.\n", "Precision tells us how many churned users did our classifier predicted correctly. On the other side, recall tell us how many churned users it missed. In layman terms, the classifier is not very accurate for churned users." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.81 0.94 0.87 1300\n", " 1 0.67 0.37 0.48 458\n", "\n", "avg / total 0.77 0.79 0.77 1758\n", "\n" ] } ], "source": [ "from sklearn.metrics import classification_report\n", "\n", "print(classification_report(df_test[target], pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model interpretability\n", "\n", "Decision Tree model uses Contract, MonthlyCharges, InternetService, TotalCharges, and tenure features to make a decision if a customer will churn or not. These features separate churned customers from others well based on the split criteria in the decision tree.\n", "\n", "Each customer sample traverses the tree and final node gives the prediction. \n", "For example, if Contract_Month-to-month is:\n", " - equal to 0, continue traversing the tree with True branch, \n", " - equal to 1, continue traversing the tree with False branch, \n", " - not defined, it outputs the class 0.\n", " \n", "This is a great approach to see how the model is making a decision or if any features sneaked in our model that shouldn't be there." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Tree\n", "\n", "\n", "\n", "0\n", "\n", "Contract_Month-to-month ≤ 0.5\n", "gini = 0.392\n", "samples = 5274\n", "value = [3863, 1411]\n", "class = 0\n", "\n", "\n", "\n", "1\n", "\n", "MonthlyCharges ≤ 0.962\n", "gini = 0.122\n", "samples = 2376\n", "value = [2221, 155]\n", "class = 0\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "True\n", "\n", "\n", "\n", "8\n", "\n", "InternetService_Fiber optic ≤ 0.5\n", "gini = 0.491\n", "samples = 2898\n", "value = [1642, 1256]\n", "class = 0\n", "\n", "\n", "\n", "0->8\n", "\n", "\n", "False\n", "\n", "\n", "\n", "2\n", "\n", "Contract_One year ≤ 0.5\n", "gini = 0.068\n", "samples = 1787\n", "value = [1724, 63]\n", "class = 0\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "5\n", "\n", "TotalCharges ≤ 1.908\n", "gini = 0.264\n", "samples = 589\n", "value = [497, 92]\n", "class = 0\n", "\n", "\n", "\n", "1->5\n", "\n", "\n", "\n", "\n", "\n", "3\n", "\n", "gini = 0.022\n", "samples = 978\n", "value = [967, 11]\n", "class = 0\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "4\n", "\n", "gini = 0.12\n", "samples = 809\n", "value = [757, 52]\n", "class = 0\n", "\n", "\n", "\n", "2->4\n", "\n", "\n", "\n", "\n", "\n", "6\n", "\n", "gini = 0.357\n", "samples = 279\n", "value = [214, 65]\n", "class = 0\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "\n", "\n", "\n", "7\n", "\n", "gini = 0.159\n", "samples = 310\n", "value = [283, 27]\n", "class = 0\n", "\n", "\n", "\n", "5->7\n", "\n", "\n", "\n", "\n", "\n", "9\n", "\n", "tenure ≤ -1.176\n", "gini = 0.404\n", "samples = 1304\n", "value = [937, 367]\n", "class = 0\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "\n", "\n", "\n", "12\n", "\n", "tenure ≤ -0.687\n", "gini = 0.493\n", "samples = 1594\n", "value = [705, 889]\n", "class = 1\n", "\n", "\n", "\n", "8->12\n", "\n", "\n", "\n", "\n", "\n", "10\n", "\n", "gini = 0.496\n", "samples = 435\n", "value = [237, 198]\n", "class = 0\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "\n", "\n", "\n", "11\n", "\n", "gini = 0.313\n", "samples = 869\n", "value = [700, 169]\n", "class = 0\n", "\n", "\n", "\n", "9->11\n", "\n", "\n", "\n", "\n", "\n", "13\n", "\n", "gini = 0.422\n", "samples = 787\n", "value = [238, 549]\n", "class = 1\n", "\n", "\n", "\n", "12->13\n", "\n", "\n", "\n", "\n", "\n", "14\n", "\n", "gini = 0.488\n", "samples = 807\n", "value = [467, 340]\n", "class = 0\n", "\n", "\n", "\n", "12->14\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dot_data = tree.export_graphviz(pipeline.named_steps['classifier'], out_file=None, \n", " feature_names = categorical_feature_names + numerical_features,\n", " class_names=[str(el) for el in pipeline.named_steps.classifier.classes_], \n", " filled=True, rounded=True, \n", " special_characters=True) \n", "graph = graphviz.Source(dot_data) \n", "graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further reading\n", "\n", "- [Handling class imbalance in customer churn prediction](https://www.sciencedirect.com/science/article/pii/S0957417408002121) - how can we better handle class imbalance in churn prediction.\n", "- [A Survey on Customer Churn Prediction using Machine Learning Techniques](https://www.researchgate.net/publication/310757545_A_Survey_on_Customer_Churn_Prediction_using_Machine_Learning_Techniques) - This paper reviews the most popular machine learning algorithms used by researchers for churn predicting.\n", "- [Telco customer churn on kaggle](https://www.kaggle.com/blastchar/telco-customer-churn) - churn analysis on kaggle.\n", "- [WTTE-RNN-Hackless-churn-modeling](https://ragulpr.github.io/2016/12/22/WTTE-RNN-Hackless-churn-modeling) - event based churn prediction." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }