{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Churn prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Customer churn, also known as customer attrition, occurs when customers stop doing business with a company. The companies are interested in identifying segments of these customers because the price for acquiring a new customer is usually higher than retaining the old one. For example, if Netflix knew a segment of customers who were at risk of churning they could proactively engage them with special offers instead of simply losing them.\n", "\n", "In this blog post, we will create a simple customer churn prediction model using [Telco Customer Churn dataset](https://www.kaggle.com/blastchar/telco-customer-churn). We chose a decision tree to model churned customers, pandas for data crunching and matplotlib for visualizations. We will do all of that above in Python.\n", "The code can be used with another dataset with a few minor adjustments to train the baseline model. We also provide few references and give ideas for new features and improvements. \n", "\n", "You can run this code by downloading this [Jupyter notebook]({{site.url}}/assets/notebooks/2019-01-25-churn-prediction).\n", " \n", "Follow me on [twitter](https://twitter.com/romanorac) to get latest updates.\n", "\n", "Let's get started." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Requirements" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import platform\n", "import pandas as pd\n", "import sklearn\n", "import numpy as np\n", "import graphviz\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "python version 3.7.0\n", "pandas version 0.23.4\n", "sklearn version 0.19.2\n", "numpy version 1.15.1\n", "graphviz version 0.10.1\n", "matplotlib version 2.2.3\n" ] } ], "source": [ "print('python version', platform.python_version())\n", "print('pandas version', pd.__version__)\n", "print('sklearn version', sklearn.__version__)\n", "print('numpy version', np.__version__)\n", "print('graphviz version', graphviz.__version__)\n", "print('matplotlib version', matplotlib.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preprocessing\n", "\n", "We use pandas to read the dataset and preprocess it. Telco dataset has one customer per line with many columns (features).\n", "There aren't any rows with all missing values or duplicates (this rarely happens with real-world datasets). \n", "There are 11 samples that have TotalCharges set to \" \", which seems like a mistake in the data. We remove those samples and set the type to numeric (float)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7043, 21)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/WA_Fn-UseC_-Telco-Customer-Churn.csv')\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | customerID | \n", "gender | \n", "SeniorCitizen | \n", "Partner | \n", "Dependents | \n", "tenure | \n", "PhoneService | \n", "MultipleLines | \n", "InternetService | \n", "OnlineSecurity | \n", "... | \n", "DeviceProtection | \n", "TechSupport | \n", "StreamingTV | \n", "StreamingMovies | \n", "Contract | \n", "PaperlessBilling | \n", "PaymentMethod | \n", "MonthlyCharges | \n", "TotalCharges | \n", "Churn | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "7590-VHVEG | \n", "Female | \n", "0 | \n", "Yes | \n", "No | \n", "1 | \n", "No | \n", "No phone service | \n", "DSL | \n", "No | \n", "... | \n", "No | \n", "No | \n", "No | \n", "No | \n", "Month-to-month | \n", "Yes | \n", "Electronic check | \n", "29.85 | \n", "29.85 | \n", "No | \n", "
1 | \n", "5575-GNVDE | \n", "Male | \n", "0 | \n", "No | \n", "No | \n", "34 | \n", "Yes | \n", "No | \n", "DSL | \n", "Yes | \n", "... | \n", "Yes | \n", "No | \n", "No | \n", "No | \n", "One year | \n", "No | \n", "Mailed check | \n", "56.95 | \n", "1889.5 | \n", "No | \n", "
2 | \n", "3668-QPYBK | \n", "Male | \n", "0 | \n", "No | \n", "No | \n", "2 | \n", "Yes | \n", "No | \n", "DSL | \n", "Yes | \n", "... | \n", "No | \n", "No | \n", "No | \n", "No | \n", "Month-to-month | \n", "Yes | \n", "Mailed check | \n", "53.85 | \n", "108.15 | \n", "Yes | \n", "
3 | \n", "7795-CFOCW | \n", "Male | \n", "0 | \n", "No | \n", "No | \n", "45 | \n", "No | \n", "No phone service | \n", "DSL | \n", "Yes | \n", "... | \n", "Yes | \n", "Yes | \n", "No | \n", "No | \n", "One year | \n", "No | \n", "Bank transfer (automatic) | \n", "42.30 | \n", "1840.75 | \n", "No | \n", "
4 | \n", "9237-HQITU | \n", "Female | \n", "0 | \n", "No | \n", "No | \n", "2 | \n", "Yes | \n", "No | \n", "Fiber optic | \n", "No | \n", "... | \n", "No | \n", "No | \n", "No | \n", "No | \n", "Month-to-month | \n", "Yes | \n", "Electronic check | \n", "70.70 | \n", "151.65 | \n", "Yes | \n", "
5 rows × 21 columns
\n", "\n", " | customerID | \n", "gender | \n", "SeniorCitizen | \n", "Partner | \n", "Dependents | \n", "tenure | \n", "PhoneService | \n", "MultipleLines | \n", "InternetService | \n", "OnlineSecurity | \n", "... | \n", "DeviceProtection | \n", "TechSupport | \n", "StreamingTV | \n", "StreamingMovies | \n", "Contract | \n", "PaperlessBilling | \n", "PaymentMethod | \n", "MonthlyCharges | \n", "TotalCharges | \n", "Churn | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "7032 | \n", "7032 | \n", "7032.000000 | \n", "7032 | \n", "7032 | \n", "7032.000000 | \n", "7032 | \n", "7032 | \n", "7032 | \n", "7032 | \n", "... | \n", "7032 | \n", "7032 | \n", "7032 | \n", "7032 | \n", "7032 | \n", "7032 | \n", "7032 | \n", "7032.000000 | \n", "7032.000000 | \n", "7032 | \n", "
unique | \n", "7032 | \n", "2 | \n", "NaN | \n", "2 | \n", "2 | \n", "NaN | \n", "2 | \n", "3 | \n", "3 | \n", "3 | \n", "... | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "2 | \n", "4 | \n", "NaN | \n", "NaN | \n", "2 | \n", "
top | \n", "7989-CHGTL | \n", "Male | \n", "NaN | \n", "No | \n", "No | \n", "NaN | \n", "Yes | \n", "No | \n", "Fiber optic | \n", "No | \n", "... | \n", "No | \n", "No | \n", "No | \n", "No | \n", "Month-to-month | \n", "Yes | \n", "Electronic check | \n", "NaN | \n", "NaN | \n", "No | \n", "
freq | \n", "1 | \n", "3549 | \n", "NaN | \n", "3639 | \n", "4933 | \n", "NaN | \n", "6352 | \n", "3385 | \n", "3096 | \n", "3497 | \n", "... | \n", "3094 | \n", "3472 | \n", "2809 | \n", "2781 | \n", "3875 | \n", "4168 | \n", "2365 | \n", "NaN | \n", "NaN | \n", "5163 | \n", "
mean | \n", "NaN | \n", "NaN | \n", "0.162400 | \n", "NaN | \n", "NaN | \n", "32.421786 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "64.798208 | \n", "2283.300441 | \n", "NaN | \n", "
std | \n", "NaN | \n", "NaN | \n", "0.368844 | \n", "NaN | \n", "NaN | \n", "24.545260 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "30.085974 | \n", "2266.771362 | \n", "NaN | \n", "
min | \n", "NaN | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "1.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "18.250000 | \n", "18.800000 | \n", "NaN | \n", "
25% | \n", "NaN | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "9.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "35.587500 | \n", "401.450000 | \n", "NaN | \n", "
50% | \n", "NaN | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "29.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "70.350000 | \n", "1397.475000 | \n", "NaN | \n", "
75% | \n", "NaN | \n", "NaN | \n", "0.000000 | \n", "NaN | \n", "NaN | \n", "55.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "89.862500 | \n", "3794.737500 | \n", "NaN | \n", "
max | \n", "NaN | \n", "NaN | \n", "1.000000 | \n", "NaN | \n", "NaN | \n", "72.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "118.750000 | \n", "8684.800000 | \n", "NaN | \n", "
11 rows × 21 columns
\n", "\n", " | tenure | \n", "MonthlyCharges | \n", "TotalCharges | \n", "
---|---|---|---|
count | \n", "7032.000000 | \n", "7032.000000 | \n", "7032.000000 | \n", "
mean | \n", "32.421786 | \n", "64.798208 | \n", "2283.300441 | \n", "
std | \n", "24.545260 | \n", "30.085974 | \n", "2266.771362 | \n", "
min | \n", "1.000000 | \n", "18.250000 | \n", "18.800000 | \n", "
25% | \n", "9.000000 | \n", "35.587500 | \n", "401.450000 | \n", "
50% | \n", "29.000000 | \n", "70.350000 | \n", "1397.475000 | \n", "
75% | \n", "55.000000 | \n", "89.862500 | \n", "3794.737500 | \n", "
max | \n", "72.000000 | \n", "118.750000 | \n", "8684.800000 | \n", "
\n", " | customerID | \n", "gender | \n", "SeniorCitizen | \n", "Partner | \n", "Dependents | \n", "tenure | \n", "PhoneService | \n", "MultipleLines | \n", "InternetService | \n", "OnlineSecurity | \n", "... | \n", "DeviceProtection | \n", "TechSupport | \n", "StreamingTV | \n", "StreamingMovies | \n", "Contract | \n", "PaperlessBilling | \n", "PaymentMethod | \n", "MonthlyCharges | \n", "TotalCharges | \n", "Churn | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "7590-VHVEG | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "2 | \n", "29.85 | \n", "29.85 | \n", "0 | \n", "
1 | \n", "5575-GNVDE | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "34 | \n", "1 | \n", "0 | \n", "0 | \n", "2 | \n", "... | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "3 | \n", "56.95 | \n", "1889.50 | \n", "0 | \n", "
2 | \n", "3668-QPYBK | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "1 | \n", "0 | \n", "0 | \n", "2 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "3 | \n", "53.85 | \n", "108.15 | \n", "1 | \n", "
3 | \n", "7795-CFOCW | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "45 | \n", "0 | \n", "1 | \n", "0 | \n", "2 | \n", "... | \n", "2 | \n", "2 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "42.30 | \n", "1840.75 | \n", "0 | \n", "
4 | \n", "9237-HQITU | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "2 | \n", "70.70 | \n", "151.65 | \n", "1 | \n", "
5 rows × 21 columns
\n", "