Successfully reported this slideshow.
Upcoming SlideShare
×

# Data Science Salon Miami Example - Churn Rate Predictor

82 views

Published on

Churn rate predictor. Developed in Python 2 with Jupyter Notebook. Original source code is located here:

Published in: Software
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Data Science Salon Miami Example - Churn Rate Predictor

1. 1. 2/14/2018 churn_rate ﬁle:///Users/gregwerner/Downloads/churn_rate.html 1/23 Table of Contents 1  3Blades Churn Rate Demo 1.1  Demo Goal 1.2  Deﬁne the problem 1.3  Import Data & Libraries 1.3.0.1  Here are our features: 1.4  Cleanse 1.5  Transform 1.6  Cross Validation and Model Preparation 1.7  Spot Check Models 1.8  Fit Model 1.9  What Does This Mean? 3Blades Churn Rate Demo This Notebook provides an example of how to automate Machine Learning (ML) workﬂows with scikit-learn (http://scikit-learn.org/) and Magics (https://ipython.org/ipython-doc/3/interactive/magics.html) and applies these to these methodologies to a churn rate predictor. Use this demo by changing the values of k-folds, x and Y in order to explore! This demo follows the following workﬂow, adapted from Jason Browlee via Machine Learning Mastery (http://machinelearningmastery.com/author/jasonb/): 1. Deﬁne the problem & Imports 2. Prepare the data 3. Cross Validation and Model Preparation 4. Run Model 5. Spot Check Algorithm
2. 2. 2/14/2018 churn_rate ﬁle:///Users/gregwerner/Downloads/churn_rate.html 2/23 Demo Goal This demo's main goal is to make it easy to use the Machine Learning Workﬂow to solve real life marketing problems. It allows users to perform data science quickly and cleanly. Superset provides: The ability to set X, y, and k-folds for a personalized demo experience. A rich set of models, tools and helpful hints to guide you through the process step by step. The ability to code in Python and R and import from a variety of data sources, so you aren't locked in. Fast run times without sacriﬁcing on folds, model types or model accuracy. The ability to change performance metrics so you have the power to interpret models in the way with which you're most comfortable. Deﬁne the problem Churn is the measure of the the amount of people who stop subscribing or buying a product over time. In situations where consumers are subscribing to a service it is important to measure how likely those people are to stop subscribing. In this demo, when churn is true, it will hold the value of 1. When a customer does not churn, it will hold a value of 0. Import Data & Libraries We will now dowload our data and libraries. We will also list our features and view the ﬁrst few lines of data.
3. 3. 2/14/2018 churn_rate ﬁle:///Users/gregwerner/Downloads/churn_rate.html 3/23 In [1]: #Import libraries %run ./Libraries.ipynb #Import data and view first few rows churn_df = pd.read_csv("churndata.csv") churn_df.head() /Users/Annmarie/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarn ing: This module was deprecated in version 0.18 in favor of the model_selection module into which al l the refactored classes and functions are moved. Also note that the interface of the new CV iterato rs are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning) Out[1]: State Account Length Area Code Phone Int'l Plan VMail Plan VMail Message Day Mins Day Calls Day Charge Eve Mins Eve Calls Eve Charge Night Mins Night Calls Night Charge Intl Min 0 KS 128 415 382- 4657 no yes 25 265.1 110 45.07 197.4 99 16.78 244.7 91 11.01 10.0 1 OH 107 415 371- 7191 no yes 26 161.6 123 27.47 195.5 103 16.62 254.4 103 11.45 13.7 2 NJ 137 415 358- 1921 no no 0 243.4 114 41.38 121.2 110 10.30 162.6 104 7.32 12.2 3 OH 84 408 375- 9999 yes no 0 299.4 71 50.90 61.9 88 5.26 196.9 89 8.86 6.6 4 OK 75 415 330- 6626 yes no 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1
4. 4. 2/14/2018 churn_rate ﬁle:///Users/gregwerner/Downloads/churn_rate.html 4/23 In [2]: churn_result = churn_df['Churn?'] y = np.where(churn_result == 'True.',1,0) # We don't need these columns to_drop = ['State','Area Code','Phone','Churn?'] churn_feat_space = churn_df.drop(to_drop,axis=1) # 'yes'/'no' has to be converted to boolean values # NumPy converts these from boolean to 1. and 0. later yes_no_cols = ["Int'l Plan","VMail Plan"] churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes' # Pull out features for future use features = churn_feat_space.columns X = churn_feat_space.as_matrix().astype(np.float) Here are our features: 1. Account Length 2. International Plan 3. Voicemail Plan 4. Voicemail Messages 5. Day Minutes 6. Day Calls 7. Day Charge 8. Eve Minutes 9. Eve Calls 10. Eve Charge 11. Night Minutes 12. Night Calls 13. Night Charge 14. International Minutes 15. International Calls 16. International Charge 17. Customer Service Calls
5. 5. 2/14/2018 churn_rate ﬁle:///Users/gregwerner/Downloads/churn_rate.html 5/23 Cleanse Now lets do data cleaning. To do this we will isolate target data, get rid of any columns we don't need and then convert any 'yes'/'no' values to values of 1/0. By doing this, we make it easy to make sense of all of our data. At this point, all missing values have been removed. We will then pull out our features and place them in a sparse matrix for analysis. It is also important that we standardize our data. This is a common requirement for many models and can be done with skit-learn's preprocessor. This will create a Gaussian distribution with zero mean and unit variance. Once completed, we can graph our distribution and move on to transforming our data to prepare for modeling. In [3]: # Isolate target data, remove uneeded columns, %run ./FeatureEngineering.ipynb # This is important from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X = scaler.fit_transform(X) Transform Now it is time to transform our data. In the following cells we will take our data and begin to optimize it. To do this we will look at what features are the most important to the data set. We will check the relationship between the variables and make sure that we have the best features in their best possible format. We now ﬁnd that the following three features are the most important to the data set 1. International Plan 2. Voicemail Plan 3. Account Length Feature space holds 3333 observations and 17 features Unique target labels: [0 1]
6. 6. 2/14/2018 churn_rate ﬁle:///Users/gregwerner/Downloads/churn_rate.html 6/23 In [4]: #Check for colinearity churn_df.corr()
7. 7. 2/14/2018 churn_rate ﬁle:///Users/gregwerner/Downloads/churn_rate.html 7/23 Out[4]: Account Length Area Code VMail Message Day Mins Day Calls Day Charge Eve Mins Eve Calls Eve Charge Night Mins Night Calls Account Length 1.000000 -0.012463 -0.004628 0.006216 0.038470 0.006214 -0.006757 0.019260 -0.006745 -0.008955 -0.0131 Area Code -0.012463 1.000000 -0.001994 -0.008264 -0.009646 -0.008264 0.003580 -0.011886 0.003607 -0.005825 0.0165 VMail Message -0.004628 -0.001994 1.000000 0.000778 -0.009548 0.000776 0.017562 -0.005864 0.017578 0.007681 0.0071 Day Mins 0.006216 -0.008264 0.000778 1.000000 0.006750 1.000000 0.007043 0.015769 0.007029 0.004323 0.0229 Day Calls 0.038470 -0.009646 -0.009548 0.006750 1.000000 0.006753 -0.021451 0.006462 -0.021449 0.022938 -0.0195 Day Charge 0.006214 -0.008264 0.000776 1.000000 0.006753 1.000000 0.007050 0.015769 0.007036 0.004324 0.0229 Eve Mins -0.006757 0.003580 0.017562 0.007043 -0.021451 0.007050 1.000000 -0.011430 1.000000 -0.012584 0.0075 Eve Calls 0.019260 -0.011886 -0.005864 0.015769 0.006462 0.015769 -0.011430 1.000000 -0.011423 -0.002093 0.0077 Eve Charge -0.006745 0.003607 0.017578 0.007029 -0.021449 0.007036 1.000000 -0.011423 1.000000 -0.012592 0.0075 Night Mins -0.008955 -0.005825 0.007681 0.004323 0.022938 0.004324 -0.012584 -0.002093 -0.012592 1.000000 0.0112 Night Calls -0.013176 0.016522 0.007123 0.022972 -0.019557 0.022972 0.007586 0.007710 0.007596 0.011204 1.0000 Night Charge -0.008960 -0.005845 0.007663 0.004300 0.022927 0.004301 -0.012593 -0.002056 -0.012601 0.999999 0.0111 Intl Mins 0.009514 -0.018288 0.002856 -0.010155 0.021565 -0.010157 -0.011035 0.008703 -0.011043 -0.015207 -0.0136 Intl Calls 0.020661 -0.024179 0.013957 0.008033 0.004574 0.008032 0.002541 0.017434 0.002541 -0.012353 0.0003 Intl Charge 0.009546 -0.018395 0.002884 -0.010092 0.021666 -0.010094 -0.011067 0.008674 -0.011074 -0.015180 -0.0136
8. 8. 2/14/2018 churn_rate ﬁle:///Users/gregwerner/Downloads/churn_rate.html 8/23 Account Length Area Code VMail Message Day Mins Day Calls Day Charge Eve Mins Eve Calls Eve Charge Night Mins Night Calls CustServ Calls -0.003796 0.027572 -0.013263 -0.013423 -0.018942 -0.013427 -0.012985 0.002423 -0.012987 -0.009288 -0.0128
9. 9. 2/14/2018 churn_rate ﬁle:///Users/gregwerner/Downloads/churn_rate.html 9/23 In [5]: # %load ./Visualizations.py.html # In[16]: from IPython.core.magic import (register_line_magic, register_cell_magic, register_line_cell_magic) register_cell_magic('Visualizations') get_ipython().magic(u'run ./Libraries.ipynb') # Isolate target data churn_result = churn_df["Churn?"] y = np.where(churn_result == 'True.',1,0) fig = plt.figure(figsize=(18,6), dpi=1600) alpha=alpha_scatterplot = 0.2 alpha_bar_chart = 0.55 ax1 = plt.subplot2grid((2,3),(0,0)) # plots a bar graph of those churn vs those who do not. churn_result.value_counts().plot(kind='bar', alpha=alpha_bar_chart) ax1.set_xlim(-1, 2) plt.title("Distribution of churn, (1 = Churn)") Out[5]: <matplotlib.text.Text at 0x11f6dca10>
10. 10. 2/14/2018 churn_rate ﬁle:///Users/gregwerner/Downloads/churn_rate.html 10/23 Cross Validation and Model Preparation Now let's set up the parameters for the model cross validation-- we will use K-Folds. In the next line, chance the "n folds" object to your desired number of folds and then run! How well will your ML algorithms perform? Make predictions for new data for known answers Resampling methods estimates performance on new data. It is important to evaluate ML algorithms using data that is diﬀerent than the data used to train ML algorithm, to avoid overﬁtting This notebook uses K-folds to avoid over ﬁtting and ensures we make good predictions! In [6]: #Set folds by changing the "n_folds" object kf = KFold(len(y),n_folds=25,shuffle=True) In [7]: # Construct a kfolds object def run_cv(X,y,clf_class,**kwargs): y_pred = y.copy() # Iterate through folds for train_index, test_index in kf: X_train, X_test = X[train_index], X[test_index] y_train = y[train_index] # Initialize a classifier with key word arguments clf = clf_class(**kwargs) clf.fit(X_train,y_train) y_pred[test_index] = clf.predict(X_test) return y_pred In [8]: run_cv Out[8]: <function __main__.run_cv>
11. 11. 2/14/2018 churn_rate ﬁle:///Users/gregwerner/Downloads/churn_rate.html 11/23 Spot Check Models For this data set we will test Support Vector Machines, Random Forests and K-nearest neighbors. Note that Random Forest seems to perform the best. To remove a model, feel free to add # before that line to exclude it. In [9]: def accuracy(y_true,y_pred): # NumPy interprets True and False as 1. and 0. return np.mean(y_true == y_pred) In [10]: print "Support vector machines:" print "%.3f" % accuracy(y, run_cv(X,y,SVC)) print "Random forest:" print "%.3f" % accuracy(y, run_cv(X,y,RF)) print "K-nearest-neighbors:" print "%.3f" % accuracy(y, run_cv(X,y,KNN)) Support vector machines: 0.921 Random forest: 0.945 K-nearest-neighbors: 0.893