Applied python for correlation on churn and stocks datasets
1. Applied Python for correlation on churn & stocks datasets
PRESENTED BY MAHMOUD FOUAD DARWISH
2. Correlation
Correlation is a statistic that measures the degree to which two variables move in relation to each other.
Correlation measures association, but doesn’t show if x causes y or vice versa.
3. Correlation Types
Positive Correlation :- when x goes up or down then we expect y to follow the
same direction.
Negative Correlation :- when x goes up or down, we expect y to follow the
opposite direction.
A zero correlation, we cannot say anything in relation to each other.
5. Churn Dataset
Churn dataset used is publicly available and is mentioned in the book [*Discovering Knowledge in
Data*](https://www.amazon.com/dp/0470908742/) by Daniel T. Larose. The author attributed the dataset
to the University of California Irvine Repository of Machine Learning Datasets.
Mobile phone service providers keep historical records on customers who churn or leave their service
provider to another provider as it is useful to identify those customers before they leave and try to avoid
losing them.
Dataset file contains 3,333 records, Each record uses 21 attributes to describe the profile of a customer of
an unknown US mobile phone service provider.
7. Dataset Description
State: The US state in which the customer resides indicated by a two letter abbreviation. For example, OH
or NJ
Account Length: The number of days that this account has been active
Area Code: The three digit area code of the corresponding customer’s phone number
Phone: The seven digit phone number
Int’l Plan: Whether the customer has an international calling plan: yes/no
VMail Plan: Whether the customer has a voice mail feature: yes/no
VMail Message: The average number of voice mail messages per month
Day Mins: The total number of calling minutes used during the day
8. Dataset Description Cont.
Day Calls: The total number of calls placed during the day
Day Charge: The billed cost of daytime calls
Eve Mins, Eve Calls, Eve Charge: The billed cost for calls placed during the evening
Night Mins, Night Calls, Night Charge: The billed cost for calls placed during nighttime
Intl Mins, Intl Calls, Intl Charge: The billed cost for international calls
CustServ Calls: The number of calls placed to Customer Service
Churn?: Whether the customer left the service: true/false
9. Data Exploration - Describe
The first step is to use a describe function to see how the values of individual attributes are distributed, as
well as compute summary statistics for numeric attributes such as mean, min values, max values, standard
deviations, etc.
display(churn.describe())
10. Data Exploration - Histogram
A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a
set of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal
distribution), outliers, skewness, etc.
hist = churn.hist(bins=30, sharey=True, figsize=(10, 10))
11. Data Exploration - crosstab
We use crosstab function in order to show frequency tables for each categorical feature and counts of
unique values.
for column in churn.select_dtypes(include=['object']).columns:
display(pd.crosstab(index=churn[column], columns='% observations', normalize='columns')) print("#
of unique values {}".format(churn[column].nunique()))
21. Historical stock prices dataset loading
In order to be able to read historical prices for US stock market, we would depend on pandas data
reader library to load stocks information from yahoo finance.
31. Conclusion
We can use python to generate correlation between different attributes using corr, scattermatrix,
seaborn heatmap.
We have applied python correlation functions on two different datasets [ churn dataset and stocks
datasets ]
We can read automatically financial stock prices and load data properly using panda and panda reader
libraries.
We can describe datasets using describe, histogram and other python functions.
We can plot graphs using plot function in matplotlib libarary.
We used notebook & anaconda to execute and run all python codes that are part of this presentation
successfully with no issues.
32. Future work
Apply Machine Learning Models to datasets after considering correlation information.